CB>Even my 8 year old Microsoft Quick C, a compiler hardly known for
CB>aggressive optimization, and designed to run on even an XT, will do that
CB>conversion.
Borland compilers fail to perform such an optimization.
CB>The first one, I felt it was 'unfair' of you to unroll the pointer
CB>example 4 times but not the indexing one. So, I unrolled it 4 times.
CB>And I got 6,150,160 bytes per second. That's about 10.5 cycles per
CB>byte. And again, the index version is faster than the pointer version.
Yes, I am aware of that. I was however showing the different methods of
writing a loop. On your machine the index version performed faster than the
pointer version. It did _NOT_ do this on my machine. That is why i am not
very happy about letting a compiler do the 'dirty-work' for me.
CB>The second optimization I did was notice that you were still messing
CB>with transferring a byte at a time. That's rather slow. So, I changed
CB>your unrolled pointer loop to 32 bit integers and got 11,168,389 bytes
CB>transferred per second. Double your best optimization on my computer.
CB>That works out to about 6 cycles per byte transferred. (You would also
CB>need to deal with making sure the data is aligned properly, since that
CB>is a major performance killer, and on some computers, even fatal. But
CB>it can be dealt with.)
Your program would've performed incorrectly.
Take note of '#define 99999'
CB>The next optimization is, of course, the most obvious one.... I used
CB>memcpy(). I did it just as a way of benchmarking the performances.
CB>And, within my timing resolution, I got very similar results to my
CB>unrolled loop above. (There were enough cases where the integer unrolled
CB>loop was faster than the inline rep movsd of memcpy() that I'm not
CB>entirely sure it was timing resolution problems.) (And when the data can
CB>fit entirely into the cache, this drops down to about 1.5 cycles per
CB>byte. On a Pentium, it would be around 1/3rd cycle per byte.) When the
CB>data fit entirely into the L2, and especially the L1 cache, memcpy() was
CB>faster. The rest of the time, they were the same.
So that means. 'who cares, it's fast on my machine', right?
CB>(I could have made a couple of additional optimizations. First,
CB>pre-warm the cache, by making sure the data we are about to copy is
CB>actually in the cache. That can reduce memory latency. The second
CB>might have been to use the floating point registers (or the MMX
CB>registers) since they can safely transfer 8 bytes at a time, instead of
CB>just 4 bytes. Both optimizations such as those have been shown to give
CB>significant improvements in code. I didn't bother doing because it
CB>wasn't worth the effort for an example.)
You could also add code patching to speed it up by 50%.
I am aware of these optization carey.
... Error 87 - Tagline out of caracters...
--- Ezycom V1.48g0 01fd016b
---------------
* Origin: Fox's Lair BBS Bris Aus +61-7-38033908 V34+ Node 2 (3:640/238)
|