TIP: Click on subject to list as thread! ANSI
echo: c_plusplus
to: CAREY BLOODWORTH
from: HERMAN SCHONFELD
date: 1997-05-02 19:43:00
subject: DJGPP OPTIMIZATIONS 1/2

CB>Even my 8 year old Microsoft Quick C, a compiler hardly known for
CB>aggressive optimization, and designed to run on even an XT, will do that
CB>conversion.
Borland compilers fail to perform such an optimization.
CB>The first one, I felt it was 'unfair' of you to unroll the pointer
CB>example 4 times but not the indexing one.  So, I unrolled it 4 times.
CB>And I got 6,150,160 bytes per second.  That's about 10.5 cycles per
CB>byte.  And again, the index version is faster than the pointer version.
Yes, I am aware of that. I was however showing the different methods of 
writing a loop. On your machine the index version performed faster than the 
pointer version. It did _NOT_ do this on my machine. That is why i am not 
very happy about letting a compiler do the 'dirty-work' for me.
CB>The second optimization I did was notice that you were still messing
CB>with transferring a byte at a time.  That's rather slow. So, I changed
CB>your unrolled pointer loop to 32 bit integers and got 11,168,389 bytes
CB>transferred per second.  Double your best optimization on my computer.
CB>That works out to about 6 cycles per byte transferred. (You would also
CB>need to deal with making sure the data is aligned properly, since that
CB>is a major performance killer, and on some computers, even fatal.  But
CB>it can be dealt with.)
Your program would've performed incorrectly.
Take note of '#define 99999'
CB>The next optimization is, of course, the most obvious one.... I used
CB>memcpy().  I did it just as a way of benchmarking the performances.
CB>And, within my timing resolution, I got very similar results to my
CB>unrolled loop above. (There were enough cases where the integer unrolled
CB>loop was faster than the inline rep movsd of memcpy() that I'm not
CB>entirely sure it was timing resolution problems.) (And when the data can
CB>fit entirely into the cache, this drops down to about 1.5 cycles per
CB>byte.  On a Pentium, it would be around 1/3rd cycle per byte.)  When the
CB>data fit entirely into the L2, and especially the L1 cache, memcpy() was
CB>faster. The rest of the time, they were the same.
So that means. 'who cares, it's fast on my machine', right?
CB>(I could have made a couple of additional optimizations.  First,
CB>pre-warm the cache, by making sure the data we are about to copy is
CB>actually in the cache.  That can reduce memory latency.  The second
CB>might have been to use the floating point registers (or the MMX
CB>registers) since they can safely transfer 8 bytes at a time, instead of
CB>just 4 bytes.  Both optimizations such as those have been shown to give
CB>significant improvements in code.  I didn't bother doing because it
CB>wasn't worth the effort for an example.)
You could also add code patching to speed it up by 50%.
I am aware of these optization carey.
... Error 87 - Tagline out of caracters...
--- Ezycom V1.48g0 01fd016b
---------------
* Origin: Fox's Lair BBS Bris Aus +61-7-38033908 V34+ Node 2 (3:640/238)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.