TIP: Click on subject to list as thread! ANSI
echo: c_plusplus
to: HERMAN SCHONFELD
from: CAREY BLOODWORTH
date: 1997-04-30 21:55:00
subject: DJGPP OPTIMIZATIONS 1/2

(talking about *256 vs <<8)
HS>That basic example does not get optimized by most compilers, Darin.
Such as what compilers?
Most compilers do.  You'd be hard pressed to find an even vaguely new
compiler that doesn't.  Strength reduction, constant folding, etc. are
classic optimizations.
Even my 8 year old Microsoft Quick C, a compiler hardly known for
aggressive optimization, and designed to run on even an XT, will do that
conversion.
I used to have a 1983 C compiler designed to run on a 64k computer, and
even it would do something like that, when it thought it was
appropriate.
Things like that are simply part of what's called 'an optimizing
compiler'.  The optimizations they make are not always appropriate, but
they at least try to do things like that.
HS>Most compilers optimize code well, but just about all don't optimize
HS>the fastest.
They _can't_.  There are simply too many variations in the CPUs.  An
Intel 486 is going to behave very differently than a Cyrix 486, or a
486DLC, etc.  They may all be '486 compatible', but they all behave
differently and run code with different cycles per instruction.  There
is simply no way they can optimize for all of the possible clones, plus
differences in cache size, performance, write back vs. write through,
etc.  All they can do is either aim towards the middle, or optimize for
a very specific cpu/platform combination.
If you are compiling for a line with only a few variations (like the 68k
line) then optimization is much simpler and more consistent.  But with
the x86 line, with the Intel and clone chips, there are 3 or 4 different
8086, a couple of 8088s, several 80286, 7 or 8 80386 versions, perhaps a
dozen different 486's, 4 or 5 '586' class, a couple of 686 class
chips, dozens of different motherboard designs, each with different
cache behavior and size.   You just can't optimize very well because
there is no single point to aim towards.  It's like what a shot gun does
compared to a rifle.
Even if you limit yourself to just a single 'number', such as a 386 or
486, there is still such an incredibly wide variety of behavior and
performance, that it simply is not possible to optimize the code for all
versions.  A Cyrix 486, Intel 486, and an AMD 486 are all going to have
different cycle times for their instructions.  Code that runs optimally
on one can suffer a major performance penalty on another.
As an example, my Cyrix 486 multiplies much much faster than a standard
Intel 486.  (There are other cases, of course, and there are cases
where mine is slower than a standard 486.)  Mine can do a 32 bit
multiply in 7 cycles (and an 8 or 16 in 3 cycles).  A standard 486 takes
a minimum of 13 cycles for any size multiply and as many as 42 cycles
for a 32 bit mul.
If I optimize my code to do multiplications, in some program of some
sort, then on other computers, the performance could be significantly
slower.  Perhaps I do x-D arrays whose sizes aren't powers of two, and
so the indices have to be multiplied.  Or maybe there is some
calculation that I could get reasonable results if I did things as
powers of two (for shifts) but instead I know my CPU can do multiplies
fast and go ahead and do it directly.  Or maybe I'm performing graphics
calculations and tune it for my ability to quick multiplies.
HS>Thats the kind of code that I often see in source code.
HS>Not very fast is it?
Unless it's in a loop, the odds are very good that it's good enough.  In
my testing, with my compiler and CPU, it runs at 4/11ths the speed of
the best one.  If this is only done once, or even a few hundred times,
the speed improvement is likely to be negligible compared to the rest of
the program execution time.
HS>Watcom results :-
HS>---------------------------+
HS>LOOP  |   386   |  486     |
HS>------|---------|----------+
HS>loop1 | 81.1 cy | 35.3 cy  |
HS>loop2 | 38.7 cy | 29.7 cy  |
HS>loop3 | 15.9 cy | 9.5  cy  |
HS>---------------------------+
Well.... I ran your examples of optimizations.  Rather than doing a
cycle count, I just did a bytes per second, which is easier to calculate
and is just as easy to show relative performance.  If you want cycles
per second, I'm running a 66mhz 486, so feel free...  I used DJGPP
2.6.3. (Because of timing resolution, the bytes/sec results do vary
between runs, but stay within 1% of the reported results.)
Example 1 (with the array indexing) did 4,000,983 bytes per second.
About 16.5 cycles per byte.
Example 2 (with the pointers) did 3,792,598 bytes per second.  This is
about 18 cycles per byte.  Notice that in this case, using pointers
instead of the simpler array indexing is _SLOWER_, even though your
timings above show it to be faster.  This is an excellent example of how
the classic 'wisdom' of how 'always using a pointer is faster' simply
isn't true.
Example 3 (with pointers unrolled 4 times) got 5,688,898 bytes per
second.  About 11.8 cycles per byte.
HS>Now that I have taught you, you may actually want to read the thread and
HS>then come back with your apology.
I then did three of my own improvements.
The first one, I felt it was 'unfair' of you to unroll the pointer
example 4 times but not the indexing one.  So, I unrolled it 4 times.
And I got 6,150,160 bytes per second.  That's about 10.5 cycles per
byte.  And again, the index version is faster than the pointer version.
The second optimization I did was notice that you were still messing
with transferring a byte at a time.  That's rather slow. So, I changed
your unrolled pointer loop to 32 bit integers and got 11,168,389 bytes
transferred per second.  Double your best optimization on my computer.
That works out to about 6 cycles per byte transferred. (You would also
need to deal with making sure the data is aligned properly, since that
is a major performance killer, and on some computers, even fatal.  But
it can be dealt with.)
The next optimization is, of course, the most obvious one.... I used
memcpy().  I did it just as a way of benchmarking the performances.
And, within my timing resolution, I got very similar results to my
unrolled loop above. (There were enough cases where the integer unrolled
loop was faster than the inline rep movsd of memcpy() that I'm not
entirely sure it was timing resolution problems.) (And when the data can
fit entirely into the cache, this drops down to about 1.5 cycles per
byte.  On a Pentium, it would be around 1/3rd cycle per byte.)  When the
data fit entirely into the L2, and especially the L1 cache, memcpy() was
faster. The rest of the time, they were the same.
(I could have made a couple of additional optimizations.  First,
pre-warm the cache, by making sure the data we are about to copy is
actually in the cache.  That can reduce memory latency.  The second
might have been to use the floating point registers (or the MMX
registers) since they can safely transfer 8 bytes at a time, instead of
just 4 bytes.  Both optimizations such as those have been shown to give
significant improvements in code.  I didn't bother doing because it
wasn't worth the effort for an example.)
There are several key points to this reply.  And none of them are 'my
code / compiler is better than yours'.  (I'm not at all questioning the
timing results you got on your system.  I'm just pointing out that you
wouldn't get your results on my computer or compiler.)
(Continued to next message)
--- QScan/PCB v1.19b / 01-0162
---------------
* Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.