TIP: Click on subject to list as thread! ANSI
echo: c_plusplus
to: HERMAN SCHONFELD
from: JERRY COFFIN
date: 1997-04-25 11:30:00
subject: DJGPP OPTIMIZATIONS

On (23 Apr 97) Herman Schonfeld wrote to Darin McBride...
 HS> Most compilers optimize code well, but just about all don't optimize
 HS> the fastest.
 HS> Since you obviously fail to comprehend this I shall demonstrate.
[ demo code elided ]
 HS> See if Watcom will do that for you.. I think not..
 HS> Watcom results :-
 HS> ---------------------------+
 HS> LOOP  |   386   |  486     |
 HS> ------|---------|----------+
 HS> loop1 | 81.1 cy | 35.3 cy  |
 HS> loop2 | 38.7 cy | 29.7 cy  |
 HS> loop3 | 15.9 cy | 9.5  cy  |
 HS> ---------------------------+
 HS> cy are cycles incase you haven't noticed.
 HS> Now that I have taught you, you may actually want to read the thread
 HS> and then come back with your apology.
Nobody needs to apologize to you at all.  You are the one who needs to
go back and read what's been written.  The mere fact that you get one
set of results with your compiler doesn't prove a thing about what
anybody else is going to get with another compiler.  For the sake of
comparison, I ran your code through MS C.  The first loop, which you
assumed would be the slowest, was in fact consistently the fastest.
Your "optimizations" consistently slowed the code down.  In fact, with
MS C, the first loop compiled to:
    mov     eax, DWORD PTR ?bufSize@@3HA            ; bufSize
    push    edi
    test    eax, eax
; Line 15
    jle SHORT $L158
    mov esi, OFFSET FLAT:?buf2@@3PADA       ; buf2
    mov edi, OFFSET FLAT:?buf1@@3PADA       ; buf1
    mov ecx, eax
    shr ecx, 2
    rep movsd
    mov ecx, eax
    and ecx, 3
    rep     movsb
Note that the majority of the move is done as efficiently as a 486 can
possibly do: with a `rep movsd'.
By contrast, your "optimized" code produced a mess; the resulting code
is over 5 times as long, and roughly 20% slower.
Now, if you write only for Watcom, your "optimization" might be useful.
If you want to produce good code with nearly every compiler on earth,
and optimal code with most, consider using:
    memcpy(buf1, buf2, sizeof(bufSize));
It's pretty rare that this will produce poorer code than an explicit
loop; with many compilers it will do considerably better.  Come to that,
most decent optimizers know how to unroll loops on their own, and most
produce better code for the unrolled loop than you can explicitly.
Generally if you think you need to unroll a loop by hand, you really
just need to learn to use your compiler.
This begs the question: has Watcom's compiler _really_ gotten this much
worse since I used it last?  At one time, it had a perfectly good
optimizer, but if your results are truly indicative of the best the
compiler can do, it's gotten a LOT worse in the last several years.
    Later,
    Jerry.
... The Universe is a figment of its own imagination.
--- PPoint 1.90
---------------
* Origin: Point Pointedly Pointless (1:128/166.5)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.