On (23 Apr 97) Herman Schonfeld wrote to Darin McBride...
HS> Most compilers optimize code well, but just about all don't optimize
HS> the fastest.
HS> Since you obviously fail to comprehend this I shall demonstrate.
[ demo code elided ]
HS> See if Watcom will do that for you.. I think not..
HS> Watcom results :-
HS> ---------------------------+
HS> LOOP | 386 | 486 |
HS> ------|---------|----------+
HS> loop1 | 81.1 cy | 35.3 cy |
HS> loop2 | 38.7 cy | 29.7 cy |
HS> loop3 | 15.9 cy | 9.5 cy |
HS> ---------------------------+
HS> cy are cycles incase you haven't noticed.
HS> Now that I have taught you, you may actually want to read the thread
HS> and then come back with your apology.
Nobody needs to apologize to you at all. You are the one who needs to
go back and read what's been written. The mere fact that you get one
set of results with your compiler doesn't prove a thing about what
anybody else is going to get with another compiler. For the sake of
comparison, I ran your code through MS C. The first loop, which you
assumed would be the slowest, was in fact consistently the fastest.
Your "optimizations" consistently slowed the code down. In fact, with
MS C, the first loop compiled to:
mov eax, DWORD PTR ?bufSize@@3HA ; bufSize
push edi
test eax, eax
; Line 15
jle SHORT $L158
mov esi, OFFSET FLAT:?buf2@@3PADA ; buf2
mov edi, OFFSET FLAT:?buf1@@3PADA ; buf1
mov ecx, eax
shr ecx, 2
rep movsd
mov ecx, eax
and ecx, 3
rep movsb
Note that the majority of the move is done as efficiently as a 486 can
possibly do: with a `rep movsd'.
By contrast, your "optimized" code produced a mess; the resulting code
is over 5 times as long, and roughly 20% slower.
Now, if you write only for Watcom, your "optimization" might be useful.
If you want to produce good code with nearly every compiler on earth,
and optimal code with most, consider using:
memcpy(buf1, buf2, sizeof(bufSize));
It's pretty rare that this will produce poorer code than an explicit
loop; with many compilers it will do considerably better. Come to that,
most decent optimizers know how to unroll loops on their own, and most
produce better code for the unrolled loop than you can explicitly.
Generally if you think you need to unroll a loop by hand, you really
just need to learn to use your compiler.
This begs the question: has Watcom's compiler _really_ gotten this much
worse since I used it last? At one time, it had a perfectly good
optimizer, but if your results are truly indicative of the best the
compiler can do, it's gotten a LOT worse in the last several years.
Later,
Jerry.
... The Universe is a figment of its own imagination.
--- PPoint 1.90
---------------
* Origin: Point Pointedly Pointless (1:128/166.5)
|