TIP: Click on subject to list as thread! ANSI
echo: c_plusplus
to: HERMAN SCHONFELD
from: CAREY BLOODWORTH
date: 1997-04-22 21:28:00
subject: Low level optimizatio 1/2

Let's work through a simple project.  Let's say that we need to copy
some data from one array to another.  Basically, just like memcpy()
does, except we are doing it ourselves.  And it has to be fast because
we are doing this in a loop, and it will be done thousands of times.
Let's also say that it is a nice power of 2, somewhere between 128 and
4096 bytes.  (That simplifies the example a bit, although it wouldn't
take much effort to actually allow any lengths.)
The first thing we would write is probably something like:
void memcpy(char to[], char from[], int len)
{ while (len--) to[len]=from[len];}
Of course, on some computers, array indexing is slow, so instead, you
might do:
void memcpy(char *to, char *from,int len)
{ while (len--) *to++=*from++;}
That might be faster on some systems.  Of course, other systems can
easily deal with array indexing.  Especially if they have few registers
they can use to keep the pointers handy.
(I'll leave aside things like assigning register variables etc., because
a decent compiler, with reasonable optimization switches enabled, will
do fairly well all by itself.  And many compilers totally ignore the
register keyword anyway.)
That works well and is reliable.  We can optimize that a bit by
realizing we are only doing a char at a time, and that we could move
words at a time.  So, we'd have:
void memcpy(char *to, char *from,int len)
{int *t=(int*)to;
 int *f=(int*)from;
 len/=sizeof(int);
 while (len--) *t++=*f++;
}
That should get us an improvement.  Of course, there is a catch... The
data may not be aligned on word boundaries.  On a non-cached processor,
that would result in double the number of memory accesses!  On a cached
processor, the cache would mostly hide that extra work.  On other
processors, such as the 68k, trying to do misaligned accesses will cause
a fatal exception, though!  Bummer!!  For the sake of the rest of the
example, we'll say that it will always be aligned.  We could test and do
a couple of bytes if it isn't.
Then we might try unrolling the loop.  We could do, oh, 4 transfers.
void memcpy(char *to, char *from,int len)
{int *t=(int*)to;
 int *f=(int*)from;
 len=len / sizeof(int) / 4;
 while (len--) {*t++=*f++;*t++=*f++; *t++=*f++; *t++=*f++;
}
That would certainly cut down on loop overhead.
And what if we could do the data in bigger 'gulps'.  Well, the floating
point registers can load and store 8 bytes (64 bit int) at a time.  On
some systems with fast FPUs, it could be significantly faster.  It would
also allow the integer part of the loop to run concurently with the
transfer itself.  This same idea would be applicable to the MMX
extensions now appearing in several high end CPUs.  Of course, both of
those would result in a severe performance penalty (or failure) if the
computer doesn't have a fast FPU, or even a FPU / MMX instructions.
The next improvement you might decide to make is using the built in x86
REP prefix and an appropriate assembly instruction.  Makes sense.  Of
course, unfortunately, that would make the program non-portable, and
many processors don't have an equivilent method.  But, if the program is
only going to run on an x86, no problem. Of course..... Although a REP
would be faster on some processors, for others it can actually be slower
than a basic unrolled loop!
Then you'll probably consider doing this inline, saving a function call.
How useful that would be would depend on the situation, but probably
improve things well under 1%, since function calls don't really take
_that_ much time.
Then you might consider the effects of the cache.  If the data could
somehow be guaranteed being in the cache when you need it, you could
save considerable memory latency delays.  Hmmmm.... Well, there is a
technique called 'warming the cache', where you load parts of your data
into the cache and then operate on them.  It depends on the length of
your cache line, but typically, reading every 16th byte will very
quickly load the entire array into the cache.  That way save quite a bit
of time when you actually read the entire thing to do the copy.    The
principle could be applied to all of the above methods.   Savings can
range anywhere from 20 to 30%.  Of course, on non-cached systems,
performance would suffer. It would even suffer on smaller cached
systems, like with the 486DLC, which has a cache size ranging from just
1k to 16k.
Each of the ideas above can give better results than the standard one,
on some systems.  But on some other systems, they will give
significantly worse results. So which is the 'Best' optimization to
make???  That depends too much on the situation, the processors that
(Continued to next message)
--- QScan/PCB v1.19b / 01-0162
---------------
* Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.