HS>CB>Even my 8 year old Microsoft Quick C, a compiler hardly known for
HS>CB>aggressive optimization, and designed to run on even an XT, will do
hat
HS>CB>conversion.
HS>Borland compilers fail to perform such an optimization.
Even Turbo C++ 3.0 will do it! And that's Borlands _LOW_ end compiler.
And TC30 that's, what, 5 years old. Older TC's, such as TC1.0 might
not. But not many people have it anymore.
If you don't have a compiler that will do something as simple as
strength reduction, constant folding, etc., then pick up a copy of the
free DJGPP. Judging from the tests you post, it does give quite a bit
better results than what ever compiler you are using.
HS>Yes, I am aware of that. I was however showing the different methods of
HS>writing a loop. On your machine the index version performed faster than
he
HS>pointer version. It did _NOT_ do this on my machine. That is why i am not
ve
HS>happy about letting a compiler do the 'dirty-work' for me.
But you can't decide in advance which will be better on what CPU.
You've got to actually run it on all those different CPUs. And if you
happen to be distributing source, rather than executable, you don't even
have the option of optimizing it specifically for your compiler, because
as those examples you gave showed, although your optimizations improved
things for your particular compiler, it made things worse for most
people.
HS>Your program would've performed incorrectly.
HS>Take note of '#define 99999'
I noticed that. And I noted that in the message. But since it was only
an _EXAMPLE_, I chose not to mess with it and just mention that if you
really did it, you would have to deal with it. It would have increased
the complexity, but not helped or hurt the run time. Dealing with those
things isn't hard. I would have needed to check at the beginning to
make sure it was aligned properly, and at both the beginning and the end
deal with a few bytes of misaligned data. (Which could be up to 3 bytes
for a 32 bit machine, or 7 for a 64 bit machine.)
HS>CB>The next optimization is, of course, the most obvious one.... I used
HS>CB>memcpy(). I did it just as a way of benchmarking the performances.
HS>So that means. 'who cares, it's fast on my machine', right?
More like, 'the library writer probably already put in a lot of work
making it run fast, so why should I waste my time reinventing the wheel
for a 0.0000001% improvement'. I'd much rather spend my time working on
the _algorithm_ and improve things by a factor of 2 than slave over
tweaking for a 1/1000th of 1 percent improvement that may or may not
show up, depending on what system you run it on.
HS>You could also add code patching to speed it up by 50%.
Depends on what you mean by code patching. That means different things
to different people.
I am assuming you know enough to not mean self modifying code...
That's a rather dangerous practice. Now, obviously self modifying code
is definetly processor family specific, so you don't have the
portability problems, like if you were going to post something in the
echo, but even within a single family (like the x86) there are dangers.
If the patch is too close to what's being prefetched and decoded, it can
cause major pipeline stalls, _ASSUMING_ the CPU even catches it. (In
fact, self modifying code is the way to tell the difference between a
8086 and an 8088, because of the size of the prefetch decoder and the
8086 will _not_ catch the modified code and execute the old code.)
Also, on some systems, it'll cause the program to crash. And some OSs
write protect (with the MMU) the program itself to prevent changes in
the program. And it would play major havoc on a mutli-user/tasking OS
that share a single piece of code among multiple tasks/processes running
it. And some CPUs have different code and data address spaces and you
wouldn't even be able to modify it. (Of course, as I said, since self
modifying code is processor specific, you wouldn't run into that
situation.)
--- QScan/PCB v1.19b / 01-0162
---------------
* Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)
|