HS>CB>Even my 8 year old Microsoft Quick C, a compiler hardly known for
HS>CB>aggressive optimization, and designed to run on even an XT, will
CB>do that
HS>CB>conversion.
HS>Borland compilers fail to perform such an optimization.
CB>Even Turbo C++ 3.0 will do it! And that's Borlands _LOW_ end compiler.
CB>And TC30 that's, what, 5 years old. Older TC's, such as TC1.0 might
CB>not. But not many people have it anymore.
CB>If you don't have a compiler that will do something as simple as
CB>strength reduction, constant folding, etc., then pick up a copy of the
CB>free DJGPP. Judging from the tests you post, it does give quite a bit
CB>better results than what ever compiler you are using.
HS>Yes, I am aware of that. I was however showing the different methods
CB>of
HS>writing a loop. On your machine the index version performed faster
CB>than the
HS>pointer version. It did _NOT_ do this on my machine. That is why i am
CB>not ve
HS>happy about letting a compiler do the 'dirty-work' for me.
CB>But you can't decide in advance which will be better on what CPU.
CB>You've got to actually run it on all those different CPUs. And if you
CB>happen to be distributing source, rather than executable, you don't even
CB>have the option of optimizing it specifically for your compiler, because
CB>as those examples you gave showed, although your optimizations improved
CB>things for your particular compiler, it made things worse for most
CB>people.
HS>Your program would've performed incorrectly.
HS>Take note of '#define 99999'
CB>I noticed that. And I noted that in the message. But since it was only
CB>an _EXAMPLE_, I chose not to mess with it and just mention that if you
CB>really did it, you would have to deal with it. It would have increased
CB>the complexity, but not helped or hurt the run time. Dealing with those
CB>things isn't hard. I would have needed to check at the beginning to
CB>make sure it was aligned properly, and at both the beginning and the end
CB>deal with a few bytes of misaligned data. (Which could be up to 3 bytes
CB>for a 32 bit machine, or 7 for a 64 bit machine.)
HS>CB>The next optimization is, of course, the most obvious one.... I
CB>used
HS>CB>memcpy(). I did it just as a way of benchmarking the
CB>performances.
HS>So that means. 'who cares, it's fast on my machine', right?
CB>More like, 'the library writer probably already put in a lot of work
CB>making it run fast, so why should I waste my time reinventing the wheel
CB>for a 0.0000001% improvement'. I'd much rather spend my time working on
CB>the _algorithm_ and improve things by a factor of 2 than slave over
CB>tweaking for a 1/1000th of 1 percent improvement that may or may not
CB>show up, depending on what system you run it on.
HS>You could also add code patching to speed it up by 50%.
CB>Depends on what you mean by code patching. That means different things
CB>to different people.
CB>I am assuming you know enough to not mean self modifying code...
CB>That's a rather dangerous practice. Now, obviously self modifying code
CB>is definetly processor family specific, so you don't have the
CB>portability problems, like if you were going to post something in the
CB>echo, but even within a single family (like the x86) there are dangers.
CB>If the patch is too close to what's being prefetched and decoded, it can
CB>cause major pipeline stalls, _ASSUMING_ the CPU even catches it. (In
CB>fact, self modifying code is the way to tell the difference between a
CB>8086 and an 8088, because of the size of the prefetch decoder and the
CB>8086 will _not_ catch the modified code and execute the old code.)
CB>Also, on some systems, it'll cause the program to crash. And some OSs
CB>write protect (with the MMU) the program itself to prevent changes in
CB>the program. And it would play major havoc on a mutli-user/tasking OS
CB>that share a single piece of code among multiple tasks/processes running
CB>it. And some CPUs have different code and data address spaces and you
CB>wouldn't even be able to modify it. (Of course, as I said, since self
CB>modifying code is processor specific, you wouldn't run into that
CB>situation.)
Yes, I am aware of that, that is why I suggest compiling specifically for a
machine. I don't mind spending a few hours compiling a dozen executables. If
you've ever used code patching in your code (the correct way) you would
notice that the improvement is far too great to ignore.
My pure-c written interpolated voxel mapper ran at 310 fps.
My C + asm written interpolated voxel mapper ran at 294 fps.
My C + asm with code patching interpolated voxel mapper ran at a whoppingly
fast 641 fps!
That's over 50% improvement.
... Get OS/2 2.0 - The best Windows tip around!
--- Ezycom V1.48g0 01fd016b
---------------
* Origin: Fox's Lair BBS Bris Aus +61-7-38033908 V34+ Node 2 (3:640/238)
|