XStream wrote:
Thanks everyone, that gives me alot to think about. I am not going to be trying to optimize every single instruction, I am really just trying to learn how to identify where such optimizations would be helpful and in some cases a must.
I've noticed that the more I've coded stuff that I want to be as fast as possible (or usually a bit faster), the more I've noticed that caring about the assembler level is totally useless. Sure, you can get a few percent off from some code by carefully tuning your inner loops, but one should never do that, unless one is totally convinced that there is no way the code on higher level.
It turns out that more often than not, one can optimize code on higher level, and sometimes such optimizations do involve speed/memory/binarysize tradeoffs, but not allways. One thing that often pays off is to think of all math as math, not code, and then try to simplify them as math.
One example might be tracing a n-th level polynome. You could certainly do 2*n multiplies and n additions, like a1x^n + a2^(n-1) .. or you could group them and save half the multiplies, like x(a1 + x(a2 + x(...)/a2)/a1) although ofcourse you would eliminate the divisors but..
In some cases this is not the way to go however. In somecases you can trade some accuracy for faster speed, and use another way instead using derivates: add nth derivate to (n-1)th derivative, then that value to (n-2)th derivative, until you hit the actual function. Ofcourse now you take not only some accuracy-problems which can accumulate, but also possible cache burns from added variables. On the other hand, if you trace short enough to not cause too much loss of accuracy, and you only need a low n (say 4) so you can fit them into registers, then this might well be the fastest way: just n additions.
The downside ofcourse is that you now have to calculate the derivatives, which might be more trouble than it's worth.. the point is, it's irrelevant how long an instruction takes, if you can eliminate it alltogether.
An even more interesting problem is the cache-problem: in some recent code I wrote something like:
Code: Select all
for(int x = 0; x < xmax; x++)
for(int y = 0; y < ymax; y++)
dosomething(x, y);
The problem (ofcourse) was, that I was doing stuff with involved a two-dimensional array, store in memory so that each row was after the next. Since this happened to be in a few inner loops, switching the places of the two for-loops (so that x was inner) sped up the code about
20%.. the first version ofcourse was taking much more cache-misses, since it was touching memory from "random" locations, instead of consecutive locations.
For the same reason avoiding buffer sizes of 2^n might be a good idea, even if "wrapping" from the end to the beginning with 2^n buffer is a bitwise-AND and with something else needs a test. It happens that if you use several buffers, the "slower" code (with the test) is often actually faster, since caches are aligned -> non-2^n buffers result in less cache-flushes.
I would conclude, that no more are individual instruction timings relevant, except possibly for some floating-point instructions like fsin/fcos (which can take over 200 clocks).
PS. Yeah I know this is again one of the "less well written" posts. Sorry about that.