Fastest / Smallest Code

XStream · Post by **XStream** » Fri Aug 20, 2004 9:16 am

Hey,

Can anyone point me to docs that explain what are the best ways to achieve a certain goal.

For example, which is better / faster:

Code: Select all

add   eax, 0x2

or

Code: Select all

inc   eax
inc   eax

I have seen different code that does it one way or the other, it would be nice to be able to work out the most efficient and the least byte consuming instructions.

Cheers.

Curufir · Post by **Curufir** » Fri Aug 20, 2004 9:30 am

Use the first one, the speed bonus will come from being able to do another operation at the same time.

Candy · Post by **Candy** » Fri Aug 20, 2004 10:07 am

XStream wrote: Hey,

Can anyone point me to docs that explain what are the best ways to achieve a certain goal.

For example, which is better / faster:
Code: Select all
add   eax, 0x2
or
Code: Select all
inc   eax
inc   eax
I have seen different code that does it one way or the other, it would be nice to be able to work out the most efficient and the least byte consuming instructions.

Cheers.

Depends on what's limiting your speed. If it's cache speed (memory speed), you're better off with the second one on X86-32, since it takes only 2 bytes and the first takes 3 bytes (if done with intelligence) and 6 if done without. On AMD64 you're better off with the first, since the 1-byte inc has been removed, and it then ends up being 4 bytes compared to 2.

If you're limited by operations more than cache size (such as in tight loops) where you must limit the operation count, use the first. It might take a byte more but it leaves an execution unit left over. If you have one to spare at that time, take the second again. On P4's, take the second one about any time, it has logically 4 integer pipelines which can all do an inc (although they are related, so they might not pipeline that good).

Conclusion would be that for older computers (up to pentium) the second was best because it was smaller, but starting with the superscalar pmmx or ppro/p2 it's best to choose the first because it takes less instructions, less decoding time, on a p4 even less space in the cache, it's faster (no dependencies) so there's about no place for the second. Guess why AMD64 removed the 1-byte inc/decs?

As for all assembly level tiny things, try to bench all places you apply it to see which is faster. Test on most computers your program is going to run on if possible, as far as they're different. Then decide with intelligence which one actually is faster there.

Solar · Post by **Solar** » Fri Aug 20, 2004 1:02 pm

The book you're looking for is named "IA-32 Intel Architecture Optimization Reference Manual". AMD released a similar book. However, such minute optimization makes only sense in a place you positively identified as a bottleneck. Everywhere else, you'll probably spend more time in finding the "best" way to do something than all your users combined will ever "lose" by the second-best way.

Schol-R-LEA · Post by **Schol-R-LEA** » Fri Aug 20, 2004 1:47 pm

First off, unless this happens to be the critical bottleneck in you inner loop, then optimizing such a small difference gains you nothing. The difference is probably less than two cycles in the entire program run. While tight code is admirable, this sort of micro-optimization is likely to cost you more time than the difference in the code would.

Second, the only way real way to determine the difference is to profile it, and even then, the acual performance will vary with CPU model, the speed of the bus, the RAM and the L1 and L2 caches, and the overall code it is a part of. Don't expect to find hard and fast answers, as what is fastest in one situation may be slower in another.

If you are serious about optimized code, then I recommend reading the Graphics Programming Black Book by Michael Abrash. While it definitely is showing it's age, the basic principles and techniques it describes are still relevant for any kind of programming.

XStream · Post by **XStream** » Sat Aug 21, 2004 9:39 pm

Thanks everyone, that gives me alot to think about. I am not going to be trying to optimize every single instruction, I am really just trying to learn how to identify where such optimizations would be helpful and in some cases a must.

Cheers.

mystran · Post by **mystran** » Sun Aug 22, 2004 12:23 pm

XStream wrote: Thanks everyone, that gives me alot to think about. I am not going to be trying to optimize every single instruction, I am really just trying to learn how to identify where such optimizations would be helpful and in some cases a must.

I've noticed that the more I've coded stuff that I want to be as fast as possible (or usually a bit faster), the more I've noticed that caring about the assembler level is totally useless. Sure, you can get a few percent off from some code by carefully tuning your inner loops, but one should never do that, unless one is totally convinced that there is no way the code on higher level.

It turns out that more often than not, one can optimize code on higher level, and sometimes such optimizations do involve speed/memory/binarysize tradeoffs, but not allways. One thing that often pays off is to think of all math as math, not code, and then try to simplify them as math.

One example might be tracing a n-th level polynome. You could certainly do 2*n multiplies and n additions, like a1x^n + a2^(n-1) .. or you could group them and save half the multiplies, like x(a1 + x(a2 + x(...)/a2)/a1) although ofcourse you would eliminate the divisors but..

In some cases this is not the way to go however. In somecases you can trade some accuracy for faster speed, and use another way instead using derivates: add nth derivate to (n-1)th derivative, then that value to (n-2)th derivative, until you hit the actual function. Ofcourse now you take not only some accuracy-problems which can accumulate, but also possible cache burns from added variables. On the other hand, if you trace short enough to not cause too much loss of accuracy, and you only need a low n (say 4) so you can fit them into registers, then this might well be the fastest way: just n additions.

The downside ofcourse is that you now have to calculate the derivatives, which might be more trouble than it's worth.. the point is, it's irrelevant how long an instruction takes, if you can eliminate it alltogether.

An even more interesting problem is the cache-problem: in some recent code I wrote something like:

Code: Select all

for(int x = 0; x < xmax; x++)
  for(int y = 0; y < ymax; y++)
    dosomething(x, y);

The problem (ofcourse) was, that I was doing stuff with involved a two-dimensional array, store in memory so that each row was after the next. Since this happened to be in a few inner loops, switching the places of the two for-loops (so that x was inner) sped up the code about 20%.. the first version ofcourse was taking much more cache-misses, since it was touching memory from "random" locations, instead of consecutive locations.

For the same reason avoiding buffer sizes of 2^n might be a good idea, even if "wrapping" from the end to the beginning with 2^n buffer is a bitwise-AND and with something else needs a test. It happens that if you use several buffers, the "slower" code (with the test) is often actually faster, since caches are aligned -> non-2^n buffers result in less cache-flushes.

I would conclude, that no more are individual instruction timings relevant, except possibly for some floating-point instructions like fsin/fcos (which can take over 200 clocks).

PS. Yeah I know this is again one of the "less well written" posts. Sorry about that.

Curufir · Post by **Curufir** » Sun Aug 22, 2004 1:25 pm

I only see 3 real reasons for dropping down to assembly language while coding.

1. Amusement. This is my personal reason, because I just find writing in assembly language a lot of fun.

2. Space. If you're doing work where extra memory is going to vastly increase the costs of the end-product then assembly might just be your only option. Eg If the original spec has only an 4k ram chip then you are likely to start out by throwing away the compiler. 99.9% of programmers are never going to be able to use this as a reasonable argument for using assembly language.

3. Speed. If you've pored over your code, isolated the inner loops, optimised the algorithmics as much as you can and your code still can't run within the required parameters of speed then, and only then, does speed become a valid reason to use assembly. Chances are that unless you're a pretty good assembly language programmer the compiler is going to beat out whatever you come up with anyway.

As far as x86 OS dev goes only the first reason holds any weight (IMHO). If anyone is using assembly because they think it will automatically make their code faster then they are extremely foolish.

Candy · Post by **Candy** » Sun Aug 22, 2004 4:35 pm

mystran wrote: An even more interesting problem is the cache-problem: in some recent code I wrote something like:
Code: Select all
for(int x = 0; x < xmax; x++)
  for(int y = 0; y < ymax; y++)
    dosomething(x, y);
The problem (ofcourse) was, that I was doing stuff with involved a two-dimensional array, store in memory so that each row was after the next. Since this happened to be in a few inner loops, switching the places of the two for-loops (so that x was inner) sped up the code about 20%.. the first version ofcourse was taking much more cache-misses, since it was touching memory from "random" locations, instead of consecutive locations.

For the same reason avoiding buffer sizes of 2^n might be a good idea, even if "wrapping" from the end to the beginning with 2^n buffer is a bitwise-AND and with something else needs a test. It happens that if you use several buffers, the "slower" code (with the test) is often actually faster, since caches are aligned -> non-2^n buffers result in less cache-flushes.

I concur completely. Did this once in QBasic code, with lots of inline assembly manually compiled, which used VESA on old computers to do some graphical work. Now, the video memory is even slower, and each cache miss is overexaggerated because you have to bank switch, but in the end the problem is the same. The program got a few factors (complete factors) faster when I used horizontal lines to fill a square, instead of vertical lines. Horizontal lines are contigouos in memory, whereas the verticals are straight through a number of banks.

Result of this entire discussion should be, you can optimize assembly-level, but optimizing at a higher level will almost always yield better solutions.

Even if your C code can't be optimized better, reformulating the solution to the problem so it doesn't need that code makes the bottleneck disappear too

OSDev.org

Fastest / Smallest Code

Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code

Re:Fastest / Smallest Code