Can anyone point me to docs that explain what are the best ways to achieve a certain goal.
For example, which is better / faster:
Code: Select all
add eax, 0x2
Code: Select all
inc eax
inc eax
Cheers.
Code: Select all
add eax, 0x2
Code: Select all
inc eax
inc eax
Depends on what's limiting your speed. If it's cache speed (memory speed), you're better off with the second one on X86-32, since it takes only 2 bytes and the first takes 3 bytes (if done with intelligence) and 6 if done without. On AMD64 you're better off with the first, since the 1-byte inc has been removed, and it then ends up being 4 bytes compared to 2.XStream wrote: Hey,
Can anyone point me to docs that explain what are the best ways to achieve a certain goal.
For example, which is better / faster:
orCode: Select all
add eax, 0x2
I have seen different code that does it one way or the other, it would be nice to be able to work out the most efficient and the least byte consuming instructions.Code: Select all
inc eax inc eax
Cheers.
I've noticed that the more I've coded stuff that I want to be as fast as possible (or usually a bit faster), the more I've noticed that caring about the assembler level is totally useless. Sure, you can get a few percent off from some code by carefully tuning your inner loops, but one should never do that, unless one is totally convinced that there is no way the code on higher level.XStream wrote: Thanks everyone, that gives me alot to think about. I am not going to be trying to optimize every single instruction, I am really just trying to learn how to identify where such optimizations would be helpful and in some cases a must.
Code: Select all
for(int x = 0; x < xmax; x++)
for(int y = 0; y < ymax; y++)
dosomething(x, y);
I concur completely. Did this once in QBasic code, with lots of inline assembly manually compiled, which used VESA on old computers to do some graphical work. Now, the video memory is even slower, and each cache miss is overexaggerated because you have to bank switch, but in the end the problem is the same. The program got a few factors (complete factors) faster when I used horizontal lines to fill a square, instead of vertical lines. Horizontal lines are contigouos in memory, whereas the verticals are straight through a number of banks.mystran wrote: An even more interesting problem is the cache-problem: in some recent code I wrote something like:The problem (ofcourse) was, that I was doing stuff with involved a two-dimensional array, store in memory so that each row was after the next. Since this happened to be in a few inner loops, switching the places of the two for-loops (so that x was inner) sped up the code about 20%.. the first version ofcourse was taking much more cache-misses, since it was touching memory from "random" locations, instead of consecutive locations.Code: Select all
for(int x = 0; x < xmax; x++) for(int y = 0; y < ymax; y++) dosomething(x, y);
For the same reason avoiding buffer sizes of 2^n might be a good idea, even if "wrapping" from the end to the beginning with 2^n buffer is a bitwise-AND and with something else needs a test. It happens that if you use several buffers, the "slower" code (with the test) is often actually faster, since caches are aligned -> non-2^n buffers result in less cache-flushes.