Hi,
Tosi wrote:What I do is write it all in C and let the compiler optimize it for me. I am serious, the compiler-generated code is much faster than almost anything I could write in assembly.
By definition, for "best case" assembly language is always at least as good anything any compiler could generate, given the fact that it's possible for an assembly language programmer to start with the output of the compiler and "tweak" (while testing if their tweaking helped or not between tweaks).
However, in general (and especially for large projects) no assembly language programmer is going to waste a huge amount of time trying to get the fastest code possible for all of the code, and because of this most assembly language programmers are frequently beaten by compilers. It's not that they can't beat the compiler, it's that it's not worth the time it would take. The other thing is that highly optimised assembly code is much much harder to maintain, and an assembly language programmer that does spend a huge amount of time trying to reduce code maintainability is the last thing you want.
Mostly, you want to find the bottlenecks and only spend time improve them. It doesn't matter much if the project is 100% assembly (where most code is "average" and a small part is optimised) or if the project is primarily C with a few pieces in optimised inline assembly.
Dario wrote:AFAIK, modern CPUs implement out-of-order pipelines, so that kind of optimization is "not" necessary and it only slows down programming since there are hundreds of ways to combine instructions.
In my opinion, an Intel Atom is a modern CPU. It doesn't do any instruction reordering, speculative execution, or register renaming.
Even for CPUs that do implement out-of-order pipelines, there's typically restrictions on what CPUs are capable of, and code that tries to put instructions in an optimal order can still help.
bewing wrote:- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
It's always best to count down, and test ZF or SF at the end. Under typical usage, doing that will increase code speed about 5%, instead of using CMP.
The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily. Using memory (including the stack) is somewhere between 3 and 500 times slower than using a register -- depending on caching. You can get more speed improvement by hand coding your ASM to use registers properly than by any other optimization.
For most loops, there's usually something inside the loop that can be used to determine loop termination, so that you don't need to use a separate register. For an example, consider:
Code: Select all
mov edi,startAddress
mov ecx,0
mov eax,PG_PRESENT | physAddress
cld
.next:
stosd
add ecx,1
add eax,0x00001000
cmp ecx,(endAddress - startAddress)/4
jb .next
You could use this instead:
Code: Select all
mov edi,endAddress-4
mov ecx,0
mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
std
.next:
stosd
sub ecx,1
lea eax,[eax-0x00001000]
jne .next
But, why do you need ECX at all?
Code: Select all
mov edi,startAddress
mov eax,PG_PRESENT | physAddress
cld
.next:
stosd
add eax,0x00001000
cmp edi,endAddress
jb .next
In this case, doing things in reverse doesn't make much difference:
Code: Select all
mov edi,endAddress-4
mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
std
.next:
stosd
sub eax,0x00001000
cmp edi,startAddress
jae .next
There's also cases where the loop must happen in a specific direction. For an example, consider "memmove()" where the source data overlaps the destination data.
bewing wrote:@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.
ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)
That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.
64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.
a5498828 wrote:- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?
Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. You need to consider what the CPU supports (no point trying to use SSE on an 80486 for e.g.), the startup overhead, the throughput, the effect on caches, the size of the data, etc.
a5498828 wrote:- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?
Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. For example, if the bottleneck is instruction fetch then maybe inc/dec will be faster (because it's smaller), if the dependency on flags makes no difference (e.g. plenty of instructions that don't touch flags in between) then it might not make any difference, and if there's a dependency on flags somewhere that can't be avoided then add/sub might be better.
a5498828 wrote:- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?
There's still no "rule of thumb" that applies to every possible situation. However, your example makes me think you're doing something wrong. For example:
Code: Select all
add byte [edi],0x34
inc edi
adc byte [edi],0x12
inc edi
Should probably be:
Code: Select all
add byte [edi],0x34
adc byte [edi+1],0x12
add edi,2
Or maybe:
a5498828 wrote:- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.
Probably not. I can't see how it'd make any difference.
a5498828 wrote:- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?
For multiply/divide by a constant, in general (depending on things like whether you can break up the instruction dependencies) I'd use a combination of shifts, add/sub and LEA with no more than 3 instructions. Any more than that and it's probably better to use MUL/IMUL/DIV/IDIV anyway. For example, multiplying EAX by 10 can be:
Code: Select all
lea eax,[eax*4+eax] ;eax = value * 5
*other instruction/s that don't rely on EAX here *
add eax,eax ;eax = value * 10
*other instruction/s that don't rely on EAX here *
However, for multiplying EAX by 4619 it'd be faster to MUL because it'd take too many simpler instructions.
For multiplying/dividing by a variable, use MUL/IMUL/DIV/IDIV. The only other option that might be worth considering is if the values are tiny and the operation needs to be done a lot, where a lookup table would be small enough and would remain in L1 cache, especially if it frees up registers and allows you to do more in parallel. For example (four 8-bit multiplications at once):
Code: Select all
movzx eax,byte [value1a]
movzx ebx,byte [value2a]
movzx ecx,byte [value3a]
movzx edx,byte [value4a]
mov ah,[value1b]
mov bh,[value2b]
mov ch,[value3b]
mov dh,[value4b]
mov ax,[table+eax*2]
mov bx,[table+ebx*2]
mov cx,[table+ecx*2]
mov dx,[table+edx*2]
mov [result1],ax
mov [result2],bx
mov [result3],cx
mov [result4],dx
a5498828 wrote:- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?
There's no "rule of thumb" that applies to every possible situation. For example, if your using the high bits of RAX for something then you can't use MOVZX, but "movzx eax, word [something]" will break any dependency on the previous operation that modified RAX/EAX.
a5498828 wrote:- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?
The best option would probably be to increase the size of it to 8 bytes and ensure it is aligned correctly; then use "mov rax,[from]" then "mov [to],rax").
Cheers,
Brendan