about performance and code design [8086+]
about performance and code design [8086+]
I have more than a few problems with writing good code, and i would like to ask some questions here.
For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.
- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?
- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?
- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?
- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.
- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?
- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?
- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?
For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.
- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?
- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?
- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?
- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.
- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?
- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?
- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: about performance and code design [8086+]
Both intel and AMD have optimisation manuals. The Graphics programming black book goes in great detail about the 486/586 (with some hints about older processors) and is probably a good read to get some background knowledge and the thinking patterns.a5498828 wrote:For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.
SSE block operations are generally faster for long sequences, and slower for short sequences. rep opcode is better for code size. Try to use the widest mode as much as possible.- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?
using inc/dec followed by a conditional branch causes a partial register stall on the flag register on modern processors, in which case add/sub is better. On CPUs where the stall does not exist, inc/dec is better (and shorter). Decrementing counters allow you to change add;cmp;jl into the sorter sub;jnz.- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?
Learn to use LEA- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
Still depends on all circumstances you haven't mentioned.So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?
divide by a power of two? use shr/sar. divide by constant? try multiplying with its inverse. divide a byte? use aam.- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?
In the end, what matters more on modern processors is instruction scheduling - try to maximize the number of instructions between where a register is written and where it is used. Avoid complex instructions, and if you have to you better set them up as soon as possible, do something else you need, then merge the results. If you do this right, it will benefit about every cpu core out there.
Even better: use O(1) algorithms where possible
Re: about performance and code design [8086+]
Thnx for interest and quick answer, i look into that book.
How do i detect carry then? Unless i copy old register and then do cmp, but its slower than swapping cf into memory.Learn to use LEA
by unknown value.divide by a power of two? use shr/sar. divide by constant? try multiplying with its inverse. divide a byte? use aam.
Re: about performance and code design [8086+]
I recommend reading Agner Fog.
Re: about performance and code design [8086+]
AFAIK, modern CPUs implement out-of-order pipelines, so that kind of optimization is "not" necessary and it only slows down programming since there are hundreds of ways to combine instructions.a5498828 wrote: For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.
____
Dario
Dario
-
- Member
- Posts: 255
- Joined: Tue Jun 15, 2010 9:27 am
- Location: Flyover State, United States
- Contact:
Re: about performance and code design [8086+]
What I do is write it all in C and let the compiler optimize it for me. I am serious, the compiler-generated code is much faster than almost anything I could write in assembly. This is especially true for newer versions of GCC, with the -march flag. If your machine is detected correctly, -march=native will tune the generated code for performance for your system. Even if it's not detected correctly, you can at least figure out what your processor is (e.g., core2) and pass that as "-march=core2". Looking at the generated code, it makes use of SIMD and other extensions when necessary, and also uses certain idioms that are faster on modern machines.
Outside of compiler optimization, your best hope for speed is your choice of algorithm rather than whether you code it in assembly. Get a good book on computer science and algorithms, take a class, or even just search the web; Wikipedia works for basic descriptions of algorithms and their average and worse running time in most cases.
Outside of compiler optimization, your best hope for speed is your choice of algorithm rather than whether you code it in assembly. Get a good book on computer science and algorithms, take a class, or even just search the web; Wikipedia works for basic descriptions of algorithms and their average and worse running time in most cases.
Re: about performance and code design [8086+]
It's always best to count down, and test ZF or SF at the end. Under typical usage, doing that will increase code speed about 5%, instead of using CMP.- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily. Using memory (including the stack) is somewhere between 3 and 500 times slower than using a register -- depending on caching. You can get more speed improvement by hand coding your ASM to use registers properly than by any other optimization.
Re: about performance and code design [8086+]
I can say that C is not good. I dont care if it produce better code than i do. The only advantage of c is that i can compile code for many cpus.
But still any OS specyfic operation (io, reigstry, console), i have to implement separatly, just creating common interface between specyfic functions and real code.
And not all common things easly done in asm can be done as easly in c.
For example, modulo. This heavly relies on carry flag and bit manipulation, wich c isnt capable of.
So programmer wich use c has his job easier when writing simple code, but anything more complicated get nearly impossible.
But still any OS specyfic operation (io, reigstry, console), i have to implement separatly, just creating common interface between specyfic functions and real code.
And not all common things easly done in asm can be done as easly in c.
For example, modulo. This heavly relies on carry flag and bit manipulation, wich c isnt capable of.
So programmer wich use c has his job easier when writing simple code, but anything more complicated get nearly impossible.
Re: about performance and code design [8086+]
x86/x64 specific or in general? Because x86 has really small register file?bewing wrote: The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily
____
Dario
Dario
Re: about performance and code design [8086+]
@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.
Re: about performance and code design [8086+]
Hi,
However, in general (and especially for large projects) no assembly language programmer is going to waste a huge amount of time trying to get the fastest code possible for all of the code, and because of this most assembly language programmers are frequently beaten by compilers. It's not that they can't beat the compiler, it's that it's not worth the time it would take. The other thing is that highly optimised assembly code is much much harder to maintain, and an assembly language programmer that does spend a huge amount of time trying to reduce code maintainability is the last thing you want.
Mostly, you want to find the bottlenecks and only spend time improve them. It doesn't matter much if the project is 100% assembly (where most code is "average" and a small part is optimised) or if the project is primarily C with a few pieces in optimised inline assembly.
Even for CPUs that do implement out-of-order pipelines, there's typically restrictions on what CPUs are capable of, and code that tries to put instructions in an optimal order can still help.
You could use this instead:
But, why do you need ECX at all?
In this case, doing things in reverse doesn't make much difference:
There's also cases where the loop must happen in a specific direction. For an example, consider "memmove()" where the source data overlaps the destination data.
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)
That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.
64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.
Should probably be:
Or maybe:
However, for multiplying EAX by 4619 it'd be faster to MUL because it'd take too many simpler instructions.
For multiplying/dividing by a variable, use MUL/IMUL/DIV/IDIV. The only other option that might be worth considering is if the values are tiny and the operation needs to be done a lot, where a lookup table would be small enough and would remain in L1 cache, especially if it frees up registers and allows you to do more in parallel. For example (four 8-bit multiplications at once):
Cheers,
Brendan
By definition, for "best case" assembly language is always at least as good anything any compiler could generate, given the fact that it's possible for an assembly language programmer to start with the output of the compiler and "tweak" (while testing if their tweaking helped or not between tweaks).Tosi wrote:What I do is write it all in C and let the compiler optimize it for me. I am serious, the compiler-generated code is much faster than almost anything I could write in assembly.
However, in general (and especially for large projects) no assembly language programmer is going to waste a huge amount of time trying to get the fastest code possible for all of the code, and because of this most assembly language programmers are frequently beaten by compilers. It's not that they can't beat the compiler, it's that it's not worth the time it would take. The other thing is that highly optimised assembly code is much much harder to maintain, and an assembly language programmer that does spend a huge amount of time trying to reduce code maintainability is the last thing you want.
Mostly, you want to find the bottlenecks and only spend time improve them. It doesn't matter much if the project is 100% assembly (where most code is "average" and a small part is optimised) or if the project is primarily C with a few pieces in optimised inline assembly.
In my opinion, an Intel Atom is a modern CPU. It doesn't do any instruction reordering, speculative execution, or register renaming.Dario wrote:AFAIK, modern CPUs implement out-of-order pipelines, so that kind of optimization is "not" necessary and it only slows down programming since there are hundreds of ways to combine instructions.
Even for CPUs that do implement out-of-order pipelines, there's typically restrictions on what CPUs are capable of, and code that tries to put instructions in an optimal order can still help.
For most loops, there's usually something inside the loop that can be used to determine loop termination, so that you don't need to use a separate register. For an example, consider:bewing wrote:It's always best to count down, and test ZF or SF at the end. Under typical usage, doing that will increase code speed about 5%, instead of using CMP.- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily. Using memory (including the stack) is somewhere between 3 and 500 times slower than using a register -- depending on caching. You can get more speed improvement by hand coding your ASM to use registers properly than by any other optimization.
Code: Select all
mov edi,startAddress
mov ecx,0
mov eax,PG_PRESENT | physAddress
cld
.next:
stosd
add ecx,1
add eax,0x00001000
cmp ecx,(endAddress - startAddress)/4
jb .next
Code: Select all
mov edi,endAddress-4
mov ecx,0
mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
std
.next:
stosd
sub ecx,1
lea eax,[eax-0x00001000]
jne .next
Code: Select all
mov edi,startAddress
mov eax,PG_PRESENT | physAddress
cld
.next:
stosd
add eax,0x00001000
cmp edi,endAddress
jb .next
Code: Select all
mov edi,endAddress-4
mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
std
.next:
stosd
sub eax,0x00001000
cmp edi,startAddress
jae .next
ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)bewing wrote:@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)
That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.
64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.
Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. You need to consider what the CPU supports (no point trying to use SSE on an 80486 for e.g.), the startup overhead, the throughput, the effect on caches, the size of the data, etc.a5498828 wrote:- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?
Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. For example, if the bottleneck is instruction fetch then maybe inc/dec will be faster (because it's smaller), if the dependency on flags makes no difference (e.g. plenty of instructions that don't touch flags in between) then it might not make any difference, and if there's a dependency on flags somewhere that can't be avoided then add/sub might be better.a5498828 wrote:- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?
There's still no "rule of thumb" that applies to every possible situation. However, your example makes me think you're doing something wrong. For example:a5498828 wrote:- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?
Code: Select all
add byte [edi],0x34
inc edi
adc byte [edi],0x12
inc edi
Code: Select all
add byte [edi],0x34
adc byte [edi+1],0x12
add edi,2
Code: Select all
add word [edi],0x1234
add edi,2
Probably not. I can't see how it'd make any difference.a5498828 wrote:- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.
For multiply/divide by a constant, in general (depending on things like whether you can break up the instruction dependencies) I'd use a combination of shifts, add/sub and LEA with no more than 3 instructions. Any more than that and it's probably better to use MUL/IMUL/DIV/IDIV anyway. For example, multiplying EAX by 10 can be:a5498828 wrote:- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?
Code: Select all
lea eax,[eax*4+eax] ;eax = value * 5
*other instruction/s that don't rely on EAX here *
add eax,eax ;eax = value * 10
*other instruction/s that don't rely on EAX here *
For multiplying/dividing by a variable, use MUL/IMUL/DIV/IDIV. The only other option that might be worth considering is if the values are tiny and the operation needs to be done a lot, where a lookup table would be small enough and would remain in L1 cache, especially if it frees up registers and allows you to do more in parallel. For example (four 8-bit multiplications at once):
Code: Select all
movzx eax,byte [value1a]
movzx ebx,byte [value2a]
movzx ecx,byte [value3a]
movzx edx,byte [value4a]
mov ah,[value1b]
mov bh,[value2b]
mov ch,[value3b]
mov dh,[value4b]
mov ax,[table+eax*2]
mov bx,[table+ebx*2]
mov cx,[table+ecx*2]
mov dx,[table+edx*2]
mov [result1],ax
mov [result2],bx
mov [result3],cx
mov [result4],dx
There's no "rule of thumb" that applies to every possible situation. For example, if your using the high bits of RAX for something then you can't use MOVZX, but "movzx eax, word [something]" will break any dependency on the previous operation that modified RAX/EAX.a5498828 wrote:- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?
The best option would probably be to increase the size of it to 8 bytes and ensure it is aligned correctly; then use "mov rax,[from]" then "mov [to],rax").a5498828 wrote:- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: about performance and code design [8086+]
ARM has 15 general purpose registers; remember that R15 is PC (<< Though this does often reduce register pressure when you're doing position independent code and such). As with most architectures, you'll lose another (R14) to the stack pointer. In leafs, you may lose R13 too as the link register (In non-leafs, its normally optimal to just write LR down onto the stack using using STMDB and load it straight back into PC with LDMIA when you throw down the other callee-save registers).Brendan wrote:ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)bewing wrote:@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)
That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.
64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.
Also remember that in Thumb-1 mode, many instructions cannot be performed on the top 8 registers. In Thumb-2 mode, this has mostly been rectified (though the versions which operate on the top 8 will produce bigger code, so you should assign the most popular operations to the bottom 8 where possible). ARM mode is fully orthogonal.
ARM mode features a fused add and shift. The three operand form of this (i.e. R1 = R2 + R3 << Constant) is generally one cycle; the four operand (R1 = R2 + R3 << R4) is generally two cycle.
As with most RISC architectures, you can't embed a full-width constant in an instruction. However, ARM "Modified immediate constants" cover the majority of requirements, being able to express any 8-bit wide value accompanied with any rotate (Or its inverse). These are generated using the MOV and/or NEG instructions. Failing that, the assembler will generate a PC relative load (Using the LDR pseudo-instruction lets the assembler optimize this)
SPARC is a special case because of the register windows. Both it and MIPS hardwire a register to zero (Because zero is useful). Itanium has a register stack, which poses other interesting properties; Itanium registers can also be in a "Not a thing" state; this has some really interesting results.
(And I realise I have digressed quite widely from the topic here!)
Re: about performance and code design [8086+]
I disagree. You are forgetting the BSWAP opcode. Adding an extra BSWAP between operations slows things down (of course), but it's still faster than going to memory. So if you forget about leaving a register open for a stack frame, and forget about leaving a couple open to do math in, you have:Brendan wrote: ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)
That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really).
AL, AH, upper eax for a bswapped word, BL, BH, upper ebx for a bswapped word, ditto for ecx, ditto for edx, and then six more words stored in lower and upper ebp, esi, and edi. For a grand total of 8 byte registers and 10 word registers = 18 total. Or, if you want to play with XCHG, too, you can have a semi-convenient maximum of 24 byte registers + 4 word registers = 28 total. Which is a lot more than 12, if you just use some imagination. And it's an incredible amount more than the *THREE* that GCC typically uses for doing most things. Yeah, it doesn't match the other CPUs out there. I find that it doesn't take more than 10 to do almost any subroutine.
Of course, playing with BSWAP/XCHG gets confusing and unmaintainable in hand-coded ASM. Which is why I'm writing a compiler that will handle the registers in this way for you.
Re: about performance and code design [8086+]
Hi,
Of course you'd have to do the same for all the other architectures for a fair comparison.
Cheers,
Brendan
If you're going to count BSWAP, then you might aswell call it 256 general purpose registers (that are 1-bit each). Hmm - maybe a few more if you can use some of the flags in EFLAGS.bewing wrote:I disagree. You are forgetting the BSWAP opcode. Adding an extra BSWAP between operations slows things down (of course), but it's still faster than going to memory.
Of course you'd have to do the same for all the other architectures for a fair comparison.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: about performance and code design [8086+]
Yes, if you have enough boolean variables, and you are short on GP registers, they can sometimes be stored in individual bits. But there are problems getting them in and back out of the shifted bit positions in a time faster than what you could get with one stack operation.
And most CPUs do not have BSWAP or XCHG. Many do have fast multi-bit ROL/ROR opcodes that can do similar things though. But my point above was "how do you optimize code to easily beat what GCC will ever give you". And the answer is "do clever things to have more registers available, and USE them."
And most CPUs do not have BSWAP or XCHG. Many do have fast multi-bit ROL/ROR opcodes that can do similar things though. But my point above was "how do you optimize code to easily beat what GCC will ever give you". And the answer is "do clever things to have more registers available, and USE them."