about performance and code design [8086+]

a5498828 · Post by **a5498828** » Fri Feb 11, 2011 4:02 am

I have more than a few problems with writing good code, and i would like to ask some questions here.

For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.

- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?

- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?

- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?

- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?

- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.

- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?

- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?

- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?

Combuster · Post by **Combuster** » Fri Feb 11, 2011 4:41 am

a5498828 wrote:For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.

Both intel and AMD have optimisation manuals. The Graphics programming black book goes in great detail about the 486/586 (with some hints about older processors) and is probably a good read to get some background knowledge and the thinking patterns.

- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?

SSE block operations are generally faster for long sequences, and slower for short sequences. rep opcode is better for code size. Try to use the widest mode as much as possible.

- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?

using inc/dec followed by a conditional branch causes a partial register stall on the flag register on modern processors, in which case add/sub is better. On CPUs where the stall does not exist, inc/dec is better (and shorter). Decrementing counters allow you to change add;cmp;jl into the sorter sub;jnz.

- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.

Learn to use LEA

So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?

Still depends on all circumstances you haven't mentioned.

- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?

divide by a power of two? use shr/sar. divide by constant? try multiplying with its inverse. divide a byte? use aam.

In the end, what matters more on modern processors is instruction scheduling - try to maximize the number of instructions between where a register is written and where it is used. Avoid complex instructions, and if you have to you better set them up as soon as possible, do something else you need, then merge the results. If you do this right, it will benefit about every cpu core out there.

Even better: use O(1) algorithms where possible

a5498828 · Post by **a5498828** » Fri Feb 11, 2011 5:32 am

Thnx for interest and quick answer, i look into that book.

Learn to use LEA

How do i detect carry then? Unless i copy old register and then do cmp, but its slower than swapping cf into memory.

divide by a power of two? use shr/sar. divide by constant? try multiplying with its inverse. divide a byte? use aam.

by unknown value.

qw · Post by qw » Fri Feb 11, 2011 6:44 am

I recommend reading Agner Fog.

Dario · Post by **Dario** » Fri Feb 11, 2011 8:05 am

a5498828 wrote: For example, where can i learn in details about pipeline? What instructions to use in what order, so the cpu can recognize them and optimize execution.

AFAIK, modern CPUs implement out-of-order pipelines, so that kind of optimization is "not" necessary and it only slows down programming since there are hundreds of ways to combine instructions.

Tosi · Post by **Tosi** » Fri Feb 11, 2011 4:41 pm

What I do is write it all in C and let the compiler optimize it for me. I am serious, the compiler-generated code is much faster than almost anything I could write in assembly. This is especially true for newer versions of GCC, with the -march flag. If your machine is detected correctly, -march=native will tune the generated code for performance for your system. Even if it's not detected correctly, you can at least figure out what your processor is (e.g., core2) and pass that as "-march=core2". Looking at the generated code, it makes use of SIMD and other extensions when necessary, and also uses certain idioms that are faster on modern machines.

Outside of compiler optimization, your best hope for speed is your choice of algorithm rather than whether you code it in assembly. Get a good book on computer science and algorithms, take a class, or even just search the web; Wikipedia works for basic descriptions of algorithms and their average and worse running time in most cases.

bewing · Post by **bewing** » Fri Feb 11, 2011 10:28 pm

- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?

It's always best to count down, and test ZF or SF at the end. Under typical usage, doing that will increase code speed about 5%, instead of using CMP.

The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily. Using memory (including the stack) is somewhere between 3 and 500 times slower than using a register -- depending on caching. You can get more speed improvement by hand coding your ASM to use registers properly than by any other optimization.

a5498828 · Post by **a5498828** » Sat Feb 12, 2011 12:16 am

I can say that C is not good. I dont care if it produce better code than i do. The only advantage of c is that i can compile code for many cpus.
But still any OS specyfic operation (io, reigstry, console), i have to implement separatly, just creating common interface between specyfic functions and real code.

And not all common things easly done in asm can be done as easly in c.
For example, modulo. This heavly relies on carry flag and bit manipulation, wich c isnt capable of.

So programmer wich use c has his job easier when writing simple code, but anything more complicated get nearly impossible.

Dario · Post by **Dario** » Sat Feb 12, 2011 1:58 pm

bewing wrote: The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily

x86/x64 specific or in general? Because x86 has really small register file?

bewing · Post by **bewing** » Sat Feb 12, 2011 3:09 pm

@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.

Brendan · Post by **Brendan** » Sun Feb 13, 2011 4:32 am

Hi,

Tosi wrote:What I do is write it all in C and let the compiler optimize it for me. I am serious, the compiler-generated code is much faster than almost anything I could write in assembly.

By definition, for "best case" assembly language is always at least as good anything any compiler could generate, given the fact that it's possible for an assembly language programmer to start with the output of the compiler and "tweak" (while testing if their tweaking helped or not between tweaks).

However, in general (and especially for large projects) no assembly language programmer is going to waste a huge amount of time trying to get the fastest code possible for all of the code, and because of this most assembly language programmers are frequently beaten by compilers. It's not that they can't beat the compiler, it's that it's not worth the time it would take. The other thing is that highly optimised assembly code is much much harder to maintain, and an assembly language programmer that does spend a huge amount of time trying to reduce code maintainability is the last thing you want.

Mostly, you want to find the bottlenecks and only spend time improve them. It doesn't matter much if the project is 100% assembly (where most code is "average" and a small part is optimised) or if the project is primarily C with a few pieces in optimised inline assembly.

Dario wrote:AFAIK, modern CPUs implement out-of-order pipelines, so that kind of optimization is "not" necessary and it only slows down programming since there are hundreds of ways to combine instructions.

In my opinion, an Intel Atom is a modern CPU. It doesn't do any instruction reordering, speculative execution, or register renaming.

Even for CPUs that do implement out-of-order pipelines, there's typically restrictions on what CPUs are capable of, and code that tries to put instructions in an optimal order can still help.

bewing wrote:
- When looping, is it better to test for ZF at the end, or do cmp? Meaning should loop counter go up to destination or down to zero? Or maybe jcxz at begginig, and jm at the end?
It's always best to count down, and test ZF or SF at the end. Under typical usage, doing that will increase code speed about 5%, instead of using CMP.

The thing nobody has mentioned yet, and that GCC does an *awful* job of, is using your registers heavily. Using memory (including the stack) is somewhere between 3 and 500 times slower than using a register -- depending on caching. You can get more speed improvement by hand coding your ASM to use registers properly than by any other optimization.

For most loops, there's usually something inside the loop that can be used to determine loop termination, so that you don't need to use a separate register. For an example, consider:

Code: Select all

    mov edi,startAddress
    mov ecx,0
    mov eax,PG_PRESENT | physAddress
    cld
.next:
    stosd
    add ecx,1
    add eax,0x00001000
    cmp ecx,(endAddress - startAddress)/4
    jb .next

You could use this instead:

Code: Select all

    mov edi,endAddress-4
    mov ecx,0
    mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
    std
.next:
    stosd
    sub ecx,1
    lea eax,[eax-0x00001000]
    jne .next

But, why do you need ECX at all?

Code: Select all

    mov edi,startAddress
    mov eax,PG_PRESENT | physAddress
    cld
.next:
    stosd
    add eax,0x00001000
    cmp edi,endAddress
    jb .next

In this case, doing things in reverse doesn't make much difference:

Code: Select all

    mov edi,endAddress-4
    mov eax,PG_PRESENT | (physAddress + (endAddress - startAddress)/4 * 0x00001000)
    std
.next:
    stosd
    sub eax,0x00001000
    cmp edi,startAddress
    jae .next

There's also cases where the loop must happen in a specific direction. For an example, consider "memmove()" where the source data overlaps the destination data.

bewing wrote:@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.

ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)

That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.

64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.

a5498828 wrote:- Is using rep instructions bad idea? Its tempting instead of creating custom loop use rep xxxxx, but is it possible to write pipelined code to actually be faster?

Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. You need to consider what the CPU supports (no point trying to use SSE on an 80486 for e.g.), the startup overhead, the throughput, the effect on caches, the size of the data, etc.

a5498828 wrote:- Is using add/sub instead of inc/dec a good idea? Forget about flags, lets say im incrementing loop counter. Should i use inc or add, and why?

Sometimes it's a bad idea, sometimes it's a good idea. There's no "rule of thumb" that applies to every possible situation. For example, if the bottleneck is instruction fetch then maybe inc/dec will be faster (because it's smaller), if the dependency on flags makes no difference (e.g. plenty of instructions that don't touch flags in between) then it might not make any difference, and if there's a dependency on flags somewhere that can't be avoided then add/sub might be better.

a5498828 wrote:- When using adc is using sahf/lahf a good idea? What i mean is between adc's i have to increment address to wich i add. And if using inc is bad for some reason (it doesnt touch CF, but takes 8 of them in long mode) i have to use add wich detroy CF.
So sahf or pushf is better? Or maybe other way, like setting CF status in local variable, and then stc/clc depending on it? How you would do simple large carried addition?

There's still no "rule of thumb" that applies to every possible situation. However, your example makes me think you're doing something wrong. For example:

Code: Select all

    add byte [edi],0x34
    inc edi
    adc byte [edi],0x12
    inc edi

Should probably be:

Code: Select all

    add byte [edi],0x34
    adc byte [edi+1],0x12
    add edi,2

Or maybe:

Code: Select all

    add word [edi],0x1234
    add edi,2

a5498828 wrote:- In 32bit mode should i use only ebx and ebp for addressing like in 16 bit mode? Ive heard its much faster because of legacy.

Probably not. I can't see how it'd make any difference.

a5498828 wrote:- When dividing small data wich fit into 1 register should i use div/idiv, or do it manually? Same in case of mul/imul. Will best possible optimized code bead hardware division/multiplication?

For multiply/divide by a constant, in general (depending on things like whether you can break up the instruction dependencies) I'd use a combination of shifts, add/sub and LEA with no more than 3 instructions. Any more than that and it's probably better to use MUL/IMUL/DIV/IDIV anyway. For example, multiplying EAX by 10 can be:

Code: Select all

    lea eax,[eax*4+eax]                     ;eax = value * 5
    *other instruction/s that don't rely on EAX here *
    add eax,eax                             ;eax = value * 10
    *other instruction/s that don't rely on EAX here *

However, for multiplying EAX by 4619 it'd be faster to MUL because it'd take too many simpler instructions.

For multiplying/dividing by a variable, use MUL/IMUL/DIV/IDIV. The only other option that might be worth considering is if the values are tiny and the operation needs to be done a lot, where a lookup table would be small enough and would remain in L1 cache, especially if it frees up registers and allows you to do more in parallel. For example (four 8-bit multiplications at once):

Code: Select all

    movzx eax,byte [value1a]
    movzx ebx,byte [value2a]
    movzx ecx,byte [value3a]
    movzx edx,byte [value4a]
    mov ah,[value1b]
    mov bh,[value2b]
    mov ch,[value3b]
    mov dh,[value4b]
    mov ax,[table+eax*2]
    mov bx,[table+ebx*2]
    mov cx,[table+ecx*2]
    mov dx,[table+edx*2]
    mov [result1],ax
    mov [result2],bx
    mov [result3],cx
    mov [result4],dx

a5498828 wrote:- In long mode i want to load ax with 2 byte value. What is better: mov with 0x66 prefix, or movzx?

There's no "rule of thumb" that applies to every possible situation. For example, if your using the high bits of RAX for something then you can't use MOVZX, but "movzx eax, word [something]" will break any dependency on the previous operation that modified RAX/EAX.

a5498828 wrote:- I want to copy 7 bytes in long mode. What is better, 7x mov byte, 3x mov byte + 1x mov 0x66 or 1x mov dword + 1x mov 0x66 + 1x mov byte?

The best option would probably be to increase the size of it to 8 bytes and ensure it is aligned correctly; then use "mov rax,[from]" then "mov [to],rax").

Cheers,

Brendan

Owen · Post by **Owen** » Sun Feb 13, 2011 9:35 am

Brendan wrote:
bewing wrote:@Dario: if used properly, on x86 you can have (4) byte-size variables, (4) word variables, and (2) dword variables all stored in registers at the same time, and still have a couple registers open to do calculations in. So that isn't all that small a number -- just so long as you are sneaky about it, and keep your variable sizes as small as possible.
ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)

That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really). Itanium has 128 general registers. PowerPC has 32 of them, MIPs and SPARC have 31 of them, and ARM has 16 of them.

64-bit x86 has twice the number of general purpose registers (16 of them, where you could use 4 of them as pairs of byte registers to get to 20 "sort of" registers); which is still less than most modern CPUs.

ARM has 15 general purpose registers; remember that R15 is PC (<< Though this does often reduce register pressure when you're doing position independent code and such). As with most architectures, you'll lose another (R14) to the stack pointer. In leafs, you may lose R13 too as the link register (In non-leafs, its normally optimal to just write LR down onto the stack using using STMDB and load it straight back into PC with LDMIA when you throw down the other callee-save registers).

Also remember that in Thumb-1 mode, many instructions cannot be performed on the top 8 registers. In Thumb-2 mode, this has mostly been rectified (though the versions which operate on the top 8 will produce bigger code, so you should assign the most popular operations to the bottom 8 where possible). ARM mode is fully orthogonal.

ARM mode features a fused add and shift. The three operand form of this (i.e. R1 = R2 + R3 << Constant) is generally one cycle; the four operand (R1 = R2 + R3 << R4) is generally two cycle.

As with most RISC architectures, you can't embed a full-width constant in an instruction. However, ARM "Modified immediate constants" cover the majority of requirements, being able to express any 8-bit wide value accompanied with any rotate (Or its inverse). These are generated using the MOV and/or NEG instructions. Failing that, the assembler will generate a PC relative load (Using the LDR pseudo-instruction lets the assembler optimize this)

SPARC is a special case because of the register windows. Both it and MIPS hardwire a register to zero (Because zero is useful). Itanium has a register stack, which poses other interesting properties; Itanium registers can also be in a "Not a thing" state; this has some really interesting results.

(And I realise I have digressed quite widely from the topic here!)

bewing · Post by **bewing** » Sun Feb 13, 2011 2:38 pm

Brendan wrote: ah, al, bh, bl, ch, cl, dh, dl = 8 byte sized variables (with false dependency problems)
esi/si, edi/di, ebp/bp, esp/sp all count as either word variables or dword variables (but not both)

That gives a maximum of 12 registers, sort of (but only 8 general purpose registers that are truly independent really).

I disagree. You are forgetting the BSWAP opcode. Adding an extra BSWAP between operations slows things down (of course), but it's still faster than going to memory. So if you forget about leaving a register open for a stack frame, and forget about leaving a couple open to do math in, you have:
AL, AH, upper eax for a bswapped word, BL, BH, upper ebx for a bswapped word, ditto for ecx, ditto for edx, and then six more words stored in lower and upper ebp, esi, and edi. For a grand total of 8 byte registers and 10 word registers = 18 total. Or, if you want to play with XCHG, too, you can have a semi-convenient maximum of 24 byte registers + 4 word registers = 28 total. Which is a lot more than 12, if you just use some imagination. And it's an incredible amount more than the *THREE* that GCC typically uses for doing most things. Yeah, it doesn't match the other CPUs out there. I find that it doesn't take more than 10 to do almost any subroutine.

Of course, playing with BSWAP/XCHG gets confusing and unmaintainable in hand-coded ASM. Which is why I'm writing a compiler that will handle the registers in this way for you.

Brendan · Post by **Brendan** » Sun Feb 13, 2011 2:45 pm

Hi,

bewing wrote:I disagree. You are forgetting the BSWAP opcode. Adding an extra BSWAP between operations slows things down (of course), but it's still faster than going to memory.

If you're going to count BSWAP, then you might aswell call it 256 general purpose registers (that are 1-bit each). Hmm - maybe a few more if you can use some of the flags in EFLAGS.

Of course you'd have to do the same for all the other architectures for a fair comparison.

Cheers,

Brendan

bewing · Post by **bewing** » Sun Feb 13, 2011 2:55 pm

Yes, if you have enough boolean variables, and you are short on GP registers, they can sometimes be stored in individual bits. But there are problems getting them in and back out of the shifted bit positions in a time faster than what you could get with one stack operation.

And most CPUs do not have BSWAP or XCHG. Many do have fast multi-bit ROL/ROR opcodes that can do similar things though. But my point above was "how do you optimize code to easily beat what GCC will ever give you". And the answer is "do clever things to have more registers available, and USE them."

OSDev.org

about performance and code design [8086+]

about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]

Re: about performance and code design [8086+]