Page 1 of 2
Assembly: code optimization
Posted: Fri Mar 11, 2011 6:43 am
by Teehee
Hi.
In a 32bit system, is there any speed difference if i use 16 bit registers (or 8bits) instead 32bit ones?
Lets suppose a 32bit simple code, and the same code using 16 bit registers (just to illustrate):
Code: Select all
xor eax,eax
xor ecx,ecx
@@: add eax,5
sub eax,5
inc ecx
cmp ecx,255
jne @b
Code: Select all
xor ax,ax
xor cx,cx
@@: add ax,5
sub ax,5
inc cx
cmp cx,255
jne @b
is there any possibility the second code be faster than the first one?
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 7:15 am
by a5498828
most efficient are 32bit registers on 32bit cpu (or 64bit if your cpu support it).
8 bit registers are i belive as fast as native ones, but simply 8 bit... Using 16bit registers on machine supoprting 32/64bit, is very slow.
So if your cpu support long mode, 16 and 32bit registers are just legacy stuff, main transistors operate on 64 bit.
Full speed 16bit you get only uo to 80286 i guess.
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 8:04 am
by Owen
On x86, in general, using the GPRs with any width provides the same speed. However:
- This is not completely the case. Some operations are slower on the 8-bit and particularly 16-bit registers.
- For purposes of alias consideration, xL = xH = ExX = RxX
- If you do
Code: Select all
movl (some_address), %eax
add $1, %ax
then the add will be stalled while waiting for the completion of the load
- A few operations, such as DIV, are quicker on the smaller registers (Time taken ∝ register size in bits)
- Operand size prefixes add an extra byte to an instruction, thus wasting some instruction fetch bandwidth
This leads to a few rules of thumb:
- Use 32-bit registers always for n ≤ 32 bit operations. Use 64-bits for 33 ≤ n ≤ 64 bit operations
- If doing a division, consider if doing a smaller divide and mov[zs]x would be a good idea
- Always load the full target register. Note that for 64-bit mode, loading the Exx registers will always load the upper 32-bits of Rxx with zero.
- Generally avoid the xH registers.They alias with their corresponding xL register, are in a less useful position, and cannot be used in any REX-prefixed instruction. Remember however that you get DIL/SIL/SPL*/BPL*/RnL in 64-bit mode
- In C, on x86, you may get performance benefits by casting to (int16_t) before doing a division. Don't count on the compiler to understand that however. Expect doing this to hurt other architectures, as they don't get free result truncation.
For optimizing, repeat after me: I will not use 16-bit LEA. I will not use 16-bit LEA†. Got that?
Performance information here is taken from AMD's optimization guide & Agner Fog's x86 optimization reference.
* I don't see much use to these two. Well; I accept BPL, you may be eliding your frame pointer. SPL, on the other hand, seems rather utterly useless
† Or anything involving 16-bit addressing
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 10:37 am
by bewing
There are a few other small issues that can complicate what Owen said -- in a real world that is more complicated than your code examples. The upper 16bits of the Exx registers can be used to store variables. These variables can be accessed much more quickly than going to memory. Using the Exx registers preferentially will trash the upper parts of the registers for only a tiny performance boost, whereas using them for variable storage will give you a much better performance boost. And the same goes for the xH registers -- not using them gives you a small boost, using them carefully to store variables gives you much more.
But doing this makes your code more complex, and harder to read and maintain.
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 11:45 am
by Owen
bewing wrote:There are a few other small issues that can complicate what Owen said -- in a real world that is more complicated than your code examples. The upper 16bits of the Exx registers can be used to store variables. These variables can be accessed much more quickly than going to memory. Using the Exx registers preferentially will trash the upper parts of the registers for only a tiny performance boost, whereas using them for variable storage will give you a much better performance boost. And the same goes for the xH registers -- not using them gives you a small boost, using them carefully to store variables gives you much more.
But doing this makes your code more complex, and harder to read and maintain.
L1 cache access on modern processors is at least as fast as storing data in ExxH. I can see it in some cases being much faster. General ExxH access would involve
EDIT: I did a microbenchmark which determined that this is not the case, at least for Core 2. Regardless, I still assert that doing this to anything except the most heavily used functions is a net loss due to higher ICache footprint
Code: Select all
mov %eax, %ebx
shr $16, %ebx
...
code working on ebx
...
shl $16, %ebx
and $0000ffff, %eax
or %ebx, %eax
vs an rSP relative stack access, which can generally be very well speculated and so will be ~1 cycle, never mind significantly smaller.
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 5:31 pm
by Teehee
Thanks all, very usefull informations.
---
This code has 4 bytes:
Code: Select all
use32
inc eax
inc ecx
inc ebx
inc edx
in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?
Re: Assembly: code optimization
Posted: Fri Mar 11, 2011 6:26 pm
by Owen
Teehee wrote:Thanks all, very usefull informations.
---
This code has 4 bytes:
Code: Select all
use32
inc eax
inc ecx
inc ebx
inc edx
in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?
No. Depends a lot upon the processor. An Athlon 64 will (best case) execute that in 2 clocks, same for a Core 2. An i7 will possibly do that in 1. A 386 will probably take somewhere in the region of 8-12. A Pentium will probably take 4.
Theres no hard and fast rule. See your processor's optimization guide.
Re: Assembly: code optimization
Posted: Sat Mar 12, 2011 6:03 am
by Tosi
Anything I say about optimization is probably out of date, since the only processors I ever had to heavily optimize for were 8086-80386 real mode and protected mode for 386 under DOS.
I think some operations, mostly arithmetic, are faster when using eax as a destination. They might also generate slightly smaller opcodes.
For newer processors, one important optimization is to order the instructions to maximize parallelization. So the following:
Code: Select all
mov eax, [esp + 4]
shr eax, 5
mov ebx, [esp + 8]
and ebx, 0xF0F0F0F0
could be coded better as:
Code: Select all
mov eax, [esp + 4]
mov ebx, [esp + 8]
shr eax, 5
and ebx, 0xF0F0F0F0
The goal is to allow the processor to pipeline as much instructions as possible. If you use an instruction that uses EAX right after another that uses it, you will have to wait for that instruction above to finish before accessing the next one. Of course, there are always the classics such as unrolling small loops, improving cache performance, and optimizing for branch prediction. For information, there are the Intel and AMD optimization manuals, and lots of websites. Just google something like "
your processor optimization".
Re: Assembly: code optimization
Posted: Sat Mar 12, 2011 7:40 am
by Owen
For the first: use of rAX will help with instruction size for quite a few instructions, with the corresponding resultant decrease in instruction bandwidth requirements.
For the second: Its much less the case on modern superscalar out-of-order processors than it is on in-order cores (Such as the P1 and Atom). Having said that, it cannot hurt, and probably will help if the processor if for one reason or another the instruction decoders have become the bottleneck inside the processor (e.g. I-Cache/Memory latency is in play)
Re: Assembly: code optimization
Posted: Fri Mar 18, 2011 7:03 am
by turdus
Just to make it clear, x86_64 uses 32 bit registers by default. If you want to use 64 bit, your instruction will be 1 byte longer (called register extension prefix, see intel or amd manuals), and longer instruction takes more time to decode. The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.
Re: Assembly: code optimization
Posted: Fri Mar 18, 2011 2:23 pm
by Teehee
turdus wrote:The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.
Sorry, i didn't understand this part.
Re: Assembly: code optimization
Posted: Fri Mar 18, 2011 2:49 pm
by Tosi
Most applications can be compiled or run in 32-bit protected mode or only using the 32-bit opcodes in long mode with little impact on performance. In fact the largest improvements in speed I've noted from optimizations is using SSE instructions or using a compiler which can do so automatically. SSE works from any mode (even real mode, I think).
Outside of some modern games, what application needs over 4 GB of RAM? Definitely not an operating system, unless it is for a specific domain.
A typical operating system should use as little memory as possible, and in general be as invisible to the user applications as possible except in system calls.
Re: Assembly: code optimization
Posted: Sat Mar 19, 2011 3:05 am
by Brendan
Hi,
Tosi wrote:Outside of some modern games, what application needs over 4 GB of RAM?
Database management systems (e.g. mySQL) would have to be at the top of that list.
Tosi wrote:Definitely not an operating system, unless it is for a specific domain.
A typical operating system should use as little memory as possible, and in general be as invisible to the user applications as possible except in system calls.
Actually, no.
Free RAM is RAM that is being wasted. An OS should try to make sure that there's very little free RAM, by using any RAM that applications don't need to improve performance. This includes doing things like caching as much file data as possible to minimise the amount of (slow) disk IO needed. If/when an application actually does need the RAM you'd remove something from your caches, etc and give it to the application. If/when an application doesn't need that RAM anymore you'd make the RAM available to your caches, etc (and maybe consider pre-fetching any data that is likely to be needed so it's already in RAM if/when the data is needed).
Imagine a computer that has 3.5 GiB of RAM that's being used as a HTTP/FTP server (where lots of random people are accessing lots of random files). The only software that's running is the kernel, the HTTP/FTP server and a few small daemons; and (combined) they only use 512 MiB of RAM, which leaves 3 GiB of RAM "free". The OS is a 32-bit OS, kernel-space is limited to 2 GiB, and the VFS cache is limited to 1 GiB. The hard disk is a 1 TiB RAID array and it's half full. How much of the 512 GiB of file data can you cache in your 1 GiB VFS cache? How much RAM is actually wasted?
Cheers,
Brendan
Re: Assembly: code optimization
Posted: Sat Mar 19, 2011 6:48 am
by turdus
Teehee wrote:turdus wrote:The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.
Sorry, i didn't understand this part.
I'm not a native english speaker, sorry. I was trying to say that it's very unlikely that a code use jumps over 4G (as a matter of fact, 99% of jumps is within 64k, that's called locality), and a normal, every day life application do not have to calculate with numbers other than -2^31 to 2^31. That's why amd engineers decided to use default register size 32 bit in 64 bit long mode. If you want to override this, you can, but you have to use a special prefix which makes code longer, and longer code takes more time to decode.
In other words the default accumulator register in long mode is NOT rax, but eax.
I hope it's clear.
Re: Assembly: code optimization
Posted: Wed Mar 23, 2011 4:27 pm
by JamesM
Owen wrote:Teehee wrote:Thanks all, very usefull informations.
---
This code has 4 bytes:
Code: Select all
use32
inc eax
inc ecx
inc ebx
inc edx
in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?
No. Depends a lot upon the processor. An Athlon 64 will (best case) execute that in 2 clocks, same for a Core 2. An i7 will possibly do that in 1. A 386 will probably take somewhere in the region of 8-12. A Pentium will probably take 4.
Theres no hard and fast rule. See your processor's optimization guide.
Firstly, i7 has an issue width of 3, as far as I know.
Secondly, it'll take at least n clock ticks to complete where n is the length of the i7's pipeline.