OSDev.org

Posted: **Fri Mar 11, 2011 6:43 am**

Hi.

In a 32bit system, is there any speed difference if i use 16 bit registers (or 8bits) instead 32bit ones?

Lets suppose a 32bit simple code, and the same code using 16 bit registers (just to illustrate):

Code: Select all

    xor eax,eax
    xor ecx,ecx
@@: add eax,5
    sub eax,5
    inc ecx
    cmp ecx,255
    jne @b

Code: Select all

    xor ax,ax
    xor cx,cx
@@: add ax,5
    sub ax,5
    inc cx
    cmp cx,255
    jne @b

is there any possibility the second code be faster than the first one?

Posted: **Fri Mar 11, 2011 7:15 am**

most efficient are 32bit registers on 32bit cpu (or 64bit if your cpu support it).
8 bit registers are i belive as fast as native ones, but simply 8 bit... Using 16bit registers on machine supoprting 32/64bit, is very slow.
So if your cpu support long mode, 16 and 32bit registers are just legacy stuff, main transistors operate on 64 bit.

Full speed 16bit you get only uo to 80286 i guess.

Posted: **Fri Mar 11, 2011 8:04 am**

On x86, in general, using the GPRs with any width provides the same speed. However:

This is not completely the case. Some operations are slower on the 8-bit and particularly 16-bit registers.
For purposes of alias consideration, xL = xH = ExX = RxX
If you do
Code: Select all
```
    movl (some_address), %eax
    add $1, %ax
```
then the add will be stalled while waiting for the completion of the load
A few operations, such as DIV, are quicker on the smaller registers (Time taken ∝ register size in bits)
Operand size prefixes add an extra byte to an instruction, thus wasting some instruction fetch bandwidth

This leads to a few rules of thumb:

Use 32-bit registers always for n ≤ 32 bit operations. Use 64-bits for 33 ≤ n ≤ 64 bit operations
If doing a division, consider if doing a smaller divide and mov[zs]x would be a good idea
Always load the full target register. Note that for 64-bit mode, loading the Exx registers will always load the upper 32-bits of Rxx with zero.
Generally avoid the xH registers.They alias with their corresponding xL register, are in a less useful position, and cannot be used in any REX-prefixed instruction. Remember however that you get DIL/SIL/SPL*/BPL*/RnL in 64-bit mode
In C, on x86, you may get performance benefits by casting to (int16_t) before doing a division. Don't count on the compiler to understand that however. Expect doing this to hurt other architectures, as they don't get free result truncation.

For optimizing, repeat after me: I will not use 16-bit LEA. I will not use 16-bit LEA†. Got that?

Performance information here is taken from AMD's optimization guide & Agner Fog's x86 optimization reference.

* I don't see much use to these two. Well; I accept BPL, you may be eliding your frame pointer. SPL, on the other hand, seems rather utterly useless

† Or anything involving 16-bit addressing

Posted: **Fri Mar 11, 2011 10:37 am**

There are a few other small issues that can complicate what Owen said -- in a real world that is more complicated than your code examples. The upper 16bits of the Exx registers can be used to store variables. These variables can be accessed much more quickly than going to memory. Using the Exx registers preferentially will trash the upper parts of the registers for only a tiny performance boost, whereas using them for variable storage will give you a much better performance boost. And the same goes for the xH registers -- not using them gives you a small boost, using them carefully to store variables gives you much more.

But doing this makes your code more complex, and harder to read and maintain.

Posted: **Fri Mar 11, 2011 11:45 am**

bewing wrote:There are a few other small issues that can complicate what Owen said -- in a real world that is more complicated than your code examples. The upper 16bits of the Exx registers can be used to store variables. These variables can be accessed much more quickly than going to memory. Using the Exx registers preferentially will trash the upper parts of the registers for only a tiny performance boost, whereas using them for variable storage will give you a much better performance boost. And the same goes for the xH registers -- not using them gives you a small boost, using them carefully to store variables gives you much more.

But doing this makes your code more complex, and harder to read and maintain.

L1 cache access on modern processors is at least as fast as storing data in ExxH. I can see it in some cases being much faster. General ExxH access would involve
EDIT: I did a microbenchmark which determined that this is not the case, at least for Core 2. Regardless, I still assert that doing this to anything except the most heavily used functions is a net loss due to higher ICache footprint

Code: Select all

    mov %eax, %ebx
    shr $16, %ebx
    ...
    code working on ebx
    ...
    shl $16, %ebx
    and $0000ffff, %eax
    or %ebx, %eax

vs an rSP relative stack access, which can generally be very well speculated and so will be ~1 cycle, never mind significantly smaller.

Posted: **Fri Mar 11, 2011 5:31 pm**

Thanks all, very usefull informations.

---

This code has 4 bytes:

Code: Select all

use32
        inc eax
        inc ecx
        inc ebx
        inc edx

in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?

Posted: **Fri Mar 11, 2011 6:26 pm**

Teehee wrote:Thanks all, very usefull informations.

---

This code has 4 bytes:
Code: Select all
use32
        inc eax
        inc ecx
        inc ebx
        inc edx    
in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?

No. Depends a lot upon the processor. An Athlon 64 will (best case) execute that in 2 clocks, same for a Core 2. An i7 will possibly do that in 1. A 386 will probably take somewhere in the region of 8-12. A Pentium will probably take 4.

Theres no hard and fast rule. See your processor's optimization guide.

Posted: **Sat Mar 12, 2011 6:03 am**

Anything I say about optimization is probably out of date, since the only processors I ever had to heavily optimize for were 8086-80386 real mode and protected mode for 386 under DOS.

I think some operations, mostly arithmetic, are faster when using eax as a destination. They might also generate slightly smaller opcodes.
For newer processors, one important optimization is to order the instructions to maximize parallelization. So the following:

Code: Select all

mov eax, [esp + 4]
shr eax, 5
mov ebx, [esp + 8]
and ebx, 0xF0F0F0F0

could be coded better as:

Code: Select all

mov eax, [esp + 4]
mov ebx, [esp + 8]
shr eax, 5
and ebx, 0xF0F0F0F0

The goal is to allow the processor to pipeline as much instructions as possible. If you use an instruction that uses EAX right after another that uses it, you will have to wait for that instruction above to finish before accessing the next one. Of course, there are always the classics such as unrolling small loops, improving cache performance, and optimizing for branch prediction. For information, there are the Intel and AMD optimization manuals, and lots of websites. Just google something like "your processor optimization".

Posted: **Sat Mar 12, 2011 7:40 am**

For the first: use of rAX will help with instruction size for quite a few instructions, with the corresponding resultant decrease in instruction bandwidth requirements.

For the second: Its much less the case on modern superscalar out-of-order processors than it is on in-order cores (Such as the P1 and Atom). Having said that, it cannot hurt, and probably will help if the processor if for one reason or another the instruction decoders have become the bottleneck inside the processor (e.g. I-Cache/Memory latency is in play)

Posted: **Fri Mar 18, 2011 7:03 am**

Just to make it clear, x86_64 uses 32 bit registers by default. If you want to use 64 bit, your instruction will be 1 byte longer (called register extension prefix, see intel or amd manuals), and longer instruction takes more time to decode. The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.

Posted: **Fri Mar 18, 2011 2:23 pm**

turdus wrote:The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.

Sorry, i didn't understand this part.

Posted: **Fri Mar 18, 2011 2:49 pm**

Most applications can be compiled or run in 32-bit protected mode or only using the 32-bit opcodes in long mode with little impact on performance. In fact the largest improvements in speed I've noted from optimizations is using SSE instructions or using a compiler which can do so automatically. SSE works from any mode (even real mode, I think).

Outside of some modern games, what application needs over 4 GB of RAM? Definitely not an operating system, unless it is for a specific domain.
A typical operating system should use as little memory as possible, and in general be as invisible to the user applications as possible except in system calls.

Posted: **Sat Mar 19, 2011 3:05 am**

Hi,

Tosi wrote:Outside of some modern games, what application needs over 4 GB of RAM?

Database management systems (e.g. mySQL) would have to be at the top of that list.

Tosi wrote:Definitely not an operating system, unless it is for a specific domain.
A typical operating system should use as little memory as possible, and in general be as invisible to the user applications as possible except in system calls.

Actually, no.

Free RAM is RAM that is being wasted. An OS should try to make sure that there's very little free RAM, by using any RAM that applications don't need to improve performance. This includes doing things like caching as much file data as possible to minimise the amount of (slow) disk IO needed. If/when an application actually does need the RAM you'd remove something from your caches, etc and give it to the application. If/when an application doesn't need that RAM anymore you'd make the RAM available to your caches, etc (and maybe consider pre-fetching any data that is likely to be needed so it's already in RAM if/when the data is needed).

Imagine a computer that has 3.5 GiB of RAM that's being used as a HTTP/FTP server (where lots of random people are accessing lots of random files). The only software that's running is the kernel, the HTTP/FTP server and a few small daemons; and (combined) they only use 512 MiB of RAM, which leaves 3 GiB of RAM "free". The OS is a 32-bit OS, kernel-space is limited to 2 GiB, and the VFS cache is limited to 1 GiB. The hard disk is a 1 TiB RAID array and it's half full. How much of the 512 GiB of file data can you cache in your 1 GiB VFS cache? How much RAM is actually wasted?

Cheers,

Brendan

Posted: **Sat Mar 19, 2011 6:48 am**

Teehee wrote:
turdus wrote:The idea behind that is locality, most applications are happy with 4G memory and variable ranges -2billion to 2billion.
Sorry, i didn't understand this part.

I'm not a native english speaker, sorry. I was trying to say that it's very unlikely that a code use jumps over 4G (as a matter of fact, 99% of jumps is within 64k, that's called locality), and a normal, every day life application do not have to calculate with numbers other than -2^31 to 2^31. That's why amd engineers decided to use default register size 32 bit in 64 bit long mode. If you want to override this, you can, but you have to use a special prefix which makes code longer, and longer code takes more time to decode.

In other words the default accumulator register in long mode is NOT rax, but eax.

I hope it's clear.

Posted: **Wed Mar 23, 2011 4:27 pm**

Owen wrote:
Teehee wrote:Thanks all, very usefull informations.

---

This code has 4 bytes:
Code: Select all
use32
        inc eax
        inc ecx
        inc ebx
        inc edx    
in a 32 bit system it means the CPU will execute these 4 instructions in a single step(clock?)?
No. Depends a lot upon the processor. An Athlon 64 will (best case) execute that in 2 clocks, same for a Core 2. An i7 will possibly do that in 1. A 386 will probably take somewhere in the region of 8-12. A Pentium will probably take 4.

Theres no hard and fast rule. See your processor's optimization guide.

Firstly, i7 has an issue width of 3, as far as I know.

Secondly, it'll take at least n clock ticks to complete where n is the length of the i7's pipeline.

OSDev.org

Assembly: code optimization

Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization