x86-64 is 64 bit?

stephenj · Post by **stephenj** » Sun May 23, 2010 11:03 am

I've been working on an x86-64 bit instruction encoder lately (I'm not putting a front end onto it to make it a full assembler), and I've learned that the x86-64 lacks 64-bit features (like add rcx, 0x123456789AB & the memory address limitations).

My question is, as far as anyone knows, does Intel/AMD plan on changing the architecture to support more 64-bit operations? Or is it going to remain semi-64, semi-32 bit, even in Long Mode?

Of course, "mov rax, 0x123456789AB / add rcx, rax" would do the trick for my earlier example. But I'd rather not have to handle for special situations.

NickJohnson · Post by **NickJohnson** » Sun May 23, 2010 12:37 pm

I think they plan to extend the addressing from 48 bit to 64 bit eventually, whenever that limit is reached (which is a pretty big limit; it should take at least 24 years, assuming bloat follows Moore's Law).

Owen · Post by **Owen** » Sun May 23, 2010 1:01 pm

Consider yourself lucky that you even get 32 bit literals in the instruction words. Power64 has to deal with 16 bit literals. On 32-bit, to load a 32-bit literal one has to do ldi rX, low_bits; orih rX, high_bits; I've forgotten the procedure for ppc64.

fronty · Post by **fronty** » Sun May 23, 2010 2:33 pm

Owen wrote:Consider yourself lucky that you even get 32 bit literals in the instruction words. Power64 has to deal with 16 bit literals. On 32-bit, to load a 32-bit literal one has to do ldi rX, low_bits; orih rX, high_bits; I've forgotten the procedure for ppc64.

This isn't very rare among architectures with fixed instruction length. Eg. SPARC has different immediate lengths, including 13 of integer manipulation instructions and 22 bit taken by sethi. If you want to move 32-bit value to a register, you have to do sethi %rX, %hi(val); or %rX, %lo(val), %rX. %hi gets high 22 bits of a value, %lo gets low 10 bits.

stephenj · Post by **stephenj** » Sun May 23, 2010 5:38 pm

How's ARM or MIPS?

At some point I'm going to want to learn another architecture. Right now, I'd probably pick ARM, but MIPS seems very tempting.

Owen · Post by **Owen** » Sun May 23, 2010 5:39 pm

Both ARM and MIPS have sub-word constant sizes. Its pretty much a given on a RISC processor.

nedbrek · Post by **nedbrek** » Mon May 24, 2010 5:55 am

AFAIK, they will never add 64 bit immediates to the ALU functions. Handling immediates is a big pain in hardware, and adds 8 bytes to the instruction length (combined with [SIB+dword imm] would push the length towards the max of 15). The performance and ease of encoding gains are negligible, so there is no motivation.

Owen · Post by **Owen** » Mon May 24, 2010 8:12 am

The maximum length already is 15 bytes. That would push it to 19, which is getting beyond ridiculous.

nedbrek · Post by **nedbrek** » Tue May 25, 2010 5:55 am

Owen wrote:The maximum length already is 15 bytes. That would push it to 19, which is getting beyond ridiculous.

Sorry, I meant, "It would push more instructions towards the 15B limit". The 15B limit is a pretty hard boundary. Going over it would require guarantees that such instructions are never seen on older machines (which might not even know enough to flag it is #UD). Going over 15B is also a pain in hardware, and has little performance or usability benefit.

Currently, 64 bit immediates are limited to "mov r = i64" (I hope I am right in my reading on this...). This is a severely restricted form, which could be uop-ed as "load r = [RIP+off]". Extending this to ALU ops adds a lot of extra pain - again for little benefit.

54616E6E6572 · Post by **54616E6E6572** » Tue May 25, 2010 7:48 am

While a "add reg/mem64, imm64" instruction would be incredibly useful, it would not necessarilly provide the best performance. There exists many example of this on the x86 architecture, and while these instructions are useful, and generally produce smaller code, internally they are no were near as efficient as some of their counterparts.

A perfect example of this is the LOOP instruction. While the following code, which provides readability, and a smaller code size:

Code: Select all

label:
    ...
    loop label

and performs exactly the same operation as the following code:

Code: Select all

label:
    ...
    dec rcx
    jnz label

The first example uses a vector path instruction that takes at least 8 cycles to execute (assuming no-wait state).
The second example uses 2 direct path instructions taking 2 cycles to execute both instructions (assuming a predicted branch).

The second example, despite being larger, less readable, and more complex; ends up execute several times faster. This is because internally the second example is broken down into 2 macro-ops in hardware, while the first example is broken down into 1+ (usually 3+) macro-ops using the MROM (on-chip Microcode-engine ROM).

JamesM · Post by **JamesM** » Tue May 25, 2010 5:04 pm

And what does this tell you? That Intel optimised the common case first. dec/jnz is used by compilers, loop would be more often used by hand-crafted asm programmers.

nedbrek · Post by **nedbrek** » Wed May 26, 2010 5:44 am

54616E6E6572 wrote:A perfect example of this is the LOOP instruction. While the following code, which provides readability, and a smaller code size:

The first example uses a vector path instruction that takes at least 8 cycles to execute (assuming no-wait state).
The second example uses 2 direct path instructions taking 2 cycles to execute both instructions (assuming a predicted branch).

The second example, despite being larger, less readable, and more complex; ends up execute several times faster. This is because internally the second example is broken down into 2 macro-ops in hardware, while the first example is broken down into 1+ (usually 3+) macro-ops using the MROM (on-chip Microcode-engine ROM).

You are correct (according to the latest optimization guide I could find). The front-end is only prepared to emit one uop for each instruction (Core 2 uses "uop fusion" to increase the number of instructions which generate "one" uop - which is later broken into two by the execution core).

That said, there is a fused uop for cmp-jcc. It seems feasible to emit a dec-jcc as well. I'd have to look at the instruction mix, probably it does not turn up often enough to be a consideration (make the common case fast).

OSDev.org

x86-64 is 64 bit?

x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?

Re: x86-64 is 64 bit?