How exactly the CPU fetch instructions from RAM?

nikito · Post by **nikito** » Tue Aug 10, 2010 11:08 pm

HI!

Y coded a dummy test:

mov RAX, 1000000
dd "BLNC"
mov EDX, 1
dd "BLNC"
dummy_code:
dec EAX
dd "BLNC"
add EAX, EDX
dd "BLNC"
dec RAX
dd "BLNC"
cmp EAX, 0
dd "BLNC"
jne dummy_code 
dd "BLNC"
jmp $
dd "BLNC"

Then assembled it in binary format just for read it, not for executing, and I saw that the instructions are of variable length.
So my question is what operand size uses the processor when read instructions from RAM? And how it optimizes the performance having to read sometimes a word and other times >quadword to read one single instruction?

Thanks!

Candy · Post by **Candy** » Tue Aug 10, 2010 11:33 pm

I recommend downloading hiew (hacker's view), put it in disassembly mode and start typing arbitrary byte sequences. You'll figure out the logic behind it pretty soon.

Alternatively, download the instruction manuals and see what they say. The CPU will look at the first X bytes and decide its opcode "X" which has three arguments, one inside the opcode, one in a single byte behind it and one in a 4-byte behind it.

Incidentally, if you think that decoding this must be slow, it is. The K6-3 was slow in that it could only do two opcodes per cycle. The Athlon was the first with three opcodes per cycle. They may have accomplished 4 by now, but I doubt it. This instruction decoding is the main reason your x86 cpu is slow.

nikito · Post by **nikito** » Wed Aug 11, 2010 12:05 am

Candy wrote:Incidentally, if you think that decoding this must be slow, it is. The K6-3 was slow in that it could only do two opcodes per cycle. The Athlon was the first with three opcodes per cycle. They may have accomplished 4 by now, but I doubt it. This instruction decoding is the main reason your x86 cpu is slow.

This is what exactly I've been afraid from. Then I can imagine too that when the CPU read the opcode for an "mov R**, imm64"(for example), it need to go to the RAM one more time to fetch the rest of the instruction that cross the 64 bit boundary (I imagine an operand size of 64 bits for an 64 bit capable CPU when reading code). I was looking at Intel documentation last night, but found not information about decoded instructions per cycle. Will read more.

I am interested of this from the perspective of an SMC too. If I need to replace an insctuction with one larger, have I to shift to the right the rest of the program to free up place?!?!!

Candy wrote:I recommend downloading hiew (hacker's view), put it in disassembly mode and start typing arbitrary byte sequences. You'll figure out the logic behind it pretty soon.

Thanks for the recommendation, I will try it next time I clean my PC.

skyking · Post by **skyking** » Wed Aug 11, 2010 1:11 am

Not necessarily. Reads from RAM can be done in larger chunks, fx for entire cache lines.

If you have to replace an instruction on a variable instruction length architecture you will need to pad the instructions with NOPs so that the instruction plus the NOPs occupy the maximum number of bytes that the possible instructions may take.

Candy · Post by **Candy** » Wed Aug 11, 2010 1:48 am

Your computer does everything it can to go faster, of course. There's decades of "let's make this **** architecture faster" behind it.

Your cpu loads cache lines in 64-byte chunks (that's bytes, not bits). It reads those into multiple cache levels so you can have both a bigger and a faster cache. The topmost I-cache level (on some cpu's at least) is not in x86 opcodes but in native opcodes of the cpu, which are a constant-size expansion of the x86 instructions. It uses these to run through the x86 opcodes quicker - basically, by just not doing them but a faster already-translated equivalent. The places where this fails the cpu slows down *incredibly*.

Try doing SMC (self-modifying code) on a Pentium 4. It works, but every instruction change in the same cache line causes a full D-cache line flush, a full I-cache flush for the relevant entries (or just all of them), a full pipeline flush and possibly more, and then reloads of the very same.

nikito · Post by **nikito** » Wed Aug 11, 2010 2:02 am

Thank you for the help, skyking!

@Candy:

This clarification helped me a lot. Intel are talking about an important performance penalty when using SMC, and now I understand the why of this. It sounds tremendous. Now I will try to redesign my code to avoid the SMC at all cost.

Thanks to all!

niki

skyking · Post by **skyking** » Wed Aug 11, 2010 2:39 am

The performance penalty with SMC should only be present when the code is actually changed (or loaders would be in deep trouble). You might use this to improve performance by having a "modified code cache" where you keep the modified versions of your code and if one specific modification of the code is used again later you might avoid having to do the modification of the code (this is kind of what qemu does).

nikito · Post by **nikito** » Wed Aug 11, 2010 3:06 am

I understand that invalidating cache is not the same as disabling it. But even this situation seems to me an important decreasing in performance that I should try to avoid. Now keeping in mind the cache, I can thing of many manners to optimize SMC (better said: minimize his impact) but at now just will try to avoid its use. After all I am still an novice.

Solar · Post by **Solar** » Wed Aug 11, 2010 4:19 am

Self-modifying code can be a mighty tool in the hands of the skilled.

For all others it's an excellent way to shoot yourself in the foot. Not only performance-wise; it's also hell to debug and maintain.

Actually, the debug / maintenance thing is still true for the skilled.

Brendan · Post by **Brendan** » Wed Aug 11, 2010 4:21 pm

Hi,

Candy wrote:Incidentally, if you think that decoding this must be slow, it is. The K6-3 was slow in that it could only do two opcodes per cycle. The Athlon was the first with three opcodes per cycle. They may have accomplished 4 by now, but I doubt it. This instruction decoding is the main reason your x86 cpu is slow.

I'm not sure that's the whole story.

In theory, with fixed length instructions the decoder is much simpler and can decode instructions faster, but the code is less compact and you spend more time fetching instructions from cache and RAM; so you probably lose what you gain and end up with similar performance in the end.

Of course for fixed length instructions you could have a very fast (and therefore very small) "pre-fetch buffer" to help avoid the cost of fetching instructions from L1 cache; and for variable length instructions you could have a very fast (and therefore very small) "micro-code buffer" to help avoid the cost of decoding (and fetching instructions).

The other thing to consider is backward compatibility and future extensions - with fixed length instructions it's much harder to add features that weren't part of the original design while keeping backward compatibility. For an example, if the original 8086 used fixed length instructions it would've been almost impossible to extend it to 32-bit (without shifting to variable length instructions), and then even more impossible to extend it again to 64-bit. Backward compatibility means more sales, which means more $$$ for research and development, which means better performance. That's why (despite using variable length instructions) a CPU like Nehalem can still decode up to 5 instructions per cycle (and can turn it's decoder off for tight loops when it's running from it's "micro-code buffer"), and why 80x86 dominates the laptop/desktop/server market in terms of "bang per buck".

Cheers,

Brendan

Candy · Post by **Candy** » Wed Aug 11, 2010 11:34 pm

Brendan wrote:
Candy wrote:Incidentally, if you think that decoding this must be slow, it is. The K6-3 was slow in that it could only do two opcodes per cycle. The Athlon was the first with three opcodes per cycle. They may have accomplished 4 by now, but I doubt it. This instruction decoding is the main reason your x86 cpu is slow.
I'm not sure that's the whole story.

In theory, with fixed length instructions the decoder is much simpler and can decode instructions faster, but the code is less compact and you spend more time fetching instructions from cache and RAM; so you probably lose what you gain and end up with similar performance in the end.

You can design a memory system around needing 16 bytes of new data per cycle from the I-cache. You can't design around not knowing where to start figuring out how long the second instruction sequence is until you figured out the first. That's why it's such a major pain - each instruction you decode is very dependent on the previous instruction. I'm suspecting that the newer processors have a pre-opcode-selection phase where they figure out how you would decode an instruction at offset X from IP (IE, how long it is), an opcode-start-selection phase where each of them decides where its opcode actually starts and then reading & actual decoding. That way you minimize the latency.

Of course for fixed length instructions you could have a very fast (and therefore very small) "pre-fetch buffer" to help avoid the cost of fetching instructions from L1 cache; and for variable length instructions you could have a very fast (and therefore very small) "micro-code buffer" to help avoid the cost of decoding (and fetching instructions).

IIRC that latter one is the I-cache. The prefetch buffer is a lot easier to design and can be guaranteed not to underrun in complex ways.

The other thing to consider is backward compatibility and future extensions - with fixed length instructions it's much harder to add features that weren't part of the original design while keeping backward compatibility. For an example, if the original 8086 used fixed length instructions it would've been almost impossible to extend it to 32-bit (without shifting to variable length instructions), and then even more impossible to extend it again to 64-bit. Backward compatibility means more sales, which means more $$$ for research and development, which means better performance. That's why (despite using variable length instructions) a CPU like Nehalem can still decode up to 5 instructions per cycle (and can turn it's decoder off for tight loops when it's running from it's "micro-code buffer"), and why 80x86 dominates the laptop/desktop/server market in terms of "bang per buck".

And of course, there's a very good rationale why X86 is the stuff we're using right now. It's not just variable-length opcodes though.

Owen · Post by **Owen** » Thu Aug 12, 2010 2:13 am

Candy is correct, at least for AMD CPUs: The bits which in data/unified cache would be used for ECC, are used for encoding start/stop, prefix and such indicators in the L1 instruction cache; the instructions are simply protected by parity, rather than by full ECC as data is. After all, if your instruction gets corrupted, you just fetch it again. While K8s can do 3 instruction decodes per cycle, and K10 can do 4, this does I believe only hold if you have a primed instruction cache (or maybe also if your sequence of instructions is lucky, in that its one of the layouts common enough the processor was optimized to try it)

As for variable vs fixed length: Given a fixed length instruction set, extending it to be variable length is easy; granted, it may not have the highest coding efficiency, but then again, x86 doesn't either.

However, as an effective instruction set, x86 falls down in quite a few ways:

Prefix bytes which can come in any order
Prefix bytes in there entirety (Though those which come in one place are much better!)
Variable length opcodes
Variable length postfixes

Now, I need to elaborate on the third point. Variable length opcodes isn't, in of itself a problem, as long as the first coding word establishes a length for the instruction up front.

And yes, thats the deal: If you want an efficient variable length architecture, make how long the instruction is be very easy to find. Perhaps consider using 16-bit coding words; it gives you much more room to play with, and you'll still be able to beat x86 for code density!

(Remember: ARM Thumb2 averages 1.1x the size of x86 for the same source code. Thumb2 isn't the cleanest of variable length RISC instruction sets; for a start, in the the 32-bit instructions, the first 16-bits aren't pulling their weight due to a need for backwards compatibility)

Brendan · Post by **Brendan** » Thu Aug 12, 2010 3:48 am

Hi,

Candy wrote:
Brendan wrote:In theory, with fixed length instructions the decoder is much simpler and can decode instructions faster, but the code is less compact and you spend more time fetching instructions from cache and RAM; so you probably lose what you gain and end up with similar performance in the end.
You can design a memory system around needing 16 bytes of new data per cycle from the I-cache. You can't design around not knowing where to start figuring out how long the second instruction sequence is until you figured out the first. That's why it's such a major pain - each instruction you decode is very dependent on the previous instruction. I'm suspecting that the newer processors have a pre-opcode-selection phase where they figure out how you would decode an instruction at offset X from IP (IE, how long it is), an opcode-start-selection phase where each of them decides where its opcode actually starts and then reading & actual decoding. That way you minimize the latency.

That's how Nehalem and Core 2 do it - in both cases there's a 32 KiB I-cache, then a "pre-decode" stage, then an instruction queue. Branch prediction also happens in this stage. This means that the decoders themselves receive an "assumed contiguous" stream of instructions (where sometimes the assumption is wrong due to branch misprediction, etc and it speculatively decodes/executes instructions) where each instruction is 1 entry in the instruction queue (the decoders don't need to know/care about the original instruction's length).

Candy wrote:
Brendan wrote:Of course for fixed length instructions you could have a very fast (and therefore very small) "pre-fetch buffer" to help avoid the cost of fetching instructions from L1 cache; and for variable length instructions you could have a very fast (and therefore very small) "micro-code buffer" to help avoid the cost of decoding (and fetching instructions).
IIRC that latter one is the I-cache. The prefetch buffer is a lot easier to design and can be guaranteed not to underrun in complex ways.

The I-cache is before the decoder, not after it. For Nehalem, between the decoder and the instruction units there's a 28 micro-op buffer, a 128 entry re-order buffer, then a 36 entry reservation station. Core 2 is mostly the same (just smaller buffers). These buffers serve different purposes, but they also help to hide any delays in the decoders.

Cheers,

Brendan

nikito · Post by **nikito** » Thu Aug 12, 2010 9:42 am

All this is helpful to know it, but I wonder if there is an possibility to detect the actions of the caches. I mean issuing an prefetched transaction(for example) from RAM, and in some way to know if the data has cached and past how many time this occurs. Something like, let say debugging the caches. I don't know if explain myself well.

And yes, talking about real machine, not emulated.

Solar · Post by **Solar** » Fri Aug 13, 2010 5:32 am

nikito wrote:All this is helpful to know it, but I wonder if there is an possibility to detect the actions of the caches. [...] Something like, let say debugging the caches.

Cachegrind and Callgrind from the Valgrind suite spring to mind. KCachegrind provides a visualisation tool for the output.

OSDev.org

How exactly the CPU fetch instructions from RAM?

How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?

Re: How exactly the CPU fetch instructions from RAM?