OSDev's dream CPU

rdos · Post by **rdos** » Wed Jul 18, 2012 12:33 am

Owen wrote:When booted from EFI, one can't assume that the graphics card, assuming it is a VGA compatible controller (likely, but I expect increasingly unlikely in the future) is in text mode - or is even in a legacy mode which can in any way be configured without a dedicated graphics driver

Assuming people actually buy these machines, and that sellers want to carry them when they are more or less fixed to a predefined OS. The concept seems to work on the mobile phone market, but that doesn't mean it also works for PCs. People usually expect more from a PC. They especially want to run their preferred apps.

Owen wrote:Most aren't. Expect Windows 8 to change this, being as it is mandating UEFI anyway. Some machines being sold, however, already are legacy free; and you have been able to buy mostly legacy free machines since 2005

OTOH, Windows 8 might become a big flop, especially if Microsoft ships it with a mobile-phone GUI and makes it impossible for users to install their OS of choice. After all, XBox, which used this concept, wasn't sold in large amounts to ordinary PC users.

After all, it doesn't cost PC manufacturers much to support reasonable legacy functionality. The hardware part of it is implemented in chipsets, and the software part is just included from major BIOS vendors. As long as chipset manufacturers support legacy functions, and BIOS vendors include software support, I don't see legacy functionality rapidly disappearing.

Virtlink · Post by **Virtlink** » Thu Jul 19, 2012 8:00 pm

My ideal CPU would be more high-level than most. The advantage of having a high-level CPU is that it can have more knowledge about the software than most, and use this information for run-time optimizations. For example, having a single possible calling convention and providing just one call and one ret instruction allows the CPU to do anything to make the call happen, without caring about how it is done. Today the difference between non-load/store (CISC) and load/store (RISC) processors has gotten a lot smaller. Transistors became smaller and cheaper, instructions are heavily pipelined and parallelized, and non-load/store processors are again viable.

Memory and registers
My CPU would have 64-bit virtual and physical addresses, but does not require the availability of the whole virtual and/or physical address space. For virtual addresses, less than 64-bits may be actually used (a la AMD64) resulting in a higher half and a lower half of the virtual memory. This is convenient for embedded variants of the CPU, that can settle for e.g. a 32-bit virtual address space.

The CPU has between 32 and 256 64-bit general purpose (integer) registers, between 8 and 256 256-bit SIMD/floating-point registers (or no SIMD support at all), and a special per-thread 64-bit BP register indicating the base of the code. Again, embedded versions may settle on the lower end of these ranges. You can write your software to work with the minimum number of registers, or you may want to run managed software with a Just-in-time compiler that can take the CPU characteristics into account.

The CPU also tracks the free pages, and it can do that really fast.

Stack
Stacks are 64-bit. In my ideal CPU they are managed solely by the CPU, and the CPU also manages the stack frames. There are instructions to allocate stacks, destroy them, allocate a chunk of stack memory, calling procedures, returning from procedures and of course pushing and popping values.

The CPU is also responsible for moving the arguments on the stack to the callee, and moving the results back to the caller. The CPU can do this extremely fast, might even used hidden caches or registers for this.

The CPU managed stack opens up all kinds of new possibilities. For example, my CPU would also manage exception handling: no more magic numbers or error codes, but a pointer (which presumably points to some exception object, but the CPU does not care) that is propagated up the call chain. As the CPU knows the exact stack frames, it can unwind the stack and run the exception filters, handlers and finalization code it encounters on its way. If the stack ever runs out, the CPU runs out of memory or there is a division by zero, the CPU will throw the appropriate exception (instead of interrupting).

Also, my CPU would manage the context switches and threads. This can follow naturally as all stacks are managed by the CPU, and CPU just manages the call stack for each thread. The CPU can optimize the saving of the state as it desires.

Instruction set
The instruction set is a non-load/store variable-length instruction set, again with the argument that higher-level instructions may give the CPU more freedom to optimize. Each instruction consists of a 2-byte opcode, an 1-byte addressing mode specifier, and up to 4 operands.

The first opcode byte selects the opcode map to use, there are 128 available for general use and extensions (such as SIMD extensions) and 128 for vendor-specific extensions. Each opcode map contains up to 255 instructions (encoded in the second opcode byte). This gives a theoretical maximum of 32767 possible general instructions and 32767 vendor-specific instructions.

CPU functions
Instructions tend to be small, quick things that are executed in the blink of an eye. However, many of the basic stuff in for example the C library is required on all computers, and people try to some up with more and more sophisticated methods to squeeze that tiny extra bit of power from memcpy and memset. In my CPU, there are special CPU functions that perform such operations. These are usually not finished within a single CPU cycle, but the CPU has the ability to do it as optimal as possible. When the CPU vendor wants to speed 99% of the programs up, all it has to do is spend some more transistors in the memcpy area. On the other hand, an embedded version might include a 1 MiB on-die ROM chip with machine code representations of the functions.

Yoda · Post by **Yoda** » Wed Jul 25, 2012 2:49 am

Virtlink wrote:Each instruction consists of a 2-byte opcode, an 1-byte addressing mode specifier, and up to 4 operands.

Every operand for non-load/store ISA needs it's own addressing mode. So, it is better to include addressing mode into operand encoding.

Owen · Post by **Owen** » Wed Jul 25, 2012 11:39 am

16-bit opcodes are a gross waste of (valuable) memory bandwidth, as are 1 byte operands. In general, I find byte-based thinking to be a trait of people who have spent too much time dealing with software (which is used to thinking in byte quantities) and not enough with hardware (which can very easily work with other bit dimensions).

A "per thread code base pointer" register is generally useless (think shared libraries) - better off to provide good PC-relative addressing capabilities. 256 registers is too many - no procedure will have anywhere near that many live variables at a time, and so for that many on chip registers to be useful, you need some form of windowing (SPARC) or shifting (Itanium) system, in which case, better to provide a more compact number anyhow. I would say experience shows somewhere between 16 and 32 is ideal - ARM seem to be moving to 32, and my hunch would be that 32 allows a better distribution of registers to callee scratch.

Yoda wrote:Every operand for non-load/store ISA needs it's own addressing mode. So, it is better to include addressing mode into operand encoding.

Not necessarily. Only the most extreme examples of CISC allow this (e.g. VAX). Both x86 and S/360 have a more restricted, per instruction addressing mode specifier.

If I were to design an architecture today, it would have the properties

Simple "CISC": Most instructions have an addressing mode, which allows one field of memory addressing per instruction
Simple, consistent instruction formats: Instructions are represented in 16-bit increments, with a 16-bit minimal size. The most common instructions can fit in 16 bits in 2 operand form, take 32 in 3 operand, and an extra 16 bits are added for some addressing modes. More esoteric instructions (e.g. system) only exist in 32-bit form (plus a potential addressing byte). Importantly, the length of the instruction is trivially determinable from the first coding unit
30 general purpose registers, plus a stack pointer, and a context sensitive register (encoded as "r31"). R31, in most instructions, will provide a literal zero; for addressing modes, as a base it takes the value of the instruction pointer, otherwise is zero; but in certain circumstances R31 encodes the "link register" (mainly load/store multiple instructions). 16-bit instructions address the first 8 of this set.

linguofreak · Post by **linguofreak** » Wed Jul 25, 2012 7:34 pm

Owen wrote:Simple, consistent instruction formats: Instructions are represented in 16-bit increments, with a 16-bit minimal size.

What about common one-operand instructions (e.g, push/pop register)? I'm fiddling around with the design of a hypothetical architecture, and am finding it very tempting to represent these in one byte. I have 14 general purpose registers (with a twist), a stack pointer, and a frame pointer, so the register file can be represented in 4 bits. I'm representing push/pop with a 4-bit opcode and a 4-bit register specifier, so the whole instruction fits in one byte. The problem is that while it conserves memory bandwidth, it also uses up a significant amount of opcode space. I'm a bit torn as to whether it's a good tradeoff or not.

Owen · Post by **Owen** » Wed Jul 25, 2012 8:53 pm

linguofreak wrote:
Owen wrote:Simple, consistent instruction formats: Instructions are represented in 16-bit increments, with a 16-bit minimal size.
What about common one-operand instructions (e.g, push/pop register)? I'm fiddling around with the design of a hypothetical architecture, and am finding it very tempting to represent these in one byte. I have 14 general purpose registers (with a twist), a stack pointer, and a frame pointer, so the register file can be represented in 4 bits. I'm representing push/pop with a 4-bit opcode and a 4-bit register specifier, so the whole instruction fits in one byte. The problem is that while it conserves memory bandwidth, it also uses up a significant amount of opcode space. I'm a bit torn as to whether it's a good tradeoff or not.

Most of the time, you want to push more than one register... so just do the logical thing and implement a "push multiple registers" instruction (encoded using a bitfield)

Yoda · Post by **Yoda** » Thu Jul 26, 2012 2:32 am

Owen wrote:16-bit opcodes are a gross waste of (valuable) memory bandwidth...

Agreed. In my ISA there are both one- and two-byte opcodes. If the most significant bit in opcode is 0, then it is 1 byte opcode. Otherwise, 2 byte. Totally possible 128 one-byte (they encode most frequently used ops) and 32640 two-byte opcodes.

Owen wrote:...as are 1 byte operands.

Not agreed. It is right a good choice of size to pack down 16-32 registers plus addressing mode into one byte.

Owen wrote:I find byte-based thinking to be a trait of people who have spent too much time dealing with software (which is used to thinking in byte quantities) and not enough with hardware (which can very easily work with other bit dimensions).

Aha! That's why I spent a lot of time designing bit-processor

.
But I concluded that it will be inefficient. Jump from 1 bit to 8 bit granularity drastically increases the performance. But further jump to 16 or 32-bit granularity doesn't bring more performance but applies innatural restrictions that require additional circuits for special processing and wastes the memory. I'm sure that wasting the memory is tightly bound to loosing the performance, – it requires more memory read operations and wastes the cache. The keystone in my ISA is simplicity of instruction decoding plus optimization for total size of code.

Owen wrote:
Yoda wrote:Every operand for non-load/store ISA needs it's own addressing mode. So, it is better to include addressing mode into operand encoding.
Not necessarily. Only the most extreme examples of CISC allow this (e.g. VAX). Both x86 and S/360 have a more restricted, per instruction addressing mode specifier.

I didn't say necessary, I said better

. So if you are familiar with VAX archiecture you may compare regularity of it's ISA with madness of x86.

Owen wrote:Simple "CISC": Most instructions have an addressing mode, which allows one field of memory addressing per instruction

Addition: ...which allows at least one field of memory addressing...
I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.

Owen wrote:Simple, consistent instruction formats: Instructions are represented in 16-bit increments, with a 16-bit minimal size.

What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?

Owen wrote:30 general purpose registers, plus a stack pointer, and a context sensitive register (encoded as "r31"). R31, in most instructions, will provide a literal zero; for addressing modes, as a base it takes the value of the instruction pointer, otherwise is zero; but in certain circumstances R31 encodes the "link register" (mainly load/store multiple instructions). 16-bit instructions address the first 8 of this set.

I see a great irregularity here. It is better to reserve commonly used literals in addressing mode, for example as in VAX-11 ISA.

linguofreak wrote:What about common one-operand instructions (e.g, push/pop register)? I'm fiddling around with the design of a hypothetical architecture, and am finding it very tempting to represent these in one byte. I have 14 general purpose registers (with a twist), a stack pointer, and a frame pointer, so the register file can be represented in 4 bits. I'm representing push/pop with a 4-bit opcode and a 4-bit register specifier, so the whole instruction fits in one byte.

First, that kind of instructions may push/pop not only registers, but also constants and values from memory (parameters for function calls, for example). That turns these instructions to be absolutely of the same class as any other instruction.
Second, it adds irregularity, mixing the opcode and operands into one byte opcode.

Virtlink · Post by **Virtlink** » Thu Jul 26, 2012 9:21 am

Based on the comments I made some changes.

Yoda wrote:Every operand for non-load/store ISA needs it's own addressing mode. So, it is better to include addressing mode into operand encoding.

I decided to put the operands specification near the operand. However, technically it is not a part of it and in one case optional. My encoding allows for up to 6 operands, all of which can be immediate values, memory operands, registers, or a mix of these. As Andrew Tanenbaum showed that 98% of all the constants in a program would fit in 13 bits, the smallest immediate value is 16-bits and not 8-bits. However, encoding a zero neither requires an immediate value, nor a register.

Owen wrote:A "per thread code base pointer" register is generally useless (think shared libraries) - better off to provide good PC-relative addressing capabilities. 256 registers is too many - no procedure will have anywhere near that many live variables at a time. I would say experience shows somewhere between 16 and 32 is ideal - ARM seem to be moving to 32, and my hunch would be that 32 allows a better distribution of registers to callee scratch.

I settled on exactly 32 registers, of which number 31 is the PC register for easy PC-relative addressing. There is no (externally visible) stack pointer, as the stack is managed by the CPU.

Owen wrote:16-bit opcodes are a gross waste of (valuable) memory bandwidth, as are 1 byte operands.

Most variable-length CPUs get, cache and interpret more bytes than the length of most instructions. Any excess bytes are discarded. So, for memory bandwidth it will not matter that much. I use byte-sized operands.

Yoda wrote:In my ISA there are both one- and two-byte opcodes. If the most significant bit in opcode is 0, then it is 1 byte opcode. Otherwise, 2 byte. Totally possible 128 one-byte (they encode most frequently used ops) and 32640 two-byte opcodes.

I decided to use an opcode system similar to yours, but using a variant of Dlugosz' Encoding that is also used in the CLR (.Net) to encode integers.

Yoda wrote:Jump from 1 bit to 8 bit granularity drastically increases the performance. But further jump to 16 or 32-bit granularity doesn't bring more performance but applies innatural restrictions that require additional circuits for special processing and wastes the memory.

I settled for 8-bit granularity.

Yoda wrote:I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.

So I made all operands the same, so there can be up to six memory operands.

More about the encoding I had in mind here on the Wiki.

Owen · Post by **Owen** » Thu Jul 26, 2012 4:40 pm

Yoda wrote:
Owen wrote:
Yoda wrote:Every operand for non-load/store ISA needs it's own addressing mode. So, it is better to include addressing mode into operand encoding.
Not necessarily. Only the most extreme examples of CISC allow this (e.g. VAX). Both x86 and S/360 have a more restricted, per instruction addressing mode specifier.
I didn't say necessary, I said better . So if you are familiar with VAX archiecture you may compare regularity of it's ISA with madness of x86.

I'm not familiar with it - but I can tell you, that the address mode per operand was a major issue when it came to pipelining it. If you had Intel's muscle, you could probably get it to fly... but nobody would design it like that today

Yoda wrote:
Owen wrote:Simple "CISC": Most instructions have an addressing mode, which allows one field of memory addressing per instruction
Addition: ...which allows at least one field of memory addressing...
I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.

Code: Select all

    mov r0, [r1]
    add [r2], r0

The "addressing mode specifier" bits are closer to an instruction mode specifier; not only does it specify which operand the addressing mode is applied to, it also specifies which form the instruction takes (diadic/triadic). In particular, for a diadic instruction the addressing mode can be applied to the destination register, for read/modify/write behaviour.

Of course, some instructions don't have a valid 2 operand form; for example, multiply accumulate; therefore, "MAC r0, r2" is invalid and its behaviour Unpredictable; though on a probable implementation it would execute as "MAC r0, r2, r0", aka "r0 += r2 * r0"

Owen wrote:Simple, consistent instruction formats: Instructions are represented in 16-bit increments, with a 16-bit minimal size.

What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?[/quote]
Simplification of the instruction decoder. In particular, you remove half of the taps on the barrel shifter which aligns the instruction to be decoded by the instruction decoder, greatly reducing its size and noticeably reducing its latency.

Also, you can already get quite respectable instruction densities at this point: ARM Thumb-2, for example, is not a particularly well compacted instruction set (in particular, it is constrained by Thumb's design, which never anticipated expansion to add 32-bit instructions as Thumb-2 did, and therefore these instructions are quite wasteful of bits) yet already trades places effectively with the byte coded i386. i386 tends to be reasonably compact; except for in some areas around memory addressing. In particular, it tends to strongly beat most "true CISCs"

Yoda wrote:
Owen wrote:30 general purpose registers, plus a stack pointer, and a context sensitive register (encoded as "r31"). R31, in most instructions, will provide a literal zero; for addressing modes, as a base it takes the value of the instruction pointer, otherwise is zero; but in certain circumstances R31 encodes the "link register" (mainly load/store multiple instructions). 16-bit instructions address the first 8 of this set.
I see a great irregularity here. It is better to reserve commonly used literals in addressing mode, for example as in VAX-11 ISA.

This is an "exploitation" of an irregularity built into the architecture: Indexing off of LR is so rare (it is overwritten with the return address at every function call, after all) and therefore it is a viable candidate for the coding of special values. For example, it allows base+scale*index+offset addressing to degenerate into base+offset; and base+offset addressing to be used for PC-relative addressing of data

Yoda wrote:
linguofreak wrote:What about common one-operand instructions (e.g, push/pop register)? I'm fiddling around with the design of a hypothetical architecture, and am finding it very tempting to represent these in one byte. I have 14 general purpose registers (with a twist), a stack pointer, and a frame pointer, so the register file can be represented in 4 bits. I'm representing push/pop with a 4-bit opcode and a 4-bit register specifier, so the whole instruction fits in one byte.
First, that kind of instructions may push/pop not only registers, but also constants and values from memory (parameters for function calls, for example). That turns these instructions to be absolutely of the same class as any other instruction.
Second, it adds irregularity, mixing the opcode and operands into one byte opcode.

Outside of IA-32 and its decrepit calling convention, I only very, very rarely see use of a "push memory_value" or "push literal" instruction. It's not worth reserving coding space for. For example, on AMD64, it is very rare.

Far better to implement a "push multiple registers" instruction, which will get heavy use.

cb88 · Post by **cb88** » Thu Jul 26, 2012 8:51 pm

Isn't the push multiple values essential what transactional memory is doing at least from the out to ram perspective?

Or is it more like ram IO scheduling ... I've read some on it but am still not quite sure what its all about it I know at least that its not either a RISC or CISC specific as Oracle was implementing it in I want to say the Sparc64/Rock.

That said haven't register windows been proven to be better managed by the compiler... thats what they explained in computer arch class seemed like it was rather old news as well and sort of a blotch on the Sparc architecture.

Owen · Post by **Owen** » Thu Jul 26, 2012 9:55 pm

cb88 wrote:Isn't the push multiple values essential what transactional memory is doing at least from the out to ram perspective?

Or is it more like ram IO scheduling ... I've read some on it but am still not quite sure what its all about it I know at least that its not either a RISC or CISC specific as Oracle was implementing it in I want to say the Sparc64/Rock.

A "load multiple"/"store multiple"/"push multiple"/"pop multiple" instruction is simple. Lets take the following ARM instruction:

Code: Select all

    STMFD sp!, {r4-r7, lr}

Whats it doing? quite simply, it could be expressed as the following combination of simpler instructions:

Code: Select all

    sub sp, sp, #4*5
    str r4, [sp]
    str r5, [sp+4]
    str r6, [sp+8]
    str r7, [sp+12]
    str lr, [sp+16]

"STM" stands for "store multiple"; "FD" means "full descending", and describes the 'discipline' in play. For STM, its an alias for "DB" (Decrement before); for LDM, its an alias for "IA" (Increment after). sp! says use SP as the base register for the memory operation; the ! means update it afterards, and then a selection of which registers to store follows.

(One of the things of note is that ARM explicitly allows for STM/LDMs to be aborted and restarted on exception or interrupt, i.e. an address may be read/written twice. Therefore, specifically LDM sp, {r0, sp} is invalid (sp may be updated, but an interrupt/asynchronous exception cause the instruction to be restarted). Any STM in which the base register is mentioned, and the base register is also updated, is also defined to have undefined behaviour with regards to what value is saved.

cb88 wrote:That said haven't register windows been proven to be better managed by the compiler... thats what they explained in computer arch class seemed like it was rather old news as well and sort of a blotch on the Sparc architecture.

Pretty much. IA-64 style register "scrolling" might have some merit, (because of the finer granularity,) but register windows are generally thought of as less efficient than explicit spills

Yoda · Post by **Yoda** » Fri Jul 27, 2012 5:28 am

Virtlink wrote:As Andrew Tanenbaum showed that 98% of all the constants in a program would fit in 13 bits, the smallest immediate value is 16-bits and not 8-bits.

Of course, this result is interesting. But it would be much more interesting to calculate full distribution of usage frequencies. For example like this:
0 bits - 20% (encoding zero)
1 bit - 25%
2 bits - 27%
...
8 bits - 81%
...
13 bits - 98%
...
64 bits - ???%
...
128 bits - ???%

Did Tanenbaum gathered large statistic data for such distribution? May be somebody on this forum have such statistics?

Virtlink wrote:However, encoding a zero neither requires an immediate value, nor a register.

Yes, you are right! I've done my own statistical investigation (yet not fully complete) and found that usage of zero value is extremely frequent, so it requires special encoding. Moreover, it is used only in special context - clearing. In most other contexts it is quite useless - nobody will add/subtract zero, multiply to zero, divide by zero, or/xor/and with zero and so on. So, encoding zero as an operand is sensless. That's why I implemented "clr" instruction with one destination operand in my ISA.

Virtlink wrote:There is no (externally visible) stack pointer, as the stack is managed by the CPU.

How do you manage function call parameters? How do you organize local function data?

Virtlink wrote:Most variable-length CPUs get, cache and interpret more bytes than the length of most instructions. Any excess bytes are discarded. So, for memory bandwidth it will not matter that much. I use byte-sized operands.

Neverteless, some instructions are used much more frequently ("mov", for example) than others. So the variable-length instruction encoding is the point for memory bandwidth optimization, isn't it?

Owen wrote:address mode per operand was a major issue when it came to pipelining it.

"Ducks fly better than dogs. That's why I need a powerful jet to make dog to fly better than duck."
In other words, Intel needs a strong pipeline engine to force 2-3 sequential instructions execute faster than one.

Owen wrote:
Yoda wrote:I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.
Code: Select all
    mov r0, [r1]
    add [r2], r0

No, that is the case:

Code: Select all

    add [src], [r1], [r2]

instead of

Code: Select all

    mov r0, [src]
    add r0, [r1]
    mov [r2], r0

Saving two instructions, memory for encoding them and one temporary register. Also, reducing clocks needed and power usage.

Owen wrote:
Yoda wrote:What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?
Simplification of the instruction decoder. In particular, you remove half of the taps on the barrel shifter which aligns the instruction to be decoded by the instruction decoder, greatly reducing its size and noticeably reducing its latency.

So why not 32? It will drop another stage of barrel shifter. And 16-byte alignment (VLIW?) will allow you to get rid of shifter at all.
Let's consider. Maximal length of instruction for x86 (just for example) is 15 bytes. Round up to 16. For this size every stage of barrel shifter adds 16*8 = 128 two-inputs switches or 128*3 = 384 logical gates. For full processing (byte-aligned) there are (0/1, 0/2, 0/4, 0/8) 4 stages, or 1536 gates and 8 gate-to-gate propagation delays. Having 3 stages instead of 4 will save you 25% of both capacity and speed of shifter. Are 384 gates too much for die with billions of gates? But OK, 2 additional gate-to-gate propagation delays may be too much. For fast shifter we need to implement full matrix of transmission gates (3-state buffers). This will result in 128*16 = 2048 gates for byte-aligned shifter and 128*8 = 1024 gates for word-aligned shifter. So, for fast shifter word-alignment will NOT save time and will save only 1024 gates. Having one transmission gate realized on 3 transistors it will save 3072/2000000000 ~= 0,00015% of die size

.

Owen wrote:This is an "exploitation" of an irregularity built into the architecture: Indexing off of LR is so rare (it is overwritten with the return address at every function call, after all)

I didn't get that, please explain it in more details.

Owen wrote:Outside of IA-32 and its decrepit calling convention, I only very, very rarely see use of a "push memory_value" or "push literal" instruction. It's not worth reserving coding space for. For example, on AMD64, it is very rare.

Far better to implement a "push multiple registers" instruction, which will get heavy use.

I agree that with parameters being passed in registers the need in pushing constants and values from memory become obsolete. But the same for pushing single register also. Yes, saving multiple registers will be more useful, but I still didn't decide, how to encode that operation.

Combuster · Post by **Combuster** » Fri Jul 27, 2012 6:11 am

Yoda wrote:
Owen wrote:
Yoda wrote:What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?
Simplification of the instruction decoder. In particular, you remove half of the taps on the barrel shifter which aligns the instruction to be decoded by the instruction decoder, greatly reducing its size and noticeably reducing its latency.
So why not 32? It will drop another stage of barrel shifter. And 16-byte alignment (VLIW?) will allow you to get rid of shifter at all.

Memory efficiency. Many of the individual steps you want to take can prefectly well be encoded in two bytes: even on x86 the vast majority of instructions fit in two bytes if they don't use immediates. So do you need to waste cache space for storing things not being used? In fact ARM downgraded from its original 32-bit RISC to the 16-bit THUMB instruction set because of that waste.

As for VLIW, so far it has been an commercial failure.

Owen · Post by **Owen** » Fri Jul 27, 2012 8:23 am

Yoda wrote:
Owen wrote:address mode per operand was a major issue when it came to pipelining it.
"Ducks fly better than dogs. That's why I need a powerful jet to make dog to fly better than duck."
In other words, Intel needs a strong pipeline engine to force 2-3 sequential instructions execute faster than one.

For out of order, it requires the ability to issue at least N+1 operations per clock (Where N is the number of operands). Fundamentally, it's going to decode into a series of load, an operate, and a series of store micro-ops because thats the efficient way of doing things. If your ISA has 3 potential memory operands, then you need to be able to issue 4 micro-ops to the OOo core. This requires big multiplexers, which are expensive in terms of propagation delay.

Yoda wrote:
Owen wrote:
Yoda wrote:I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.
Code: Select all
    mov r0, [r1]
    add [r2], r0
No, that is the case:
Code: Select all
    add [src], [r1], [r2]
instead of
Code: Select all
    mov r0, [src]
    add r0, [r1]
    mov [r2], r0
Saving two instructions, memory for encoding them and one temporary register. Also, reducing clocks needed and power usage.

No less clocks (OK: On an in-order implementation you'll avoid a pipeline stall, which clever scheduling can avoid, but in-order implementations are barely relevant).

Yoda wrote:
Owen wrote:
Yoda wrote:What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?
Simplification of the instruction decoder. In particular, you remove half of the taps on the barrel shifter which aligns the instruction to be decoded by the instruction decoder, greatly reducing its size and noticeably reducing its latency.
So why not 32? It will drop another stage of barrel shifter. And 16-byte alignment (VLIW?) will allow you to get rid of shifter at all.
Let's consider. Maximal length of instruction for x86 (just for example) is 15 bytes. Round up to 16. For this size every stage of barrel shifter adds 16*8 = 128 two-inputs switches or 128*3 = 384 logical gates. For full processing (byte-aligned) there are (0/1, 0/2, 0/4, 0/8) 4 stages, or 1536 gates and 8 gate-to-gate propagation delays. Having 3 stages instead of 4 will save you 25% of both capacity and speed of shifter. Are 384 gates too much for die with billions of gates? But OK, 2 additional gate-to-gate propagation delays may be too much. For fast shifter we need to implement full matrix of transmission gates (3-state buffers). This will result in 128*16 = 2048 gates for byte-aligned shifter and 128*8 = 1024 gates for word-aligned shifter. So, for fast shifter word-alignment will NOT save time and will save only 1024 gates. Having one transmission gate realized on 3 transistors it will save 3072/2000000000 ~= 0,00015% of die size .

The AMD K8 design was done with 5 gate delays per pipeline stage. If your shifter is using 4 of those gate delays, you have one gate to determine the length of the first instruction, and the best you can do is parallel decode of 2 instructions/cycle without horrid pipeline depth growth.

With 16-bit granularity, plus a simpler instruction encoding, you can quite easily reach 3 instruction decode/clock

Yoda wrote:
Owen wrote:This is an "exploitation" of an irregularity built into the architecture: Indexing off of LR is so rare (it is overwritten with the return address at every function call, after all)
I didn't get that, please explain it in more details.

How often do you index an array by your return address? Yeah, never; so for indexing it can be reused as zero
How often do you look up the code at your return address? Very rarely; so for base calculations, it's more valuable as the program counter

Yoda · Post by **Yoda** » Fri Jul 27, 2012 5:24 pm

Combuster wrote:Memory efficiency.

Exactly! That's what I'm talking about.

Combuster wrote:Many of the individual steps you want to take can prefectly well be encoded in two bytes

But many won't. The less granularity you use, the more efficient encoding you can implement. My opinion is that 1 byte granularity is the perfect balance between decoding speed and memory usage.

Combuster wrote:As for VLIW, so far it has been an commercial failure.

I hate them too.

Owen wrote:For out of order, it requires the ability to issue at least N+1 operations per clock (Where N is the number of operands). Fundamentally, it's going to decode into a series of load, an operate, and a series of store micro-ops because thats the efficient way of doing things. If your ISA has 3 potential memory operands, then you need to be able to issue 4 micro-ops to the OOo core.

If I have 3 memory operand, I must execute the same 4 micro-ops (two mem reads, one write and operation), no matter which architecture do I use. But whereas 4 u-ops is theoretical limit some architectures may do in more than 4 u-ops! That belongs to the x86, of course. The difference is that my ISA provides exactly the theoretical limit for the whole action. Operations encode more efficiently, take less memory, allocate less registers, are easier to decode, and to analyze dependencies between instructions. BTW, dependecies may stall performance or require more power in case of "only one in-memory operand is allowed". The example is:

Code: Select all

    add [op2], [op1]
    sub [op4], [op3]

vs

Code: Select all

    mov r0, [op1]
    add [op2], r0
    mov r0, [op3]
    sub [op4], r0

In first case no register allocated and since there are no dependencies, the second instruction may be executed out of order (simultaneously) without using high intelligence. But the second case for the same action not only requires to execute FOUR instructions, but it could not be executed simultaneously without involving register renaming and additional speculation circuits.

Owen wrote:This requires big multiplexers, which are expensive in terms of propagation delay.

No, multiplexers aren't too big, as shown above. And the time of switching/shifting for fast multiplexers/shifters doesn't depend on number of inputs or shifting size. It just requires more gates, but that increase is negligible in comparison with size of modern die.

Owen wrote:
Yoda wrote:
Code: Select all
    add [src], [r1], [r2]
instead of
Code: Select all
    mov r0, [src]
    add r0, [r1]
    mov [r2], r0
Saving two instructions, memory for encoding them and one temporary register. Also, reducing clocks needed and power usage.
No less clocks (OK: On an in-order implementation you'll avoid a pipeline stall, which clever scheduling can avoid, but in-order implementations are barely relevant).

I can't prove obvious things! I can just state that to achieve the same result second case needs either out-of-order execution (instead of ONE instruction that is singular by it's nature and doesn't need any kind of out-of-order), or more clocks to process.

Owen wrote:The AMD K8 design was done with 5 gate delays per pipeline stage. If your shifter is using 4 of those gate delays, you have one gate to determine the length of the first instruction, and the best you can do is parallel decode of 2 instructions/cycle without horrid pipeline depth growth.

Again, for FAST shifters shifting speed doesn't depend on shifting size and always requires two gates propagation delay (one for row selector and one for transmission gate). If AMD has shifter (and I said above that only VLIW design doesn't need shifters at all), no matter will it shift words or bytes – it will take SAME time.

Owen wrote:This is an "exploitation" of an irregularity built into the architecture: Indexing off of LR is so rare (it is overwritten with the return address at every function call, after all)...
How often do you index an array by your return address? Yeah, never; so for indexing it can be reused as zero
How often do you look up the code at your return address? Very rarely; so for base calculations, it's more valuable as the program counter

Still didn't get... what is LR? Why should I index something off it?

OSDev.org

OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU