Virtlink wrote:As Andrew Tanenbaum showed that 98% of all the constants in a program would fit in 13 bits, the smallest immediate value is 16-bits and not 8-bits.
Of course, this result is interesting. But it would be much more interesting to calculate full distribution of usage frequencies. For example like this:
0 bits - 20% (encoding zero)
1 bit - 25%
2 bits - 27%
...
8 bits - 81%
...
13 bits - 98%
...
64 bits - ???%
...
128 bits - ???%
Did Tanenbaum gathered large statistic data for such distribution? May be somebody on this forum have such statistics?
Virtlink wrote:However, encoding a zero neither requires an immediate value, nor a register.
Yes, you are right! I've done my own statistical investigation (yet not fully complete) and found that usage of zero value is extremely frequent, so it requires special encoding. Moreover, it is used only in special context - clearing. In most other contexts it is quite useless - nobody will add/subtract zero, multiply to zero, divide by zero, or/xor/and with zero and so on. So, encoding zero as an operand is sensless. That's why I implemented "clr" instruction with one destination operand in my ISA.
Virtlink wrote:There is no (externally visible) stack pointer, as the stack is managed by the CPU.
How do you manage function call parameters? How do you organize local function data?
Virtlink wrote:Most variable-length CPUs get, cache and interpret more bytes than the length of most instructions. Any excess bytes are discarded. So, for memory bandwidth it will not matter that much. I use byte-sized operands.
Neverteless, some instructions are used much more frequently ("mov", for example) than others. So the variable-length instruction encoding is the point for memory bandwidth optimization, isn't it?
Owen wrote:address mode per operand was a major issue when it came to pipelining it.
"Ducks fly better than dogs. That's why I need a powerful jet to make dog to fly better than duck."
In other words, Intel needs a strong pipeline engine to force 2-3 sequential instructions execute faster than one.
Owen wrote:Yoda wrote:I better call ISAs, restricted to only one memory operand, as a half-step to full-featured CISC. In that case, for example, to add memory to memory and to store in memory you need three instructions, two of which will be load/store (RISC like?) instead of one instruction with three memory operands. In addition, load/store instructions need separate decoding in pipeline and you also need to allocate precious registers for temporary values.
No, that is the case:
instead of
Code: Select all
mov r0, [src]
add r0, [r1]
mov [r2], r0
Saving two instructions, memory for encoding them and one temporary register. Also, reducing clocks needed and power usage.
Owen wrote:Yoda wrote:What are the serious arguments for 16-bit granularity? Won't it be just the waste of memory as you said about 16-bit opcodes vs 8-bit?
Simplification of the instruction decoder. In particular, you remove half of the taps on the barrel shifter which aligns the instruction to be decoded by the instruction decoder, greatly reducing its size and noticeably reducing its latency.
So why not 32? It will drop another stage of barrel shifter. And 16-byte alignment (VLIW?) will allow you to get rid of shifter at all.
Let's consider. Maximal length of instruction for x86 (just for example) is 15 bytes. Round up to 16. For this size every stage of barrel shifter adds 16*8 = 128 two-inputs switches or 128*3 = 384 logical gates. For full processing (byte-aligned) there are (0/1, 0/2, 0/4, 0/8) 4 stages, or 1536 gates and 8 gate-to-gate propagation delays. Having 3 stages instead of 4 will save you 25% of both capacity and speed of shifter. Are 384 gates too much for die with billions of gates? But OK, 2 additional gate-to-gate propagation delays may be too much. For fast shifter we need to implement full matrix of transmission gates (3-state buffers). This will result in 128*16 = 2048 gates for byte-aligned shifter and 128*8 = 1024 gates for word-aligned shifter. So, for fast shifter word-alignment will NOT save time and will save only 1024 gates. Having one transmission gate realized on 3 transistors it will save 3072/2000000000 ~= 0,00015% of die size
.
Owen wrote:This is an "exploitation" of an irregularity built into the architecture: Indexing off of LR is so rare (it is overwritten with the return address at every function call, after all)
I didn't get that, please explain it in more details.
Owen wrote:Outside of IA-32 and its decrepit calling convention, I only very, very rarely see use of a "push memory_value" or "push literal" instruction. It's not worth reserving coding space for. For example, on AMD64, it is very rare.
Far better to implement a "push multiple registers" instruction, which will get heavy use.
I agree that with parameters being passed in registers the need in pushing constants and values from memory become obsolete. But the same for pushing single register also. Yes, saving multiple registers will be more useful, but I still didn't decide, how to encode that operation.