To weigh in on the memory-operands-in-instructions (Essentially CISC) vs load-store architecture (Essentially RISC) business, let us compare
Code: Select all
CISC:
1. ADD [r0+r1], r2
RISC:
1. LD r3, [r0+r1]
2. ADD r3, r2
3. ST [r0+r1], r3
Now, the RISC version is always going to be larger (At the very least, it has to encode the address twice, plus it has to encode 3 opcode fields, plus it has more operands to encode)
Lets look at their superscalar decoding:
Code: Select all
CISC:
1. Allocate(rn0) = Load(r0+r1)
Allocate(rn1) = rn0 + r2
Store(r0+r1, rn1)
RISC:
1. Allocate(rn0 is now r3) = Load(r0 + r1)
2. Allocate(rn1 is now r3) = rn0+r2
3. Store(r0+r1, rn1)
Note that both are the same, except the RISC version requires more physical register tag traffic (i.e. the RISC version must rename r3 twice in the PRF, while the CISC version never needs to rename a register). CISC wins (less contention & less power expended, plus you don't waste a physical register)
Now, on an in-order architecture, the distinction is less clear cut. The traditional integer RISC pipeline is Fetch - Decode - ALU - Memory. CISCy operands can be implemented by the rearrangement of this pipeline as Fetch - Decode - AGU - Load - ALU - Store. Depending upon the latency of your AGU (almost certainly less than the ALU; at most an AGU should be a couple of adders and a barrel shifter), you might be able to squeeze this into the Load stage. The amount of memory access logic (other than interstage flip-flops) need not necessarily be majorly increased; at most you need some small logic to determine whether the load or store gets priority (a simple method would be to declare that loads always have priority). At most you need a store-to-load conflict.
In general, the extra logic of the CISC machine in this case should increase performance; you drop the 2 cycle load-to-use penalty by way of shuffling loads earlier in the pipeline and allowing integer operations to be submitted with them (in this case I'm assuming 3 operand machine).
In general, I think its going to be a bit swings-and-roundabouts: the CISC machine gets smaller instructions and more efficient memory access. The load store architecture probably gets a slight increase of non-memory instruction efficiency, and power consumption probably tips in the same directions as performance.
This all assumes that the decoder can scale well; however, here I'm assuming that it takes at most a few bits from the first coding unit of the instruction to determine how long it is, which permits relatively easy parallel decoding, and where the first "coding unit" is defined to be the minimum possible size of an instruction. Its when things end up x86-like, with multiple prefixes, multibyte opcodes, and variable length postfixes that parallel decoding gets really hard.