OSDev's dream CPU

rdos · Post by **rdos** » Fri May 11, 2012 1:09 am

MessiahAndrw wrote:I think it would be interesting to try to build a multitasking OS for something more extreme - like tri-processor setup containing an ARM, a MIPS, and an x86_64 CPU.

Yes, but a configuration like that is not very useful. I find it interesting enough to have been able to let several processor cores (of the same type) run concurrently in my scheduler. That was a dream already many years ago, but at that time the hardware was too specialized and expensive. Today it is more or less common-place. More useful multi-processor configurations with different processor types use one high-end processor (like an x86), and a more primitive processor like a digital signal processor or microcontroller.

Actually, my examination task was to build a FFT analyser with an ABC-80 computer. We built a PCB with a TMS32010 digital signal processor that did the FFT in realtime, and then we acquired the data and displayed it on the ABC-80 screen. There was a lot of assembly-code in that system.

Another useful combination I've built is to use one or more PIC microcontrollers and connect them to a PC via an PC/104 interface. However, there would be no real OS running on the PIC controller.

Typical modern network cards and audio cards actually are programmed a lot like this too, and they probably contain some type of specialized processor core. As are video accelerators.

Solar · Post by **Solar** » Fri May 11, 2012 1:10 am

MessiahAndrw wrote:I remember someone who did development on the Nintendo 3DS complained about there being two different processors. They had slightly different ISAs too.

Pfff. Child's play. In an Amiga with PowerUP expansion card, you'd have a 68k running the OS, and a PowerPC for doing the number crunching - but you'd have to return to 68k world to actually do anything (like graphics or disk I/O). "Slightly" different ISA's? Huh...

Yoda · Post by **Yoda** » Fri May 11, 2012 7:53 am

Rudster816 wrote:
rdos wrote:More complex instructions do more in a single clock cycle, which means that RISC is out and CISC is in.
That statement is just flat out wrong.

Let me rephrase it. If you put common and integrated actions in one instruction, it will execute more efficiently (both in power and speed consideration) in CISC architecture even if it will take more gates in design. You can implement every instruction with an algorithm written in extremely simple and clock boosted CPU (OISC maybe?

) But it will be the sequence of the instructions and each of them needs to be prefetched, decoded and executed separately. That's why pure RISC loose.

Rudster816 wrote:Memory writes only occur on every store if you're using an antiquated write through cache.

You still didn't get an idea. I'm not talking about write through operations. I didn't mention that. I'm talking about delayed writes (or write back) of junk data. The cache knows nothing about the use of data below the stack pointer. So if at least once the data were written there, the memory write access will occur soon or later flushing the data (probably being already junk) from cache to memory. An idea is:
1. To delay writes from stack cache for as long as possible (in hope they'll be discarded).
2. To discard cache rows below the stack pointer without flushing them to memory.

Rudster816 wrote:The very nature of a standard machine instruction is the fact that you can't break it up. It's the lowest common denominator.

Again, you still didn't get an idea. An idea is:
1. It is simultaneously the whole (as CISC) and being made up of parts (like RISC sequence op+load/store).
2. Logically it has completely independent parts of opcode and operands which makes assembly and execution very transparent and simple.
3. In hardware it has a CISC potential of execution optimization. But the straightforward RISC-like approach of hardware implementation of such ISA with very simple Verilog/VHDL description is still applicable.

Rudster816 wrote:If you treat your operands as separate instructions, this very fundamental concept is broken.

They are NOT separate instructions.

Rudster816 wrote:
Code: Select all
Address: 
0x10010: XOR3
0x10011: Operand1
0x10012: Operand2
0x10013: Operand3
0x10014: NEXT INSTRUCTION
In every sense, one would say that the instruction is 4 bytes in length, not 4 1 byte instructions. Rather or not you think so is unimportant, because in order to maintain the exact same definition, you must look at all 4 bytes. If I were to change the byte at address 0x10011, the operation would still be XOR3, but the instruction would change.

Oughhhhh. The complete incomprehention. But let it be. It's my ISA, my ideas and they are almost working in VM, I'm not urged to chew them for others.

Rudster816 wrote:Conclusion: Operands that take at least a byte == Stupid ISA

Amen!

OSwhatever · Post by **OSwhatever** » Fri May 11, 2012 9:21 am

One architecture that remains very unexplored are stack machines. These are also known as forth processors and they are usually extremely simple and low gate count. The code density of stack machines is also excellent.

Stack machine usually suffers from more memory references and that temporary values cannot be conveniently stored unless you add some extra storage for them. Instruction level parallelism is something that has yet to be solved, so if you want that Ph.D research subject this could be it for you.

AndrewAPrice · Post by **AndrewAPrice** » Fri May 11, 2012 9:53 am

OSwhatever wrote:One architecture that remains very unexplored are stack machines. These are also known as forth processors and they are usually extremely simple and low gate count. The code density of stack machines is also excellent.

Stack machine usually suffers from more memory references and that temporary values cannot be conveniently stored unless you add some extra storage for them. Instruction level parallelism is something that has yet to be solved, so if you want that Ph.D research subject this could be it for you.

The problem with stack machines is the fact that every operation performs memory operations on the stack.

For example: "push" has to read from the memory and then write to the stack, "add" has to read two values from memory and then push the answer to the stack.

It can help to have caching the stack in the CPU, but you're looking at 1kb of registers to cache the stack in, and every operation (even a simple 'add') would have to do bounds checking to make sure the top two values of the stack are in the cache (if not, it'll have to do page look up). This logic would probably be performed in microcode on top of traditional registers, so why not just expose those registers instead?

I think that this is the reason behind why stack-based ISAs aren't popular in physical hardware.

Stack ISAs are wonderful for intermediate bytecode languages. For compiler backends, they can represent flattened expression trees. For JIT'ed languages, they make no assumptions about the register layout of the target architectures.

Yoda · Post by **Yoda** » Fri May 11, 2012 3:36 pm

berkus wrote:Interesting. So you're saying pure RISC loses because x86 processors decode CISC instructions into series of simpler RISCy instructions that are easier for the pipeline to execute? Okay.

First, we don't knows precisely what does x86 processor internally. Intel declares that this junk translates instructions into internal code but I won't bet that this is true RISC core. It may be hardware JIT into any architecture.
Second, x86 is not the paragon to be followed.

OSwhatever wrote:One architecture that remains very unexplored are stack machines. These are also known as forth processors and they are usually extremely simple and low gate count. The code density of stack machines is also excellent.

Check for Inmos transputers architecture.

AndrewAPrice · Post by **AndrewAPrice** » Sat May 12, 2012 4:46 pm

How about a single instruction architecture? For example, "mov" is the only instruction, and a program is nothing more than:
[src_addr] [dst_addr]
[src_addr] [dst_addr]
[src_addr] [dst_addr]
[src_addr] [dst_addr]

You could have a memory memory ALU.
ALU_Src1 - value of the first operand
ALU_Src2 - value of the second operand
ALU_AddDst - read from here to get src1 + src2
ALU_SubDst - read from here to get src1 - src2
etc..

The instruction counter would be stored in memory, so you could read and write that. Conditional jumps could take place on a memory mapped logic unit:
mov condition, L_Condition
mov trueJmpAddr, L_IfTrue
mov falseJmpAddr, L_IfFalse
mov L_Result, CPU_PIC ; jumps to "trueJmpAddr" if condition is true, else jumps to "falseJmpAddr"

Dereferencing a pointer is a little more difficult. You can't do
mov value, [address]
since that would be a different instruction (copy value at address to address at address) Instead you'd need to do it through a memory mapped MMU:
mov address, MMU_PointerAddress
mov value, MMU_PointerValue

All instructions could take the exact same number of cycles, as it would always include a read then a write.

ACcurrent · Post by **ACcurrent** » Sat May 12, 2012 6:24 pm

They exist.
heres a list:
http://en.wikipedia.org/wiki/Transport_ ... mentations

pointfree · Post by **pointfree** » Sat Jun 09, 2012 5:27 pm

For a long time I've wanted to use a clockless CPU. I have found Tiempo's clockless processors which seem to be available for purchase. However they appear to be 16-bit processors only.

ACcurrent · Post by **ACcurrent** » Sun Jun 10, 2012 1:40 am

Why a clock-less CPU?

JamesM · Post by **JamesM** » Sun Jun 10, 2012 6:33 am

MessiahAndrw wrote:
OSwhatever wrote:One architecture that remains very unexplored are stack machines. These are also known as forth processors and they are usually extremely simple and low gate count. The code density of stack machines is also excellent.

Stack machine usually suffers from more memory references and that temporary values cannot be conveniently stored unless you add some extra storage for them. Instruction level parallelism is something that has yet to be solved, so if you want that Ph.D research subject this could be it for you.
The problem with stack machines is the fact that every operation performs memory operations on the stack.

For example: "push" has to read from the memory and then write to the stack, "add" has to read two values from memory and then push the answer to the stack.

It can help to have caching the stack in the CPU, but you're looking at 1kb of registers to cache the stack in, and every operation (even a simple 'add') would have to do bounds checking to make sure the top two values of the stack are in the cache (if not, it'll have to do page look up). This logic would probably be performed in microcode on top of traditional registers, so why not just expose those registers instead?

I think that this is the reason behind why stack-based ISAs aren't popular in physical hardware.

Stack ISAs are wonderful for intermediate bytecode languages. For compiler backends, they can represent flattened expression trees. For JIT'ed languages, they make no assumptions about the register layout of the target architectures.

No - a stack machine replaces a register file with a stack on-chip. The stack is implemented as registers, there is no memory write unless you overflow the stack.

JamesM · Post by **JamesM** » Sun Jun 10, 2012 6:35 am

Also Brendan to weigh in on predicated instructions...

Predicated instructions were removed from the armv8 ISA because they took up a full nibble of the 32-bit encoding space for every instruction, and weren't used enough to justify this.

The thumb instruction set was created completely and utterly with compilers in mind. They looked at the type of code that compilers generated and created a 16-bit instruction set that pretty much covered only that subset of functionality. Armv8 has a similar mindset in mind.

OSwhatever · Post by **OSwhatever** » Sun Jun 10, 2012 12:24 pm

JamesM wrote:Also Brendan to weigh in on predicated instructions...

Predicated instructions were removed from the armv8 ISA because they took up a full nibble of the 32-bit encoding space for every instruction, and weren't used enough to justify this.

The thumb instruction set was created completely and utterly with compilers in mind. They looked at the type of code that compilers generated and created a 16-bit instruction set that pretty much covered only that subset of functionality. Armv8 has a similar mindset in mind.

With Thumb-2 ARM introduced the it instruction which is a if-then and if-then-else instruction. It basically tells the pipeline which instructions belongs to the then-clause and else clause after the it instruction with up to four instructions. I've looked into disassembled ARM code and I've discovered that the it instruction is quite often used. The benefit of this instruction is that it does not require any invading predicate bits in the instruction. I think this instruction is available in ARMv8 as well but I'm not 100% sure.

JamesM · Post by **JamesM** » Sun Jun 10, 2012 12:48 pm

OSwhatever wrote:
JamesM wrote:Also Brendan to weigh in on predicated instructions...

Predicated instructions were removed from the armv8 ISA because they took up a full nibble of the 32-bit encoding space for every instruction, and weren't used enough to justify this.

The thumb instruction set was created completely and utterly with compilers in mind. They looked at the type of code that compilers generated and created a 16-bit instruction set that pretty much covered only that subset of functionality. Armv8 has a similar mindset in mind.
With Thumb-2 ARM introduced the it instruction which is a if-then and if-then-else instruction. It basically tells the pipeline which instructions belongs to the then-clause and else clause after the it instruction with up to four instructions. I've looked into disassembled ARM code and I've discovered that the it instruction is quite often used. The benefit of this instruction is that it does not require any invading predicate bits in the instruction. I think this instruction is available in ARMv8 as well but I'm not 100% sure.

Indeed, it is often used because it's the way to do conditionals in Thumb without doing a full branch-around. I forget if AArch64 has IT or not. I'm not an expert on AArch64 yet, unfortunately!

Owen · Post by **Owen** » Mon Jun 11, 2012 7:37 am

To weigh in on the memory-operands-in-instructions (Essentially CISC) vs load-store architecture (Essentially RISC) business, let us compare

Code: Select all

CISC:
1.  ADD [r0+r1], r2

RISC:
1.  LD r3, [r0+r1]
2.  ADD r3, r2
 3. ST [r0+r1], r3

Now, the RISC version is always going to be larger (At the very least, it has to encode the address twice, plus it has to encode 3 opcode fields, plus it has more operands to encode)

Lets look at their superscalar decoding:

Code: Select all

CISC:
1.  Allocate(rn0) = Load(r0+r1)
    Allocate(rn1) = rn0 + r2
    Store(r0+r1, rn1)

RISC:
1.  Allocate(rn0 is now r3) = Load(r0 + r1)
2.  Allocate(rn1 is now r3) = rn0+r2
3.  Store(r0+r1, rn1)

Note that both are the same, except the RISC version requires more physical register tag traffic (i.e. the RISC version must rename r3 twice in the PRF, while the CISC version never needs to rename a register). CISC wins (less contention & less power expended, plus you don't waste a physical register)

Now, on an in-order architecture, the distinction is less clear cut. The traditional integer RISC pipeline is Fetch - Decode - ALU - Memory. CISCy operands can be implemented by the rearrangement of this pipeline as Fetch - Decode - AGU - Load - ALU - Store. Depending upon the latency of your AGU (almost certainly less than the ALU; at most an AGU should be a couple of adders and a barrel shifter), you might be able to squeeze this into the Load stage. The amount of memory access logic (other than interstage flip-flops) need not necessarily be majorly increased; at most you need some small logic to determine whether the load or store gets priority (a simple method would be to declare that loads always have priority). At most you need a store-to-load conflict.

In general, the extra logic of the CISC machine in this case should increase performance; you drop the 2 cycle load-to-use penalty by way of shuffling loads earlier in the pipeline and allowing integer operations to be submitted with them (in this case I'm assuming 3 operand machine).

In general, I think its going to be a bit swings-and-roundabouts: the CISC machine gets smaller instructions and more efficient memory access. The load store architecture probably gets a slight increase of non-memory instruction efficiency, and power consumption probably tips in the same directions as performance.

This all assumes that the decoder can scale well; however, here I'm assuming that it takes at most a few bits from the first coding unit of the instruction to determine how long it is, which permits relatively easy parallel decoding, and where the first "coding unit" is defined to be the minimum possible size of an instruction. Its when things end up x86-like, with multiple prefixes, multibyte opcodes, and variable length postfixes that parallel decoding gets really hard.

OSDev.org

OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU