What features would you like in a CPU?

Owen · Post by **Owen** » Thu Nov 27, 2008 9:03 am

As someone who is experimenting with designing his own CPU (To be implemented in Verilog HDL), I'd like to ask you all a question:

As developers - both OS and application - what features would you most like to see in a new CPU? What unusual features from present CPUs do you find especially valuable?

Now, first things first, this design will still be mostly conventional. I'm aiming at a RISC processor, with some more complex opcodes thrown in where it is thought they would be valuable. (E.G. strcpy which does 4 byte copies but automatically performs masked memory accesses to handle misaligned starts, and to handle the end NULL)

Another important question is this: How would you like interrupts implemented?. I'm especially asking this from a NUMA multiprocessor scenario, even though my design is likely to be initially single chip single core. I particularly ask this from a position where the the system's processors may be assymetric; for example, there may be a 16-bit processor closely coupled to the network controller; the main CPU(s) obviously need to be able to interrupt this, and it obviously needs to be able to interrupt them, but it would just be very silly sending this microcontroller graphics card interrupts.

Finally, How many registers do you think are appropriate?. Note that a processor need not save all registers on a context switch - if they are saved to a special region of memory rather than the stack, an extra (hidden) bit can be added to each register to tell the processor it is not modified and need not be saved.

(Just a note: I won't be implementing a 32-bit processor for a while yet; I'm currently designing that aforementioned 16-bit processor. It has 16:16 linear segment:offset addressing; thus it can access the full 32-bit address space. It's main purpose is, as I mentioned, high performance peripheral offloading. The processor would have two busses - one which accesses main system memory and another which accesses it's private memory - which would include RAM for code and the I/O registers of any peripherals it was controlling. If the processor is sleeping, then the busses would just pass through)

Love4Boobies · Post by **Love4Boobies** » Thu Nov 27, 2008 10:28 am

Precise interrupts? More to come.

Owen · Post by **Owen** » Thu Nov 27, 2008 10:50 am

You have to throw deterministic interrupts out the window when you reach the point at which cache becomes a necessity. Which is in pretty much any system using DRAM, since that has quite high latencies, making it hate random and love burst accesses.

The other problem with deterministic interrupts is that you need multiple register files. With big ones, or with systems which can nest interrupts, you begin having problems here.

bewing · Post by **bewing** » Fri Nov 28, 2008 9:38 am

When I build my RISC CPU, here are a few opcode things that I intend to implement:

Addressing modes that include subtraction -- ie. [esi - 8*ebx]
A version of a MOV instruction that sets some eflags -- esp ZF.
An entire set of arithmetic operations that do not modify any eflags.
A version of INC and DEC instructions that add/sub 2, 4, and 8.
A version of an LEA instruction that sets all eflags based on the result.
Register indirection -- ie. accessing register numbers as an "array" from another register.
An opcode that will burst transfer a large number of registers to/from memory, and set a flag on completion.
A CALL instruction that stores a magic number on the stack, and a RET that autoverifies it, to prevent stack corruption.

Korona · Post by **Korona** » Fri Nov 28, 2008 9:42 am

bewing wrote:When I build my RISC CPU, here are a few opcode things that I intend to implement:

Addressing modes that include subtraction -- ie. [esi - 8*ebx]
A version of a MOV instruction that sets some eflags -- esp ZF.
An entire set of arithmetic operations that do not modify any eflags.
A version of INC and DEC instructions that add/sub 2, 4, and 8.
A version of an LEA instruction that sets all eflags based on the result.
Register indirection -- ie. accessing register numbers as an "array" from another register.
An opcode that will burst transfer a large number of registers to/from memory, and set a flag on completion.
A CALL instruction that stores a magic number on the stack, and a RET that autoverifies it, to prevent stack corruption.

That sounds like a CISC CPU.

cpumaster · Post by **cpumaster** » Fri Nov 28, 2008 10:03 am

what i think should be added as a feature is that the software can send a kill signal to the cpu and have the system stay online in a suspended state then have the option of restarting the cpu and amds cool and quite feature. it will only go fast when there is a bigger lead so it stays cool. anyways why not make a maching motherboard and have it run a homemade os?

Owen · Post by **Owen** » Fri Nov 28, 2008 1:55 pm

bewing wrote:When I build my RISC CPU, here are a few opcode things that I intend to implement:

Addressing modes that include subtraction -- ie. [esi - 8*ebx]

I'm half way there - all relative offests are signed; though my most complex addressing mode is MOV r0, [r1 + r2 * 8 + 4]; also note that it's pure load store excepting the complex instructions like STRCPY, MEMCPY, etc (And that the * in the above is really a bitshift)

A version of a MOV instruction that sets some eflags -- esp ZF.

Quite possible with MOV reg, reg; not so much with memory moves, since that kind of thing is a good way to gobble opcodes very quickly

An entire set of arithmetic operations that do not modify any eflags.

Why? Very rarely do I see a case where you want to do some form of comparison and keep the results arround for a long time

A version of INC and DEC instructions that add/sub 2, 4, and 8.

I've already designed that in as a form of the ADD.literal instruction

A version of an LEA instruction that sets all eflags based on the result.

Anotrher way to gobble instructions

I'm considering a LEA, but I'm not devoting 75% of my opcode space to addressing when it could be used for more useful instructions!

Register indirection -- ie. accessing register numbers as an "array" from another register.

Seriously - why? I've never come accross a need to use an array of dynamically indexed registers

An opcode that will burst transfer a large number of registers to/from memory, and set a flag on completion.

From memory is easy enough, but to it consumes large quantities of silicon. I have a PFETCH instruction which tells the processor to fetch an address that I don't need right now - with the caveat that said fetch can blow up in the consuming instruction. That is, it's perfectly valid to get a page fault when doing ADD r0, r1 if r1 was set for a PFETCH and your address triggered one. Debugger developers will love this one, since I imagine it will add lots of heinous backtracking! Then again, I'm gonna love implementing the logic to insert MOV rX, [rX] instructions into the pipeline to handle the unloaded case.

A CALL instruction that stores a magic number on the stack, and a RET that autoverifies it, to prevent stack corruption.

Interesting idea - though one which would probably have to be a double cycle instruction in order to handle the two memory accesses.

cpumaster wrote:what i think should be added as a feature is that the software can send a kill signal to the cpu and have the system stay online in a suspended state then have the option of restarting the cpu and amds cool and quite feature. it will only go fast when there is a bigger lead so it stays cool. anyways why not make a maching motherboard and have it run a homemade os?

I beleive your "kill signal" is called the HLT instruction, which halts the CPU until an interrupt arrives. As for speed throttling, I'm unlikely to support it; though I will support completely stopping the clock to the execution unit and all it's periphery when a halt instruction is executed.

As for a motherboard - I'd probably design some form of motherboard with the CPU in a FPGA on it but I'd want to implement it on a development board first - getting one thing working at a time is better for my sanity!

bewing · Post by **bewing** » Fri Nov 28, 2008 2:46 pm

Korona wrote: That sounds like a CISC CPU.

On mine, I'm planning on having 2 million or more registers, and "reducing" the instruction set in other ways than eliminating basic integer opcodes. I don't like cramming my concepts into predefined boxes -- you can call it anything you want. And my suggestions are always meant to be filtered through what is "doable".

Owen wrote: Why? Very rarely do I see a case where you want to do some form of comparison and keep the results arround for a long time

A "long time" is rarely necessary, but I do run into cases where I need to do some quick calculation (usually just before a return) without messing up the state of my return flags.
This is especially true for fixing up the stack pointer (removing automatic storage) before a return, if you do not have an LEA instruction. You need to do an ADD ESP, 0x?? -- and that's going to screw up all of your return EFLAGS.

Seriously - why? I've never come accross a need to use an array of dynamically indexed registers

One of the biggest deficiencies in modern CPUs is that you cannot store a simple array in registers. Do you realize how many clock cycles you could save if you could store an 8 entry int array in registers, rather than having it be in cached (hopefully) main memory?

(In my design, the intent is that I am going to be eliminating cache completely -- so I need to be able to do ALL forms of addressing with registers only -- and there will be NO main memory addressing modes, except the one "load register block from main mem" and "store register block to main mem".)

Owen · Post by **Owen** » Fri Nov 28, 2008 3:34 pm

bewing wrote:On mine, I'm planning on having 2 million or more registers

Not to be picky - but at one clock cycle per instruction, you need at least 3 ports on your register file for a traditional two or three operand instruction set, and the size of a register tends to go up with the square of the number of ports. This is why processors tend to have smaller numbers of registers

The problem with operating without cache is that a processor accesses memory - in random places - far faster than the memory's clock rate. And when you throw in the randonimity, you spend more memory cycles on latencies than on actual access cycles.

Ferrarius · Post by **Ferrarius** » Sun Dec 07, 2008 5:29 am

I'm planning on having 2 million or more registers, and "reducing" the instruction set in other ways than eliminating basic integer opcodes. I don't like cramming my concepts into predefined boxes -- you can call it anything you want.

Besides being problematic with opcodes, as Owen mentioned already. A processor with 2M+ registers would be quite expensive to make. with 2M+ registers you'd sort of add a register cache. And even though cache size has significantly grown later it continues to be expensive.

A version of INC and DEC instructions that add/sub 2, 4, and 8

That would be heavenly.

what i think should be added as a feature is that the software can send a kill signal to the cpu and have the system stay online in a suspended state then have the option of restarting the cpu and amds cool and quite feature. it will only go fast when there is a bigger lead so it stays cool. anyways why not make a maching motherboard and have it run a homemade os?

Intel Speedstep? on-demand Underclocking and in the core i7 also overclocking?

Troy Martin · Post by **Troy Martin** » Sun Dec 07, 2008 11:21 am

Free access to an 8-16 KB cache.

And maybe reverse DIV and MUL (Say, DIVR 3 divides DX by 3) for the sake of not using XCHG.

Owen · Post by **Owen** » Sun Dec 07, 2008 1:19 pm

Troy Martin wrote:Free access to an 8-16 KB cache.

Waste of cache. Either the OS ends up in the cache, in which case you have to decide which pieces of the OS to store there, or each application can load a bit of itself into the cache and you have to switch it out every task switch. In any case, the processor knows better

And maybe reverse DIV and MUL (Say, DIVR 3 divides DX by 3) for the sake of not using XCHG.

Does DIV rX, 3 not do that? Perhaps you mean DIV 3, rX, in which case I see little point as you'll end up needing big literals for it to be worthwhile. In any case, with 32 GPRs, the register pressure should be sufficiently low to cope with a limited number of instructions supporting literals

Venkatesh · Post by **Venkatesh** » Tue Dec 16, 2008 11:38 pm

OS: DCAS ; Applicationy-stuff: Multiply-accumulate, tagged arithmetic.

If the cpu is going to play microcontroller, a (or more) user-accessible shift register, with selectable taps/feedback! Clock dividers, PRNGs, and convolutional encoders suddenly don't take lots of software bit shuffling.

Owen · Post by **Owen** » Wed Dec 17, 2008 10:13 am

One thought I had was a fixed point math mode. An example of this would be
XQUOT 10 (Set flags to indicate 10-bit quotient)
XMUL rA, rB, rC (rC = rA * rB >> Flags.Quotient; 64-bit intermediates)
XDIV rA, rB, rC (rC = (rA << Flags.Quotient) / rB)

(Why the X? F is reserved for floating point, and it's fiXed point)

LoseThos · Post by **LoseThos** » Thu Dec 18, 2008 8:48 pm

The single instruction I would like that's not on a x86 as far as I know is an instruction to fetch a noncached memory value, going around the value in cache. I'm pretty sure this would be handy.

Suppose you have two cores and a global memory variable which is cached. They both have accessed recently so each core has it in it's local cache. Core #0 changes it. Core #0 can write-back invalidate cache, writing everything out of cache. (Just pushing one value from cache to mem would be good, but WbInvd does the job.) Now, you have core #1... how do you fetch the updated value sitting in memory when you have a different value in your local cache? There's no way to do it that I know of. If you WbInvd invalidate, it might write-it-out and clobber the value in memory. If you Invalidate cache without writing-back, you just lost changes to other things. Maybe, WbInvd does not write unmodified values sitting in cache, in which case there is a way, since it might be acceptable if it works only in one direction #0->#1 or #1->#0. If WbInvd writes all to memory, it'll clobber the value you are trying to fetch.

Maybe, there's a reason they don't have it--lots of things are easier said than done.

You can of course use uncached memroy pages, but, for example, I have task records and I don't want the whole record uncached -- often I just want one value to be fetched or stored. It's inconvient to have two separate records with half in a cached area and half in an uncached.

OSDev.org

What features would you like in a CPU?

What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?