OSDev's dream CPU

Rudster816 · Post by **Rudster816** » Fri May 04, 2012 2:28 pm

rdos wrote: CPU instruction scheduling is a dead-end. The only way to achieve more performance is to write applications with many threads.

Based on my tests, I can only conclude that Moore's law is no longer valid. We had 3GHz CPUs several years ago, and they are just as fast as the standard CPUs of today. So, in reality, the only way to achieve better performance is to write parallell software, which your typical C compiler cannot help you with.

Typical users cannot care less about compilers. They want their software to work, and to run the fastest possible. Practically nobody sold native Itanium software, so the CPU ended up executing legacy-software at a terribly low speed.

That is not the high-volume market. The high-volume market is desktop PCs and portable PCs.

That's not the way I understand it. AFAIK, Intel launched (and patented) Itanium in order to get rid of the competition once and for all. This was a big failure since practically nobody bought Itanium. Then it was AMD that extended x86 to 64 bits, and thereby made sure that they stayed on the market.

Moore's law states that the amount of transistors in a CPU will double every ~18-24 months, and has been remarkably accurate. People just added the whole CPU's double in speed part.

If you think that a 3.0ghz P4 is just as fast as a 3ghz Sandy Bridge chip, you're delusional. Even clock for clock (and core for core), an SB chip will absoltuely roflstomp a P4. They've also quadrupuled the amount of cores, and cut the TDP in half. The reason you don't see 9Ghz CPU's is because the TDP increases for high clock rates are monstrous. It's also because what electricity can accomplish in 300 picoseconds is very low. We're at the point where large scale clock increases are gone.

You're also dead wrong when you say the server makred isnt the high-volume market. I don't have the numbers, so I wont say the server market is larger for sure, but I know there is a lot more money to be made in it than the desktop\laptop one.

Intel wanted Itanium to be the 64 bit successor to x86. It failed in this sense, and the business aspect of things. It also did the exact opposite of what Brendan said (removing competition), because while Intel was working on the doomed Itanium architecture, AMD was developing AMD64. If Intel had done the same thing as AMD and just extended x86 to 64 bits, AMD's architecture would have inevitably failed because Intel had (and still does) way more tout than AMD. But now Intel need's a license from AMD to make chips, when it use to be that AMD just needed a license from Intel. This benefited AMD in two ways, not only did they have the cash from the license agreement, but they also had a huge lead in the performance department courtesy of Athlon64. It wasn't until Core2Duo (and the return of the P6 microarchitecture) did Intel regain the performance advantage.

As far as Alpha\SPARC\etc, they were already effectively where they are now in the server market. Itanium didn't really change there fate, they were never going to surpass x86 in any market.

bluemoon · Post by **bluemoon** » Fri May 04, 2012 2:35 pm

I dream of mini on-cpu stack, that can push/pop registers, it can generate exception if stack overflow.

Yoda · Post by **Yoda** » Fri May 04, 2012 3:25 pm

Brendan described a good approach, it has a good ideas, but they are not new. I'd treat his post as a decription of a some kind of counter-x86 architecture and many things are obvious.

I spent a lot of time designing new ISA for last decade. It mostly exists as a fromal description and a yet not complete virtual machine. Here I'll try to discuss some points.

Rudster816 wrote:I don't know what your definition of CISC is, but I think any new architecture that isn't load\store would be poor. I don't think there is a microarch that doesn't turn a MUL RAX, [RBX] into two microcode ops anyways. It just makes instructions longer and a lot more complex.

Here I don't agree. I think that CISC and RISC may be joined in one miixed architecture and may get benefits of each other. Here is an idea (from my ISA).
For regular command set each instruction may get a common structure, consisting of opcode and from 0 to 4 operands. But the encoding of each operand is made in regular universal codeset that helps decoding the total instruction. Morover, the length of opcode, size of data and operand must be explicitly specified in their structure. For example, the instruction
XOR.W SRC1, SRC2, DST
has four distinct parts. Then instruction prefetch block may start almost simultaneous processing of four parts of instruction. But where is the RISC, you may say? The RISC is that encoding of each operand being separated from the opcode may be treated as load/store instruction! The only difference from CISC approach is that opcode precedes the load/store instructions and provides the size of data. What are the betefits in comparison with natural RISC?
1. Store/load opcodes don't need to be specified since they are implied.
2. Size of data is given simultaneously for all stores/loads, saving space and increasing pipeline throughput.
3. It has a good potential of internal execution optimization which RISC hasn't.

Rudster816 wrote:Instructions always falling on 2 byte boundaries should make decoding variable length instructions a bit easier.

I think that this limitation is unnatural. The size of instruction must be easily calculated in any case. But for word-alignment you should break the structure with paddings and increase memory bandwidth requirements.

Rudster816 wrote:Just 16 registers for both GP\FP is quite anemic.

IMHO 16 registers are a perfect balance. More registers will not noticeably enhance performance on most algorithms (and even will degrade on saving/restoring CPU state) but will make encoding them more space consuming and make it irregular.

Rudster816 · Post by **Rudster816** » Fri May 04, 2012 4:23 pm

Yoda wrote:Brendan described a good approach, it has a good ideas, but they are not new. I'd treat his post as a decription of a some kind of counter-x86 architecture and many things are obvious.

I spent a lot of time designing new ISA for last decade. It mostly exists as a fromal description and a yet not complete virtual machine. Here I'll try to discuss some points.

Rudster816 wrote:I don't know what your definition of CISC is, but I think any new architecture that isn't load\store would be poor. I don't think there is a microarch that doesn't turn a MUL RAX, [RBX] into two microcode ops anyways. It just makes instructions longer and a lot more complex.
Here I don't agree. I think that CISC and RISC may be joined in one miixed architecture and may get benefits of each other. Here is an idea (from my ISA).
For regular command set each instruction may get a common structure, consisting of opcode and from 0 to 4 operands. But the encoding of each operand is made in regular universal codeset that helps decoding the total instruction. Morover, the length of opcode, size of data and operand must be explicitly specified in their structure. For example, the instruction
XOR.W SRC1, SRC2, DST
has four distinct parts. Then instruction prefetch block may start almost simultaneous processing of four parts of instruction. But where is the RISC, you may say? The RISC is that encoding of each operand being separated from the opcode may be treated as load/store instruction! The only difference from CISC approach is that opcode precedes the load/store instructions and provides the size of data. What are the betefits in comparison with natural RISC?
1. Store/load opcodes don't need to be specified since they are implied.
2. Size of data is given simultaneously for all stores/loads, saving space and increasing pipeline throughput.
3. It has a good potential of internal execution optimization which RISC hasn't.

Rudster816 wrote:Instructions always falling on 2 byte boundaries should make decoding variable length instructions a bit easier.
I think that this limitation is unnatural. The size of instruction must be easily calculated in any case. But for word-alignment you should break the structure with paddings and increase memory bandwidth requirements.

Rudster816 wrote:Just 16 registers for both GP\FP is quite anemic.
IMHO 16 registers are a perfect balance. More registers will not noticeably enhance performance on most algorithms (and even will degrade on saving/restoring CPU state) but will make encoding them more space consuming and make it irregular.

Right now the full instruction size is encoded in the first two bytes of each instruction. I also fail to see how instruction alignment is unnatural at all. Address alignment in general is fairly common, and so are fixed size instructions. I don't know for sure, but I'd venture to guess that non thumb ARM requires instructions to fall on 4 byte boundaries.

As for as your implicit load\store scheme, it adds a huge layer of complexity to every instruction. For every instruction, regardless of rather or not its an xor, add, sub, mul, etc, it might contain an effective address, which take lots of bits to encode. Even in the most common case of register to register operations, you still have to have a way to specify that, and requires more bits to encode. The more complex the instructions are, the more complex the decoding is. They do allow for greater code density, but that isn't nearly as attractive as a faster CPU.

My previous example requires to uops.

Code: Select all

mov temp, [rbx]
mul rax, temp

This effectively means that the execution time would be the same in a pure load\store architecture (all else equal). You may increase the instruction density for that particular sequence, but you're shooting yourself in the foot because it means that most instructions have the possibility of requiring microcode, which increases the complexity of the front end of the pipeline, and may even require the addition of an entire stage. The increase in code density is negligible at best too, because while you might increase for sequences like above, you decrease it for the most common cases of register to register operations.

I also greatly dislike 3 operand architectures for the same reason. You're just wasting bits for instruction encoding that could be used elsewhere. The vast majority of the time, I don't need to save both operands I use when writing assembly, and moving one to another register to save it before hand when I do is extremely cheap. It is useful in plenty of situations, but IMO it doesn't justify the cost.

16 GP registers that aren't aliased with FP registers is fine in most cases. But when you combine them, things can get really cramped for programs that actually make large use of FP operations. It also comes with a penalty at the microarchitecture level, because both FP and ALU operations will be accessing the same register file. This requires more read\write ports for the register file, and additional stalls when you run out. It might not be bad if there was some sort of upside, but I see little to no upside to aliasing FP\GP registers.

Personally, I'm doing 26-28 GP registers (and of course, not aliased with FP) in my ISA. The other 4-6 (for a 5 bit value) are for specific things like the frame\stack pointer, so they aren't GP in the purest sense. While you may not be able to make use of all the registers for one given algorithm, having that many greatly increases the flexibility of the calling conventions, especially leaf functions. It wouldn't be restrictive to reserve 6+ registers for function calls, which results in very little stack spillage. This is especially helpful for complex loops (that require lots of registers) that make a function call, where the caller\callee have to constantly push\pop a bunch of registers on the stack to save them before and after each call.

Brendan · Post by **Brendan** » Sat May 05, 2012 6:09 am

Hi,

OSwhatever wrote:
Brendan wrote:It "failed" because there wasn't a good enough compiler, and because it severely restricts the range of optimisations that could be done in future versions of the CPU (without causing performance problems for code tuned for older Itanium CPUs).
The problem in general with Itanium was that it was too complex to optimize for.

So you're saying that it "failed" because there wasn't a good enough compiler (and that there wasn't a good enough compiler because it was too complex to optimize for)?

OSwhatever wrote:Compiler instruction scheduling is the way forward when you go massively multicore as OOE in HW increases complexity a lot, multiply by all the cores you have and you save suddenly a lot if you remove it.

Compiler instruction scheduling could work if there's only ever one implementation of the architecture; or if all software is provided as source code (so that the end user can recompile it with a compiler that tunes the result to suit their specific CPU). Otherwise if you try to add more execution units or change the length of the pipeline or make any other changes that effect the timing of instructions you end up with performance problems because the CPU will be expecting code tuned differently and won't do anything to work around the "wrong instruction scheduling" problem.

As far as complexity goes, it's like RISC - sounds good in theory, but in practice the extra complexity is necessary for the best performance (rather than just "average" performance) and the space it consumes in silicon is mostly irrelevant compared to space consumed by cache.

rdos wrote:The only way to achieve more performance is to write applications with many threads.

Applications with many threads (or applications with at least one thread per CPU) can help for some things, but doesn't help for other things. For an example, see if you can figure out how to use many threads to speed up Floyd–Steinberg dithering.

Basically you end up with Amdahl's law, where parts that can't be done in parallel limit the maximum performance (or, where single-threaded performance has a huge impact on software that isn't embarrassingly parallelizable).

rdos wrote:Typical users cannot care less about compilers. They want their software to work, and to run the fastest possible. Practically nobody sold native Itanium software, so the CPU ended up executing legacy-software at a terribly low speed.

You're right in that there may not have been native versions of popular 3D games or MS Office or anything else that nobody purchasing a high-end/Itanium server would ever care about.

For high-end servers, "software" mostly means OSs, and things like database management systems, web servers and maybe some custom designed software. All of the OSs that mattered were native (including HP-UX, Windows, Linux, FreeBSD, etc), all the database management systems that mattered were native (including Oracle's and Microsoft's), all the web servers that mattered were native (IIS, Apache). Custom design software (e.g. where the company that owns the server also owns/develops/maintains the source code for their own "in-house" application/s) can be trivially recompiled.

rdos wrote:
Brendan wrote:For high-end servers, support for legacy software is virtually irrelevant (it's not like the desktop space where everyone wants to be able to run ancient Windows applications).
That is not the high-volume market. The high-volume market is desktop PCs and portable PCs.

No. The high volume market is small/embedded systems, where a CPU sells for less than $10 and the CPU manufacturer is lucky to make $1 profit on each unit sold. How many little MIPS CPUs do you think you'd need to sell just to come close to the profit Intel would make from one multi-socket (Itanium or Xeon) server?

It's like saying air-craft manufacturers like Boeing should just give up because people buy more bicycles and cars than air-planes.

rdos wrote:
Brendan wrote:Of course "failed" depends on your perspective. Once upon a time there were a variety of CPUs competing in the high-end server market (80x86, Alpha, Sparc, PA-RISC, etc). Itanium helped kill everything else, leaving Intel's Itanium, Intel's 80x86 and AMD's 80x86 (which can barely be called competition at all now). It's a massive success for Intel - even if they discontinue Itanium they're still laughing all the way to the bank from (almost) monopolising an entire (very lucrative) market.
That's not the way I understand it. AFAIK, Intel launched (and patented) Itanium in order to get rid of the competition once and for all. This was a big failure since practically nobody bought Itanium. Then it was AMD that extended x86 to 64 bits, and thereby made sure that they stayed on the market.

You're looking at the wrong market. Nobody used Itanium for embedded systems, smartphones, notebooks, laptops, desktops or low-end servers; because they were never intended for that in the first place. People did buy Itanium for high-end servers, especially where reliability/fault tolerance and/or scalability was needed. Extending a desktop CPU to 64-bit doesn't suddenly make reliability/fault tolerance and/or scalability features appear out of thin air; and while some of the features have found their way into chipsets for Xeon and Opteron, most of the competing high-end server CPUs (Alpha, Sparc, PA-RISC) were all dead or dying before that happened.

Here's a timeline for you to think about:

1986 - PA-RISC released
1987 - Sparc released
1992 - DEC Alpha released
1998 - Comaq purchases most of DEC, decides to phase out Alpha in favour Itanium (before any Itanium has even been released). Comaq sells intellectual property related to Alpha to Intel.
2001 - original Itanium (mostly just a "proof of concept") released
2002 - Itanium 2 released
2003 - first 64-bit 80x86 released
2004 - first 64-bit 80x86 released by Intel
2006 - Sparc gives up any hope of competing and becomes open-source to stay alive
2007 - HP discontinues PA-RISC in favour of Itanium

Rudster816 wrote:Intel wanted Itanium to be the 64 bit successor to x86. It failed in this sense, and the business aspect of things. It also did the exact opposite of what Brendan said (removing competition), because while Intel was working on the doomed Itanium architecture, AMD was developing AMD64. If Intel had done the same thing as AMD and just extended x86 to 64 bits, AMD's architecture would have inevitably failed because Intel had (and still does) way more tout than AMD. But now Intel need's a license from AMD to make chips, when it use to be that AMD just needed a license from Intel. This benefited AMD in two ways, not only did they have the cash from the license agreement, but they also had a huge lead in the performance department courtesy of Athlon64. It wasn't until Core2Duo (and the return of the P6 microarchitecture) did Intel regain the performance advantage.

The only reason AMD had any lead was because Intel's NetBurst microarchitecture sucked. AMD's performance advantage had nothing to do with 64-bit. Don't forget that when AMD first introduced 64-bit nobody really cared because Windows didn't support it anyway, and by the time Windows did support 64-bit (Vista, 2006) Intel was selling 64-bit "core" and "core2" CPUs.

Here's a timeline for you to think about:

1998 - Alpha decides to hide like a little school-girl just because of rumours of Itanium
2002 - Itanium 2 released
2003 - AMD creates a secret weapon to counter the Itanium threat
2004 - Intel walks in and takes AMD's new weapon; then continues to use Itanium in one hand and 64-bit 80x86 in the other hand to pound the daylights out of everyone including AMD
2006 - Sparc sees the "dual wielding" Intel goliath, wets its pants and tries to hide
2007 - PA-RISC commits suicide in fear of the "dual wielding" Intel goliath

Cheers,

Brendan

rdos · Post by **rdos** » Sat May 05, 2012 6:51 am

Brendan wrote:Here's a timeline for you to think about:

1986 - PA-RISC released

1987 - Sparc released

1992 - DEC Alpha released

1998 - Comaq purchases most of DEC, decides to phase out Alpha in favour Itanium (before any Itanium has even been released). Comaq sells intellectual property related to Alpha to Intel.

2001 - original Itanium (mostly just a "proof of concept") released

2002 - Itanium 2 released

2003 - first 64-bit 80x86 released

2004 - first 64-bit 80x86 released by Intel

2006 - Sparc gives up any hope of competing and becomes open-source to stay alive

2007 - HP discontinues PA-RISC in favour of Itanium

OK, but I suppose this is just fine to me. After all, I suppose Itanium can still run 32-bit x86 code, even if it is slow, so that just gives me another platform that I support without an OS mainly written to be portable.

I just hope they can outcompete ARM as well.

bubach · Post by **bubach** » Sat May 05, 2012 7:03 am

rdos wrote:OK, but I suppose this is just fine to me. After all, I suppose Itanium can still run 32-bit x86 code, even if it is slow, so that just gives me another platform that I support without an OS mainly written to be portable.

With what, an opcode simulator? What would be the point, to be slow as hell on a platform you shouldn't even support becasue it's going to die?

rdos wrote:I just hope they can outcompete ARM as well.

Not going to happen. Most server OS's has discontinued support for the Itanium quoting Intel saying that they will focus on the x86-64 platform. Also the ARM is for a completely different market, embedded/handheld not high end servers.

OSwhatever · Post by **OSwhatever** » Sat May 05, 2012 7:10 am

bubach wrote:Not going to happen. Most server OS's has discontinued support for the Itanium quoting Intel saying that they will focus on the x86-64 platform. Also the ARM is for a completely different market, embedded/handheld not high end servers.

ARM is now trying to get into the server market targeting companies that are conscious about their power bill.

bubach · Post by **bubach** » Sat May 05, 2012 7:14 am

hehe well ok. still they won't go out because of the ia-64, looks more like the other way around if what you say is true.

rdos · Post by **rdos** » Sat May 05, 2012 7:27 am

bubach wrote:
rdos wrote:OK, but I suppose this is just fine to me. After all, I suppose Itanium can still run 32-bit x86 code, even if it is slow, so that just gives me another platform that I support without an OS mainly written to be portable.
With what, an opcode simulator? What would be the point, to be slow as hell on a platform you shouldn't even support becasue it's going to die?

Originally, Itanium provided x86 support in hardware. However, it seems like they discontinued this in more recent processors. According to Wikipedia, it seems like their own emulator run x86 OSes faster than the hardware support, which is indicative of a pretty bad implementation.

bubach wrote:
rdos wrote:I just hope they can outcompete ARM as well.
Not going to happen. Most server OS's has discontinued support for the Itanium quoting Intel saying that they will focus on the x86-64 platform. Also the ARM is for a completely different market, embedded/handheld not high end servers.

x86-64 is much better since apart for some segment register operations no longer being optimized, they run my OS natively on hardware optimized for x86.

It was a pretty good choice back in 1988 to decide to target 32-bit x86 in the hope that it would be around for a while. Seems like the hardware will be with us many years more.

The 32-bit x86 architecture actually is one of the most well-thought out there has ever been. By far surpassing RISC and all the other designs.

Yoda · Post by **Yoda** » Sat May 05, 2012 10:20 am

Rudster816 wrote:As for as your implicit load\store scheme, it adds a huge layer of complexity to every instruction. For every instruction, regardless of rather or not its an xor, add, sub, mul, etc, it might contain an effective address, which take lots of bits to encode.

No, that's not so. For the simplest case (register addressing) each operand takes one byte in my encoding scheme. And the structure of operand is rather simple to decode it efficiently.

Rudster816 wrote:Even in the most common case of register to register operations, you still have to have a way to specify that, and requires more bits to encode. The more complex the instructions are, the more complex the decoding is. They do allow for greater code density, but that isn't nearly as attractive as a faster CPU.

Some RISC architectures use a fixed 32 bit length opcode. In my ISA this length would be enough to encode most instructions with up to 3 register operands. And in RISC machine each instruction also needs to encode register operands (with the only exception of stack machines like Inmos T400/T800), so I don't see principal difference in speed and ease of decoding.

Rudster816 wrote:This effectively means that the execution time would be the same in a pure load\store architecture (all else equal). You may increase the instruction density for that particular sequence, but you're shooting yourself in the foot because it means that most instructions have the possibility of requiring microcode, which increases the complexity of the front end of the pipeline

That's not so also. Let's compare.
To execute load/store RISC machine also requires microcode for decoding a full instruction. In my ISA the analogous microcode decodes operands as an instruction. As I said, decoding operand in my ISA is very similar to execution of RISC load/store instructions. On the other hand, for RISC machine it is dificult to forecast which instruction will follow your operation and that may shoot pipelining. In my ISA CPU always knows the type of following instructions (load/store), size of data, and number of instructions and that may help organizing pipeline more effectively.

Rudster816 wrote:while you might increase for sequences like above, you decrease it for the most common cases of register to register operations.

See above, - register to register operations in general don't differ too much from RISC.

Rudster816 wrote:I also greatly dislike 3 operand architectures for the same reason. You're just wasting bits for instruction encoding that could be used elsewhere. The vast majority of the time, I don't need to save both operands I use when writing assembly, and moving one to another register to save it before hand when I do is extremely cheap. It is useful in plenty of situations, but IMO it doesn't justify the cost.

As known, most frequently used instructions are differend types of moving data. In my ISA there are both 2 operands version of instructions and 3 operands for many ops like XOR, ADD, etc. 3 operands costs very little for encoding operation but it replaces two sequential instructions - XOR and MOV (the result) and increases speed of execution on such sequences. You don't need to use 3-op type if you need 2-op type instruction.

Rudster816 wrote:16 GP registers that aren't aliased with FP registers is fine in most cases. But when you combine them, things can get really cramped for programs that actually make large use of FP operations. It also comes with a penalty at the microarchitecture level, because both FP and ALU operations will be accessing the same register file.

Actually, I don't yet see strong arguments against register aliasing. As for me, the decoding in aliased case is more simple because you don't need make additional logics for data transfers. From the point of view of CPU architecture my ISA looks similar for every op - take operands here and there, route them to ALU and route the result to another location. Aliasing requires different kind of instructions for different data flows and that will increase the size of microcode and total number of gates.

As for register pushing/poping, loops, function calls and other stuff, I thing that it should be a good approach to organize a special kind of cache for stack. It is not organized in my VM, because VM reflects only the logics of CPU work, but if I make Verilog description, I would organize extremely fast small stack cache which doesn't spills data out to memory and second-level cache until cache overflow occures. And it doesn't write data to memory below stack pointer. So, for example, if you push value to stack, the callee function will process it and return to caller, the pushed value will just be discarded after return without actual memory IO operation. But VM may help to estimate the gain in performance with implementation of such cache.

JuEeHa · Post by **JuEeHa** » Sat May 05, 2012 11:47 am

bluemoon wrote:I dream of mini on-cpu stack, that can push/pop registers, it can generate exception if stack overflow.

I'd really like to see that. Maybe something like 16/32 levels.

Brendan · Post by **Brendan** » Sat May 05, 2012 12:14 pm

Hi,

Rudster816 wrote:
Brendan wrote:I'd want something like this:
I don't know what your definition of CISC is, but I think any new architecture that isn't load\store would be poor.

I disagree. Any CPU designer that isn't willing to throw a few extra transistors at instruction decode to improve things like instruction fetch and caching is just plain lazy. I don't want to use 3 instructions and an extra register just to do something simple like "add [somewhere],rax". In a similar way, I wouldn't want to turn an atomic operation like "lock add [somewhere],rax" into a messy/bloated "compare and swap until you're lucky" loop.

A decoder for simple instructions would have to be twice as fast as a decoder for complex instructions just to get the same amount of work done; and it's easier to increase decoder complexity than it is to increase clock speed or reduce transistor switching times (or deal with the extra heat from faster clock or higher voltage).

Rudster816 wrote:Predicates for every instruction have been shown to be unnecessary. The latest ARM architecture decided to drop them, mainly because they were not used on most instructions, and branch prediction has gotten much better (92%-96% nowadays). Certain instructions can still have them though, not sure which, but I think just CMOV and CADD would be adequate.

If you remove things just because they aren't strictly necessary, you're going to end up sitting naked on the floor with a sharp stick scratching text into your skin (because paper wasn't strictly necessary).

Rudster816 wrote:Just 16 registers for both GP\FP is quite anemic.

For CISC with register renaming 16 registers are probably overkill. If you're doing the load/store thing with simple instructions and a simple pipeline then you might want several hundred registers in the hope of failing less badly.

Rudster816 wrote:I also see no purpose in using the same registers for FP instructions and GP instructions. I'm willing to bet a simple set of MOV gpreg, fpreg and MOV fpreg, gpreg would prove to be more than adequate. I can hardly think of any useful ALU operations that one would want to perform on FP values. In the rare case, just moving it to a GP reg and back to an FP reg would be fine.

Most code (that can't be done with SIMD) is either integer only, mostly integer with a few registers for floating point, or mostly floating point with a few registers used for integers. The overhead of saving/restoring extra registers during context switches isn't worth it, so if you insist on separate registers for floating point then forget about floating point completely (and just use SIMD instead).

Rudster816 wrote:Interesting idea for non page size paging structures. I think 512KB is too large though, as that means a minimum of 1.5MB to map an arbitrary place in virtual memory, which is extremely high.

1.5 MiB was extremely high in 1982. In 1992 it was still very high. In 2002 is was just high. In 2012, I've got 12 GiB of RAM and couldn't care less. Will you be releasing the first version of my CPU before I have 64 GiB of RAM (and a hover-car)?

Rudster816 wrote:The idea of a full 64 bit virtual address space is nice, but is unnecessary. The same goes for 64 bit physical addresses. It would also decrease the number of entries in the TLB greatly. Say we have a 48 Bit Virtual\40 bit physical (orginal AMD64 implementation). With 4KB pages, that's at least 64 bits to store a virtual->physical mapping. In reality you need to store some additional things, so lets add 8 bits to make a TLB entry 72 bits, that would make a 256 entry TLB 18KB. Upping the virtual address to a full 64 bits (keeping the same physical address size) would require 16 additional bits and a 98 bit TLB entry. That makes that same 18KB TLB only able to hold 188 entires, or a full 256 entry TLB 24.5KB. A full 64 bit scheme would require 122 bits, which would make the TLB 151 entries or 30.5KB. I think the costs far outweigh the benefits, especially in the case of physical addresses, because you would also need a full 64 bit address for your memory controller. But even in the case of a 64 bit virtual address space, I don't see any benefit. What could you possibly do that would be worth the decreased TLB entries with a 16EB address space that you couldn't do with a 256TB one?

I didn't say that physical addresses would be 64-bit (I assumed an "implementation defined physical address size" like 80x86). It only costs 16-bits more per TLB entry than CPUs were using a decade ago.

The benefit is that I'll never need to make architectural changes when 48-bit (or whatever you think is today's lucky number) starts becoming a problem (e.g. probably within the next 10 years for large servers). Basically, it's future-proof.

Rudster816 wrote:Using cache as RAM is another interesting idea, but again IMO the benefits outweigh the costs. System initialization is typically done with only one CPU in an SMP system active, so disabling cache coherency just adds complexity for little reason. There would be a huge question of how address's are used too, one to which I doubt an answer exists. Since you would still need to be able to access outside hardware (in order to initialize the RAM), the CPU would have to differentiate an address that is RAM or an address that is memory mapped I\O. Even if it automatically knew (or you told it), what would it do if you tired to read\write from RAM? Most importantly, how would you map the cache? It just adds a huge mess of logic onto the chip that would be better used somewhere else. There are also many SoC's that integreate DRAM\SRAM onto the same chip as the CPU, which would be the same thing as using cache for RAM (in the case of integrated SRAM).

For system initialization; as far as I know most firmware uses "cache as RAM" already, so this is nothing new. Unfortunately none of it is architectural; which makes it difficult to use for something like a multi-kernel OS. The idea of disabling cache coherency is to ensure that each CPU's "cache RAM" is private, and can't be trashed by whatever other CPU's do with their "cache RAM" (or whatever any bus mastering devices are doing).

The "cache RAM" could just be mapped at 0x00000000 - any physical addresses below "cache size - 1" end up being a cache hit, and any addresses above that are "nothing" or memory mapped IO. Modern 80x86 CPUs already have built-in memory controller that knows when addresses correspond to IO (and should be forwarded to the PCI host controller or whatever) so that takes nothing extra.

You can't try to access external RAM unless you use the special "fetch/store page" instructions intended for that purpose. These instructions fetch or store "one page of cache lines" using the normal physical addressing that the CPU would've used if "cache as RAM mode" wasn't enabled. If you think this is too complicated, then I suggest that you should probably have a look at what Intel is planning to do with caches.

Cheers,

Brendan

Rudster816 · Post by **Rudster816** » Sat May 05, 2012 5:48 pm

berkus wrote:
rdos wrote:The 32-bit x86 architecture actually is one of the most well-thought out there has ever been. By far surpassing RISC and all the other designs.
You're so full of biased bullsh i t, rdos.

+1. The protected mode was the chance for Intel to remove a lot of stuff that no longer made any sense, or just bag 8086 compatibly. They chose the senseless route and opted for an approach that pretty much kept all the baggage from the 8086 and made it worse.

Yoda wrote: No, that's not so. For the simplest case (register addressing) each operand takes one byte in my encoding scheme. And the structure of operand is rather simple to decode it efficiently.

That means you'll never have a 2 byte instruction for register to register operations. My 2 byte instructions have a 1 bit escape, 2 5 bit operands, and 1 5 bit opcode. This basically amounts to the vast majority (currently all) 1 (not, neg, etc) or 2 operand opcodes can be encoded in a 2 byte instruction, and can use any of the 32 registers. If you dropped down to 16 registers, that adds another two bits that could be used for a lot of things.

Yoda wrote: Some RISC architectures use a fixed 32 bit length opcode. In my ISA this length would be enough to encode most instructions with up to 3 register operands. And in RISC machine each instruction also needs to encode register operands (with the only exception of stack machines like Inmos T400/T800), so I don't see principal difference in speed and ease of decoding.

As noted above 1 byte for each operand is a lot, and what about operations like xor r0, r4, [r12 + r5]? Stuff like that would appear to be valid for a non load\store 3 operand ISA (and very well may be), but its an absolute NIGHTMARE to encode. A super complex (e.g. x86) instruction encoding scheme isn't inherently wrong\broken, but I don't see the point. It's not not relatively expensive to decode even x86 instructions on desktop chips because a TDP of 100 watts is acceptable, so you can solve a lot of problems by just throwing more and more logic at it. But IMO it just increases the complexity of the design (which increases verification costs significantly) for the benefit of being able to use less instructions. The problem is that less instructions don't mean a faster program.

Yoda wrote:That's not so also. Let's compare.
To execute load/store RISC machine also requires microcode for decoding a full instruction. In my ISA the analogous microcode decodes operands as an instruction. As I said, decoding operand in my ISA is very similar to execution of RISC load/store instructions. On the other hand, for RISC machine it is dificult to forecast which instruction will follow your operation and that may shoot pipelining. In my ISA CPU always knows the type of following instructions (load/store), size of data, and number of instructions and that may help organizing pipeline more effectively.

I don't think you know exactly what microcode is. If an instruction takes two uops, that means there are two full blown instructions in the backend of your pipeline. The only difference is that they both must be commited\retired to handle an exception\branch\etc to maintain program semantics. Most instructions shouldn't require microcode, they should just have a 1:1 translation to the microarch's "pipeline instruction format". If every single one of your operands required microcode, that means a 3 operand opcode would take at least 4 uops (3 uops for operands, 1 for the actual operation). A typical instruction needs only 1uop, so at the very lest, you've decreased your instruction throughput to 25% (because you can only commit X amount of uops per cycle).

There are also numerous other issues, but since I doubt that a proper microarch would ever require 1uop per operand, I won't ramble.

Code: Select all

As known, most frequently used instructions are differend types of moving data. In my ISA there are both 2 operands version of instructions and 3 operands for many ops like XOR, ADD, etc. 3 operands costs very little for encoding operation but it replaces two sequential instructions - XOR and MOV (the result) and increases speed of execution on such sequences. You don't need to use 3-op type if you need 2-op type instruction.

3 operand's require at least a 33% increase in bits to encode. That's far from "very little".

Yoda wrote:As for register pushing/poping, loops, function calls and other stuff, I thing that it should be a good approach to organize a special kind of cache for stack. It is not organized in my VM, because VM reflects only the logics of CPU work, but if I make Verilog description, I would organize extremely fast small stack cache which doesn't spills data out to memory and second-level cache until cache overflow occures. And it doesn't write data to memory below stack pointer. So, for example, if you push value to stack, the callee function will process it and return to caller, the pushed value will just be discarded after return without actual memory IO operation. But VM may help to estimate the gain in performance with implementation of such cache.

I thought about doing the very same thing, but the problem is cache coherency. The stack is represented as main memory, so you'd either have to throw away cache coherency for the stack, or you'll just have another L1 data cache.
In order to make this work:

Code: Select all

push somereg
mov anotherreg, [stack pointer]

You'd have to check both the L1 data cache and your stack cache. If you try to get clever and see that they are dereferencing the stack pointer, they could easily just move the stack pointer to another register before dereferencing it (e.g. a stack frame). Creating a separate call stack that only works for push\pop's would wreak havoc on compilers that create a stack frame and assume pushed parameters are on that same frame. Most OS's would balk at the idea of two stacks aswell.

At the end of the day, you could probably just double the size of the entire L1 data cache for a bit more than the price of a dedicated cache for the stack. Since the stack is probably always in L1 anyways, and the cost of an L1 cache as low as it can get, I bagged the idea of a dedicated stack cache.

Rudster816 · Post by **Rudster816** » Sat May 05, 2012 6:49 pm

Brendan wrote:I disagree. Any CPU designer that isn't willing to throw a few extra transistors at instruction decode to improve things like instruction fetch and caching is just plain lazy. I don't want to use 3 instructions and an extra register just to do something simple like "add [somewhere],rax". In a similar way, I wouldn't want to turn an atomic operation like "lock add [somewhere],rax" into a messy/bloated "compare and swap until you're lucky" loop.

As a CPU designer, I don't care about how many instructions it takes. If those three instructions take just as long as one, who cares? The very rare case that you want to do some atomic operation like that isn't remotely worth a major design change.

There's nothing wrong with allowing memory operands in most ops, but again, I don't think the benefits outweigh the costs.

Brendan wrote:If you remove things just because they aren't strictly necessary, you're going to end up sitting naked on the floor with a sharp stick scratching text into your skin (because paper wasn't strictly necessary).

You can't just come up with some clever oversimplification of a statement and think that adds to the discussion, can you?

For CISC with register renaming 16 registers are probably overkill. If you're doing the load/store thing with simple instructions and a simple pipeline then you might want several hundred registers in the hope of failing less badly.

Register renaming doesn't prevent stack spillages, and doesn't come free.

Brendan wrote:Most code (that can't be done with SIMD) is either integer only, mostly integer with a few registers for floating point, or mostly floating point with a few registers used for integers. The overhead of saving/restoring extra registers during context switches isn't worth it, so if you insist on separate registers for floating point then forget about floating point completely (and just use SIMD instead).

Context switches don't always mean that you have to save the state in memory. ARM has separate register banks for for Kernel\User modes, which are switched automatically at probably little to know extra cost (other than the cost of duplicating the registers).

When you overlap Integer\FPU registers you run in to significant microarchitectural challenges. With separate register files you don't need to connect the integer file to the FPU execution units. If they are the same register file, you do have to connect them, and when clock speeds for desktop chips topping out at 4ghz, floor planning is a significant concern. You'll want SIMD registers anyways, and aliasing all three types (Int\FP\SIMD) would be a waste (extra bits for SIMD to store Int\FP values), so why not just alias SIMD\FPU registers?

Brendan wrote:1.5 MiB was extremely high in 1982. In 1992 it was still very high. In 2002 is was just high. In 2012, I've got 12 GiB of RAM and couldn't care less. Will you be releasing the first version of my CPU before I have 64 GiB of RAM (and a hover-car)?

Why make something cost more then it should? Just because I have 12GB of RAM, doesn't mean I want to waste a bunch of it on paging structures. You're also cornering yourself into an environment where RAM is plentiful (PC's and Servers). Even just one step down (Tablets\Smart Phones), 1.5MB is looking much much bigger. It also means that each of those structures requires 512KB of contiguous physical memory, which might create some issues with OS's because of physical memory fragmentation.

I think something like 64KB would be more reasonable. You could implement something like a "Big paging structure" bit to make larger structures possible as well.

Brendan wrote:I didn't say that physical addresses would be 64-bit (I assumed an "implementation defined physical address size" like 80x86). It only costs 16-bits more per TLB entry than CPUs were using a decade ago.

The benefit is that I'll never need to make architectural changes when 48-bit (or whatever you think is today's lucky number) starts becoming a problem (e.g. probably within the next 10 years for large servers). Basically, it's future-proof.

x64 virtual address's aren't capped at 48 bits at the architectural level, so your own argument for physical address's satisfies mine for non 64 bit virtual address's. Canonical form address's are a perfect solution to the future proofing problem. There is no reason for current CPU's to support something that won't be needed for 10 years at an extremely high cost. VA's greater than 48 bits won't make sense on 99.99% of machines for quite some time, and the cost to up it is low. It's not it's 1980 and we have to decide if we want to store the year as two BCD's or use a more future proof scheme.

IMO, you're looking more at the programming side of an ISA, and not at everything else. There is a lot more that goes into each design decision than just "does it make programming it better?" or "Is it easier to use?". You have to make sacrifices at all levels in order to meet your goals at each individual level. We can sit in our chair's and dream about our perfect ISA's, but the people who actually have to make those dreams turn in to realities have limitations that we aren't even aware of.

OSDev.org

OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU