OSDev.org

Posted: **Sat May 05, 2012 7:59 pm**

Hi,

Rudster816 wrote:
Brendan wrote:If you remove things just because they aren't strictly necessary, you're going to end up sitting naked on the floor with a sharp stick scratching text into your skin (because paper wasn't strictly necessary).
You can't just come up with some clever oversimplification of a statement and think that adds to the discussion, can you?

I did, which is proof that I can.

If you remove everything that isn't strictly necessary, what are you really going to be left with? I've used CPUs with only 3 general purpose registers, no caches, no floating point, no SIMD and no MMU, so obviously none of that is necessary.

Are predicates for every instruction necessary? Probably not, but does that prevent me from wanting it (even if it's only because I like the idea and 80x86 doesn't support it)?

Rudster816 wrote:
For CISC with register renaming 16 registers are probably overkill. If you're doing the load/store thing with simple instructions and a simple pipeline then you might want several hundred registers in the hope of failing less badly.
Register renaming doesn't prevent stack spillages, and doesn't come free.

Not sure where I've claimed that register renaming prevents stack spillages or is free...

Rudster816 wrote:
Brendan wrote:Most code (that can't be done with SIMD) is either integer only, mostly integer with a few registers for floating point, or mostly floating point with a few registers used for integers. The overhead of saving/restoring extra registers during context switches isn't worth it, so if you insist on separate registers for floating point then forget about floating point completely (and just use SIMD instead).
Context switches don't always mean that you have to save the state in memory. ARM has separate register banks for for Kernel\User modes, which are switched automatically at probably little to know extra cost (other than the cost of duplicating the registers).

That wouldn't help for task switches. It would get in the way for a kernel API (where I like passing values in registers). It might help a little for IRQs, but that depends on the kernel's design.

Rudster816 wrote:When you overlap Integer\FPU registers you run in to significant microarchitectural challenges. With separate register files you don't need to connect the integer file to the FPU execution units. If they are the same register file, you do have to connect them, and when clock speeds for desktop chips topping out at 4ghz, floor planning is a significant concern. You'll want SIMD registers anyways, and aliasing all three types (Int\FP\SIMD) would be a waste (extra bits for SIMD to store Int\FP values), so why not just alias SIMD\FPU registers?

You think there's separate integer and FPU execution units, and the FPU execution unit isn't used for things like integer multiply/divide? How about conversions between integer and floating point? Don't forget SIMD is both integer and floating point too.

I'm also not sure how aliasing SIMD and FPU registers would make sense (assuming SIMD can do scalar floating point anyway).

Rudster816 wrote:
Brendan wrote:1.5 MiB was extremely high in 1982. In 1992 it was still very high. In 2002 is was just high. In 2012, I've got 12 GiB of RAM and couldn't care less. Will you be releasing the first version of my CPU before I have 64 GiB of RAM (and a hover-car)?
Why make something cost more then it should?

You're right - to avoid making TLB miss more expensive than it should be, maybe we should have even larger pages, page tables and page directories, so that we can remove the whole page directory pointer table layer too. That'd reduce TLB miss costs by about 50%.

Rudster816 wrote:Just because I have 12GB of RAM, doesn't mean I want to waste a bunch of it on paging structures. You're also cornering yourself into an environment where RAM is plentiful (PC's and Servers). Even just one step down (Tablets\Smart Phones), 1.5MB is looking much much bigger. It also means that each of those structures requires 512KB of contiguous physical memory, which might create some issues with OS's because of physical memory fragmentation.

I really don't care about tablets/smart phones (they're just disposable toys to me), although most smartphones will have 1 GiB of RAM soon and 1.5 MiB would only be 0.15% of RAM anyway.

Maybe I could increase page size to reduce the size of other structures. How about 256 KiB pages, 256 KiB page tables, 256 KiB page directories and 256 KiB page directory pointer tables; with only 2 "CR3" registers?

Rudster816 wrote:
Brendan wrote:I didn't say that physical addresses would be 64-bit (I assumed an "implementation defined physical address size" like 80x86). It only costs 16-bits more per TLB entry than CPUs were using a decade ago.

The benefit is that I'll never need to make architectural changes when 48-bit (or whatever you think is today's lucky number) starts becoming a problem (e.g. probably within the next 10 years for large servers). Basically, it's future-proof.
x64 virtual address's aren't capped at 48 bits at the architectural level, so your own argument for physical address's satisfies mine for non 64 bit virtual address's. Canonical form address's are a perfect solution to the future proofing problem.

When Intel decides to increase virtual address space size they're going to have to redesign paging. I just hope they don't slap a "PLM5" on top of the existing mess to get 57-bit virtual address (and make TLB miss costs even worse).

Brendan wrote:There is no reason for current CPU's to support something that won't be needed for 10 years at an extremely high cost. VA's greater than 48 bits won't make sense on 99.99% of machines for quite some time, and the cost to up it is low. It's not it's 1980 and we have to decide if we want to store the year as two BCD's or use a more future proof scheme.

There's no reason for a CPU that won't exist for 10 years to support something that won't be needed for 10 years? Somehow I suspect there's a flaw in your logic.

Cheers,

Brendan

Posted: **Sun May 06, 2012 2:07 am**

Rudster816 wrote:
berkus wrote:
rdos wrote:The 32-bit x86 architecture actually is one of the most well-thought out there has ever been. By far surpassing RISC and all the other designs.
You're so full of biased bullsh i t, rdos.
+1. The protected mode was the chance for Intel to remove a lot of stuff that no longer made any sense, or just bag 8086 compatibly. They chose the senseless route and opted for an approach that pretty much kept all the baggage from the 8086 and made it worse.

Protected mode is called so for a reason: It extended the existing 8086 archiecture to implement segment based protection and page based protection. This was a really brilliant design decision that apart from a few things (hardware task-switching isn't multicore safe) is still fully functional 2-3 decades later.

The bad thing is that portability has made segmentation obsolete because software from non-segmented environments could not easily be ported, and because most compilers didn't handle segmentation well. In fact, the big mistake was the 286 chip, which could only handle 16-bit protected mode, which was the environment where segmentation was really problematic. Apart from OS/2, there really was no OS that exploited 32-bit segmentation that has none of the drawbacks of 16-bit segmentation.

Posted: **Sun May 06, 2012 2:15 am**

Rudster816 wrote:
Brendan wrote:I disagree. Any CPU designer that isn't willing to throw a few extra transistors at instruction decode to improve things like instruction fetch and caching is just plain lazy. I don't want to use 3 instructions and an extra register just to do something simple like "add [somewhere],rax". In a similar way, I wouldn't want to turn an atomic operation like "lock add [somewhere],rax" into a messy/bloated "compare and swap until you're lucky" loop.
As a CPU designer, I don't care about how many instructions it takes. If those three instructions take just as long as one, who cares? The very rare case that you want to do some atomic operation like that isn't remotely worth a major design change.

There's nothing wrong with allowing memory operands in most ops, but again, I don't think the benefits outweigh the costs.

I think Brendan is right here. As a CPU designer you need to consider the upper limit of the CPU clock. The RISC architecture was fine as long as the clock frequency of CPUs gradually increased, but as we have hit the upper limit of the clock frequency, fast CPUs should be able to do as much as possible in a single clock cycle. More complex instructions do more in a single clock cycle, which means that RISC is out and CISC is in.

Additionally, as the best way to increase performance is multithreading and multicore, it is a given that efficient synchronization primitives and atomic operations are very important to provide. The more the CPU can help parallellize software, the faster it will execute.

To connect to my other post, this also means that if software validations can be performed by the CPU, rather than by software, this should make applications execute faster. That does mean that properly implemented segmentation, which removes lots of validations in software, should execute faster not slower. However, I doubt this would happen, but if I had Intel's hardware designers under me I'd give them orders first fix the broken 64-bit long mode to provide 32-bit descriptor tables, and fully working segmentation. Then I'd give them orders to optimize segmentation with caches just like they already did with paging. By caching a few descriptors you could handle most segment register loads with no memory accesses, which could be as fast as a register-register move. This is partly what the sysenter/syscall interface does. It uses MSRs that has fixed segment values. It wouldn't have costed much to instead have implemented a cache, which in a flat environment would only need 5 cached descriptors to never have to access descriptor tables. The same could have been done with the IDT. In fact, the complete IDT could have been loaded into the CPU core, and then interrupt response times would have been a fraction of what they are today.

Posted: **Sun May 06, 2012 2:46 am**

rdos wrote:
Rudster816 wrote:
Brendan wrote:I disagree. Any CPU designer that isn't willing to throw a few extra transistors at instruction decode to improve things like instruction fetch and caching is just plain lazy. I don't want to use 3 instructions and an extra register just to do something simple like "add [somewhere],rax". In a similar way, I wouldn't want to turn an atomic operation like "lock add [somewhere],rax" into a messy/bloated "compare and swap until you're lucky" loop.
As a CPU designer, I don't care about how many instructions it takes. If those three instructions take just as long as one, who cares? The very rare case that you want to do some atomic operation like that isn't remotely worth a major design change.

There's nothing wrong with allowing memory operands in most ops, but again, I don't think the benefits outweigh the costs.
I think Brendan is right here. As a CPU designer you need to consider the upper limit of the CPU clock. The RISC architecture was fine as long as the clock frequency of CPUs gradually increased, but as we have hit the upper limit of the clock frequency, fast CPUs should be able to do as much as possible in a single clock cycle. More complex instructions do more in a single clock cycle, which means that RISC is out and CISC is in.

Additionally, as the best way to increase performance is multithreading and multicore, it is a given that efficient synchronization primitives and atomic operations are very important to provide. The more the CPU can help parallellize software, the faster it will execute.

That statement is just flat out wrong.

I also disagree that we've definitely hit the frequency barrier. It wasn't too long ago that CPU's were topping out at 2.67ghz (Intel C2D E6700). Both AMD and Intel's latest chips are flirting (and in the case of AMD, at) with the 4ghz mark. That's over a 40% increase after we thought we hit the barrier with the P4's (which topped out at 3.6ghz). Smaller fab process's will allow higher frequencies even past what we're at today IMO, but I wouldn't be shocked to not see large scale (> 15-20%) in the next 3-5 years either.

Posted: **Sun May 06, 2012 2:54 am**

Rudster816 wrote:That statement is just flat out wrong.

I also disagree that we've definitely hit the frequency barrier. It wasn't too long ago that CPU's were topping out at 2.67ghz (Intel C2D E6700). Both AMD and Intel's latest chips are flirting (and in the case of AMD, at) with the 4ghz mark. That's over a 40% increase after we thought we hit the barrier with the P4's (which topped out at 3.6ghz). Smaller fab process's will allow higher frequencies even past what we're at today IMO, but I wouldn't be shocked to not see large scale (> 15-20%) in the next 3-5 years either.

I agree that we might see small increases in clock frequency (probably at the expense of power consumption), but that is not enough for the RISC processors to live on. If most operations on a RISC processor takes two instructions, while they take one on CISC, CISC will win big. When RISC was most popular they could use considerably higher clock frequencies than the CISC alternatives, which made them attractive. This is no longer the case.

Posted: **Sun May 06, 2012 4:28 am**

rdos wrote:
Rudster816 wrote:That statement is just flat out wrong.

I also disagree that we've definitely hit the frequency barrier. It wasn't too long ago that CPU's were topping out at 2.67ghz (Intel C2D E6700). Both AMD and Intel's latest chips are flirting (and in the case of AMD, at) with the 4ghz mark. That's over a 40% increase after we thought we hit the barrier with the P4's (which topped out at 3.6ghz). Smaller fab process's will allow higher frequencies even past what we're at today IMO, but I wouldn't be shocked to not see large scale (> 15-20%) in the next 3-5 years either.
I agree that we might see small increases in clock frequency (probably at the expense of power consumption), but that is not enough for the RISC processors to live on. If most operations on a RISC processor takes two instructions, while they take one on CISC, CISC will win big. When RISC was most popular they could use considerably higher clock frequencies than the CISC alternatives, which made them attractive. This is no longer the case.

Lol, you obviously fail to see how complex modern processor's are. They don't execute one instruction every cycle, or anything remotely close to that. There isn't even a set time that any given instruction takes because there are dozens of factors that affect instruction latency\throughput.

The translation of number of assembly instructions it takes to do algorithm A on architecture X has absolutely no translation to how fast it will run compared to architecture Y that takes twice as many instructions.

Posted: **Sun May 06, 2012 6:12 am**

Rudster816 wrote:
rdos wrote:
Rudster816 wrote:That statement is just flat out wrong.

I also disagree that we've definitely hit the frequency barrier. It wasn't too long ago that CPU's were topping out at 2.67ghz (Intel C2D E6700). Both AMD and Intel's latest chips are flirting (and in the case of AMD, at) with the 4ghz mark. That's over a 40% increase after we thought we hit the barrier with the P4's (which topped out at 3.6ghz). Smaller fab process's will allow higher frequencies even past what we're at today IMO, but I wouldn't be shocked to not see large scale (> 15-20%) in the next 3-5 years either.
I agree that we might see small increases in clock frequency (probably at the expense of power consumption), but that is not enough for the RISC processors to live on. If most operations on a RISC processor takes two instructions, while they take one on CISC, CISC will win big. When RISC was most popular they could use considerably higher clock frequencies than the CISC alternatives, which made them attractive. This is no longer the case.
Lol, you obviously fail to see how complex modern processor's are. They don't execute one instruction every cycle, or anything remotely close to that. There isn't even a set time that any given instruction takes because there are dozens of factors that affect instruction latency\throughput.

The translation of number of assembly instructions it takes to do algorithm A on architecture X has absolutely no translation to how fast it will run compared to architecture Y that takes twice as many instructions.

You are comparing CISC with CISC here. The original idea of RISC was to provide simple instructions that could reduce complexity of the execution unit and thereby clock it at much higher frequencies than the more complex CISC unit. This idea is no longer valid as the RISC unit cannot be clocked with much higher frequencies than the CISC.

And for complex single instructions versus simple multiple instructions, the execution unit is free to rearrange and perform the complex instruction in any way it find suitable, whereas multiple simple instructions must do what they set-out to do (for instance do unnecesary updates to register file for temporary registers).

Posted: **Sun May 06, 2012 6:50 am**

The CISC vs RISC debate isn't valid anymore and something that belonged in the 90s. If you look at modern CPUs you will see that they have borrowed aspect of both designs. As for VLIW isn't really complex instructions as they whole idea is to put instructions together that can be executed in parallel.

Posted: **Sun May 06, 2012 12:35 pm**

Brendan wrote:If you remove things just because they aren't strictly necessary, you're going to end up sitting naked on the floor with a sharp stick scratching text into your skin (because paper wasn't strictly necessary).

Pfft! A sharp stick? You got fingernails, don't you?

Posted: **Mon May 07, 2012 3:47 am**

i wonder if modern x86 are actually two or more isolated cpu, corresponds to ancient real mode and (IA32+32e).
the cpu boots with "real mode" circuit, upon switch to pmode the circuit never reach the "real mode part" again (until switch back).

I believe there was transmeta crussoe. Although it did not do what you are saying your comment reminds me of them. Essentially they were 128bit cpus with a completely different architecture which JITed x86 instructions into its own code. They have been discontinued though. I doubt intel/AMD/VIA/Cyrix suffer from that much brain damage to put 2 seperate circuits.

Posted: **Tue May 08, 2012 12:52 am**

ACcurrent wrote:I believe there was transmeta crussoe. Although it did not do what you are saying your comment reminds me of them. Essentially they were 128bit cpus with a completely different architecture which JITed x86 instructions into its own code. They have been discontinued though. I doubt intel/AMD/VIA/Cyrix suffer from that much brain damage to put 2 seperate circuits.

What brain damage? AMD started recently with CPU+GPU on-die, and ARM licensees have been doing that for a looooong time now: multiple (mixed) cores, DSPs, GPU, and all the other circuitry including kitchen sink.

Posted: **Tue May 08, 2012 5:32 am**

I meant 2 separate circuits, one for 16 bit and one for 32 bit. Don't mind having more than one cpu with multiple functions that are accessible at all times.

Posted: **Thu May 10, 2012 9:41 am**

Brendan wrote:If you remove things just because they aren't strictly necessary, you're going to end up sitting naked on the floor with a sharp stick scratching text into your skin (because paper wasn't strictly necessary).

I can even cut out all instructions but one, for example, SUBLEQ

.

Rudster816 wrote:That means you'll never have a 2 byte instruction for register to register operations.

Not a great miss. It will be got back on the nearest load/store sequence.

Rudster816 wrote:As noted above 1 byte for each operand is a lot, and what about operations like xor r0, r4, [r12 + r5]? Stuff like that would appear to be valid for a non load\store 3 operand ISA (and very well may be), but its an absolute NIGHTMARE to encode.

Why?? Why do you think so?
Instructions like xor r0, r4, [r12 + r5] will encode VERY easy. It will generate the following sequence of four independent code places:
- XOR3 opcode;
- Operand r0;
- Operand r4;
- Operand [r12+r5];
The first "XOR3" knows nothing about following operands encoding. And the encoding of operands knows nothing about operation they used for.

Rudster816 wrote:It's not not relatively expensive to decode even x86 instructions on desktop chips because a TDP of 100 watts is acceptable, so you can solve a lot of problems by just throwing more and more logic at it. But IMO it just increases the complexity of the design (which increases verification costs significantly)

As you can see, even with straightforward approach, the design of such CISC ISA is not more complicated as the design of RISC with relatively same performance.

Rudster816 wrote:I don't think you know exactly what microcode is. If an instruction takes two uops, that means there are two full blown instructions in the backend of your pipeline. The only difference is that they both must be commited\retired to handle an exception\branch\etc to maintain program semantics.

I see the problem of comprehension is that you treat each CISC instruction as unbreakable into separate modular uniform parts. But in the sample above I shown that they have such modularity and the execution of each part is much easier than the execution of a whole.

Rudster816 wrote:Most instructions shouldn't require microcode, they should just have a 1:1 translation to the microarch's "pipeline instruction format". If every single one of your operands required microcode, that means a 3 operand opcode would take at least 4 uops (3 uops for operands, 1 for the actual operation). A typical instruction needs only 1uop, so at the very lest, you've decreased your instruction throughput to 25% (because you can only commit X amount of uops per cycle).

The same is true for RISC on memory operations. For register operations in CISC it is the question of not a very complex prefetch and decoding optimization.

Rudster816 wrote:3 operand's require at least a 33% increase in bits to encode. That's far from "very little".

That's just a change of one code place (XOR2) to another (XOR3). If the total amount of different opcodes is not greater than 256, it doesn't increase the amount of bits to encode at all.

Rudster816 wrote:I thought about doing the very same thing, but the problem is cache coherency. The stack is represented as main memory, so you'd either have to throw away cache coherency for the stack, or you'll just have another L1 data cache.

Actually there is no problems with cache coherency. Look at the following sequence:
- The data pushed onto stack. Like common cache, the memory row is marked as dirty but the memory write operation is postponed. Cache coherency is working fine: if another core will try to access dirty row, it will be fed to another core from the cahce entry.
- The data popped from stack. The cache line dirty bit is just unmarked but without memory writeback operation! What the junk is in memory at that time - it doesn't matter, because in classic architecture there may be unpredicted junk there (below stack pointer!) too.
The actual problems may arise in specially designed bad code, like getting pointer to the pushed data, then access to memory after popping, but that's definitely wrong.

Rudster816 wrote:At the end of the day, you could probably just double the size of the entire L1 data cache for a bit more than the price of a dedicated cache for the stack.

The hit to the performance is not in the size of cache. The problem is that stack data are written to memory even being dropped and freed. And that writes will occure independent of cache size.

Posted: **Thu May 10, 2012 11:29 am**

Memory writes only occur on every store if you're using an antiquated write through cache.

Your little stack cache scheme is just another cache, and since L1 caches (by their very design) are as fast as possible, provides no improvement. It offers basically no benefits over a standard writeback L1 cache, adds another cache unit to check for every data access\snoop, and most importantly it is completely unnatural. Just because you pop something off the stack, its never written to memory. I've never seen (or want to see) an ISA enforce the principal that anything below the stack pointer can never be treated as anything but garbage. It just assumes that software is well behaved, and that is probably the most ill advised\dangerous assumption that could possibly be made in computer science. As far as performance goes for clearing the dirty bit, it's pretty much a dud because it would only ever reduce the amount of memory transfers. Seeing as it has been very well documented that switching from single to dual channel memory (effectively doubling memory bandwidth) offers virtually no improvement for the vast majority of applications (1-2% last I checked), I can't imagine that you would actually gain any type of performance advantage.

The very nature of a standard machine instruction is the fact that you can't break it up. It's the lowest common denominator. You can take that instruction and move it anywhere, and its meaning won't change, etc. If you treat your operands as separate instructions, this very fundamental concept is broken. Just because you conceptualize that each operand is it's own entity, that doesn't change the definition of an instruction.
E.g.

Code: Select all

Address: 
0x10010: XOR3
0x10011: Operand1
0x10012: Operand2
0x10013: Operand3
0x10014: NEXT INSTRUCTION

In every sense, one would say that the instruction is 4 bytes in length, not 4 1 byte instructions. Rather or not you think so is unimportant, because in order to maintain the exact same definition, you must look at all 4 bytes. If I were to change the byte at address 0x10011, the operation would still be XOR3, but the instruction would change.

Code density might not matter in sense of how much space it takes up main memory or on the HDD, but it matters quite a lot. You can only execute as many instructions as you can fetch per cycle. This might not sound like a problem, but when you consider that high end super scalars can begin to execute 4+ instructions every cycle, it matters a lot. It also determines how much code can fit in the caches, which is very important. The above example doesn't quite show it, but an ISA like your describing is full of a lot of air unless you define 256 architectural registers. I define 32, and I can still do 32 reg->reg operations in two bytes (2 5 bit registers, 1 5 bit opcode, 1 escape bit). This is a 33% improvement for a major portion (if not a majority) of instructions, which is huge. Condition codes will also introduce a huge problem. If you define them all in the first byte you'll suck up 12-16 opcodes for every conditional instruction you want to support. If you add another byte, you'll waste half of it on nothing.

[insert the rest of my thoughts here, as they are tldt (too long didn't type)].

Conclusion: Operands that take at least a byte == Stupid ISA

Posted: **Thu May 10, 2012 3:43 pm**

I remember someone who did development on the Nintendo 3DS complained about there being two different processors. They had slightly different ISAs too. Essentially he built two binaries - the OS loaded the binaries into memory, and woke the faster CPU up. He then had to manually wake the other and build some sort of message handling system between the two CPUs.

from Wikipedia:
"32 bit ARM946E-S main CPU; 67 MHz clock speed. Processes gameplay mechanisms and video rendering.
32 bit ARM7TDMI coprocessor; 33 MHz clock speed. Processes sound output, Wi-Fi support and takes on second-processor duties in Game Boy Advance mode."

I think it would be interesting to try to build a multitasking OS for something more extreme - like tri-processor setup containing an ARM, a MIPS, and an x86_64 CPU.

OSDev.org

OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU

Re: OSDev's dream CPU