OSDev.org

Posted: **Thu Apr 13, 2017 4:18 am**

Hi,

dozniak wrote:
SpyderTL wrote:And since the x86 and ARM processors, AFAIK, can't copy from one immediate memory address to another immediate memory address in a single instruction, and since that's pretty much the only thing an OICS processor can do, there are a few scenarios where an OISC processor may even be faster per clock cycle.
This means _each_ OISC instruction performs _two_ memory accesses _each_ time it is run. This is _extremely_ slow and probably beats all other performance considerations.

At the lowest level I'd count it as 4 memory accesses (instruction fetch, reading 2 values, then writing the result of the subtraction), which will cripple performance due to cache bandwidth limitations. At a slightly higher level (e.g. MMU) I'd count it as 3 memory accesses, which will cripple performance if any kind of "secure multi-process" is attempted.

It also means that every instruction needs three addresses (address of each operand plus address to jump to), which makes the instructions huge (e.g. over 20 bytes for 64-bit) and cripples performance due to instruction fetch consuming too much bandwidth from cache (that's already being pounded to oblivion by 4 memory accesses).

That lack of things like "indexed addressing modes" means that you have rely on self modifying code for extremely basic things (e.g. "x = array[y];"). This will cripple performance because you don't get much benefit from splitting cache into "L1 instruction cache" and "L1 data cache". It will also cripple performance by ruining speculative execution. It will also cripple performance by making higher-level things (e.g. memory mapped executables, multiple CPUs running the same code, etc) impossible or impractical.

"Lack of registers" cripples performance by making various tricks (register renaming) impractical. Combined with self modifying code, it also means that the CPU can't effectively determine what an instruction depends on, which cripples performance by ruining "out-of-order execution".

"Every instruction is a potential branch" means that the CPU has to update the instruction pointer and compare that against the branch target just to determine if it's a real branch or not; then (if it is a real branch) there's no way to do static branch prediction effectively (e.g. you can't have simple rules, like "backward branches predicted taken, forward branches predicted not taken"). This means a heavy reliance on branch target buffers for branch prediction, and will cripple performance due to branch mispredictions ruining instruction pre-fetch. A practical CPU will spend almost all of its time stalled (waiting for data from RAM or cache) and doing nothing.

On top of all of this; there's "one instruction". This means that things that are extremely trivial in digital electronics (e.g. bitwise operations like AND, OR and XOR, shifting, etc) become many slow instructions, and more complex things (multiplication, division, modulo) become a massive disaster; which all cripples performance. It also means that there's no SIMD, which cripples performance for anything that involves processing a lot of data (e.g. "pixel pounding").

Finally; there's "only one data type" (e.g. no support for anything that isn't a 64-bit integer). For smaller (8-bit, 16-bit, 32-bit) integers programmers will mostly just use 64-bit integers where they can, which will cripple performance by wasting a huge amount of RAM and exacerbating the already severe "cache bandwidth limitations" problem. When programmers can't do this (file formats, network packets, etc) they'll have to resort to "shifting and masking", where both shifting and masking are extremely slow, which will cripple performance. For floating point; welcome to a realm of nightmares, please kiss any hope of acceptable "FLOPS" goodbye.

For all of these reasons; if someone like Intel spent billions of dollars trying to make the fastest SUBLEQ CPU possible, they wouldn't even be able to beat an 80486 (from almost 3 decades ago) with ten times the power consumption (and a hundred times the price).

All of the above should've been obvious for anyone actually interested in CPU design. None of this was obvious to Geri, but that's fine (being clueless is perfectly natural). However, people who aren't clueless pointed out almost all of the problems 4 years ago and Geri was too stupid to listen to anyone, and since then Geri would've had to have discovered half the massive performance problems first hand many many times over the last ~4 years (including "Oh, alpha blending is far too slow because it involves multiplication" recently) and now Geri is ignoring evidence that he created himself.

Mostly; multiple severe and unsolvable performance disasters are a massive problem; but they are nothing compared to Geri's progression from "clueless" to "ignorant" to "delusional". When you add wreckless incompetence as a software developer on top of that, you get a recipe for decades of pure pointless failure.

Cheers,

Brendan

Posted: **Thu Apr 13, 2017 8:44 am**

Posted: **Thu Apr 13, 2017 12:34 pm**

SpyderTL wrote:And I already realize how ridiculous this whole idea is, so if you were going to just point that out, don't bother.

With that said, performance isn't necessarily everything.

This OISC design may, in fact, represent the smallest "unit" of logic that can be possibly created.

I'm speculating a bit, but what if it turns out that you could combine a logic unit, 16k of RAM, and a power source that would run for 20 years, and was waterproof, microscopic, and you could manufacture 10,000 units for $5. At that point, who cares if it takes 10ms to add two numbers together.

Or to put it another way, at some point in the future, it's highly likely that we will have nano-bots building our infrastructure and performing medical procedures. It's also highly likely that they will not be using x86 or ARM instruction sets...

But, I digress... The original question was whether I should pre-calculate memory addresses such that a 32-bit machine with 4 GB would have 1 GB of addressable memory, each of which would be 32-bits, or whether I should allow the entire 4 GB to be accessed at the byte level, somehow.

I'm starting to think that the former 1 GB x 32bit approach would be easier to implement, but I'll still probably need a way to do byte and word read/writes, eventually.

Posted: **Thu Apr 13, 2017 1:05 pm**

Hi,

SpyderTL wrote:Or to put it another way, at some point in the future, it's highly likely that we will have nano-bots building our infrastructure and performing medical procedures. It's also highly likely that they will not be using x86 or ARM instruction sets...

Yes, it's highly likely that they'll be using custom digital circuits (like the original "electronic games" - pong, etc) and won't have a CPU at all.

SpyderTL wrote:My idea was to try to run this code, natively, on an x86 machine by putting the CPU in a tight loop that just did the above instruction logic over and over. The problem now is the addresses.

It would be trivial to JIT compile instead, including (the equivalents of) detecting "x = x - x" and replacing it with "x = 0", detecting when the branch goes to the next instruction and can be suppressed, detecting things like "x = a - b; y = c - x" and recycling the old value of x that's still in a register, etc. In that case it'd also be trivial to multiply addresses by 4 while you're at it; and also trivial to detect when instruction writes to code and call an "invalidate the previously JITed copy" function where necessary.

At least that way, there's a chance that you might learn something that's actually useful (about efficient JIT compiling).

Cheers,

Brendan

OSDev.org

True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development