OSDev.org

Posted: **Wed Mar 15, 2017 1:35 am**

SpyderTL wrote:No registers, no arithmetic, just call methods.

If you only can call methods, and those methods can only call other methods, the end result will be a deep call chain doing exactly nothing. What is the point of this?

Posted: **Wed Mar 15, 2017 3:21 am**

dozniak wrote:
SpyderTL wrote:No registers, no arithmetic, just call methods.
If you only can call methods, and those methods can only call other methods, the end result will be a deep call chain doing exactly nothing. What is the point of this?

I'm guessing the lowest-level methods (e.g. "add(a, b)") are implemented in a runtime environment to use the machine's hardware.

Posted: **Wed Mar 15, 2017 10:28 am**

Hi,

SpyderTL wrote:The difference between a VM and a VCPU is that in the latter case, only the CPU is virtualized, but the rest of the system is directly accessed using a nearly direct memory map. So, you can still access things like the PCI bus and IDE controllers, but do it using a single code base that contains "drivers" for all devices regardless of the platform.

Typically; device drivers act as an abstraction that hide differences between different devices from the rest of the OS; then you have "upper layers" (file systems and VFS, sockets, etc) that hide the devices themselves from the rest of the OS. Applications are sitting on top of multiple layers of abstractions, where your differences between VM and VCPU no longer exist.

The real problem isn't hardware differences, it's OS differences. It's things like DirectX vs. OpenGL, BSD sockets vs. Winsocks, etc. These differences are either abstracted by languages (including standard libraries) or they aren't (and you end up with "non-portable" where the programmer has to use ugly hacks, like "#ifdef ...").

SpyderTL wrote:So, my first question is, how would performance compare between a VM solution and a VCPU solution?

Performance would be identical because VM and VCPU are effectively identical.

SpyderTL wrote:I assume that the CPU emulation loop would slow things down by 5 to 10x just due to the size of the loop itself, and then the simplicity of the VCPU using a single instruction would slow things down by another 10x, due to the complexity of the code required to do simple operations. So let's just call it 100x slower than pure assembler. This sounds bad, but the flip side is that no other hardware would need to be virtualized. A typical vm must execute hundreds of instructions every time a virtual device is accessed to emulate the behavior of a real device.

At some point you end up translative some kind of portable byte code (e.g. the VCPU's instructions) into native code. You can do this one instruction at a time (a pure interpreter, with a "CPU emulation loop" like you described) and it will suck (about 100 times slower). You can enhance this with JIT techniques, where you cache the "previously translated to native" code to avoid continually converting the same thing, and it will suck a lot less. You can cache "previously translated to native" code on disk so that it can be re-used next time the executable is executed. You can also "pre-translate to native" some or all of the code (e.g. compile some or all of it to native when the executable is installed). More importantly; you can do any mixture of all of these - compile what you can to native when the executable is installed, then translate anything that wasn't already translated while the executable is running (including caching none or more of the resulting native code).

Mostly; "when translation to native happens" is a relatively irrelevant implementation detail that only effects performance and the complexity of the implementation. What actually matters is the design of the portable byte code (e.g. the VCPU's instructions) and the design of the source language; because some features (e.g. allowing "self modifying byte-code") ruin an implementation's ability to cache the native code and therefore ruin "max. performance that an implementation can achieve".

onlyonemac wrote:
dozniak wrote:
SpyderTL wrote:No registers, no arithmetic, just call methods.
If you only can call methods, and those methods can only call other methods, the end result will be a deep call chain doing exactly nothing. What is the point of this?
I'm guessing the lowest-level methods (e.g. "add(a, b)") are implemented in a runtime environment to use the machine's hardware.

In that case, it will be a massive performance disaster because it will be impossible for any implementation to optimise - e.g. you can throw $123 million dollars at it (hiring extremely talented/experienced "JIT compiler developers", etc) and nobody will ever be able to do anything to make it better than "unusable due to poor performance".

Note that every single time someone has attempted anything like this (where they sacrifice performance for no sane/practical reason) they have eventually been forced to provide big fat libraries (so that things like arrays, hash tables, etc. can be shifted to native code to avoid the language's performance problem), and then still forced to provide some way of using native code to bypass the remainder of the performance problem (things like JNI in Java; C language bindings/interfaces in Python, Lua; etc).

Cheers,

Brendan

Posted: **Wed Mar 15, 2017 10:38 pm**

I agree that performance is a big problem in any situation where native code is not being executed by a physical machine. But I'm trying to compare a virtual machine, where the CPU and all hardware is emulated (or virtualized), versus only emulating the CPU, and accessing the physical hardware more-or-less directly.

The virtual CPU could be a "lowest common denominator" type processor that can only do a few simple things that are guaranteed to be available natively on 99% of devices. Or, taken to the extreme, it could be a single instruction set processor that would completely give up all performance for the ability to run on any device, even the simplest battery powered toy.

Just to advocate for a second, the single instruction set CPU does have a few other advantages that I think are worth mentioning. First off, although emulating a OISC processor using a typical PC would have dramatic performance problems, building a physical OISC processor could, in fact, perform at an extremely high clock speed. I'm basing this entirely on the emergence and evolution of ASIC processors designed specifically for bitcoin mining, which are essentially single instruction set processors that calculate hash codes at extremely fast speeds -- up to 1000x faster than a modern CPU for roughly the same price.

The other big advantage would be that true compile-once, run-anywhere would finally be a reality. You could literally run the same OS and same applications on your server, your laptop, your router, your TV, your phone, your watch, and your toothbrush. The only difference would be the drivers that would be loaded into memory at run time.

And I think the most important aspect of any new technology is the ability to try it out before you buy it, which is certainly possible in this case, since running a virtual machine that only executes one instruction is something that virtually everyone on this site could do in a few hours.

Maybe I'm wrong, but I definitely can see the potential for this type of "technology" to become the next "evolution" of computers in our lifetime.

I feel like I've gotten a bit off topic on my own post here...

Posted: **Thu Mar 16, 2017 12:45 am**

Hi,

SpyderTL wrote:I agree that performance is a big problem in any situation where native code is not being executed by a physical machine. But I'm trying to compare a virtual machine, where the CPU and all hardware is emulated (or virtualized), versus only emulating the CPU, and accessing the physical hardware more-or-less directly.

You have to have some kind of device driver (possibly including "fake" devices, like "/dev/loopback" or whatever), and you have to have "none or more" layers on top of that (to allow multiple applications to share the same device if/where necessary), and you have to have some kind of software interface that applications can use (to access the device via. its driver or the layers on top). If you want, you can pretend that the software interface is a "virtual emulated device", however this is just meaningless word games - it changes nothing, it's still some kind of software interface.

SpyderTL wrote:The virtual CPU could be a "lowest common denominator" type processor that can only do a few simple things that are guaranteed to be available natively on 99% of devices. Or, taken to the extreme, it could be a single instruction set processor that would completely give up all performance for the ability to run on any device, even the simplest battery powered toy.

Just to advocate for a second, the single instruction set CPU does have a few other advantages that I think are worth mentioning. First off, although emulating a OISC processor using a typical PC would have dramatic performance problems, building a physical OISC processor could, in fact, perform at an extremely high clock speed. I'm basing this entirely on the emergence and evolution of ASIC processors designed specifically for bitcoin mining, which are essentially single instruction set processors that calculate hash codes at extremely fast speeds -- up to 1000x faster than a modern CPU for roughly the same price.

It probably costs about $5000000+ (in design, prototypes, validation, fab setup costs, etc) to produce a chip that is even slightly competitive. That cost has to be amortised - if you sell 5000000 chips you add $1 to the price of each chip to recover the cost, and if you sell 10 chips you add $500000 to the price of each chip to recover the cost. For OISC, you will never find enough people willing to use it (even if you give the chips away for free) and will never be able to get the price down to anything close to commercially viable.

Yes, you might (as a hypothetical fantasy) be able to achieve "4 instructions per cycle at 10 giga-cycles per second" (or 40 billion instructions per second). This sounds "nice" until you realise that to get any work done software will need thousands of times more instructions and that it's still slower than an 50 MHz ARM CPU you could've bought for $1.

If you don't believe me, show me the "one instruction code" that does the equivalent of a normal CPU's floating point addition instruction (scalar, not SIMD). Before showing me this code, think about the number of branches you couldn't avoid and the performance problems that branch mis-predictions cause.

SpyderTL wrote:The other big advantage would be that true compile-once, run-anywhere would finally be a reality. You could literally run the same OS and same applications on your server, your laptop, your router, your TV, your phone, your watch, and your toothbrush. The only difference would be the drivers that would be loaded into memory at run time.

Something that isn't a massive performance disaster (CIL, LLVM, Java byte-code, ...) is also capable of true "compile-once, run-anywhere".

SpyderTL wrote:And I think the most important aspect of any new technology is the ability to try it out before you buy it, which is certainly possible in this case, since running a virtual machine that only executes one instruction is something that virtually everyone on this site could do in a few hours.

Maybe I'm wrong, but I definitely can see the potential for this type of "technology" to become the next "evolution" of computers in our lifetime.

A virtual machine using "pure interpreted" for OISC would be very easy to write; and a virtual machine for OISC that is capable of getting performance that is better than "10000 times slower than Java" would take decades of work (if and only if that's actually possible in practice).

Cheers,

Brendan

Posted: **Thu Mar 16, 2017 9:01 am**

SpyderTL wrote:First off, although emulating a OISC processor using a typical PC would have dramatic performance problems, building a physical OISC processor could, in fact, perform at an extremely high clock speed. I'm basing this entirely on the emergence and evolution of ASIC processors designed specifically for bitcoin mining, which are essentially single instruction set processors that calculate hash codes at extremely fast speeds -- up to 1000x faster than a modern CPU for roughly the same price.

There's a huge difference between a focused ASIC that's designed for exactly one thing, and a general purpose OISC that has to do everything. Not only is there the massively higher number of instructions required, but I also suspect you can't really do any optimization on them the way a modern CPU does- say byebye to useful pipelining, superscalar execution, etc.

Posted: **Thu Mar 16, 2017 1:55 pm**

Rusky wrote:There's a huge difference between a focused ASIC that's designed for exactly one thing, and a general purpose OISC that has to do everything. Not only is there the massively higher number of instructions required, but I also suspect you can't really do any optimization on them the way a modern CPU does- say byebye to useful pipelining, superscalar execution, etc.

That's true.

But, I'm sure that someone said the same thing about Parallel Ports vs. Serial Ports back in the 80's. Sure, parallel ports are faster than serial ports at the same clock speed.

But, ultimately, serial ports surpassed the performance and flexibility of parallel ports, because they were based on a simpler mechanism that could be easily scaled up in speed, which allowed moving more of the communication logic to the protocol layer. This is why we now have 10 Gbps USB ports, while parallel ports are all but forgotten.

Of course, predicting the future of technology is virtually impossible, so I'm not going to try too hard to convince anyone.

But I do want credit if it turns out that OISC processors are the next big thing in 5-10 years.

Posted: **Thu Mar 16, 2017 2:37 pm**

Hi,

SpyderTL wrote:
Rusky wrote:There's a huge difference between a focused ASIC that's designed for exactly one thing, and a general purpose OISC that has to do everything. Not only is there the massively higher number of instructions required, but I also suspect you can't really do any optimization on them the way a modern CPU does- say byebye to useful pipelining, superscalar execution, etc.
That's true.

But, I'm sure that someone said the same thing about Parallel Ports vs. Serial Ports back in the 80's. Sure, parallel ports are faster than serial ports at the same clock speed.

But, ultimately, serial ports surpassed the performance and flexibility of parallel ports, because they were based on a simpler mechanism that could be easily scaled up in speed, which allowed moving more of the communication logic to the protocol layer. This is why we now have 10 Gbps USB ports, while parallel ports are all but forgotten.

Of course, predicting the future of technology is virtually impossible, so I'm not going to try too hard to convince anyone.

But I do want credit if it turns out that OISC processors are the next big thing in 5-10 years.

To predict the future you extrapolate from "known facts" from the past. The facts are:

Clock frequency (Hz) on its own is an extremely poor indicator of performance because instruction per cycle (IPC) varies
"Hz * IPC" is an extremely poor indicator of performance because the number of instructions needed to do the same amount of work ("IPW") varies
"Hz * IPC / IPW" is a poor indicator of performance because it only tells you a theoretical maximum that is almost never achievable in practice, because CPUs tend to stall (dependencies between instructions, loads/stores that miss caches, branch mispredictions, etc) and utilisation varies
"Hz * IPC / IPW * utilisation" is a reasonable indicator of performance
For modern CPUs, the power consumption is a major limiting factor
For silicon/transistors, the relationship between power consumption and clock frequency is non-linear/exponential (e.g. doubling the frequency can require four times as much power consumption)
To improve performance (without melting the chip) you have to do more in parallel; which means increasing IPC, reducing IPW and/or increasing utilisation.
OISC decreases IPC, increases IPW and decreases utilisation for multiple reasons (IPW is the worst possible by definition, utilisation is significantly worse due to "many branches", IPC suffers severely due to additional dependencies and worse utilisation)
OISC is "everything you can possibly do to make performance worse"

Cheers,

Brendan

Posted: **Thu Mar 16, 2017 2:44 pm**

Brendan wrote:OISC decreases IPC, increases IPW and decreases utilisation for multiple reasons (IPW is the worst possible by definition, utilisation is significantly worse due to "many branches", IPC suffers severely due to additional dependencies and worse utilisation)
OISC is "everything you can possibly do to make performance worse"

What about the fact that one instruction takes between 2 and 20 clock cycles on a typical CPU (ignoring external bus factors), whereas theoretically a OISC processor could run between 1 and 2 clock cycles (again ignoring bus factors)? That might level the playing field a bit, given several years and processor generations.

That's kind of a rhetorical question, but I wanted to just throw that out there.

Posted: **Thu Mar 16, 2017 3:37 pm**

Hi,

SpyderTL wrote:
Brendan wrote:OISC decreases IPC, increases IPW and decreases utilisation for multiple reasons (IPW is the worst possible by definition, utilisation is significantly worse due to "many branches", IPC suffers severely due to additional dependencies and worse utilisation)
OISC is "everything you can possibly do to make performance worse"
What about the fact that one instruction takes between 2 and 20 clock cycles on a typical CPU (ignoring external bus factors), whereas theoretically a OISC processor could run between 1 and 2 clock cycles (again ignoring bus factors)? That might level the playing field a bit, given several years and processor generations.

That's kind of a rhetorical question, but I wanted to just throw that out there.

What do you think is faster; one instruction that takes 20 clock cycles on a 3 GHz CPU (where that instruction takes 6.666 nanoseconds); or 1000 instructions that do exactly the same thing and take 1 cycle each on a 10 GHz CPU (where all those instructions take 100 nanoseconds)?

If you think "15 times slower to get the exact same work done" is an improvement, then...

Cheers,

Brendan

Posted: **Thu Mar 16, 2017 10:49 pm**

1000 instructions may be a bit of an exaggeration. The better comparison would be the average number of OISC instructions per CISC instruction. FPU, MMX, SIMD, etc. instructions may take thousands of OISC instructions, but probably 60-80% of a typical application is probably MOV instructions. And since the x86 and ARM processors, AFAIK, can't copy from one immediate memory address to another immediate memory address in a single instruction, and since that's pretty much the only thing an OICS processor can do, there are a few scenarios where an OISC processor may even be faster per clock cycle.

Another thing that occurs to me is that if OISC processing becomes popular, and people start looking for ways to improve performance, you may start seeing things like pattern matching algorithms where the CPU will detect common patterns that make up commonly used instruction sequences, like adding two numbers together, and actually implement custom logic to perform that operation in 2-3 cycles instead of 10-20.

I'm going to do a little hacking and see what I can come up with. There is an online Subleq interpreter here: http://mazonka.com/subleq/online/hsqjs.cgi I'm going to try to add some logic to my custom XML compiler to output both text numeric instructions, and binary instructions and see if I can get a simple terminal program running on two different platforms using the same code. Then I may try running some benchmarks and see if can determine how much performance is lost due to emulation.

Another idea, while I'm at it, could be a common pattern for executing native code from within OISC code. Something like loading the target physical address in one location, writing parameter values to another location, and the return address to a third location, which would then exit out of the virtual CPU and call the native method.

It's interesting how, like the serial/parallel port analogy, removing complexity from a system actually, in a way, increases flexibility.

EDIT: I found this paper discussing performance between a multicore Subleq FPGA system vs. a modern CPU. https://arxiv.org/abs/1106.2593

Posted: **Wed Apr 12, 2017 1:19 pm**

Okay, sit-rep:

I finally got a chance to sit down and try to take some simple input/output code that I had written for the Subleq online page above, that compiles to a text file with ASCII numbers in it, to a new binary format that I was going to attempt to run on a physical machine with a simple virtual CPU loop.

I had thought about this issue, before, but now that I'm trying to implement it, now I need to really decide how this needs to work before I can proceed.

Normally, when you write subleq code, your environment consists of an array of numbers, each number taking up one address. Then, each instruction is made up of 3 addresses -- subtract address 1 from address 2 and branch to address 3 if the result is <= 0.

My idea was to try to run this code, natively, on an x86 machine by putting the CPU in a tight loop that just did the above instruction logic over and over. The problem now is the addresses.

My first attempt was going to be something simple, like incrementing video memory so that a character would change on the screen. The problem is that I have to pick, in advance, the "size" of the memory addresses.

I was going to start with 32-bit, but now my problem is a) how do I just increment the first byte, if every address is four bytes, and b) is video memory at 0xb8000, or at (0xb8000 >> 2)?

Do I pre-multiply the addresses to match the target machine, or do I divide all of the addresses at run-time? I can see problems with both approaches, but run-time seems to be a little cleaner, but probably noticeably slower.

What do you guys think? I know at least one of you have probably already dealt with similar issues in their OS, so any feedback would be helpful.

And I already realize how ridiculous this whole idea is, so if you were going to just point that out, don't bother.

Thanks, guys.

Posted: **Wed Apr 12, 2017 1:36 pm**

I don't see why you would want to write and executable for the c64.

Posted: **Wed Apr 12, 2017 1:57 pm**

I consider the c64 and the x86 to represent the two extremes of the "run on any platform" criteria. In short, if I can write a Hello World application, compile it, and have the compiled code run on both platforms, I would consider that a successful "run on any platform" achievement.

The closest that I've found so far is this: https://www.youtube.com/watch?v=wEQYSrAMM5c

https://www.mikekohn.net/micro/c64_java.php

I played around with it for a few days, and found that it's not quite what I would call "usable", yet. You have to tweak a few things every time you build to get it to work, and some Java methods just plain aren't supported.

So, I'm attempting to essentially do the same thing, just with a simpler approach. (Specifically, a virtual CPU with one instruction.)

If/when I get to the point that I've got a working shell console running on both platforms with the same compiled code, then I feel like I will be able to determine whether this idea is truly feasible, and is worth developing further, or whether the compile once and run anywhere idea is just a complete waste of time.

Posted: **Thu Apr 13, 2017 1:50 am**

SpyderTL wrote:And since the x86 and ARM processors, AFAIK, can't copy from one immediate memory address to another immediate memory address in a single instruction, and since that's pretty much the only thing an OICS processor can do, there are a few scenarios where an OISC processor may even be faster per clock cycle.

This means _each_ OISC instruction performs _two_ memory accesses _each_ time it is run. This is _extremely_ slow and probably beats all other performance considerations.

OSDev.org

True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development

Re: True cross-platform development