Page 1 of 1

use of system-wide bytecode

Posted: Tue Jun 02, 2009 2:23 pm
by NickJohnson
I was thinking about making a bytecode virtual machine that would be used throughout my OS's base system. The original idea was to make something that could be used for platform independent self modifying code, mostly for within a specific library. But I realized that it would be really useful if I could use it as a compiled extension language for a text editor. Then I also realized that if I made a compiler that went from the bytecode to machine code, I could also use that as a C compiler back end, because I could make the front end compile directly to bytecode, satisfying both the "compiled extension" and "platform independent machine code" roles. It would be easy to convert the front end's resulting parse tree to bytecode because the bytecode is completely stack based.

Does this sound like a good idea? Are there any issues with doing something like this I haven't forseen? I've written quite a few simple bytcode VMs, so the VM itself shouldn't be a problem. I also do not want to do any sort of JIT... everything would be straight interpreted or compiled to machine code before linking.

I also want to be able to link bytecode to machine code using a normal ELF linker. Could I define function symbols within the bytecode that would prompt the C compiler to use a different method (non-__cdecl) that actually invokes the interpreter upon it (although, I'm going to be writing the compiler eventually anyway, so maybe I can add support)? Can I just use the normal data segment for loading as well?

Re: use of system-wide bytecode

Posted: Tue Jun 02, 2009 6:37 pm
by NickJohnson
Interestingly, this seems to be a design not too dissimilar to Windows' .NET CLI system. In comparison, it would be magnitudes smaller, require only a small amount of runtime support, and have complete interoperability with machine code, but the concept is the same. It doesn't create any sort of implied API though; the VM works directly with the libc. It's funny - I had no idea that's how .NET worked, and being an avid Linux user, I wouldn't have thought to look there for ideas.

Edit: perhaps an analogy: my bytecode is to MS CIL as assembly is to Java, if both were portable. I could probably get by writing the VM in < 250 SLOC.

Re: use of system-wide bytecode

Posted: Tue Jun 02, 2009 7:33 pm
by JohnnyTheDon
It sounds like you want to use bytecode only at compile and link time. In this case you may want to take a look at the LLVM project (not necessarily to use it, but as an example of a working system that uses this method). If you're not doing any JIT, the platform independent part is going to suffer because all your final excecutables will be in native machine code. Your best bet is to compile and link to your bytecode format, and then have a final stage that compiles/links to native code on the user's computer (not necessarily JIT, it could be done at installation).

As for using a normal linker, you are much better of writing your own. This allows you to do link-time optimization (again, see LLVM). Also, when you're linking the C compiler is totally out of the picture, and I can't see any real advantage to being able to mix bytecode and native code in the same link operation. You can just link all the bytecode with you're own linker, turn this into a native binary, and then link it with any native code you need to.

BTW, both .NET and Java are bytecode VMs that use JIT. They are significantly different, but the concept is the same.

Re: use of system-wide bytecode

Posted: Wed Jun 03, 2009 2:37 pm
by NickJohnson
WI
JohnnyTheDon wrote:If you're not doing any JIT, the platform independent part is going to suffer because all your final excecutables will be in native machine code. Your best bet is to compile and link to your bytecode format, and then have a final stage that compiles/links to native code on the user's computer (not necessarily JIT, it could be done at installation).
I'm actually planning to have sections of the executable still as bytecode once everything is linked, interpreted by a VM implemented in the shared library. This is more for dynamic loading purposes, so you can compile a bunch of extensions to bytcode-based dynamic libraries, which can be used on any system. Before final linking, the bytecode can be compiled to machine code, but it never has to be compiled at any time, even runtime. I could even distribute the entire base system as pure bytecode, and let the user compile everything to machine code later at their option, while having all of it still work initially, albeit slowly.

I don't think it would be that hard to link things properly. I could have the linker produce a set of machine code stubs (when creating the execuatble), which are what the ELF symbols would actually refer to, that simply invoke the VM upon the specified section of bytecode. There would be a bit of latency on machine code/bytecode and vice versa switches, but it would probably be insignificant.

I looked at the LLVM project too - it seems pretty similar to what I had in mind. However, it seems to lack the native linking features that are integral to my design. It's also written in C++, and since I intend to write my own compiler, I'm not planning on using anything C++ for a while, so it would be hard to bootstrap the standard implementation on my (planned) system. I'll look most closely at the actual LLVM specification though, maybe it will be adaptable to my purposes.