Page 1 of 2

Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 12:40 am
by Colonel Kernel
In this thread I argued that run-time environments such as the JVM and the CLI are the wave of the future. The discussion quickly turned towards the security-specific aspects of the argument, and I realized there was a lot more I wanted to say that wasn't part of that topic... so here it is.

The accusation was made that VMs are effectively redundant and exist because of laziness. I argue that there are a lot of benefits to running apps in a VM, and I will try to list here all the benefits I can think of, as I think of them. Comments and rebuttals are most welcome. :)

Why I like VMs #1: Memory protection
Part of the process of translating bytecode (or IL, or whatever each camp calls it) into machine code can include verifying it for "type safety". In a nutshell, this means no blind access to memory. This means no more memory corruption and all the subtle errors it can cause. Processes with different address spaces were created to deal with this problem, but it was quickly realized that crossing address spaces frequently was too expensive, and it made a lot more sense in many cases to implement certain things as shared libraries. The problem with shared libraries is that they share the address space with the process using them, and the quality of those written by 3rd parties cannot be guaranteed (sadly, software has bugs... we learn to live with it while trying to make things better).

The other day I was debugging two database drivers side-by-side in the same test tool, and I noticed that the rows of the table returned by one driver contained lots of garbage. Among that garbage were symbols that I knew were unique to the other driver! Obviously some pretty weird memory issues going on there...

Anyway, the point is, not having to worry about memory smashes when you're writing an application is pretty neat. It eliminates a large class of common errors, and the result is more robust software.

...continued...

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 1:02 am
by Colonel Kernel
Why I like VMs #2: Garbage collection
Somewhat related to #1 is the fact that VMs make garbage collection a lot easier to implement. As someone who has written commercial grade C++ code the hard way for over 6 years, I think garbage collection is great. I think smart pointers and deterministic destruction are great too, but I think it's redundant for them to manage plain old memory.

Garbage collection is not a panacea. It can be abused. GC should not be used to collect file handles, database connections, or anything else that really ought to be cleaned up right now. This is why tools such as deterministic destruction exist (c.f. IDispose/using in C#, and destructors in C++/CLI). If your app is time sensitive (e.g. -- a graphics-intensive game), then GC can slow you down at an inconvenient moment. However, understanding and good design can usually overcome these problems. I believe that GC solves more problems than it creates -- it has certainly worked out that way for me in practice.

There's a lot to say about GC, but I'll try to keep it short. Some interesting things about GC:
  • If using a compacting GC, memory allocation from the heap is lightning fast (almost as fast as allocating from the stack), and because of this it is a lot less of a bottleneck on SMP systems with lots of threads.
  • GCs such as the one in the .NET CLR can usually run a collection cycle in the amount of time it would take to handle a page fault.
  • Many lock-free algorithms are much easier to implement with garbage collection than without.
Sorry for the lack of linkage on the last point... A lot of this stuff is in print (C++ User's Journal, etc.) and isn't for free or widely available yet. Here's at least point-form slides from one of Andrei Alexandrescu's recent presentations on lock-free algorithms.

More later...

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 2:57 am
by Pype.Clicker
There are indeed several cases where 'virtual machine' can be of much use. Years before Java, LISP already had garbage collection and automagic memory safeness, which made it *the* language of choice for number of applications.

Yet, like you said, they're not panacea. Some garbage collector may not discover cycles in unreferenced memory structures, and some of the functions in the VM code (or native libraries) may have flaws that could be exploited from within the machine (did i mentionned that nasty virus that had code that exploited a flaw in a generation of virus scanners already ?)

Imho, VMs doesn't help building *better* software ... they help building acceptable software *faster* (which is a gift to anyone having to work on a schedule :) )

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 4:26 am
by AR
Colonel Kernel wrote:Why I like VMs #1: Memory protection
...
I'll concede this point, but it is up to the individual programmer whether or not they want the imposed type safety, C/C++ is more 'free' for lack of a better term.
Colonel Kernel wrote:Why I like VMs #2: Garbage collection
...
Garbage collection is mixed, to put it in a different light: do you prefer a program that doesn't leak by design or one that has a large safety net to occasionally cleanup memory allocations that the programmer didn't bother handling himself, meaning wasting memory until the GC decides to run? [In normal operation, wasting memory like this may not be too severe but if it's a server for instance that suddenly comes under heavy load then you've got to wait for the GC to cycle through everything and release the memory so the incoming tasks can be handled, then keep cycling the GC every minute or two as finished tasks need to be collected to make room for new tasks, the overall throughput is reduced]
On a related note, why does Java and .Net feel they need to eat most of the computer's resources? I use Azureus for BitTorrent downloads and it consumes 110MB, that is 10% of my 1GB of RAM. I consider that to be extremely bloated since I have full 3D games that operate with less memory than that. I also wrote an appointment tracker in .Net once, it was extremely small, about 300 lines but it consumed 20MB at any given point. Allocating may be fast but if my system will be overloaded with only 10 programs running then they better start handing out RAM chips for free.

As Pype said, these tools let you write software faster which is what RAD is for, but I doubt they are going to replace machine code languages, at least not unless they can run it directly on the hardware (I've heard you can get a Java extension card but that's like being forced to pay for someone else's mistake). I'm wondering how java handles 3D stuff now, since by java's extreme abstraction, would you even be allowed to use machine code pixel and vertex shaders...?

A VM adds additional layers to what we already have so you'll end up with: Hardware>Kernel>Drivers>Services>System Libraries>Virtual Machines>Virtual Machine Libraries>Applications
Every tool has a use but I doubt VMs will be the killer "do everything" tool, they, like Visual Basic, are great for creating a quick solution to a small problem but also like VB, you wouldn't use it in a large scale project.

Something that interests me though is that since VMs impose heavy abstraction from the OS, why did Microsoft create .Net to begin with? The only major (double-edged) advantage that I know of in VMs is that what you're running on doesn't matter so Microsoft has effectively weakened their own hold on the OS market by pushing their developers onto .Net since .Net apps can run on Mono on Linux, you could even support .Net apps that use Windows APIs by combining Mono with Wine, it should be easier than supporting full machine code Windows apps. They mentioned that it will enable games to run on consoles (read XboX2) or PC (read Windows) but their XNA thing should allow C++ code to be built for both targets but the risk to their monopoly brought by allowing unmodified binaries to run on Linux shows that they may have gone insane.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 4:53 am
by Curufir
This is something I was toying with a few months back (As with so many projects I haven't got around to it yet :)). The big selling point for me was removing the delay for ring transitions, some aspects of scheduling and a unified address space. My design was inspired by MUD drivers, some of which have quite sophisticated GC. However I'm still worried about speed, and how much of the VM is simply replicating (In software) the available hardware functionality.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 8:04 am
by mystran
A virtual machine is most definitely NOT needed for memory protection!!

I repeat, just in case:

A virtual machine is most definitely NOT needed for memory protection!!

What you need is a safe language, which means you need a language with semantics that guarantees that all memory access goes through safe pointers, and all bounds are checked.

Oh, and in case someone really didn't know, you don't need virtual machine for GC either. In fact, you can use Boehm GC with plain C too. Just replace malloc() calls with GC_MALLOC() calls, and remove all free() calls. (There are some limitations on how you can play with your pointers, but if your code confirms to semi-strict interpretation of ANSI C requirements, then you are more or less safe.)

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 9:36 am
by Colonel Kernel
AR wrote:Garbage collection is mixed, to put it in a different light: do you prefer a program that doesn't leak by design or one that has a large safety net to occasionally cleanup memory allocations that the programmer didn't bother handling himself, meaning wasting memory until the GC decides to run? [In normal operation, wasting memory like this may not be too severe but if it's a server for instance that suddenly comes under heavy load then you've got to wait for the GC to cycle through everything and release the memory so the incoming tasks can be handled, then keep cycling the GC every minute or two as finished tasks need to be collected to make room for new tasks, the overall throughput is reduced]
How is that memory wasted if nobody needs it for a while? I've actually implemented a server application in .NET, and it works just fine under heavy load. If you watch the memory usage graph in Task Manager, it looks a bit like a saw blade. Memory usage rises slowly, then drops suddenly when the GC runs.

In terms of new requests coming in, there may be a slight delay (completely imperceptible by end users) if the GC needs to run, but I haven't seen the impact on throughput that you predict. We did optimize our usage of memory somewhat though -- re-using objects rather than creating new ones all the time in our inner loops. But that's because GC is not a cure for basic stupidity, and I never claimed it was. It has made my life a lot easier though, and I think I hardly qualify as a careless or lazy programmer (perhaps you'll have to take my word for it for now... ;) )

In comparison, I've also maintained C++ code for server components that do very computationally expensive and memory-intensive work, and their use of new and delete can absolutely kill performance in the wrong circumstances. Heap fragmentation causes real pain. Unfortunately, it's one of those cases where there's a huge chunk of legacy code involved, and we can't afford to fix it... :-/
A VM adds additional layers to what we already have so you'll end up with: Hardware>Kernel>Drivers>Services>System Libraries>Virtual Machines>Virtual Machine Libraries>Applications
Not necessarily so. You can eliminate a lot of those layers if all the apps run in the VM. I know that probably offends your sensibilities :) but for some systems it makes a lot of sense.
Every tool has a use but I doubt VMs will be the killer "do everything" tool, they, like Visual Basic, are great for creating a quick solution to a small problem but also like VB, you wouldn't use it in a large scale project.
Nothing is a "do everything tool", but VMs are not just toys or RAD tools. Over the past few years I have completed more than one project to create production-quality software that runs on .NET. I'm sure I'm not alone (maybe on this forum I am, but in the industry at large, I highly doubt it). There are scores of applications out there happily running on J2EE servers...
Something that interests me though is that since VMs impose heavy abstraction from the OS, why did Microsoft create .Net to begin with? The only major (double-edged) advantage that I know of in VMs is that what you're running on doesn't matter so Microsoft has effectively weakened their own hold on the OS market by pushing their developers onto .Net since .Net apps can run on Mono on Linux, you could even support .Net apps that use Windows APIs by combining Mono with Wine, it should be easier than supporting full machine code Windows apps. They mentioned that it will enable games to run on consoles (read XboX2) or PC (read Windows) but their XNA thing should allow C++ code to be built for both targets but the risk to their monopoly brought by allowing unmodified binaries to run on Linux shows that they may have gone insane.
...much to the delight of those of us committed to supporting our products on Unix as well as Windows. ;D

If they've gone insane, then IMO it's a good kind of insane. :)

To answer your question though, I think it's so that they could port apps from x86 versions of Windows to Windows CE in its various incarnations. Which brings me to reason #3 Why I like VMs: portability. I think this is well understood though, so I won't expand on it for now...

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 9:40 am
by Colonel Kernel
Pype.Clicker wrote:Yet, like you said, they're not panacea. Some garbage collector may not discover cycles in unreferenced memory structures
How could that be? One of the major goals of GC is to handle precisely that case. Otherwise, we'd all just use reference counting. FWIW, the GC algorithms I'm familiar with handle cycles without any problems.
Imho, VMs doesn't help building *better* software ... they help building acceptable software *faster* (which is a gift to anyone having to work on a schedule :) )
True... reduced development effort is a big plus. :) However, I think that if an app fails gracefully with a useful error message, then it's already better than one that crashes randomly and spews garbage into the user's data.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 9:52 am
by Colonel Kernel
mystran wrote:A virtual machine is most definitely NOT needed for memory protection!!
I didn't claim that it was, just that it allows for more granular and robust memory protection, especially in the presence of shared libraries.
What you need is a safe language, which means you need a language with semantics that guarantees that all memory access goes through safe pointers, and all bounds are checked.
Yes! I agree whole-heartedly. However, if you deploy applications compiled from such languages as bare machine code, how do you go about verifying that they haven't been tampered with? And how can you run them on different platforms without re-compiling...? If such languages must be interpreted, then they're not that useful for applications where performance is important.
Oh, and in case someone really didn't know, you don't need virtual machine for GC either. In fact, you can use Boehm GC with plain C too. Just replace malloc() calls with GC_MALLOC() calls, and remove all free() calls. (There are some limitations on how you can play with your pointers, but if your code confirms to semi-strict interpretation of ANSI C requirements, then you are more or less safe.)
There's a lot of qualifications to that last statement. :)

I didn't say that VMs were required for GC either:
Why I like VMs #2: Garbage collection
Somewhat related to #1 is the fact that VMs make garbage collection a lot easier to implement.
Perhaps we've encountered a problem with terminology. I'm thinking of VMs that include a certain set of features that make all these things easier (not possible, just easier):
  • Safe languages, like you say
  • Intermediate representation of instructions (bytecode, IL, whatever)
  • Metadata that describes all the types referenced by the code to facilitate code verification, GC, reflection, etc.
  • A JIT compiler (or interpreter, if JIT compilation isn't feasible... e.g. on memory-constrained devices) to make everything run
Does this make it clearer?

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 10:57 am
by AR
Yes! I agree whole-heartedly. However, if you deploy applications compiled from such languages as bare machine code, how do you go about verifying that they haven't been tampered with? And how can you run them on different platforms without re-compiling...? If such languages must be interpreted, then they're not that useful for applications where performance is important.
You've never heard of code-signing...? Code signatures verify that the program hasn't been tampered with, if the signature is missing or invalid then it isn't authentic. About re-compiling, it shouldn't be that difficult to simply run GCC again with different target parameters, the only difference here is the size of the distribution which is really only significant if it has to be downloaded provided it'll still fit on 1 CD.
I didn't claim that it was, just that it allows for more granular and robust memory protection, especially in the presence of shared libraries.
This depends, you could just remove the global data section and be done with it, or you could facilitate something similar using a special pipe (write read command, write variable ID number, read variable - this way random memory corruptions won't damage the library's state).
Intermediate representation of instructions (bytecode, IL, whatever)
I don't see how this changes anything, I could distribute py files, sure it's the full code but it's still "intermediate" as it isn't machine code. The only difference byte code makes is the speed at which it is interpreted other than that there is no real difference as it is still interpreted into machine code anyway. Then you could also call a normal binary, a "pre-interpreted program".
A JIT compiler (or interpreter, if JIT compilation isn't feasible... e.g. on memory-constrained devices) to make everything run
Again I don't see the necessity, the performance gain from compiling at runtime rather than installing an appropriate machine code build should be minimal apart from distribution size, or just distribute the source (which is what you are doing with .Net/Java anyway since they can be decompiled let alone disassembled)
Metadata that describes all the types referenced by the code to facilitate code verification, GC, reflection, etc.
This doesn't verify anything other than the fact that it has a matching ABI (it can all be faked if you have no method to verify who wrote it), .Net and Java both come back to code-signing for authenticity. Type safety is more granular than machine code, but it does have facilities for this stuff though (_Function@4, 4 bytes of paramaters pushed on stack, _ZN7Namespace6FunctionEPcz, takes a character pointer and variable length list), going to the effort to make the function name conventions lie is pointless but code signatures can prevent any malicious actions anyway.


To clarify my point, using Java/.Net doesn't (necessarily) make you any worse a programmer but Virtual Machines risk a lack of knowledge syndrome. If new programmers enter straight into VMs without ever having used machine code directly, when the "elite" programmers who wrote the VMs begin to retire, who is going to maintain the VMs themselves? All the programmers on the VM don't have a clue how the stuff beneath the VM works and they're stuck.

Using the right tool is a good thing, the thing that started me off was the comment about 'running any code outside a VM is a bad idea'. If you can't run code outside a VM than your OS design is flawed not the system, taking advantage of the hardware directly if that is what the programmer deemed appropriate for a given situation should definitely be permitted.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 11:41 am
by Colonel Kernel
AR wrote:You've never heard of code-signing...? Code signatures verify that the program hasn't been tampered with, if the signature is missing or invalid then it isn't authentic.
True... I hate waking up early, BTW ;)

What I should have pointed out was that I don't think it's possible to statically verify the type-safety of machine code in the general case.

What mystran said about safe languages also applies to the intermediate representation. IL for example has instructions considered to be "unsafe", but it has a well-defined "safe" subset that the "safe" high-level languages can be compiled to. I think it's extremely difficult to do the same with machine code, especially when indirect addressing and custom heap allocators are involved.
About re-compiling, it shouldn't be that difficult to simply run GCC again with different target parameters
<sigh> Tell that to Joe User. Better yet, tell that to my CEO who doesn't want to distribute our core IP to the entire world. I agree with him, BTW. The rest of this line of discussion would become political, and I don't want to go there. Suffice it to say that current economic realities make open source a very unattractive option for a lot of software companies, including the one I work for.
This depends, you could just remove the global data section and be done with it,
What would that solve? Shared libraries can still write all over each other's memory... they're all in the same address space.
or you could facilitate something similar using a special pipe (write read command, write variable ID number, read variable - this way random memory corruptions won't damage the library's state).
And what part of the system is going to enforce the use of this mechanism...?
Intermediate representation of instructions (bytecode, IL, whatever)

I don't see how this changes anything, I could distribute py files, sure it's the full code but it's still "intermediate" as it isn't machine code. The only difference byte code makes is the speed at which it is interpreted other than that there is no real difference as it is still interpreted into machine code anyway. Then you could also call a normal binary, a "pre-interpreted program".
See static verifiability above.
A JIT compiler (or interpreter, if JIT compilation isn't feasible... e.g. on memory-constrained devices) to make everything run

Again I don't see the necessity, the performance gain from compiling at runtime rather than installing an appropriate machine code build should be minimal apart from distribution size
JIT compilers don't necessarily make things faster (I did point out interesting opportunities for optimization, although this is an area of ongoing research), but it's faster than using an interpreter. Speed is not really the point anyway... if you have an intermediate representation, you have to run it somehow. This was after all an item from a VM feature list, not a list of benefits (which is currently: memory protection/type safety, GC, and portability).
or just distribute the source (which is what you are doing with .Net/Java anyway since they can be decompiled let alone disassembled)
There are obfuscation tools to make this pretty much as difficult as reverse-engineering a machine code binary. See above for why distributing source is not an option for everyone.

...continued...

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 11:42 am
by Colonel Kernel
Metadata that describes all the types referenced by the code to facilitate code verification, GC, reflection, etc.

This doesn't verify anything other than the fact that it has a matching ABI (it can all be faked if you have no method to verify who wrote it), .Net and Java both come back to code-signing for authenticity.
Static verification for type-safety is a separate issue from authenticity (I apologize for not making that clear in my last feverish post).

As for faking it, the rules of the intermediate language make that impossible. If some library has metadata describing a type with int, double, and char fields, then that is what the type will look like in memory, because the VM is in charge of instantiating it. Remember, the safe subset of the IL can't really mess with memory directly, so it can't for example try to stick a 4-character UTF-16 string into the double. There's simply no way to express such an operation without using the unsafe instructions, in which case the library wouldn't be verifiable.
Type safety is more granular than machine code, but it does have facilities for this stuff though (_Function@4, 4 bytes of paramaters pushed on stack, _ZN7Namespace6FunctionEPcz, takes a character pointer and variable length list), going to the effort to make the function name conventions lie is pointless but code signatures can prevent any malicious actions anyway.
It's not maliciousness that's at issue (you're right about code signatures dealing with this), it's the presence of memory-smashing bugs.
To clarify my point, using Java/.Net doesn't (necessarily) make you any worse a programmer but Virtual Machines risk a lack of knowledge syndrome.
That problem exists regardless of language. None of these tools are a substitute for basic intelligence. If it makes you feel any better, I'll yell at the next summer intern that abuses GC. ;)
If new programmers enter straight into VMs without ever having used machine code directly, when the "elite" programmers who wrote the VMs begin to retire, who is going to maintain the VMs themselves? All the programmers on the VM don't have a clue how the stuff beneath the VM works and they're stuck.
This is why I believe it's good to understand as much as possible. In the end, I think there will always be people like us who live and breathe this stuff... our curiosity drives us to understand how the low-level things work. I am concerned about the overwhelming shift from C/C++ to Java in university programming courses though. I think at least understanding the world of "unmanaged memory" is a prerequisite for any programmer.
Using the right tool is a good thing, the thing that started me off was the comment about 'running any code outside a VM is a bad idea'.
I just said it was becoming obsolete. The software industry decided that, not me. :) But I do believe that for most applications, running outside a VM is pointless, now that VMs are becoming mainstream and very capable. Of course there will always be exceptions (real-time systems, for one), but I think they're becoming less and less. Even games can run as managed code with a minimal performance hit.
If you can't run code outside a VM than your OS design is flawed not the system, taking advantage of the hardware directly if that is what the programmer deemed appropriate for a given situation should definitely be permitted.
And that's why there will always be OSes that allow this. I just don't think every OS needs to allow this.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 4:02 pm
by mystran
Why not distribute code as intermediate code (like bytecode) and then JIT compile it on the fly when loading? Now, you say this is what a VM is about, but you don't really need the VM part, just the JIT as part of the linker-loader.

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 4:24 pm
by Colonel Kernel
Why not distribute code as intermediate code (like bytecode) and then JIT compile it on the fly when loading?
That's how JIT compilation works, although often the compilation can be done on a per-function basis as functions are called for the first time. That's why it's called JIT (just in time)...
Now, you say this is what a VM is about, but you don't really need the VM part, just the JIT as part of the linker-loader.
... and the run-time environment, right? If those two things don't constitute a VM, then maybe we need a better definition of VM first. :)

From dictionary.com:
virtual machine

1. An abstract machine for which an interpreter exists.
Virtual machines are often used in the implementation of
portable executors for high-level languages. The HLL is
compiled into code for the virtual machine (an intermediate
language) which is then executed by an interpreter written
in assembly language or some other portable language like
{C}.
This seems to suggest that if it doesn't have an interpreter, then it isn't a VM... not sure I agree with that. IMO a Java VM (for example) is still a VM even if it JIT-compiles the bytecode rather than interprets it...

Re:Why VMs are good, mmmmkay

Posted: Sat Apr 02, 2005 11:26 pm
by AR
<sigh> Tell that to Joe User. Better yet, tell that to my CEO who doesn't want to distribute our core IP to the entire world. I agree with him, BTW. The rest of this line of discussion would become political, and I don't want to go there. Suffice it to say that current economic realities make open source a very unattractive option for a lot of software companies, including the one I work for.
I am not an "everything should be open source" fanatic, what I ment is that you could pre-compile for different targets and include multiple binaries for different targets.
or you could facilitate something similar using a special pipe (write read command, write variable ID number, read variable - this way random memory corruptions won't damage the library's state).
What would that solve? Shared libraries can still write all over each other's memory... they're all in the same address space.
I may not be aware how libraries function but I was under the impression that they had .text (read-only) and local instance .data/.bss (read-write) and a global data section shared between all instances. If this isn't how it's done then, again that's a flaw in the design.
And what part of the system is going to enforce the use of this mechanism...?
I imagine the Kernel would. You provide a buffer in Kernel space to the library which the library can interact with using a pipe. If that is the only way the library can share data across instances then the programmer is going to have to use it.
I just said it was becoming obsolete. The software industry decided that, not me. But I do believe that for most applications, running outside a VM is pointless, now that VMs are becoming mainstream and very capable. Of course there will always be exceptions (real-time systems, for one), but I think they're becoming less and less. Even games can run as managed code with a minimal performance hit.
It is impossible for machine code to be obsolete unless you're going to invent a system where the code is interpreted by something that doesn't require hardware. I could also add that the Quake port is still "unmanaged" with a few managed snap-ons (I also saw "the .Net show" where they demonstrated that, the engine went from 60 FPS to 50 FPS just by adding /clr). The "software industry" didn't decide that, Microsoft did, Java has been around for ages but it's uptake has been less than Visual Basic 6's.