Page 1 of 1

Verification of machine code

Posted: Sat Aug 11, 2007 3:41 pm
by singularity
Hi there!

Well an idea popped up in my head last week. Would it be possible to scan the machine code of all and every binary loaded by the kernel (not assembly code and not some intermediate code, real machine code), to verify that it doesn't use any priviledged instructions, doesn't modify certain registers, doesn't jump to foreign program code, doesn't dereference pointers to other programs, etc...

So in the end it would be possible to guarantee the safety of a given binary. Then it would be possible to run all programs in one address space, in ring0 and it would certainly allow for very fast message passing.

Singularity uses this too, but with an intermediate language...I would like to implement this same behaviour with normal machine code, to allow safety-checked legacy programs to run (surely there will have to be certain runtime checks too, so the binaries need to be modified a bit...).

Hm....well, what do you think about this?

Posted: Sat Aug 11, 2007 4:24 pm
by jnc100
Interesting, but incredibly difficult, which is probably one of the main reasons Singularity attempts to guarantee type safety from an intermediate language rather than machine code. Some of the things you suggest (preventing use of privileged instructions, doesn't modify certain registers - not entirely sure what you mean here but I guess you mean things like cr0 and cr3) are provided for by the protection mechanism of the i386 processor. The rest can basically be summed up as: data accesses by a process can only be made to preset parts of the image labeled data and can also not overflow any particular section within that region that they are writing to, and instruction flow is limited to those parts defined as methods/functions, with strictly defined rules as to how to transition between different parts of code.

This is where the difficulty comes in. Certainly, you can limit data accesses to a .data segment, a .bss segment, a heap and a stack, but what if you want to allow self-modifying code? Also, how do you prevent overflowing an array, as it is difficult to determine, from machine code, what is an array and how long it is. Code is a little easier: don't allow jumps to anything outside a process' .text section. This is fine for jumps of the type 'jmp rel8/16/32' or 'jmp ptr...' but what happens when you do something like 'jmp [edx]'? This is quite common in C++ code using vtables. How do you know beforehand what the value of [edx] is going to be?

One of the main reasons for CIL (what Singularity uses as its intermediate code) was that it provided a code that could be verified and yet converted to machine code quickly, as the difficulties of doing it with machine code were all too apparent.

The other issue you have to consider with the "one address space for all processes" method is that it is unrealistic on 32bit machines (the designers of Singularity admit this) as you could conceivably end up sharing 4gb between all your processes and your kernel. Long mode provides much more space, so it is feasible, but you are immediately ruling out a large proportion of the computer systems out there.

Regards,
John.

Posted: Sat Aug 11, 2007 4:55 pm
by Colonel Kernel
Being able to verify code depends a lot more on metadata accompanying the code than on the format of the code itself. Bartok (the CIL-to-x86 compiler used by Singularity) verifies CIL, but they plan to have it emit "typed assembly language" in the future that can be verified by an even smaller and simpler verifier, thus removing Bartok from the trusted part of the system. Typed assembly is just machine code along with a proof that the verifier can cross-check against the code itself.

You won't be able to verify a legacy binary unless it comes with some metadata that constitutes a proof of its type safety. This sort of verification really has to be built into the entire system from the beginning. You'll need to use hardware isolation (CPU privilege levels & the MMU) for legacy code (unless your idea of legacy is something like .NET assemblies or Java .class files).

Posted: Sat Aug 11, 2007 5:41 pm
by singularity
Yeah, I know, it would be very difficult...
I've read a few papers about machine code verification and then this idea simply popped up ;)

Jep, I meant registers like cr0 and cr3...and instructions like lidt, ltr, etc.
The rest can basically be summed up as: data accesses by a process can only be made to preset parts of the image labeled data and can also not overflow any particular section within that region that they are writing to, and instruction flow is limited to those parts defined as methods/functions, with strictly defined rules as to how to transition between different parts of code.
Yes, that was the basic idea...
The other issue you have to consider with the "one address space for all processes" method is that it is unrealistic on 32bit machines (the designers of Singularity admit this) as you could conceivably end up sharing 4gb between all your processes and your kernel. Long mode provides much more space, so it is feasible, but you are immediately ruling out a large proportion of the computer systems out there.
Yes, older systems would not be supported... (in my opinion, if I would allow hardware protection domains, there would be not much point in verifying machine code...)
Typed assembly is just machine code along with a proof that the verifier can cross-check against the code itself.
This is proof carrying code, right?

Oh well, thank you very much for you sharing your opinions with me! I'd really love to discuss a bit more, but I'm very tired at the moment and my head's a real mess now ;)

cheers

Posted: Mon Aug 13, 2007 8:13 pm
by Crazed123
There was an operating system called Go! that did that.

Posted: Tue Aug 14, 2007 2:00 am
by AJ
Hi,

A quick look on sourceforge revealed that Go! Was superceded by http://sourceforge.net/projects/odin-os/. Looks like that has been taken down now, though.

Cheers,
Adam

Posted: Wed Aug 15, 2007 6:03 am
by singularity
Yeah...I know, but the source is still in cvs...maybe, I'll look into this to get some clue, if this thingy is feasible...