Announcing Options--An OS to reverse engineer other OSes
Posted: Sun Feb 20, 2011 10:30 pm
Options is my own 64bit operating system. In the future, it will be a hypervisor operating system. Well, so it's vaporware now, right?
Well I have some code done. It's released under GPL 3, here: http://code.google.com/p/infrared/downloads/list
Options is part of Project Infrared. That project aims to make an OS that does dynamic translation of another operating system, in order to take notes as it runs it. This is somewhat faster than BOCHS because while actually executing code, the code can run natively.
The 64bit OS runs well in VMWare with 256MB guest RAM. For serious use it will need to be run on native hardware.
After taking notes while running a guest OS inside of my own Options OS, the next thing to do is use those notes together with the original disk image, to decompile the guest operating system! That's part of VmDec, the virtual machine decompiler that I haven't made yet. I do have some old released of VmDec on Google Code at this address: http://code.google.com/p/vm64dec/
But VmDec isn't useful at present as it was a static analyzer. The new project is a dynamic analyzer. This means it doesn't generate code you have to read, but it generates parse trees based on guest OS code that's actually executed--not that may execute hypothetically, but that has executed! The parse tree contains everything relevant about what the guest OS code has done. It also takes notes about control flow, logging the conditions that lead to a branch being taken or not being taken and flagging indirect control transfers for further study--such as, was it a switch-case statement (indirect jump)? What was the value of the switch and why did it have that value?
Thus it helps in test case generation, which is part of reverse engineering--first you exercising the target system's functionality--all of it that you need--then you document what the guest did for each test case, with some parse trees (the parse trees represent data. We have parse trees also for facts that were noticed and that influenced control flow). Our parse trees are directed acyclic graphs, with operators and the terminals are operands.
Here is the gist of what I am working on achieving:
Normally when you execute code the instructions have to run in a particular order. For example if I write to memory then read from memory, the memory write has to happen before the read because we have no idea whether the read will access the data written to by the write, or not, in general. But sometimes instructions can be parallelized. My hypervisor will divide basic blocks of machine code (normally a basic block is a sequence of decoded instructions that ends with a control transfer, or another basic block) into STEPS that must execute in order. Then we decompile each step of code after reconciliation is done (sometimes we have overlapping basic blocks because of the order in which code has been executed).
Has anyone ever heard of a dynamic analysis decompiling operating system? Or am I the first to attempt it?
Now I will "log" all STEPs of code. A step of code basically is a part of a basic block, it's a directed acyclic graph with operators and operands. By "log" I mean anything written to will have its old value logged to a big buffer and eventually to disk. Thus if I do "MOV EAX,1" then the old value of EAX will be stored in a big undo buffer and eventually be written to disk.
If I do this:
MOV EAX,1
MOV EAX,2
then because both instructions are the part of the same step, EAX's old value will be logged only once. All memory writes will have the memory's old value logged too.
It's not easy to get this to work with guest OSes that use hardware I/O access or that use paging hardware, but it is possible and I intend to do it (in particular memory read order has to be preserved in case a memory-mapped I/O device was being accessed).
I will also track each byte of guest RAM that is capable of containing executable code. This requires 6 bytes of metadata for each byte of guest RAM, because I have to keep track of from where the byte of memory came, if it came from disk. Then when the code in memory executes, I know which file from disk (hopefully a FAT32 volume) is executing and what the disk offset in that file is, for the executing code.
I've done already most of the above, but it was a lot of separate projects. I made already a simple PC emulator and a simple disassember, see the vm64dec link above for more information. This project puts all the pieces together though, into one big framework!
Where is it now? Well I have old versions of all the pieces, but they are not put together yet. I want to do a good job here, which means adapting some code and rewriting the rest. Options boots and initializes memory, but doesn't do anything too interesting as of yet.
Well I have some code done. It's released under GPL 3, here: http://code.google.com/p/infrared/downloads/list
Options is part of Project Infrared. That project aims to make an OS that does dynamic translation of another operating system, in order to take notes as it runs it. This is somewhat faster than BOCHS because while actually executing code, the code can run natively.
The 64bit OS runs well in VMWare with 256MB guest RAM. For serious use it will need to be run on native hardware.
After taking notes while running a guest OS inside of my own Options OS, the next thing to do is use those notes together with the original disk image, to decompile the guest operating system! That's part of VmDec, the virtual machine decompiler that I haven't made yet. I do have some old released of VmDec on Google Code at this address: http://code.google.com/p/vm64dec/
But VmDec isn't useful at present as it was a static analyzer. The new project is a dynamic analyzer. This means it doesn't generate code you have to read, but it generates parse trees based on guest OS code that's actually executed--not that may execute hypothetically, but that has executed! The parse tree contains everything relevant about what the guest OS code has done. It also takes notes about control flow, logging the conditions that lead to a branch being taken or not being taken and flagging indirect control transfers for further study--such as, was it a switch-case statement (indirect jump)? What was the value of the switch and why did it have that value?
Thus it helps in test case generation, which is part of reverse engineering--first you exercising the target system's functionality--all of it that you need--then you document what the guest did for each test case, with some parse trees (the parse trees represent data. We have parse trees also for facts that were noticed and that influenced control flow). Our parse trees are directed acyclic graphs, with operators and the terminals are operands.
Here is the gist of what I am working on achieving:
Normally when you execute code the instructions have to run in a particular order. For example if I write to memory then read from memory, the memory write has to happen before the read because we have no idea whether the read will access the data written to by the write, or not, in general. But sometimes instructions can be parallelized. My hypervisor will divide basic blocks of machine code (normally a basic block is a sequence of decoded instructions that ends with a control transfer, or another basic block) into STEPS that must execute in order. Then we decompile each step of code after reconciliation is done (sometimes we have overlapping basic blocks because of the order in which code has been executed).
Has anyone ever heard of a dynamic analysis decompiling operating system? Or am I the first to attempt it?
Now I will "log" all STEPs of code. A step of code basically is a part of a basic block, it's a directed acyclic graph with operators and operands. By "log" I mean anything written to will have its old value logged to a big buffer and eventually to disk. Thus if I do "MOV EAX,1" then the old value of EAX will be stored in a big undo buffer and eventually be written to disk.
If I do this:
MOV EAX,1
MOV EAX,2
then because both instructions are the part of the same step, EAX's old value will be logged only once. All memory writes will have the memory's old value logged too.
It's not easy to get this to work with guest OSes that use hardware I/O access or that use paging hardware, but it is possible and I intend to do it (in particular memory read order has to be preserved in case a memory-mapped I/O device was being accessed).
I will also track each byte of guest RAM that is capable of containing executable code. This requires 6 bytes of metadata for each byte of guest RAM, because I have to keep track of from where the byte of memory came, if it came from disk. Then when the code in memory executes, I know which file from disk (hopefully a FAT32 volume) is executing and what the disk offset in that file is, for the executing code.
I've done already most of the above, but it was a lot of separate projects. I made already a simple PC emulator and a simple disassember, see the vm64dec link above for more information. This project puts all the pieces together though, into one big framework!
Where is it now? Well I have old versions of all the pieces, but they are not put together yet. I want to do a good job here, which means adapting some code and rewriting the rest. Options boots and initializes memory, but doesn't do anything too interesting as of yet.