JIT - stack machine into single operations?

AndrewAPrice · Post by **AndrewAPrice** » Sun Nov 01, 2009 7:17 am

I'm attempting to write my own JIT'er for the .Net runtime, starting with a very limited subset.

However, the .Net bytecode is built around a virtual stack-based machine, for example, a multiplication would be:

push variable onto stack
push second variable onto stack
multiple
pop answer

However, when trying to convert that to a register based system (x86) I don't want to translate each one of those individual pop's/push's, but rather convert it into as few instructions as possible.

I was wondering how I would go about mapping the instructions. I don't want to say 'if 2 pushes, a multiply, and 1 pop' then replace it with a multiple because that is potentially dangerous since a value could be kept in the stack from a previous operation. I also don't want to emitting code only when I reach a non-stack-modifying instruction, because I might reach code where a lot of pop/push-ing is done to rearrange values on the stack and I skip over that completely.

The alternative way is to try to build a tree in memory of values being passed in and out, but this seems bad because it would create a lot of overhead for a JITer that is suppose to be light weight.

So, where should I go from here?

Martijn · Post by **Martijn** » Sun Nov 01, 2009 8:11 am

I think this is one of the most basic methods for code generation. I haven't implemented it yet, so i'm not sure if it works.

For each instruction, you call a translation method and pass in a stack structure that contains the result of cil pushes and pops.

Code: Select all

cil_push(stack *s, ...
cil_add(stack *s, ...

Basics:
For every instruction that pushes to the cil evaluation stack, you push a 'descriptor' to your stack with information about the value being pushed.
For every instrucion that pops from the evaluation stack, you emit code.

Example pseudocode:

Code: Select all

c = a + b;

Cil:

Code: Select all

.locals (int32 V_0, int32 V_1, int32 V_2)

ldloc.0 (pushes 1)
 - push a descriptor to your stack: type=local;value=0
ldloc.1 (pushes 1)
 - push a descriptor to your stack: type=local;value=1
add (pops 2, pushes 1)
 - on your stack there are two value descriptors of type local
 - emit code to load these 2 locals into 2 registers (eg eax and ebx)
 - emit: add eax, ebx
 - push a descriptor to your stack with: type=register;value=eax
stloc.2 (pops 1)
 - on your stack there is a value descriptor of type register
 - emit code to copy the register (eax) to the local

NickJohnson · Post by **NickJohnson** » Sun Nov 01, 2009 8:42 am

At least one thing you could do is have one register represent the top of the stack, and push it when pushing another value. For example, the C expression:

Code: Select all

(3 + 4) * 5
(equivalent to RPN (3 4 + 5 *))

would translate to

Code: Select all

mov eax, 3
push eax

mov eax, 4

pop ebx
add eax, ebx
push eax

mov eax, 5

pop ebx
mul eax, ebx

instead of the longer naive translation

Code: Select all

mov eax, 3
push eax

mov eax, 4
push eax

pop ebx
pop eax
add eax, ebx
push eax

mov eax, 5
push eax

pop ebx
pop eax
mul eax, ebx
push eax

Much better, but could definitely have more optimizations. You may be able to use other registers as representatives of the top of the stack as well, but I don't know how to do that. If you could find an algorithm to juggle two registers, it would probably produce something like this, which is nearly optimal for most operators:

Code: Select all

(eax represents one below top of stack, ebx represents top of stack)
mov ebx, 3

push eax
mov eax, ebx
mov ebx, 4

add ebx, eax
pop eax

push eax
mov eax, ebx
mov ebx, 5

mul eax, ebx

The small issues with that could probably be corrected through some simple types of peephole optimization.

Owen · Post by **Owen** » Sun Nov 01, 2009 4:26 pm

Convert the bytecode into SSA (Single Static Assignment) form and work from there.

For example, the code
push 4
push 3
push 2
mul
add

Would evaluate to something like:

4 -> V1
3 -> V2
2 -> V3
mul V3, V2 -> V4
add V4, V1 -> V5

Once you've done that you're in a much better position to do stuff like register assignment

ru2aqare · Post by **ru2aqare** » Sun Nov 01, 2009 4:31 pm

MessiahAndrw wrote:I'm attempting to write my own JIT'er for the .Net runtime, starting with a very limited subset.

However, the .Net bytecode is built around a virtual stack-based machine, for example, a multiplication would be:
push variable onto stack

push second variable onto stack

multiple

pop answer
However, when trying to convert that to a register based system (x86) I don't want to translate each one of those individual pop's/push's, but rather convert it into as few instructions as possible.

You need to do register allocation. Google/wiki it. Alternatively there are some good books on compiler construction, and most of them cover this topic. Aho et al. - Compilers, principles, techniques and tools 2nd ed. (aka the Dragon Book) comes to mind.

In my very crude CIL -> assembly compiler, I also attempted to do register allocation - allocate the top N values on the stack to registers (for example, in the order rax rcx rdx rbx r8-r15 etc.), the rest to stack (ie. [rbp-X]) locations. This worked surprisingly well and was fairly trivial to implement. You need to take the needs of some instructions into account - at least one operand of mul has to be in register rax; similar constrains for division. - If you want my code, I can send it to you.

AndrewAPrice · Post by **AndrewAPrice** » Sun Nov 01, 2009 5:00 pm

Thank you for all your replies.

The problem with the method Martijn presented is that it doesn't seem very friendly with branching, that is, if certain values are pushed to the stack in only one branch. SSA eliminates this problem, but after reading on this topic seems a bit blurred on how to deal with objects and arrays (a loop could make a variable number of changes to an object not known at runtime).

ru2aqare wrote:If you want my code, I can send it to you.

That would be nice to see an example. The way you described seems simple, although without converting to SSA I loose some of the inherit optimisations, so instead it would be dependent on the compiler to emit efficient IL (which I think is a fair assumption since JITing is suppose to be fast).

Owen · Post by **Owen** » Sun Nov 01, 2009 5:33 pm

The way LLVM and libJIT deal with arrays and objects is simple - don't treat memory as SSA. They then have an optimization pass which detects things like

Store V1 -> A1
... some code

Load A1 -> V2

and eliminates it

rootnode · Post by **rootnode** » Mon Nov 02, 2009 12:30 pm

At MOSA we're using SSA and we're getting along pretty well.
The transformation doesn't need much time and we end up having a normalized representation for our optimization stages.

OSDev.org

JIT - stack machine into single operations?

JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?

Re: JIT - stack machine into single operations?