Page 1 of 1

advice on Intermediate Representation format

Posted: Fri May 27, 2016 3:10 pm
by cxzuk
Hi all,

Spending some time on the Intermediate Representation on my compiler and was wondering if anyone had any experience or advice to help guide me. A summary or link to pros and cons on each type would be great.

I'm currently considering a stack IR as it seems the simplist.

Thank you
Mikey

Re: advice on Intermediate Representation format

Posted: Sun May 29, 2016 5:48 am
by Brendan
Hi,
cxzuk wrote:Spending some time on the Intermediate Representation on my compiler and was wondering if anyone had any experience or advice to help guide me. A summary or link to pros and cons on each type would be great.

I'm currently considering a stack IR as it seems the simplist.
For which purpose?

For simplicity, the best would be leaving it as an abstract syntax tree because this is what your parser is likely to generate anyway. This is fine for initial optimisations, and is relatively space efficient to encode as sequential data (you only need 2 bits per entry, one to say if an entry has children and another to say if an entry is its parent's last child).

For making it easier to do more complex optimisations, the best would be static single assignment form. This is probably the worst option for space efficiency.

For space efficiency, the best would be something like machine code where operations are encoded as (e.g.) "dest opcode src1, src2, ..." and where there's explicit control flow opcodes (jumps, branches and calls); and where you pretend there's an "infinite" number of virtual registers (to avoid the bloat of all the pushes that stack machines cause). This is probably also the easiest form to convert to actual machine code (e.g. beginning with assigning real registers to virtual registers, etc).

Note that you'll probably use multiple representations at different places (e.g. convert source code into AST, do stuff with AST, then convert to SSA, do stuff with SSA, then convert to "machine-code-like", then convert to actual machine code).


Cheers,

Brendan

Re: advice on Intermediate Representation format

Posted: Sun May 29, 2016 12:49 pm
by dseller
Be sure to also look at Three-address code.

Re: advice on Intermediate Representation format

Posted: Mon May 30, 2016 7:21 am
by jnc100
The most common forms of IR either target a stack machine or a register machine. Typically they assume an infinite amount of stack space or registers. A stack based code is far easier to generate from an AST, but is more difficult to convert to machine code*. A register based IR on the other hand is essentially the opposite.

Most C-like compilers (I'm talking from experience with gcc and LLVM here) will convert the initial code into an AST with pseudoinstructions, which are then again converted into a register-based IR, which is typically a three address code version (one operation, one output operand and two input operands).

Once you have this, you essentially apply multiple 'passes' that will bring the IR closer and closer to direct machine code. For example, common passes include a conversion to SSA form, performing optimizations on that SSA form e.g. constant propagation and elimination of dead code, then convert back from the SSA form (essentially eliminating the phi instructions), identification of register semantics (e.g. bit length, integer vs float, does it need a memory location because its address is taken at any point), conversion to two-address form (e.g. to support x86 like instructions where the destination is also one of the source operands), selection of machine code instructions to fulfil one or many IR instructions, register coalescing and allocation, machine code peep-hole optimizations etc.

To make this easy, your IR should be expressive enough such that semantic information can be gleaned easily and cheaply in any pass. Particularly (if choosing a register based three address code), you'd want: operator, source, destination, bit length (e.g. 32 bit add or 64 bit or single float etc - this may be combined with the operator such that add.i4 and add.i8 are different operators), signedness for arithmetic operators (signed vs unsigned multiply), destination constraints (e.g. a set-if-equal instruction will only output either 0 or 1 - this can be used to optimize away later instructions that test if the value is 2), any 'special' semantics (e.g. is this a move instruction that can be optimized out by the register coalescing algorithm) and control-flow information. You do not necessarily know all these pieces of information when you form the IR, but they can be added by later passes.

Regards,
John.


* Actually a stack-based IR can be really easily converted to machine code by virtue of utilising the machine's stack: e.g. the AST: add(load_constant(13), load_constant(29)) can be expressed as the IR stack instuctions: load_constant(13); load_constant(29); add; which could then be expressed as x86 machine code: push 13; push 29; pop eax; pop ebx; add eax, ebx; push eax;. Obviously this is highly unoptimized but is really quick to both implement and actually compile. It is possibly a valid option if you have a machine with a large register file because then instead of using the actual stack (which required slow memory accesses) you use the register file itself, e.g. reserve r0 and r1 as 'scratch' registers, then the 'stack' is r2-r16 or however many you have, spilling over into memory if necessary. This is an approach I have previously taken in a JIT compiler on x86_64 and ARM when speed of compilation is also a significant factor to take into consideration.

Re: advice on Intermediate Representation format

Posted: Mon May 30, 2016 10:30 am
by cxzuk
Thank you all for your comments, I've decided on a Tuple based Intermediate Representation.

If anyones interested, a summary on my thoughts and reasons why. Feedback always welcome;

I don't have an AST, only a Concrete Syntax Tree (which is also used in the editor for rending), The useful AST features are handled in methods. But I can see it would make sense to use it if I had one.

I have a machine/target object that has AddMethodBody(Name, IntermediateRep) (CreateMethod(Name) is a separate method to aid/provide forward symbol referencing/lookup). The object will have a method Compile() that will return a blob or ASM etc.

My desired targets don't easily share concepts like stacks, registers etc. So conceptually all allocations are heap based, and stack allocation will be a transformation pass along with register allocation in the target object. I don't have aliasing, and references are either statically bound once, or handled outside of the translation unit and can't be allocated anyway, so I don't require Escape/Alias/Pointer analysis which should hopefully make for simple transformations (famous last words).

Draft IR
CREATE <object-name>

PUSH <object-name> Add an object name to the CONTEXT that will be used with a message.
QUERY <target-object-name>, <result-object-name>, <message-name>
COMMAND <target-object-name>, <message-name>

LABEL <label-name>
IFTRUEJUMP <cond-object-name>, <label-name>
JUMP <label-name>

Im feeling that the transformations im gonna need will change the IR, so id rather get coding and see what the code tells me about what challenges arise in the transformations. E.g. COMMAND/QUERY could be joined into MESSAGE, PUSH could be CONTEXT <obj1-name>, <obj2-name>,... <objn-name> or even part of the QUERY COMMAND instructions, if its easier to deal with.

Wish me luck,
Mikey