LazyTongue compiler
Posted: Sat Nov 05, 2016 4:58 pm
Hey all. For some time now, I've been working on and off on a compiler for a toy language and a toy processor. I do not seek help, but present this story for your amusement and entertainment.
PROLOGUE
My efforts to build a compiler began a decade ago in a fit of not-made-in-here syndrome. I had been playing a video game called Garry's Mod, a sandbox physics game that allows for construction of intricate mechanisms, among other things. That game had a plugin called "wiremod" that added digital logic and programming elements - buttons, logic gates, text terminals, physical actuators, CPUs emulated at register/machine code level, and memory-mapped IO devices like keyboards and text terminals.
So I wanted to make an OS for a wiremod CPU, just to prove that I can. There was a problem with that - the CPU was programmable using assembly (or by directly writing machine code into memory), but there was no C compiler, so I couldn't port over any existing code. My assembly skills are also lacking, and I doubt anyone would develop apps for my OS if they had to do use assembly to do it.
That meant I needed a compiler. So I started learning how they function and trying to make one. There were many false starts - the syntax analysis (parsing a token stream into an abstract syntax tree) had been the most difficult to make by hand, and by the time I got to semantic analysis I would be lost in my own spaghetti code. But recently, one of thse attempts had developed into a compiler that could actually produce code, that in turn ran in the toy CPU and caused lights to blink in a sequence. Granted, it is only a snall subset of the language using only the simplest forms of branching and arithmetic, but still, BLINKENLIGHTS!
I am now having a much easier time developing the compiler, now that there is some tangible result. A fun side effect of it's incompleteness is that every bug fixed seems to add language features! Small featured. Like "i++" instead of "i = i+1", or being able to initialize a variable on the same line as declaring it. I am currently working on support for having stack frames and accessing variables from earlier frames, for things like giving the "for-loop" it's own scope.
Now, onto how the compiler is built right now:
Compiler structure
The compiler is written in C. I chose C instead of C++ because I hope to one day make it self-compiling, and supporting all of C++ is such a gargantuan task I don't even consider it.
The tokenizer and parser are generated using FLEX (lex) and BISON (yacc). The tokenizer eats text and returns tokens, the parser turns them into an abstract syntax tree. Not much science here, just figuring out how to get them to preserve the locations of all the tokens and nodes. The parser also calls some functions to transform certain parts of the syntax tree into a more convenient form (i.e. turning a recursive node into a list), just so they are easier to read later.
Then there comes the semantic analyzer. I know of no tools to generate one, so I had to make it from scratch. Here's how it works:
The semantic analyzer performs two passes over the program code. In the first pass, it gathers any declarative information and builds a symbol table, recording the types of variables, the scopes in which they are defined, the names and signatures of functions, their associated symbol tables and code segments (I somewhat erronousely refer to chunks of code as code segments. A code segment can be moved anywhere relative to another segment, but must not be deformed or split. This allows a linker to optimize the binary layout later).
The second pass again reads the same AST from the beginning, but in the context of already parsed declarative information, and generates an Intermediate Representation, that resembles very high level assembly.
The decision to make two passes over the tree was so that forward-declaration is not necessary as it is in C. The decision to use an intermediate representation was so that the semantic analyzer does not have to concern itself with code generation, and so that understanding of the source code by the compiler is independent of the target, for which the code is generated.
The semantic analyzer works like this: It visits the root node of AST, and recursively calls itself on the child nodes. Depending on the type of node and the current context, it will do some book keeping, change the current context, or maybe emit some code in between the recursive calls to the child nodes. For instance, when a function declaration node is visited, a symbol has to be created to represent the function, and added to the current (parent) symbol table. a new symbol table has to be created, to store the variables declared inside the function. The function symbol has to be associated with the new symbol table and the new code segment, into which the code of the function will be written. Finally, emitting the code to enter and exit the function has to be done i the correct order with a recursive call to analyze the node representing the function body, if correct code is to be generated. A set of unique names have to be generated to become the labels and values used in the code, and if the node is an expression, then the name of the value that holds the result of that expression is put on a stack and then popped by the function analyzing the parent node. There is a lot of stuff to keep track of and honestly it has gotten pretty spaghettified in here. I should clean it sometime.
The intermediate representation is a series of commands that specify an operation, it's arguments as 'values' and the resulting 'value'. Think values as registers, if the CPU had an infinite number of them. This means that reusing a value is unecesssary, and the semantic analyzer can treat them as napkins, or single-use dishware. They can be reused, it's just unnecessary. The problem of using the finite amount of registers to store a much larger amount of values is solved by the code generator and it's register allocator.
Here's an example of my IR:
When I was dealing with just one scope, a single-pass code generator worked fine. The register allocator was also pretty simple: spill the oldest register and return it. If the value that was previously in it is needed, load it from the stack. Resulted in a memory leak though.
Now I'm rewriting the code generator to also use a pass to check all declarations and another pass to actually generate code, as it will allow to allocate memory correctly, and allocate registers based on whichever value is not going to be needed for the longest time.
The actual generaton of code works like this: the generator sees an IR command, which either modifies its state, or causes it to emit assembly code to load approproate values into registers and perform the operation on them, then store then back. The load and store assembly is different depending on what type of storage the variable uses and how big it is.
At this point the source code has been turned into assembly. The assembler for the toy CPU already exists, so I just paste the assembly there and it gets assembled into machine code and uploaded into the CPU. Then lights blink and a counter shows increasing digits.
I would talk about the language I am compiling, but if we judge just by the features already implemented, it is essentially C without structs. Some cool stuff is planned, but, you know, ideas aren't very impressive before they are implemented.
If you would like to know more, or for me to clarify the operation of the compiler and how I solved this-or-that problem, I will be happy to oblidge.
PROLOGUE
My efforts to build a compiler began a decade ago in a fit of not-made-in-here syndrome. I had been playing a video game called Garry's Mod, a sandbox physics game that allows for construction of intricate mechanisms, among other things. That game had a plugin called "wiremod" that added digital logic and programming elements - buttons, logic gates, text terminals, physical actuators, CPUs emulated at register/machine code level, and memory-mapped IO devices like keyboards and text terminals.
So I wanted to make an OS for a wiremod CPU, just to prove that I can. There was a problem with that - the CPU was programmable using assembly (or by directly writing machine code into memory), but there was no C compiler, so I couldn't port over any existing code. My assembly skills are also lacking, and I doubt anyone would develop apps for my OS if they had to do use assembly to do it.
That meant I needed a compiler. So I started learning how they function and trying to make one. There were many false starts - the syntax analysis (parsing a token stream into an abstract syntax tree) had been the most difficult to make by hand, and by the time I got to semantic analysis I would be lost in my own spaghetti code. But recently, one of thse attempts had developed into a compiler that could actually produce code, that in turn ran in the toy CPU and caused lights to blink in a sequence. Granted, it is only a snall subset of the language using only the simplest forms of branching and arithmetic, but still, BLINKENLIGHTS!
I am now having a much easier time developing the compiler, now that there is some tangible result. A fun side effect of it's incompleteness is that every bug fixed seems to add language features! Small featured. Like "i++" instead of "i = i+1", or being able to initialize a variable on the same line as declaring it. I am currently working on support for having stack frames and accessing variables from earlier frames, for things like giving the "for-loop" it's own scope.
Now, onto how the compiler is built right now:
Compiler structure
The compiler is written in C. I chose C instead of C++ because I hope to one day make it self-compiling, and supporting all of C++ is such a gargantuan task I don't even consider it.
The tokenizer and parser are generated using FLEX (lex) and BISON (yacc). The tokenizer eats text and returns tokens, the parser turns them into an abstract syntax tree. Not much science here, just figuring out how to get them to preserve the locations of all the tokens and nodes. The parser also calls some functions to transform certain parts of the syntax tree into a more convenient form (i.e. turning a recursive node into a list), just so they are easier to read later.
Then there comes the semantic analyzer. I know of no tools to generate one, so I had to make it from scratch. Here's how it works:
The semantic analyzer performs two passes over the program code. In the first pass, it gathers any declarative information and builds a symbol table, recording the types of variables, the scopes in which they are defined, the names and signatures of functions, their associated symbol tables and code segments (I somewhat erronousely refer to chunks of code as code segments. A code segment can be moved anywhere relative to another segment, but must not be deformed or split. This allows a linker to optimize the binary layout later).
The second pass again reads the same AST from the beginning, but in the context of already parsed declarative information, and generates an Intermediate Representation, that resembles very high level assembly.
The decision to make two passes over the tree was so that forward-declaration is not necessary as it is in C. The decision to use an intermediate representation was so that the semantic analyzer does not have to concern itself with code generation, and so that understanding of the source code by the compiler is independent of the target, for which the code is generated.
The semantic analyzer works like this: It visits the root node of AST, and recursively calls itself on the child nodes. Depending on the type of node and the current context, it will do some book keeping, change the current context, or maybe emit some code in between the recursive calls to the child nodes. For instance, when a function declaration node is visited, a symbol has to be created to represent the function, and added to the current (parent) symbol table. a new symbol table has to be created, to store the variables declared inside the function. The function symbol has to be associated with the new symbol table and the new code segment, into which the code of the function will be written. Finally, emitting the code to enter and exit the function has to be done i the correct order with a recursive call to analyze the node representing the function body, if correct code is to be generated. A set of unique names have to be generated to become the labels and values used in the code, and if the node is an expression, then the name of the value that holds the result of that expression is put on a stack and then popped by the function analyzing the parent node. There is a lot of stuff to keep track of and honestly it has gotten pretty spaghettified in here. I should clean it sometime.
The intermediate representation is a series of commands that specify an operation, it's arguments as 'values' and the resulting 'value'. Think values as registers, if the CPU had an infinite number of them. This means that reusing a value is unecesssary, and the semantic analyzer can treat them as napkins, or single-use dishware. They can be reused, it's just unnecessary. The problem of using the finite amount of registers to store a much larger amount of values is solved by the code generator and it's register allocator.
Here's an example of my IR:
Code: Select all
FRAME ENTER
MOV A_1
LABEL loop
ADD A_2 A1 1
MUL A_1 A_1 A_2 // A_1 = A_1 * A_2
JL A_1 100 loop // if(A_1 < 100) goto loop;
FRAME LEAVE
Now I'm rewriting the code generator to also use a pass to check all declarations and another pass to actually generate code, as it will allow to allocate memory correctly, and allocate registers based on whichever value is not going to be needed for the longest time.
The actual generaton of code works like this: the generator sees an IR command, which either modifies its state, or causes it to emit assembly code to load approproate values into registers and perform the operation on them, then store then back. The load and store assembly is different depending on what type of storage the variable uses and how big it is.
At this point the source code has been turned into assembly. The assembler for the toy CPU already exists, so I just paste the assembly there and it gets assembled into machine code and uploaded into the CPU. Then lights blink and a counter shows increasing digits.
I would talk about the language I am compiling, but if we judge just by the features already implemented, it is essentially C without structs. Some cool stuff is planned, but, you know, ideas aren't very impressive before they are implemented.
If you would like to know more, or for me to clarify the operation of the compiler and how I solved this-or-that problem, I will be happy to oblidge.