OSDev.org

Posted: **Sat Nov 05, 2016 4:58 pm**

Hey all. For some time now, I've been working on and off on a compiler for a toy language and a toy processor. I do not seek help, but present this story for your amusement and entertainment.

PROLOGUE
My efforts to build a compiler began a decade ago in a fit of not-made-in-here syndrome. I had been playing a video game called Garry's Mod, a sandbox physics game that allows for construction of intricate mechanisms, among other things. That game had a plugin called "wiremod" that added digital logic and programming elements - buttons, logic gates, text terminals, physical actuators, CPUs emulated at register/machine code level, and memory-mapped IO devices like keyboards and text terminals.

So I wanted to make an OS for a wiremod CPU, just to prove that I can. There was a problem with that - the CPU was programmable using assembly (or by directly writing machine code into memory), but there was no C compiler, so I couldn't port over any existing code. My assembly skills are also lacking, and I doubt anyone would develop apps for my OS if they had to do use assembly to do it.

That meant I needed a compiler. So I started learning how they function and trying to make one. There were many false starts - the syntax analysis (parsing a token stream into an abstract syntax tree) had been the most difficult to make by hand, and by the time I got to semantic analysis I would be lost in my own spaghetti code. But recently, one of thse attempts had developed into a compiler that could actually produce code, that in turn ran in the toy CPU and caused lights to blink in a sequence. Granted, it is only a snall subset of the language using only the simplest forms of branching and arithmetic, but still, BLINKENLIGHTS!

I am now having a much easier time developing the compiler, now that there is some tangible result. A fun side effect of it's incompleteness is that every bug fixed seems to add language features! Small featured. Like "i++" instead of "i = i+1", or being able to initialize a variable on the same line as declaring it. I am currently working on support for having stack frames and accessing variables from earlier frames, for things like giving the "for-loop" it's own scope.

Now, onto how the compiler is built right now:

Compiler structure

The compiler is written in C. I chose C instead of C++ because I hope to one day make it self-compiling, and supporting all of C++ is such a gargantuan task I don't even consider it.

The tokenizer and parser are generated using FLEX (lex) and BISON (yacc). The tokenizer eats text and returns tokens, the parser turns them into an abstract syntax tree. Not much science here, just figuring out how to get them to preserve the locations of all the tokens and nodes. The parser also calls some functions to transform certain parts of the syntax tree into a more convenient form (i.e. turning a recursive node into a list), just so they are easier to read later.

Then there comes the semantic analyzer. I know of no tools to generate one, so I had to make it from scratch. Here's how it works:

The semantic analyzer performs two passes over the program code. In the first pass, it gathers any declarative information and builds a symbol table, recording the types of variables, the scopes in which they are defined, the names and signatures of functions, their associated symbol tables and code segments (I somewhat erronousely refer to chunks of code as code segments. A code segment can be moved anywhere relative to another segment, but must not be deformed or split. This allows a linker to optimize the binary layout later).
The second pass again reads the same AST from the beginning, but in the context of already parsed declarative information, and generates an Intermediate Representation, that resembles very high level assembly.

The decision to make two passes over the tree was so that forward-declaration is not necessary as it is in C. The decision to use an intermediate representation was so that the semantic analyzer does not have to concern itself with code generation, and so that understanding of the source code by the compiler is independent of the target, for which the code is generated.

The semantic analyzer works like this: It visits the root node of AST, and recursively calls itself on the child nodes. Depending on the type of node and the current context, it will do some book keeping, change the current context, or maybe emit some code in between the recursive calls to the child nodes. For instance, when a function declaration node is visited, a symbol has to be created to represent the function, and added to the current (parent) symbol table. a new symbol table has to be created, to store the variables declared inside the function. The function symbol has to be associated with the new symbol table and the new code segment, into which the code of the function will be written. Finally, emitting the code to enter and exit the function has to be done i the correct order with a recursive call to analyze the node representing the function body, if correct code is to be generated. A set of unique names have to be generated to become the labels and values used in the code, and if the node is an expression, then the name of the value that holds the result of that expression is put on a stack and then popped by the function analyzing the parent node. There is a lot of stuff to keep track of and honestly it has gotten pretty spaghettified in here. I should clean it sometime.

The intermediate representation is a series of commands that specify an operation, it's arguments as 'values' and the resulting 'value'. Think values as registers, if the CPU had an infinite number of them. This means that reusing a value is unecesssary, and the semantic analyzer can treat them as napkins, or single-use dishware. They can be reused, it's just unnecessary. The problem of using the finite amount of registers to store a much larger amount of values is solved by the code generator and it's register allocator.

Here's an example of my IR:

Code: Select all

FRAME ENTER
MOV A_1 
LABEL loop
ADD A_2 A1 1
MUL A_1 A_1 A_2 // A_1 = A_1 * A_2
JL A_1 100 loop // if(A_1 < 100) goto loop;
FRAME LEAVE

When I was dealing with just one scope, a single-pass code generator worked fine. The register allocator was also pretty simple: spill the oldest register and return it. If the value that was previously in it is needed, load it from the stack. Resulted in a memory leak though.

Now I'm rewriting the code generator to also use a pass to check all declarations and another pass to actually generate code, as it will allow to allocate memory correctly, and allocate registers based on whichever value is not going to be needed for the longest time.

The actual generaton of code works like this: the generator sees an IR command, which either modifies its state, or causes it to emit assembly code to load approproate values into registers and perform the operation on them, then store then back. The load and store assembly is different depending on what type of storage the variable uses and how big it is.

At this point the source code has been turned into assembly. The assembler for the toy CPU already exists, so I just paste the assembly there and it gets assembled into machine code and uploaded into the CPU. Then lights blink and a counter shows increasing digits.

I would talk about the language I am compiling, but if we judge just by the features already implemented, it is essentially C without structs. Some cool stuff is planned, but, you know, ideas aren't very impressive before they are implemented.

If you would like to know more, or for me to clarify the operation of the compiler and how I solved this-or-that problem, I will be happy to oblidge.

Posted: **Sun Nov 06, 2016 2:25 pm**

Hi,

It would be good to post this on https://forum.compilerdev.org too.

Overall, it's an impressive story. It's also interesting how fixing bugs can add features. However, you could try to consider adding features that would make your programming language better than C or C++ or some other widely used programming language. If you are interested, I have some ideas about programming language design in this thread.

Regards,
glauxosdever

Posted: **Sun Nov 06, 2016 3:12 pm**

I had a lot of fun with Wiremod too, although I thought that the Wire E2 CPUs ran on LUA?

Posted: **Sun Nov 06, 2016 4:25 pm**

glauxosdever wrote:Hi,

It would be good to post this on https://forum.compilerdev.org too.

Overall, it's an impressive story. It's also interesting how fixing bugs can add features. However, you could try to consider adding features that would make your programming language better than C or C++ or some other widely used programming language. If you are interested, I have some ideas about programming language design in this thread.

Thanks. I'll think about it.

As far as language goes, I don't think I can make anything better than C or C++, just different. C has mountains of code written for it, because it runs on everything, because it is easy to write a compiler for it. C++ boasts phenomenal cosmic power and versatility almost without losing a tick of CPU performance. I can't compete, I can only make toys.

With that in mind, I do have some ideas. The language I am planning, called "LazyTongue", is designed for lazy programmers, in the sense that it should allow for multiple coding styles while offering optional type safety and a mix of high level and low level coding. In other words, it does not force the programmer to write sane, reliable code. The programmer can choose to do things the propper way, and a system of safety checking would be in place to assist with that, or he can choose to forego all that and lazily write some unsafe code in half as many lines.

zenzizenzicube wrote:I had a lot of fun with Wiremod too, although I thought that the Wire E2 CPUs ran on LUA?

In Wiremod, there are two advanced, programmable chips:
E2 (Expression Gate 2) is programmed using the E2 language, that has C-like syntax and has a library of built-in functions to manipulate the world (e.g. apply forces to objects). This code is compiled into Lua and goes through GMod's built-in Lua interpreter/JIT compiler, which is the driving force behind ALL Garry's Mod addons. E2 is an awkward language, because it started out as a handy way to write expressions, but was never meant for full fledged programs. It combines pure computation features with a limited subset of the game engine API (limited to facilitate fair competetive multiplayer).
It's the magic red-on-black gate that makes contraptions fly.

zCPU, on the other hand, is a CPU emulator that can do computations and nothing else. It has interrupts, memory mapping unit, 64 kB of memory, and it's native type is a 64-bit double-precision floating point number (an interesting quirk). It is fairly difficult to program, as one needs to know assembly (HL-ZASM, assembly with some "high level" features like functions, while-loops and short expressions). The code of the assembler itself is unstable and often fails silently, leading to mis-assembled binaries. This is because the developer of zCPU stopped working on it a long time ago, and nobody really understands how the CPU emulator works internally, due to heavy use of Lua-JIT optimisations.
It's a blue gate that few people know how to use.

Posted: **Mon Nov 07, 2016 4:02 am**

Zondartul wrote:zCPU, on the other hand, is a CPU emulator that can do computations and nothing else. It has interrupts, memory mapping unit, 64 kB of memory, and it's native type is a 64-bit double-precision floating point number (an interesting quirk). It is fairly difficult to program, as one needs to know assembly (HL-ZASM, assembly with some "high level" features like functions, while-loops and short expressions). The code of the assembler itself is unstable and often fails silently, leading to mis-assembled binaries. This is because the developer of zCPU stopped working on it a long time ago, and nobody really understands how the CPU emulator works internally, due to heavy use of Lua-JIT optimisations.
It's a blue gate that few people know how to use.

Wow, I never noticed the zCPU in game before.
Looks like the developer of zCPU ported SmallC to target his chip.

OSDev.org

LazyTongue compiler

LazyTongue compiler

Re: LazyTongue compiler

Re: LazyTongue compiler

Re: LazyTongue compiler

Re: LazyTongue compiler