A compiler tutorial in progress...
A compiler tutorial in progress...
Some time ago, I promised to write a compiler tutorial for a simple Lisp dialect. Lisp dialect, because they are quite easy to develop, but have a few nasty things to take care of..
Anyway.. I've finally started writing such a tutorial. So far I've "finished" the parser (or rather scanner&parser) part. There's also quite lengthy description of the (small) language which is the subject of the tutorial.
The work-in-process version of the tutorial as it grows can be found at http://www.cs.hut.fi/~tvoipio/compiling.html. I'd like to hear comments, additions, suggestions, corrections, critique, praise, questions, cursing, whatever...
In any case, it certainly isn't finished, and even the part written isn't really well proof-read yet, so there might be stupidities and errors in there.
Finally, I'll notify here when I get the other parts (intepreter and compiler) written.
Anyway.. I've finally started writing such a tutorial. So far I've "finished" the parser (or rather scanner&parser) part. There's also quite lengthy description of the (small) language which is the subject of the tutorial.
The work-in-process version of the tutorial as it grows can be found at http://www.cs.hut.fi/~tvoipio/compiling.html. I'd like to hear comments, additions, suggestions, corrections, critique, praise, questions, cursing, whatever...
In any case, it certainly isn't finished, and even the part written isn't really well proof-read yet, so there might be stupidities and errors in there.
Finally, I'll notify here when I get the other parts (intepreter and compiler) written.
Re:A compiler tutorial in progress...
If you read the tutorial, PLEASE tell me if something should be explained better, or if there's something that really doesn't make sense. I will try to go over it again (and again, and again) and see if I can clarify stuff later.. but until then, suggestions are highly appriciated.
Oh, and sorry for the formatting. I'll split it to separate pages when it's ready.
Oh, and sorry for the formatting. I'll split it to separate pages when it's ready.
Re:A compiler tutorial in progress...
i kinda read it. i dont do lisp and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.
it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.
i cant really comment as i skipped most of it.
it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.
i cant really comment as i skipped most of it.
-- Stu --
Re:A compiler tutorial in progress...
My lexers read in the file to a FIFO of tokens, to save memory and allow a parser to remove them in parallel. Also allows multiprocessors more to do btw, which appears to become the future.df wrote: i kinda read it. i dont do lisp and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.
it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.
i cant really comment as i skipped most of it.
Re:A compiler tutorial in progress...
If you mean like read all tokens and build a linked list of them, only after which you move to parsing, this is ofcourse fine. I don't know if it's any more simple than reading tokens on demand, but the "online" or "on-demand" method where the parser calls the scanner has one benefit: you can start parsing before you have read all the code.df wrote: i kinda read it. i dont do lisp and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.
it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.
For an offline compiler this is ofcourse irrelevant, but for an interpreter being able to structure the whole thing is such a way that there's an "eval-loop" which calls from the next expression from the parser, which reads enough tokens to construct an expression or conclude that there's an error, after which that expression is evaluated, before more needs to be read. This makes it very easy to build an interactive "shell-interpreter", like those in Ruby, Python, (virtually every) Lisp, etc..
Re:A compiler tutorial in progress...
what's so backwards about it, exactly?and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.
it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.
mystran made good points. i, for one, don't see anything inherently awkward or backwards with getting tokens on demand. and i find the getChar/ungetChar method quite convenient, and not just for tokenizing, but for a variety of tasks where scanning a sequence of characters is involved.
Code: Select all
case digit:
number=int_from_char(char)
repeat
getChar
if not digit
ungetChar <---- is that so much of a problem?
return token
if would overflow
optionally ungetChar <----
return error
recompute number
mystran
(if you're interested in that kinda thing) fewer typos is better
<This text is divided to four parts.
>This text is divided into four parts.
didn't read the whole thing yet, maybe later.
Re:A compiler tutorial in progress...
Whether you prefer to peek for the next char (and eat it if you used it), or use getch (and then ungetch if you don't use it) is quite irrelevant. I could mention this though....zloba wrote: mystran made good points. i, for one, don't see anything inherently awkward or backwards with getting tokens on demand. and i find the getChar/ungetChar method quite convenient, and not just for tokenizing, but for a variety of tasks where scanning a sequence of characters is involved.Code: Select all
case digit: number=int_from_char(char) repeat getChar if not digit ungetChar <---- is that so much of a problem? return token if would overflow optionally ungetChar <---- return error recompute number
It's essentially the same thing which ever way you do it =)
edit: I should probably finish the tokenizer (and might even properly pythonize it so it can be run) I think.
Yeah. Since the tutorial is about compilers, I could have read the full file and parsed it as a "whole" but in general you want to be able to process stuff with an "online" algorithm.++also, imagine this: i am tokenizing a stream coming from a socket. now, unless there's a protocol that works in chunks which i can read as a whole, tokenize and process, tokenizing the "whole" stream would be tricky.
But in any case, I think I'll include some of these points in the text itself.
Sorry. My English isn't my first language, so in some cases I make errors like that without noticing them. I also haven't proof-read it too well yet, sorry about that. I'll try to fix those as soon as I notice them.mystran
(if you're interested in that kinda thing) fewer typos is better
<This text is divided to four parts.
>This text is divided into four parts.
didn't read the whole thing yet, maybe later.
edit: I will try to make the text shorter and easier to follow in future. Especially as it is now, not properly split into chapters, it can be a bit tedious. But it's the first version anyway
Re:A compiler tutorial in progress...
yeah, that's just my way of doing things.Whether you prefer to peek for the next char (and eat it if you used it), or use getch (and then ungetch if you don't use it) is quite irrelevant. I could mention this though....
It's essentially the same thing which ever way you do it =)
however, with peek/next, you do 2 calls for each "normal" char, whereas with get/unget, you do one, except for the first "abnormal" (and sometimes there isn't any "abnormal", such as 1-char tokens and self-delimited tokens (such as strings in quotes)). a minor point, but still a point, imho.
Re:A compiler tutorial in progress...
As for the getchar/ungetchar vs. readchar/peekchar desing:
assuming we only have a primitive "getchar()" function, we can write either on top of the "getchar()" by simply writing:
So the difference between the two is rather small.. =)
I have to admit that ungetchar() has the advantage that if you need to backup more than one character, it is easy to use a stack for the ungetchar/getchar holder..
assuming we only have a primitive "getchar()" function, we can write either on top of the "getchar()" by simply writing:
Code: Select all
# read/peak char
def readchar():
if (lookahead):
temp = lookahead
lookahead = false
return temp
else:
return getchar()
def peekchar():
if(not lookahead): lookahead = getchar()
return lookahead
# or get/unget()
# for more than one character ungetchar() one
# could use a stack instead.
def getchar():
if(holder):
temp = holded
holder = false
return temp
else:
return primitive_getchar()
def ungetchar(x):
holder = x
I have to admit that ungetchar() has the advantage that if you need to backup more than one character, it is easy to use a stack for the ungetchar/getchar holder..
Re:A compiler tutorial in progress...
assuming you mean a context-free grammar, iirc type checking (as in, first declaring something and later, at some random point, comparing something to what was declared) can't be done with CFGs at all. that is a basic limitation of CFGs, just like regular expressions can't handle recursion.your parsers are going to get really big if you are trying to enforce type-safety directly in the grammar.
am i wrong?
Re:A compiler tutorial in progress...
Ah, you are right. While you can enforce that only certain types of expressions can be combined, you can't easily check what type a certain variable has. You can get away with this by using a BASIC-style typing system where string variables are $variables and (say) float variables are (say) %variables.zloba wrote:assuming you mean a context-free grammar, iirc type checking (as in, first declaring something and later, at some random point, comparing something to what was declared) can't be done with CFGs at all. that is a basic limitation of CFGs, just like regular expressions can't handle recursion.your parsers are going to get really big if you are trying to enforce type-safety directly in the grammar.
am i wrong?
So yeah. I though about this when writing the sentence you quoted, and even wrote a clarification, but then got confuced and though "I'll fix this later."
Maybe it's better to just say that it's not possible =)
Thanks for pointing this out.
Is it ok to say "You can't enforce static typing in a grammar like this" ? Oh, and add somewhere that we are infact talking about context-free grammars here... or should I add something like "unless your variable-names can uniquely identify the types of the variables, and you never plan to allow user-defined types."
Re:A compiler tutorial in progress...
Added some clarifications to the beginning of the parser (after lexer that is) about what exactly is a context-free grammar and fixed the "can't enforce static typing with context-free grammar".
edit: I think that what I do next, is to split the different parts on different pages. This would also allow me to use subsections without having tons of different levels of headers, so I can structure the text a bit better
edit: I think that what I do next, is to split the different parts on different pages. This would also allow me to use subsections without having tons of different levels of headers, so I can structure the text a bit better
Re:A compiler tutorial in progress...
that depends on what your target audience is, what they are looking for, on what level you want to explain the material, theoretical-vs-practical, rigorous-vs-simple... and how much effort you are willing to put into creating such a document.Is it ok to say ...
which is what i wanted to ask.
as for me, although i am somewhat familiar with the whole process, i wouldn't mind reading a comprehensive overview. a fresh look on the old things sometimes leads to better understanding and new creative ideas.
the way i'd like it is so that an idiot like myself can understand it. but that would be a lot of work.
i am trying to write a compiler. the scanner and parser are pretty much complete, as well as a sketchy code generator. i am trying to come up with a good object model for contexts, type information and a code generator.
Re:A compiler tutorial in progress...
I think my target audience is a hobbyist programmer, who can write C, Python (or Lisp or some other high-level language..) and knows enough Assembler to understand how programs work on low level, but does quite see the big picture of taking a source and turning it into a binary.zloba wrote: as for me, although i am somewhat familiar with the whole process, i wouldn't mind reading a comprehensive overview. a fresh look on the old things sometimes leads to better understanding and new creative ideas.
So be it. I think I'll first try to write some kind of basic tutorial, which includes enough to get a moderately experienced programmer to understand what to do, and then dump it down on whatever parts are hardest to understand =)the way i'd like it is so that an idiot like myself can understand it. but that would be a lot of work.
Ok. I will go the route of first teaching how to write an interpreter, and then modifying that interpreter to actually generate assembler you can then throw at your stock assembler (like gas or NASM or whatever) and link with a small runtime (written in C probably, to make it easy to provide glue between C-code and our own). Or if I feel like it later, I might even show how to compile "on the fly" so that in the true spirit of Lisp, a program can write new programs, call the compiler, and then call functions defined in the new code.i am trying to write a compiler. the scanner and parser are pretty much complete, as well as a sketchy code generator. i am trying to come up with a good object model for contexts, type information and a code generator.
My plans would include explaining basic stack-usage, how to pass variable amounts of arguments (there are several ways actually) and talk about some of the choices involved.
I will also include full closures (that is, a function returned from another function has access to the variables that existed in the local scope when it was defined) so as to enable a functional method for building object systems. Many high-level languages include closures, although at least Java requires one to declare variables "final" if they need to be saved.
I have no idea whether I will go into optimizations. There's lots of them to do. Especially with a dynamically typed language, you can save a lot of checks (including many runtime type-checks) if you do some reasoning when compiling, or lower the amount of garbage generated... Some of this stuff I've done, some of it I know how to do, and some of it would require research from my behalf, so future will tell how much I will include.