A compiler tutorial in progress...

mystran · Post by **mystran** » Tue Aug 24, 2004 12:00 am

Some time ago, I promised to write a compiler tutorial for a simple Lisp dialect. Lisp dialect, because they are quite easy to develop, but have a few nasty things to take care of..

Anyway.. I've finally started writing such a tutorial. So far I've "finished" the parser (or rather scanner&parser) part. There's also quite lengthy description of the (small) language which is the subject of the tutorial.

The work-in-process version of the tutorial as it grows can be found at http://www.cs.hut.fi/~tvoipio/compiling.html. I'd like to hear comments, additions, suggestions, corrections, critique, praise, questions, cursing, whatever...

In any case, it certainly isn't finished, and even the part written isn't really well proof-read yet, so there might be stupidities and errors in there.

Finally, I'll notify here when I get the other parts (intepreter and compiler) written.

mystran · Post by **mystran** » Tue Aug 24, 2004 5:18 pm

If you read the tutorial, PLEASE tell me if something should be explained better, or if there's something that really doesn't make sense. I will try to go over it again (and again, and again) and see if I can clarify stuff later.. but until then, suggestions are highly appriciated.

Oh, and sorry for the formatting. I'll split it to separate pages when it's ready.

df · Post by df » Thu Aug 26, 2004 8:19 am

i kinda read it. i dont do lisp

and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.

it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.

i cant really comment as i skipped most of it.

Candy · Post by **Candy** » Thu Aug 26, 2004 8:27 am

df wrote: i kinda read it. i dont do lisp and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.

it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.

i cant really comment as i skipped most of it.

My lexers read in the file to a FIFO of tokens, to save memory and allow a parser to remove them in parallel. Also allows multiprocessors more to do btw, which appears to become the future.

mystran · Post by **mystran** » Thu Aug 26, 2004 9:37 am

df wrote: i kinda read it. i dont do lisp and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.

it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.

If you mean like read all tokens and build a linked list of them, only after which you move to parsing, this is ofcourse fine. I don't know if it's any more simple than reading tokens on demand, but the "online" or "on-demand" method where the parser calls the scanner has one benefit: you can start parsing before you have read all the code.

For an offline compiler this is ofcourse irrelevant, but for an interpreter being able to structure the whole thing is such a way that there's an "eval-loop" which calls from the next expression from the parser, which reads enough tokens to construct an expression or conclude that there's an error, after which that expression is evaluated, before more needs to be read. This makes it very easy to build an interactive "shell-interpreter", like those in Ruby, Python, (virtually every) Lisp, etc..

zloba · Post by **zloba** » Thu Aug 26, 2004 10:16 am

and i still dont understand such a backward nature to writing compilers when people pull in 1 character at a time then back it out if its not whats expected etc.

it seems to be the standard, and was written when machines have no memory... my compilers/lexers tokenise the entire file in memory as double linked list of tokens.

what's so backwards about it, exactly?

mystran made good points. i, for one, don't see anything inherently awkward or backwards with getting tokens on demand. and i find the getChar/ungetChar method quite convenient, and not just for tokenizing, but for a variety of tasks where scanning a sequence of characters is involved.

Code: Select all

case digit:
  number=int_from_char(char)
  repeat
    getChar
    if not digit
      ungetChar  <---- is that so much of a problem?
      return token
    if would overflow
      optionally ungetChar <----
      return error
    recompute number

++also, imagine this: i am tokenizing a stream coming from a socket. now, unless there's a protocol that works in chunks which i can read as a whole, tokenize and process, tokenizing the "whole" stream would be tricky.

mystran
(if you're interested in that kinda thing) fewer typos is better

<This text is divided to four parts.
>This text is divided into four parts.
didn't read the whole thing yet, maybe later.

mystran · Post by **mystran** » Thu Aug 26, 2004 11:19 am

zloba wrote: mystran made good points. i, for one, don't see anything inherently awkward or backwards with getting tokens on demand. and i find the getChar/ungetChar method quite convenient, and not just for tokenizing, but for a variety of tasks where scanning a sequence of characters is involved.
Code: Select all
case digit:
  number=int_from_char(char)
  repeat
    getChar
    if not digit
      ungetChar  <---- is that so much of a problem?
      return token
    if would overflow
      optionally ungetChar <----
      return error
    recompute number

Whether you prefer to peek for the next char (and eat it if you used it), or use getch (and then ungetch if you don't use it) is quite irrelevant. I could mention this though....

It's essentially the same thing which ever way you do it =)

edit: I should probably finish the tokenizer (and might even properly pythonize it so it can be run) I think.

++also, imagine this: i am tokenizing a stream coming from a socket. now, unless there's a protocol that works in chunks which i can read as a whole, tokenize and process, tokenizing the "whole" stream would be tricky.

Yeah. Since the tutorial is about compilers, I could have read the full file and parsed it as a "whole" but in general you want to be able to process stuff with an "online" algorithm.

But in any case, I think I'll include some of these points in the text itself.

mystran
(if you're interested in that kinda thing) fewer typos is better
<This text is divided to four parts.
>This text is divided into four parts.
didn't read the whole thing yet, maybe later.

Sorry. My English isn't my first language, so in some cases I make errors like that without noticing them. I also haven't proof-read it too well yet, sorry about that. I'll try to fix those as soon as I notice them.

edit: I will try to make the text shorter and easier to follow in future. Especially as it is now, not properly split into chapters, it can be a bit tedious. But it's the first version anyway

zloba · Post by **zloba** » Thu Aug 26, 2004 11:36 am

Whether you prefer to peek for the next char (and eat it if you used it), or use getch (and then ungetch if you don't use it) is quite irrelevant. I could mention this though....

It's essentially the same thing which ever way you do it =)

yeah, that's just my way of doing things.
however, with peek/next, you do 2 calls for each "normal" char, whereas with get/unget, you do one, except for the first "abnormal" (and sometimes there isn't any "abnormal", such as 1-char tokens and self-delimited tokens (such as strings in quotes)). a minor point, but still a point, imho.

mystran · Post by **mystran** » Thu Aug 26, 2004 11:52 am

As for the getchar/ungetchar vs. readchar/peekchar desing:

assuming we only have a primitive "getchar()" function, we can write either on top of the "getchar()" by simply writing:

Code: Select all

# read/peak char
def readchar():
   if (lookahead):
      temp = lookahead
      lookahead = false
      return temp
   else:
      return getchar()

def peekchar():
   if(not lookahead): lookahead = getchar()
   return lookahead

# or get/unget()
# for more than one character ungetchar() one
# could use a stack instead.
def getchar():
   if(holder):
     temp = holded
     holder = false
     return temp
   else:
     return primitive_getchar()

def ungetchar(x):
   holder = x

So the difference between the two is rather small.. =)
I have to admit that ungetchar() has the advantage that if you need to backup more than one character, it is easy to use a stack for the ungetchar/getchar holder..

zloba · Post by **zloba** » Thu Aug 26, 2004 12:08 pm

indeed

zloba · Post by **zloba** » Thu Aug 26, 2004 1:47 pm

your parsers are going to get really big if you are trying to enforce type-safety directly in the grammar.

assuming you mean a context-free grammar, iirc type checking (as in, first declaring something and later, at some random point, comparing something to what was declared) can't be done with CFGs at all. that is a basic limitation of CFGs, just like regular expressions can't handle recursion.
am i wrong?

mystran · Post by **mystran** » Thu Aug 26, 2004 2:01 pm

zloba wrote:
your parsers are going to get really big if you are trying to enforce type-safety directly in the grammar.
assuming you mean a context-free grammar, iirc type checking (as in, first declaring something and later, at some random point, comparing something to what was declared) can't be done with CFGs at all. that is a basic limitation of CFGs, just like regular expressions can't handle recursion.
am i wrong?

Ah, you are right. While you can enforce that only certain types of expressions can be combined, you can't easily check what type a certain variable has. You can get away with this by using a BASIC-style typing system where string variables are $variables and (say) float variables are (say) %variables.

So yeah. I though about this when writing the sentence you quoted, and even wrote a clarification, but then got confuced and though "I'll fix this later."

Maybe it's better to just say that it's not possible =)
Thanks for pointing this out.

Is it ok to say "You can't enforce static typing in a grammar like this" ? Oh, and add somewhere that we are infact talking about context-free grammars here... or should I add something like "unless your variable-names can uniquely identify the types of the variables, and you never plan to allow user-defined types."

mystran · Post by **mystran** » Thu Aug 26, 2004 2:29 pm

Added some clarifications to the beginning of the parser (after lexer that is) about what exactly is a context-free grammar and fixed the "can't enforce static typing with context-free grammar".

edit: I think that what I do next, is to split the different parts on different pages. This would also allow me to use subsections without having tons of different levels of headers, so I can structure the text a bit better

zloba · Post by **zloba** » Thu Aug 26, 2004 3:56 pm

Is it ok to say ...

that depends on what your target audience is, what they are looking for, on what level you want to explain the material, theoretical-vs-practical, rigorous-vs-simple... and how much effort you are willing to put into creating such a document.
which is what i wanted to ask.

as for me, although i am somewhat familiar with the whole process, i wouldn't mind reading a comprehensive overview. a fresh look on the old things sometimes leads to better understanding and new creative ideas.

the way i'd like it is so that an idiot like myself can understand it. but that would be a lot of work.

i am trying to write a compiler. the scanner and parser are pretty much complete, as well as a sketchy code generator. i am trying to come up with a good object model for contexts, type information and a code generator.

mystran · Post by **mystran** » Thu Aug 26, 2004 5:27 pm

zloba wrote: as for me, although i am somewhat familiar with the whole process, i wouldn't mind reading a comprehensive overview. a fresh look on the old things sometimes leads to better understanding and new creative ideas.

I think my target audience is a hobbyist programmer, who can write C, Python (or Lisp or some other high-level language..) and knows enough Assembler to understand how programs work on low level, but does quite see the big picture of taking a source and turning it into a binary.

the way i'd like it is so that an idiot like myself can understand it. but that would be a lot of work.

So be it. I think I'll first try to write some kind of basic tutorial, which includes enough to get a moderately experienced programmer to understand what to do, and then dump it down on whatever parts are hardest to understand =)

i am trying to write a compiler. the scanner and parser are pretty much complete, as well as a sketchy code generator. i am trying to come up with a good object model for contexts, type information and a code generator.

Ok. I will go the route of first teaching how to write an interpreter, and then modifying that interpreter to actually generate assembler you can then throw at your stock assembler (like gas or NASM or whatever) and link with a small runtime (written in C probably, to make it easy to provide glue between C-code and our own). Or if I feel like it later, I might even show how to compile "on the fly" so that in the true spirit of Lisp, a program can write new programs, call the compiler, and then call functions defined in the new code.

My plans would include explaining basic stack-usage, how to pass variable amounts of arguments (there are several ways actually) and talk about some of the choices involved.

I will also include full closures (that is, a function returned from another function has access to the variables that existed in the local scope when it was defined) so as to enable a functional method for building object systems. Many high-level languages include closures, although at least Java requires one to declare variables "final" if they need to be saved.

I have no idea whether I will go into optimizations. There's lots of them to do. Especially with a dynamically typed language, you can save a lot of checks (including many runtime type-checks) if you do some reasoning when compiling, or lower the amount of garbage generated... Some of this stuff I've done, some of it I know how to do, and some of it would require research from my behalf, so future will tell how much I will include.

OSDev.org

A compiler tutorial in progress...

A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...

Re:A compiler tutorial in progress...