find simple codes of compiler

Programming, for all ages and all languages.
Post Reply
lokii

find simple codes of compiler

Post by lokii »

I am interest in compiler principles,and
want to read some simple codes of a compiler.
In order to know how to implement it.
how could i find simple codes of compiler.
The simpler ,the better. :)
Schol-R-LEA

Basic compiler information

Post by Schol-R-LEA »

Well, there really isn't any 'simple' way to make a compiler, but for what you're looking for, you want to look into one of two directions: either using a hand-coded recursive-descent parser with inline code generation like the classic Small C compiler, or else using a lexical-analyzer generator and a parser generator, and add you own semantic analysis and code generation to that. And if you don't understand what that means, give me a moment and I'll explain.

Compilers basically do two things, analyze code for correctness and meaning, and generate the target code based on that meaning. In modern compiler theory, these are usually broken down into seven stages: Lexical analysis, parsing, semantic analysis, intermediate representation generation, global optimization, code generation, and final (or 'peephole') optimization. ignoring IR generation and the optimization steps (which are optional), you have:

Lexical analysis (or tokenizing) - reading in the character stream and breaking it into a stream of tokens. For example, a lexical analyzer for a Pascal compiler (an easier language to use as an example than C) might read the following line of code

Code: Select all

foo:=bar DIV 12;
and return the tokens 'foo', ':=', 'bar', 'DIV', '12' and ';'.

This stream would then be used by the next stage. There are a number of lexical analyser generators, the most commonly used being FLEX.

Parsing - This means determining the syntactic structure of the statements, usually representing them as a tree structure. For the sequence above, it might parse it as:

Code: Select all

                  <statement>
                 /           \     
          <assignment>      <divider> 
        /      |       \           \
<variable>  <assign op>   \           ';'
   /           |              \
'foo'        ':='        <factor>
                       /    |     \
              <variable> <mul op>  <unsigned constant>
                /           |           \
               'bar'      'DIV'        '12'

The syntax itself is most often defined in terms of a context-free grammar, often using Backus-Naur Form. As a simplified example for a Pascal assignment statement:

Code: Select all

<statement> ::= [<assignment>|<function call>]<divider>

<divider> ::= ';'

<assignment> ::= <variable> <assign op> <expression> 

<assign op> ::= ':=' 

<expression> ::= <simple expression> [<rel op> <simple expression>]

<rel op> ::= '=' | '<' | '<=' | '<>'  | '>=' | '>'

<simple expression> ::= [<sign>]<term>
                        | <simple expression> <add op> <term>
<sign> ::= '+' | '-'

<add op> ::= '-' | '+' | 'OR'

<term> ::= <factor>
           | <term> <mul op> <factor>

<mul op> ::= '*' | '/' | 'DIV' | 'MOD' | 'AND'

<factor> ::= <variable> 
             | <unsigned constant>
             | <function call>
             | 'NOT' <factor>
             | '(' <expression> ')'
(example adapted from Interpreters and Compilers by Ronald Mak). The grammar symbols which end in actual tokens (i.e., '+') are called terminal symbols; those which are defined in terms of other gramar symbols (or themselves) are called non-terminal symbols.

There are a variety of different context-free parsing methods, such as LALR and LL(0). Different parsing algorithms require different grammars. The recursive descent method of parsing generally uses LL grammars (EDIT:not LALR as I originally wrote).

In addition to finding the structure of the program, parsing also serves to determine if the program is well-formed, that is, if it has any syntax errors in it. A parser should be able to determine that a given expression is ungrammatical, and return an error warning.

There exist standard tools for [url=http://Parser%20generator]generating parsers[/url], such as YACC, Bison, ANTLR, and Cup. It is also possible to write a parser by hand; while there are several methods that can be used, the simplest is one called 'recursive descent', in which the parser has a function for each non-terminal symbol in the grammar, which is called as the parser reaches a section of code that matches that symbol. Recursive descent parser are easy to write, but are often hard to modify, as they usually mix parsing together with semantic analysis and code generation (though they don't have to necessarily).
Schol-R-LEA

Basic compiler information, part 2

Post by Schol-R-LEA »

Semantic Analysis - the process of going through the parse tree creating in parsing and determining the actual meaning or content of the program. This is a considerably more complex task than tokenizing or parsing, and generally must be coded by hand.

Code Generation - The compiler must take the results of the semantic analysis and generate the equivalent program in the target language, usually the assembly language or object code format of a given processor and OS. Again, it is usually necessary to write this by hand, though if you generated an intermediate representation between the semantic analysis and the code generation, you can separate the two parts of the compiler so that you can use different code generators with the same parser, or vis versa.

For a simplfied example of a recursive descent parser, see Crenshaw's tutorial Let's Make A Compiler. Other tutorials can be found with an appropriate Google search. You may want to take a look at the Free Compiler and Interpreter List for example code as well.
Other tutorials or resources:
Programmer's Heaven Compiler Source Code Page
Compiler Tutorial
Another Compiler Tutorial with bibliography
a partial LALR grammar for C++
A LALR Grammar for Java
Schol-R-LEA

Re:find simple codes of compiler

Post by Schol-R-LEA »

I've also found this online book, Complier Design With Flex and Bison. I've also located a YACC grammar for C, as well. LL grammars suitable for RDP seem to be a lot less common tha LALR grammars, as LALR parser generators apparently were perfected earlier than LL ones; furthermore, C apparently cannot be completely described by an LL grammar, and using one to implement it requires a good deal of extra work in the semantic analysis phase. Some theoretical information on parsing can be found here.

Oh, and I found some interesting threads in the comp.compilers archives that you might want to look at... I make no guarantees about accuracy, however. Similarly, I found this class handout page which could be of some use as well.

Finally, I would recommend reading up on finite state machines, as most lexical analyzers (especially automatically generated ones) use Deterministic Finite Automata to recognize valid tokens.

By the way, I was wondering what language you were thinking of implementing, and what language you planned to implement it in. Just curious.
lokii

Re:find simple codes of compiler

Post by lokii »

(Because English is not my national language,what I write may take some mistake.)

In fact ,I have a Compiler Principle course this term. In the end of this course ,we must finish the course design .The assignment is to modify a compiler which called PL0. So i want to know more about the compiler implement.

I like C language .If i have enough time , I may try to implement it. Just for fun and enhance my programming skill.

Thank you very much . :)
Post Reply