Well, there really isn't any 'simple' way to make a compiler, but for what you're looking for, you want to look into one of two directions: either using a hand-coded recursive-descent parser with inline code generation like the classic
Small C compiler, or else using a lexical-analyzer generator and a parser generator, and add you own semantic analysis and code generation to that. And if you don't understand what that means, give me a moment and I'll explain.
Compilers basically do two things, analyze code for correctness and meaning, and generate the target code based on that meaning. In modern compiler theory, these are usually broken down into seven stages: Lexical analysis, parsing, semantic analysis, intermediate representation generation, global optimization, code generation, and final (or 'peephole') optimization. ignoring IR generation and the optimization steps (which are optional), you have:
Lexical analysis (or tokenizing) - reading in the character stream and breaking it into a stream of tokens. For example, a lexical analyzer for a Pascal compiler (an easier language to use as an example than C) might read the following line of code
and return the tokens 'foo', ':=', 'bar', 'DIV', '12' and ';'.
This stream would then be used by the next stage. There are a number of lexical analyser generators, the most commonly used being
FLEX.
Parsing - This means determining the syntactic structure of the statements, usually representing them as a tree structure. For the sequence above, it might parse it as:
Code: Select all
<statement>
/ \
<assignment> <divider>
/ | \ \
<variable> <assign op> \ ';'
/ | \
'foo' ':=' <factor>
/ | \
<variable> <mul op> <unsigned constant>
/ | \
'bar' 'DIV' '12'
The syntax itself is most often defined in terms of a context-free grammar, often using
Backus-Naur Form. As a simplified example for a Pascal assignment statement:
Code: Select all
<statement> ::= [<assignment>|<function call>]<divider>
<divider> ::= ';'
<assignment> ::= <variable> <assign op> <expression>
<assign op> ::= ':='
<expression> ::= <simple expression> [<rel op> <simple expression>]
<rel op> ::= '=' | '<' | '<=' | '<>' | '>=' | '>'
<simple expression> ::= [<sign>]<term>
| <simple expression> <add op> <term>
<sign> ::= '+' | '-'
<add op> ::= '-' | '+' | 'OR'
<term> ::= <factor>
| <term> <mul op> <factor>
<mul op> ::= '*' | '/' | 'DIV' | 'MOD' | 'AND'
<factor> ::= <variable>
| <unsigned constant>
| <function call>
| 'NOT' <factor>
| '(' <expression> ')'
(example adapted from
Interpreters and Compilers by Ronald Mak). The grammar symbols which end in actual tokens (i.e., '+') are called terminal symbols; those which are defined in terms of other gramar symbols (or themselves) are called non-terminal symbols.
There are a variety of different
context-free parsing methods, such as
LALR and
LL(0). Different parsing algorithms require different grammars. The
recursive descent method of parsing generally uses LL grammars (
EDIT:not LALR as I originally wrote).
In addition to finding the structure of the program, parsing also serves to determine if the program is well-formed, that is, if it has any syntax errors in it. A parser should be able to determine that a given expression is ungrammatical, and return an error warning.
There exist standard tools for [url=http://Parser%20generator]generating parsers[/url], such as
YACC,
Bison,
ANTLR, and
Cup. It is also possible to write a parser by hand; while there are several methods that can be used, the simplest is one called 'recursive descent', in which the parser has a function for each non-terminal symbol in the grammar, which is called as the parser reaches a section of code that matches that symbol. Recursive descent parser are easy to write, but are often hard to modify, as they usually mix parsing together with semantic analysis and code generation (though they don't have to necessarily).