Parser Generator Tools
Posted: Fri Aug 07, 2015 11:03 pm
Hey guys!
I've made a set of insanely simple LALR parser generator tools that you can use in your projects. I'll go ahead and explain how you can use it.
The idea is to eventually rewrite the tools in my own language and integrate these tools into my OS, so it's self-hosting and self-compiling. I have yet to get there, but if you want to use the tools to do the same, be my guest!
See my topic: Self hosting OS/Compiler from scratch
It took a while to create the parser generator. I've been wanting to write one for a long time, and I finally got my head wrapped around it. It's not just any parser, I wrote one years ago. This one should be able to handle most any programming language you throw at it. It only took 20 hours or so to complete, after I understood it for the most part.
Here's the URL: https://github.com/joshwyant/ParserGenerator
So without further ado...
First, you create your terminals and nonterminals:
Then, you define your grammar:
The syntax to define the grammar is known as combinator style, which allows you to write a grammar directly in code, rather than having to pass a grammar file to a tool.
You're passing in the 4 system terminals and nonterminals to the base constructor (like "Start" and "Eof"), so the framework can handle those correctly.
Finally, you create a lexer. This one demonstrates the the typical complexity, except for whitespace:
This can be placed in its own file, but it is nested under your `MyGrammar` class, so you can use your `Terminal` and `Nonterminal` enums. You're overriding `GetLexer()` in the grammar to return your new fancy lexer.
For a full lexer, see: https://github.com/joshwyant/ParserGene ... e/Lexer.cs
Now, you're ready to use your parser:
... and that's it!
Now, I don't believe there are many tools like this. You can even save, load, cache, or JIT compile the parse table to a function or save it to an assembly.
How would you generate code from the abstract syntax tree?
Well, semantic analysis comes first. You will have a lot of junk from your grammar in the tree that doesn't make sense in terms of semantics. So, you'll create elements for your syntax tree:
You can extract information from abstract syntax tree nodes by flattening them out. The framework makes this ridiculously easy, allowing you to read and peek tokens that make up the node:
When you finally have your tree based on your semantic analysis, you can generate your output code using the visitor pattern. To show you how painfully simple that is, I will give you a full working example in less than 1k lines of code here. This comes from a scripting language that I wrote using a simpler parser, but the code generation method is the same. The principle is to have a subroutine for each node type, and recursively call the subroutines on the node's children to generate your machine code.
Well, there you have it. If only I knew how to do this a long time ago, if only I either had the code in front of me, or someone explained it to me in plain english.
Have a nice day!
Josh
EDIT:
A fluent interface is also supported for the grammar. I actually like it better:
See https://github.com/joshwyant/ParserGene ... Grammar.cs for a full example language similar to c# or Java.
I've made a set of insanely simple LALR parser generator tools that you can use in your projects. I'll go ahead and explain how you can use it.
The idea is to eventually rewrite the tools in my own language and integrate these tools into my OS, so it's self-hosting and self-compiling. I have yet to get there, but if you want to use the tools to do the same, be my guest!
See my topic: Self hosting OS/Compiler from scratch
It took a while to create the parser generator. I've been wanting to write one for a long time, and I finally got my head wrapped around it. It's not just any parser, I wrote one years ago. This one should be able to handle most any programming language you throw at it. It only took 20 hours or so to complete, after I understood it for the most part.
Here's the URL: https://github.com/joshwyant/ParserGenerator
So without further ado...
First, you create your terminals and nonterminals:
Code: Select all
enum Terminal {
Unknown, Ident, Var, Equals, DoubleEquals, Eof
}
enum Nonterminal {
Init, Start, Expression, Factor
}
Code: Select all
partial class MyGrammar : LALRGrammar<Terminal, Nonterminal> {
public MyGrammar() : base(Unknown, Eof, Init, Start) {
var start = DefineProduction(Start);
var expression = DefineProduction(Expression);
var factor = DefineProduction(Factor);
ProductionRule t;
t = start % expression;
t = expression % Expression / Equals / Factor;
t = expression % Expression / DoubleEquals / Factor;
t = expression % Factor;
t = factor % Ident;
}
}
You're passing in the 4 system terminals and nonterminals to the base constructor (like "Start" and "Eof"), so the framework can handle those correctly.
Finally, you create a lexer. This one demonstrates the the typical complexity, except for whitespace:
Code: Select all
partial class MyGrammar {
public class Lexer : LexerBase {
Dictionary<string, Terminal> ReservedWords
= new Dictionary<string, Terminal> { "var", Var };
public Lexer(TextReader reader) : base(reader) { }
protected override IEnumerable<Token> Lex() {
while (Reader.HasNext()) {
var c = Reader.Peek();
if (char.IsLetter(c) || c == "_") {
while (char.IsLetterOrDigit(Reader.Peek()) || Reader.Peek() == "_")
Reader.Read();
if (ReservedWords.Contains(...))
yield return new Token(...);
else
yield return new Token(Ident);
}
else {
switch (Reader.Read()) {
case '=':
if (Reader.Peek() == '=') {
Reader.Read(); yield return new Token(DoubleEquals);
}
else yield return new Token(Equals);
break;
// ...
default:
yield return new Token(Terminal.Unknown);
break;
}
}
}
yield return new Token(Eof);
}
}
public override Lexer GetLexer(TextReader reader) {
return new Lexer(reader);
}
}
For a full lexer, see: https://github.com/joshwyant/ParserGene ... e/Lexer.cs
Now, you're ready to use your parser:
Code: Select all
var grammar = new MyGrammar();
var abstractSyntaxTree = grammar.Parse(".......");
Now, I don't believe there are many tools like this. You can even save, load, cache, or JIT compile the parse table to a function or save it to an assembly.
How would you generate code from the abstract syntax tree?
Well, semantic analysis comes first. You will have a lot of junk from your grammar in the tree that doesn't make sense in terms of semantics. So, you'll create elements for your syntax tree:
Code: Select all
class FullyQualifiedName {
public string[] NameParts;
}
class IfStatement : Statement {
public Expression Condition;
public Statement Body;
public Statement Else;
}
Code: Select all
var tokens = astNode.Flatten();
if (tokens.Peek() == If)
return scanIf(tokens);
...
Well, there you have it. If only I knew how to do this a long time ago, if only I either had the code in front of me, or someone explained it to me in plain english.
Have a nice day!
Josh
EDIT:
A fluent interface is also supported for the grammar. I actually like it better:
Code: Select all
partial class MyGrammar : LALRGrammar<Terminal, Nonterminal> {
public MyGrammar() : base(Unknown, Eof, Init, Start)
{
DefineProduction(Start)
.As(Expression);
DefineProduction(Expression)
.As(Expression, Equals, Factor)
.Or(Expression, DoubleEquals, Factor)
.Or(Factor);
DefineProduction(Factor)
.As(Ident);
}
}