Page 1 of 1

Parser Generator Tools

Posted: Fri Aug 07, 2015 11:03 pm
by Joshw
Hey guys!

I've made a set of insanely simple LALR parser generator tools that you can use in your projects. I'll go ahead and explain how you can use it.

The idea is to eventually rewrite the tools in my own language and integrate these tools into my OS, so it's self-hosting and self-compiling. I have yet to get there, but if you want to use the tools to do the same, be my guest!
See my topic: Self hosting OS/Compiler from scratch

It took a while to create the parser generator. I've been wanting to write one for a long time, and I finally got my head wrapped around it. It's not just any parser, I wrote one years ago. This one should be able to handle most any programming language you throw at it. It only took 20 hours or so to complete, after I understood it for the most part.

Here's the URL: https://github.com/joshwyant/ParserGenerator

So without further ado...

First, you create your terminals and nonterminals:

Code: Select all

enum Terminal {
  Unknown, Ident, Var, Equals, DoubleEquals, Eof
}
enum Nonterminal {
  Init, Start, Expression, Factor
}
Then, you define your grammar:

Code: Select all

partial class MyGrammar : LALRGrammar<Terminal, Nonterminal> {
  public MyGrammar() : base(Unknown, Eof, Init, Start) {
    var start = DefineProduction(Start);
    var expression = DefineProduction(Expression);
    var factor = DefineProduction(Factor);

    ProductionRule t;
    t = start % expression;
    t = expression % Expression / Equals / Factor;
    t = expression % Expression / DoubleEquals / Factor;
    t = expression % Factor;
    t = factor % Ident;
  }
}
The syntax to define the grammar is known as combinator style, which allows you to write a grammar directly in code, rather than having to pass a grammar file to a tool.
You're passing in the 4 system terminals and nonterminals to the base constructor (like "Start" and "Eof"), so the framework can handle those correctly.

Finally, you create a lexer. This one demonstrates the the typical complexity, except for whitespace:

Code: Select all

partial class MyGrammar {
  public class Lexer : LexerBase {
    Dictionary<string, Terminal> ReservedWords 
      = new Dictionary<string, Terminal> { "var", Var };

    public Lexer(TextReader reader) : base(reader) { }

    protected override IEnumerable<Token> Lex() {
      while (Reader.HasNext()) {
        var c = Reader.Peek();
        if (char.IsLetter(c) || c == "_") {
          while (char.IsLetterOrDigit(Reader.Peek()) || Reader.Peek() == "_")
            Reader.Read();
          if (ReservedWords.Contains(...))
            yield return new Token(...);
          else
            yield return new Token(Ident);
        }
        else {
          switch (Reader.Read()) {
          case '=':
            if (Reader.Peek() == '=') {
              Reader.Read(); yield return new Token(DoubleEquals);
            }
            else yield return new Token(Equals);
            break;
          // ...
          default:
            yield return new Token(Terminal.Unknown);
            break;
          }
        }
      }
      yield return new Token(Eof);
    }
  }

  public override Lexer GetLexer(TextReader reader) {
    return new Lexer(reader);
  }
}
This can be placed in its own file, but it is nested under your `MyGrammar` class, so you can use your `Terminal` and `Nonterminal` enums. You're overriding `GetLexer()` in the grammar to return your new fancy lexer.
For a full lexer, see: https://github.com/joshwyant/ParserGene ... e/Lexer.cs

Now, you're ready to use your parser:

Code: Select all

var grammar = new MyGrammar();
var abstractSyntaxTree = grammar.Parse(".......");
... and that's it!

Now, I don't believe there are many tools like this. You can even save, load, cache, or JIT compile the parse table to a function or save it to an assembly.

How would you generate code from the abstract syntax tree?
Well, semantic analysis comes first. You will have a lot of junk from your grammar in the tree that doesn't make sense in terms of semantics. So, you'll create elements for your syntax tree:

Code: Select all

class FullyQualifiedName {
  public string[] NameParts;
}
class IfStatement : Statement {
  public Expression Condition;
  public Statement Body;
  public Statement Else;
}
You can extract information from abstract syntax tree nodes by flattening them out. The framework makes this ridiculously easy, allowing you to read and peek tokens that make up the node:

Code: Select all

var tokens = astNode.Flatten();
if (tokens.Peek() == If)
  return scanIf(tokens);
...
When you finally have your tree based on your semantic analysis, you can generate your output code using the visitor pattern. To show you how painfully simple that is, I will give you a full working example in less than 1k lines of code here. This comes from a scripting language that I wrote using a simpler parser, but the code generation method is the same. The principle is to have a subroutine for each node type, and recursively call the subroutines on the node's children to generate your machine code.

Well, there you have it. If only I knew how to do this a long time ago, if only I either had the code in front of me, or someone explained it to me in plain english.

Have a nice day! :D

Josh

EDIT:
A fluent interface is also supported for the grammar. I actually like it better:

Code: Select all

partial class MyGrammar : LALRGrammar<Terminal, Nonterminal> {
  public MyGrammar() : base(Unknown, Eof, Init, Start) 
  {
    DefineProduction(Start)
      .As(Expression);

    DefineProduction(Expression)
      .As(Expression, Equals, Factor)
      .Or(Expression, DoubleEquals, Factor)
      .Or(Factor);

    DefineProduction(Factor)
      .As(Ident);
  }
}
See https://github.com/joshwyant/ParserGene ... Grammar.cs for a full example language similar to c# or Java.