Pet Project: A PEG Packrat Parser

2008-01-14

I finally finished a pet project that I started in 1999 in Haskell and then ported it to C# at Easter last year.

The original motivation was to create a internal DSL to specify a parser directly from a syntax specification. I always felt that there must be a much more versatile solution to parsing lexical and syntactic grammars. The solution provided by the parser generators available never felt aesthetically appealing enough to be used for simple projects.
I also never understood that the lexical scanning must be separated from the syntactical parsing.

Because of their simplicity, EBNF and relatives embrace a very powerful expressiveness, but why do I need to care for the implementation and performance? That's what tools are for! As soon you want to pass your beautiful simple grammar to the parser generator, the sanity ceases and acronyms like LALR or LR() and shift reduce / reduce reduce conflicts pop up and many more details you've never heard of. Suddenly, you feel that you missed something, and it turns out that the answer for all your problems may be found by reading the dragon book, which essentially praises you to follow the dots and makes you feel that parsing is by ways too complicated. Whow! I just wanted to create an INI-Parser so that it is somehow reusable. This time again, you'll implement it by hand in an hour or two, and so the world gets another trashy INI parser more.

May be it is about premature optimization. But when time passes by and Moore's Law helps us to do anything we want as long we stay in the O(n) range, age old ideas are hype again and we finally can waste a lot of computing power, to, ... yes, say parsing.

Now there are Packrat parsers, which essentially are pretty simple to implement as recursive descending parser with memoization. They support unlimited lookahead, but the effective performance is linear. They use PEG's as input, which are pretty simple grammar specifications with no ambiguities and much less usability constraints, meaning that the grammar and its interpretation is a intuitive process which makes spotting conflicts and fixing errors much more straighforward. So why don't give it a try.

Implementing a Packrat parser seems to be trivial at first, but to make such a parser generator actually useful, some puzzles need to be solved:

Left Recursion

Grammers used in specifications usually specify left recursive clauses like

Digits: 
  Digits Digit
  Digit

which effectively cannot be handled by a recursive decent parser. If so, it would loop forever and consume all stack space. I saw the following options to support left recursions

Disallowing left recursive grammars would imply that some grammars need to be heavily rewritten by hand, this was an option that would reduce the usability of a parser a lot and so I considered it to be the worst option.

Automatically rewriting the grammar is possible, but may be inconvenient and actually it may be difficult to result the same parse tree then. I considered this as an option that may have a significant impact when it comes to creating the actual parse tree.

So I tried the best option in terms of usability: To natively support left recursive grammars. After some research, I stumbled over Katahdin. This really interesting project made me confidence that this problem could be solved.

The final solution turned out to be pretty simple:

A left recursion can be detected by comparing the input element offset of a production "A" that is currently in evaluation with any input element offset of a dependency to production "A". If they are equal, a left recursion occurred.

When a left recursion happens the first time, the dependency is considered as failed, so the evaluation continues and returns. If the result is positive (i.e. input characters were matched), this match can be stored for all future references to that production. After that, if the left recursion is detected again, the previous match is reused and a new result can be constructed (using the example above, the first iteration would resolve to Digit and the second to Digit Digit). The evaluation of this production needs then to be repeated as long a better match is found.

The intermediate results may be stored in separate memoization table, otherwise these values may conflict when other recursions are resolved without using the CPU stack (see below).

Regular and right recursion

Right recursion and regular recursion (recursions that are not left nor right), are usually not implemented explicitly: The parser simply recurses and descents into the grammar, but depending on the input grammar and the input that is fed to the parser, this may lead to stack overflows, which I consider as a limitation of such a parser.

So, like for the left recursion implementation, I wanted no recursion at all. I finally defined one basic rule: A grammar production can never recurse. If actually does, it must be resolved by using heap space instead of stack space.

There were some ideas to use coroutines for implementing all the code that specifies the sub rules (i.e. like oneOf or lookaheadNotOf), but this would break the simplicity of the rule's implementation, especially when an imperative language is used. So the idea was to detect this kind of recursion and to handle it. I did the obvious:

Recursions are not allowed at all, if a left recursion is detected, it fails or returns a previous result of that production. If a regular recursion is detected, it shall fail completely.

But what to do on a recursion that fails, for Example:

Expression: 
  AssignmentExpression
  "(" Expression ")"

Here an input of "(x=2)" would force the production at the offset 1: "(" . Expression ")" to fail (usually a dot is used to indicate the current input offset inside the grammar).

I learned, that if a parser that does not support recursions at all, there can be only one failure caused by a recursion at a time, so it is sufficient to only set one global flag to indicate a recursion production failure. Now, all further rule evaluations are considered failed until the result is returned to the topmost "Expression" evaluation (well, there is only one). At this point, back at "Expression", we know the offset where the recursion actually occurred and can then side-step to reevaluate the sub "Expression". After that, we restart the original evaluation at the initial offset. Luckily restarting - though it seems first - is not costly, because memoization already stored all intermediate results until the recursion failed previously (here the match of "("), so the second iteration (after offset 1 has been matched), will jump directly to the point of failure and continues from there.

Performance

Though PEG parsers can be implemented so that they require linear parsing time, they are usually very slow when compared to handwritten or table parsers. My first measurements were in the range of ~300 to ~400 input elements per millisecond.

Flexibility

I wanted to be able to use the parser on every input stream, not for merely simple characters. Every ordered collection of elements should be able to be processed by an appropriate grammar, so the parser itself and all the rules are implemented as generic types.

Usability

To be able to use any grammar for the parser as input, I created a small internal DSL that matches the JavaScript grammar specification as closely as possible. See here for an example.

Real World

This pet parser is able to parse JavaScript with limitations (no regular expressions, no auto-semicolon insertions) with the lexical and syntactical grammar as specified in the ECMA-262 specification. Some clauses of the grammar had to be changed to be prioritized properly (PEG matches always the first clause), the required changes were always pretty obvious and simple to implement. Additionally, for performance and readability reasons, I changed some obvious left-recursions to the PEG star-operator.

The parser is not yet able to generate parse trees. Parse trees are required to make any use of the parser, ... instead of producing merely heat and lots of debugging output. In the future, I may try to create an extension framework that supports the specification of annotations alongside the grammar, so that the parser generator is able to generate the types and production methods that are required to create the final parse tree, automatically.

Further Reading

Packrat Parsing Bryan Ford, Master's Thesis: a Practical Linear-Time Algorithm with Backtracking. Katahdin A programming language where the syntax and semantics are mutable at runtime. Boost.Spirit An object oriented recursive descent parser framework. YARD: Yet another recursive descent parser.

yours armin

Update: I actually implemented the parsing algorithm by using coroutines supported in C# (yes it does!, using the yield keyword, I will post an article about this later). First results showed, that the performance dropped around 5 times. Sadly, I have no time (or say head-space :) to find out why. May be later this year.

Update 2: Below you can download the source code of the unfinished project.

Update 3: I've moved the sources over to Github.