Categories: parsing

I have been interested in various aspects of parsing since 2009. My research is particularly focused on parsing general context-free grammars, using combinator parsers. I have published several papers, formalized various proofs about parsers in the HOL4 theorem prover, and produced quite a few OCaml implementations to try out various ideas:

P1 is a combinator parsing library for all context-free grammars (CFGs). It resembles traditional combinator-parsing libraries (both in interface, and in performance) with the additional feature that it can handle arbitrary CFGs. The performance with highly ambiguous left-recursive grammars is not optimal, but can be surprisingly good. For example, on the grammar

`E -> E E E | "1" | eps`

, P1 outperforms the Haskell Happy parser generator.P3 is a combinator parsing library that can parse all CFGs, whilst offering

*O*(*n*^{3}) performance (which I consider optimal for real-world usage: there are approaches that can do better than*O*(*n*^{3}), but the real-world overhead is currently unacceptable, and the implementation complexity is high).E3 is an Earley parser which is used by P3.

- P4 is a combinator parsing library that improves on P3 in 2 main ways:
- The library user has tight control over the parsing, allowing better real-world performance than P3.
- Various extensions are supported that allow to go beyond context-free grammars. For example, indentation-sensitive parsing is possible, based on allowing infinitely-many nonterminals in the grammar.

The downside is that the interface to P4 is more complicated than that for P3.

P1, P3, E3 and P4 can all be obtained from github.

P1 is a combinator parsing library for all CFGs. The implementation is based on the verified parsing work (see below).

- P1 can be obtained from https://github.com/tomjridge/p1.
- Documentation
- ocamldoc

New versions of P3 are now available from github at https://github.com/tomjridge/p3. At the moment the documentation is on the github wiki for P3 at https://github.com/tomjridge/p3/wiki/P3.

The best documentation for P3 internals is the extended version of the following paper: Tom Ridge. Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle. In Proceedings of SLE 2014. ridge14sle.pdf extended version.

E3 is available from github at https://github.com/tomjridge/e3. The examples should give some guidance on usage.

P4 is available from github at https://github.com/tomjridge/p4, but there is no documentation at the moment.

`P5_scala`

P5 is a version of P4 for Scala, available at https://github.com/tomjridge/p5_scala. Currently not in a very polished state.

Some examples are at https://github.com/tomjridge/example_grammars/ ; see the blog post

This work produced a verified, sound and complete approach to combinator parsing. This introduced the idea of "good parse trees". The problem with this approach is that it is too inefficient with left-recursive grammars. There is a post describing verified parsing here.

Related articles: