Categories: parsing
Parsing
Introduction
I have been interested in various aspects of parsing since 2009. My research is particularly focused on parsing general context-free grammars, using combinator parsers. I have published several papers, formalized various proofs about parsers in the HOL4 theorem prover, and produced quite a few OCaml implementations to try out various ideas.
P0 is a very simple combinator library, a sort-of minimal basis for combinator parsing.
P1 is a combinator parsing library for all context-free grammars (CFGs). It resembles traditional combinator-parsing libraries (both in interface, and in performance) with the additional feature that it can handle arbitrary CFGs. The performance with highly ambiguous left-recursive grammars is not optimal, but can be surprisingly good. For example, on the grammar
E -> E E E | "1" | eps
, P1 outperforms the Haskell Happy parser generator. The implementation is based on the verified parsing work (see below).tjr_simple_earley is my current favourite Earley parser implementation (in OCaml). A Python version is also available from github.
Verified parsing
This work produced a verified, sound and complete approach to combinator parsing. This introduced the idea of “good parse trees”. The problem with this approach is that it is too inefficient with left-recursive grammars. There is a post describing verified parsing here.
Deprecated
These are projects which I have largely abandoned:
P3 is a combinator parsing library that can parse all CFGs, whilst offering O(n3) performance (which I consider optimal for real-world usage: there are approaches that can do better than O(n3), but the real-world overhead is currently unacceptable, and the implementation complexity is high).
E3 is an Earley parser which is used by P3.
P4 is a combinator parsing library that improves on P3 in 2 main ways:
- The library user has tight control over the parsing, allowing better real-world performance than P3.
- Various extensions are supported that allow to go beyond context-free grammars. For example, indentation-sensitive parsing is possible, based on allowing infinitely-many nonterminals in the grammar.
The downside is that the interface to P4 is more complicated than that for P3.
Older links are:
- P3: New versions of P3 are now available from github at https://github.com/tomjridge/p3. At the moment the documentation is on the github wiki for P3 at https://github.com/tomjridge/p3/wiki/P3.
The best documentation for P3 internals is the extended version of the following paper: Tom Ridge. Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle. In Proceedings of SLE 2014. ridge14sle.pdf extended version.
E3 is available from github at https://github.com/tomjridge/e3. The examples should give some guidance on usage.
P4 is available from github at https://github.com/tomjridge/p4, but there is no documentation at the moment. It was a version of P3 with support for dynamically generated grammars. For example, you could parse a grammar of the form
G_n -> <the number n> | <the number n> G_{n+1}
:
(* _G 1 parses "1", or "1" followed by _G 2; in general, _G n parses
the number n, optionally followed by _G (n+1); there are an infinite
number of nonterminals (_G n) generated lazily on demand *)
let _G = ref (fun n -> failwith "")
let _ = _G := fun n ->
let an = a (string_of_int n) in
let alts = lazy(alts[
(rhs an) >> (fun s -> c s);
(an >-- (!_G (n+1))) >> (fun (x,y) -> (c x)^y)])
in
mkntparser_lazy (mk_pre_parser ()) alts
P5_scala
is a version of P4 for Scala, available at https://github.com/tomjridge/p5_scala. Currently not in a very polished state. Mostly useful for me to try out Scala.
Related posts:
- 2018-06-14 A typed DSL for parsing
- 2018-05-22 First Python program: an Earley parser!
- 2017-11-14 New OCaml parsing algorithm: tjr_simple_earley
- 2017-09-17 Two new OCaml libraries: P0 and tjr-csv
- 2016-02-19 Tree-structured text
- 2016-02-09 Simple implementation of an Earley-like parsing algorithm
- 2015-06-26 P5 scala parsing library
- 2014-12-19 Parsing the IMAP protocol
- 2014-12-04 Parsing examples
- 2014-11-21 Talk on parsing at the University of Sussex
- 2014-09-26 P1 combinator parsing library for OCaml
- 2014-09-26 E3 earley parser library for OCaml
- 2014-09-18 SLE 2014 conference, and Parsing at SLE workshop, slides
- 2014-09-07 ICFP 2014, OCaml workshop, slides and video
- 2014-07-11 P3 paper accepted for SLE 2014
- 2014-04-15 New release of P3 code on github
- 2014-03-02 New release of P3 code on github
- 2013-12-16 New release of P3 code on github
- 2013-12-03 Implementing algorithms efficiently
- 2013-11-08 Talk on parsing and P3 given at Cambridge
- 2011-12-01 Verified parsing