Date: 2011-12-01
Categories: software; parsing
Verified parsing
The following post describes work in 2010/2011 on verified parsing. It includes a verification (in HOL4) of a sound and complete combinator parsing library.
Verified parsing
During 2010/2011 I developed a verified parser for all context free languages. The main features of my approach are:
-
The approach is verified correct i.e. no bugs
-
The parser generator accepts all context free languages i.e. no messing around with your grammars anymore. This also helps a lot when writing actions.
-
The approach is complete i.e. you get all the possible parse trees. There is a slight wrinkle here: some grammars e.g.
E -> E | "a"
give rise to infinite numbers of parse trees on certain inputs. In this case, you have the choice of either a finite subset of the parse trees which represents the equivalence classes of all trees under some tree reduction process, or alternatively you can opt for so-called "parse results" as used in e.g. Happy. Parse results also represent the infinite class of parse trees in some sense. -
The approach is based on recursive descent parsing and is very simple. This makes it easy to (re-)implement, easy to reason about, and easy to adapt. It should be very easy to implement this in other languages that support basic functional programming idioms.
The other thing you should consider is how fast you need your parser to be. For highly ambiguous grammars, this approach seems competitive with Happy etc., but for very simple grammars, more specialized technology will probably perform better (as you should expect).
Cool example
A cool-but-stupid example that I quite like is the following:
{{
(* A very stupid way to count the length of the input. *)
}}
START -> E ?ws? ?EOF? {{ fun (i,_) -> print_string ((string_of_int i)^"\n"); i }}
E -> E E E {{ fun (x,(y,z)) -> x+y+z }}
| "a" {{ fun _ -> 1 }}
| "" {{ fun _ -> 0 }}
This grammar is examples/actions/length.g
in the supporting material. You can use the parser generator to produce a parser length.native
. You can then run it as follows:
examples/actions$ ./length.native -f length.txt
Here, length.txt
is a file containing 30 consecutive 'a' characters. Note that this grammar is extremely ambiguous. Even so, on my relatively modest laptop this returns a single result 30
in half a second. I don't know any other parser generator that can do anything like this!
Paper
The paper on all this is
Tom Ridge. Simple, functional, sound and complete parsing for all context-free grammars. In Jean Pierre Jouannaud and Zhong Shao, editors, Certified Programs and Proofs - First International Conference, CPP 2011, Kenting, Taiwan, December 7-9, 2011. Proceedings, volume 7086 of Lecture Notes in Computer Science, pages 103--118. Springer, 2011.
A pdf version can be found here.
Guide to supporting material
There is quite a lot of supporting material.
-
The HOL formal verification. The code was verified using higher-order logic in the HOL4 theorem prover. There is a proof script that you can run through HOL4 to convince yourself that everything is OK. Let me know if anything stops working (I try to track HOL4 changes from Michael Norrish's git repo, but it should work with any reasonably recent release as well).
Actually, the proof script is not ideal, because HOL4 forces a certain ordering of proofs, which is rather unnatural for the average reader (well, it is certainly unnatural for me). Previous versions used
CHEAT_TAC
to reorder things, but people don't likeCHEAT_TAC
. So the current version has noCHEAT_TAC
but is rather convoluted. -
The OCaml code you can actually compile and run. This includes many different variations, and is still very much a work in progress. It needs a bit of polishing before I would be happy for others to try it out, but if you do try it out, val me know how you get on.
-
Lots of experiments, looking particularly at performance.
The supporting material is https://github.com/tomjridge/2012-11-29_parsing.
Related posts:
- 2020-05-22 A simple MCQ test program, in OCaml, compiled to JavaScript
- 2019-08-30 B-tree random write performance
- 2019-08-21 ML'19 Workshop at ICFP: A key-value store for OCaml
- 2018-07-09 Backup tools
- 2018-06-14 A typed DSL for parsing
- 2018-05-22 First Python program: an Earley parser!
- 2018-02-01 New OCaml library: path resolution
- 2017-11-14 New OCaml parsing algorithm: tjr_simple_earley
- 2017-09-17 Two new OCaml libraries: P0 and tjr-csv
- 2017-03-16 tjr-btree: a CoW B-tree library in OCaml
- 2016-11-17 OCaml string functions
- 2016-02-19 Tree-structured text
- 2016-02-09 Simple implementation of an Earley-like parsing algorithm
- 2015-06-26 P5 scala parsing library
- 2014-12-19 Parsing the IMAP protocol
- 2014-12-04 Parsing examples
- 2014-11-21 Talk on parsing at the University of Sussex
- 2014-09-26 P1 combinator parsing library for OCaml
- 2014-09-26 E3 earley parser library for OCaml
- 2014-09-18 SLE 2014 conference, and Parsing at SLE workshop, slides
- 2014-09-07 ICFP 2014, OCaml workshop, slides and video
- 2014-07-11 P3 paper accepted for SLE 2014
- 2014-04-15 New release of P3 code on github
- 2014-03-02 New release of P3 code on github
- 2013-12-16 New release of P3 code on github
- 2013-12-03 Implementing algorithms efficiently
- 2013-11-24 Experience of using Lem
- 2013-11-08 Talk on parsing and P3 given at Cambridge
- 2011-12-01 Verified parsing