The following post describes work in 2010/2011 on verified parsing. It includes a verification (in HOL4) of a sound and complete combinator parsing library.
During 2010/2011 I developed a verified parser for all context free languages. The main features of my approach are:
The approach is verified correct i.e. no bugs
The parser generator accepts all context free languages i.e. no messing around with your grammars anymore. This also helps a lot when writing actions.
The approach is complete i.e. you get all the possible parse trees. There is a slight wrinkle here: some grammars e.g. E -> E | "a"
give rise to infinite numbers of parse trees on certain inputs. In this case, you have the choice of either a finite subset of the parse trees which represents the equivalence classes of all trees under some tree reduction process, or alternatively you can opt for so-called "parse results" as used in e.g. Happy. Parse results also represent the infinite class of parse trees in some sense.
The approach is based on recursive descent parsing and is very simple. This makes it easy to (re-)implement, easy to reason about, and easy to adapt. It should be very easy to implement this in other languages that support basic functional programming idioms.
The other thing you should consider is how fast you need your parser to be. For highly ambiguous grammars, this approach seems competitive with Happy etc., but for very simple grammars, more specialized technology will probably perform better (as you should expect).
A cool-but-stupid example that I quite like is the following:
{{
(* A very stupid way to count the length of the input. *)
}}
START -> E ?ws? ?EOF? {{ fun (i,_) -> print_string ((string_of_int i)^"\n"); i }}
E -> E E E {{ fun (x,(y,z)) -> x+y+z }}
| "a" {{ fun _ -> 1 }}
| "" {{ fun _ -> 0 }}
This grammar is examples/actions/length.g
in the supporting material. You can use the parser generator to produce a parser length.native
. You can then run it as follows:
examples/actions$ ./length.native -f length.txt
Here, length.txt
is a file containing 30 consecutive 'a' characters. Note that this grammar is extremely ambiguous. Even so, on my relatively modest laptop this returns a single result 30
in half a second. I don't know any other parser generator that can do anything like this!
The paper on all this is
Tom Ridge. Simple, functional, sound and complete parsing for all context-free grammars. In Jean Pierre Jouannaud and Zhong Shao, editors, Certified Programs and Proofs - First International Conference, CPP 2011, Kenting, Taiwan, December 7-9, 2011. Proceedings, volume 7086 of Lecture Notes in Computer Science, pages 103--118. Springer, 2011.
A pdf version can be found here.
There is quite a lot of supporting material.
The HOL formal verification. The code was verified using higher-order logic in the HOL4 theorem prover. There is a proof script that you can run through HOL4 to convince yourself that everything is OK. Let me know if anything stops working (I try to track HOL4 changes from Michael Norrish's git repo, but it should work with any reasonably recent release as well).
Actually, the proof script is not ideal, because HOL4 forces a certain ordering of proofs, which is rather unnatural for the average reader (well, it is certainly unnatural for me). Previous versions used CHEAT_TAC
to reorder things, but people don't like CHEAT_TAC
. So the current version has no CHEAT_TAC
but is rather convoluted.
The OCaml code you can actually compile and run. This includes many different variations, and is still very much a work in progress. It needs a bit of polishing before I would be happy for others to try it out, but if you do try it out, val me know how you get on.
Lots of experiments, looking particularly at performance.
The supporting material is https://github.com/tomjridge/2012-11-29_parsing.