Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle
(Talk at SLE 2014)
Dr Tom Ridge
2014-09-15
Main features
An approach to parsing, embodied in the OCaml P3 parsing library. Main features:
Combinator parsing library (also has a parser generator component); interactive development in the OCaml interactive REPL
Can handle all CFGs (including left-recursive CFGs)
Scannerless (parse strings directly without lexing), or can use a lexer
Good theoretical basis e.g. sound and complete (see here for what these mean formally); supports equational reasoning about parsers
Good performance (for a general parser)
Can handle some fancy examples
This talk
- We look at some motivating examples, the design, and a few of the ideas
Example
- Consider the grammar
E -> E E E | "1" | eps
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
(parse_E ***> parse_E ***> parse_E)
|||| (a "1")
|||| (a "")))
i)
left-recursive, highly-ambiguous
(Ridge, CPP'11) - sound and complete combinator parsing library, mechanically verified in HOL4; with this we can write examples such as parse_E
above
complete? infinite number of parse trees even with the empty string as input
E E E
| | |
"" E E ... (and so on)
| |
"" E
|
""
What does complete mean?
Completeness naively means: all parse trees are returned. What if there are an infinite number of parse trees?
Define a class of "good" parse trees. Given a grammar, the good parse trees for that grammar are finite (but potentially exponentially large). "Good parse trees are all you need."
Completeness means: all good parse trees are returned.
There is a technique -- called parsing contexts in (Ridge, CPP'11) -- for avoiding exploring bad parse trees during a parse; this can be smoothly incorporated into combinator parsers.
The upshot is that we have carved out a finite set of parse trees and the combinator parsers always terminate, for any grammar (including left-recursion etc)
For details, see CPP'11 paper
- The definition: a bad parse tree is one where a proper subtree corresponds to the same part of the input as the root; a good parse tree has no bad tree anywhere inside it; a bad parse tree:
E
/ | \
E E E
/ | \
"" "1" ""
- A parsing context is a technique for eliminating bad parse trees during parsing; again, see CPP'11 for details
Back to the example
- Consider the grammar
E -> E E E | "1" | eps
. Before, the actions were implicit. Now we add explicit actions...
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
left-recursive, highly-ambiguous
a infinite finite number of parse trees with any input; in general, for any CFG, a combinator parser constructed as above will terminate on any input with a full set of parse trees
The title of the CPP'11 paper was: "Simple, functional, sound and complete parsing for all context-free grammars."
What is missing? efficiency
Efficiency problems
With this previous work, there were performance problems with left-recursive grammars (although for non-left-recursive grammars, the performance was as normal combinator parsers i.e. probably quite reasonable depending on your application)
but compared to O(n^3)
approaches, combinator parsing is inherently inefficient
moreover, it appears difficult to make combinator parsing efficient (ie O(n^3)
) - GLL is probably the best approach, but GLL is quite complex and uses programming language features that are not easy to verify in a theorem prover
the question is, what to do, if we want the flexibility of combinator parsers, but also good performance?
The main idea
After quite a few months experimenting with various approaches, the following seemed sensible
- allow the user to write combinator parsers using the same interface as before ie something like
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
- extract the grammar from the code somehow
let g = grammar_of parse_E
- somehow use some other approach (in fact, we use Earley parsing) to do the actual parse
let parse_results = run_earley_parser g s
- then use the parse results to guide the action phase (using the user-written combinator expression eg as above)
let final_results = run_actions parse_E parse_results
- package these three steps into a single function
let final_results = run_parser parse_E s
A very clean and simple idea! The basic idea I eventually learned (many thanks to the SLE reviewers!) was first suggested by Ljunglof (2002), although there are earlier papers which use related ideas
But lots of hurdles to overcome e.g. How to represent parse results? How to define the combinators?
Applying the actions
- There are some subtle aspects to applying actions, which I want to briefly discuss
Example: parse trees
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
Input size|Number of results (good parse trees)
0 |1
1 |1
2 |3
3 |19
4 |150
...
19 |441152315040444150
Sequence A120590 from the OEIS, see here
Exponential growth; computing results for inputs of size 19 or larger is not feasible
N.B.: exponential number of results means this must take (at least) an exponential amount of time
N.B.: this has nothing to do with parsing, it is due to the actions
Conceptual distinction: parsing is not applying actions
Parsing is different from applying the actions
Parsing can be done in polytime (e.g. using Earley). Results can be computed in polytime and stored in polyspace.
Applying actions takes how long? Well, it depends on the actions.
Applying actions over all good parse trees (where each action takes a constant time), takes an exponential amount of time
Example: counting
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
Input size|Number of results
0 |1
1 |1
2 |1
4 |1
...
19 |1
The semantics is as before: compute the actions over all the (good) parse trees
Very important point: I believe that the only sensible semantics is to apply the actions over all the (good) parse trees; this enables equational reasoning about parsers etc, and is what underpins the theoretical tractability of this approach
This example takes an exponential amount of time. Does it need to? Before it had to, but now it is not so clear...
In the next slide, something that you (probably) can't do with existing tools...
Example: memoized counting (*)
let rec parse_E =
let tbl_E = MyHashtbl.create 100 in
(fun i -> memo_p3 tbl_E (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result. But weren't inputs of length 19 supposed to be infeasible?
- The library has been designed to ensure "optimal" performance at each stage.
- Earley parsing is
O(n^3)
. If you want to write highly ambiguous grammars then parsing is still O(n^3)
.
- Representation of parse results via oracle is
O(n^3)
for arbitrary grammars
- The performance when applying actions can be optimized using memoization
Simple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time, providing the parse results are represented in a compact form (the oracle), and the actions are memoized
What this means in practice is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable (ie polytime)
Example: disambiguation
- Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated
E -> E "+" E | E "-" E | ...
- One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link
<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>
<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>
<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>
- Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)
%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */
- Both these approaches can be used by P3 without any special support in the parser e.g. see here
Example: disambiguation (*)
A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some
if the parse is acceptable, None
if the parse is not acceptable) link
Suppose input is 1*2+3
; this should be parsed as (1*2)+3
, not 1*(2+3)
E -> E "+" E {{ fun (x,(_,y)) -> (match x,y with
| (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
| (Some x,Some y) -> Some(Plus(x,y))
| _ -> None) }}
| E "*" E {{ fun (x,(_,y)) -> (match x,y with
| (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
| (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
| (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
| (Some x,Some y) -> Some(Times(x,y))
| _ -> None) }}
| ?num? {{ fun s -> Some(Num(int_of_string (content s))) }}
Directly encodes which results are acceptable and not acceptable
Clear semantics: we imagine that we are applying the actions over all parse trees
- Traditionally not possible because there are exponentially many parse trees; the point of last few slides is: it is not the number of parse trees that matters
- eg
5+3*2*1+4+1+1+1+1+1+1+1+1+1+1
has an awful lot of parse trees (see here), but can be parsed in a fraction of a second to give a single result
Not only usable for arithmetic expressions...
- The point is that you can perform actions apparently over an exponential number of parse trees, but everything happens in polytime
Expressivity
- This approach also allows some fancy examples
Example: modular combination of parsers (*)
Consider grammars that are X (where X is LALR(1) or LR(1) or ...). You can't combine two X grammars to get an X grammar.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
We are in a functional, higher-order setting: combinator parsers can be parameterized over other parsers etc etc; this is extremely powerful; and we basically have no restrictions on the grammars we use, or the actions we apply...
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here
(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y))
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(((parse_A h) ***> (a "+") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
|||| (((parse_A h) ***> (a "-") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
|||| (((a "(") ***> (parse_A h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B
for booleans, and parse_L
for lambda expressions
(* example input: \ x \ y x (y y) *)
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(((a "\\") ***> parse_var ***> (parse_L h)) (* lam *)
>>>> (fun (_,(x,body)) -> `Lam(x,body)))
|||| (((parse_L h) ***> (parse_L h)) (* app *)
>>>> (fun (e1,e2) -> `App(e1,e2)))
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
|||| (((a "(") ***> (parse_L h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
- Can use parsers separately (with a "do nothing" helper), or can combine to form a parser for the union of the languages; in the following notice the definition of the helper function
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
- Then define an evaluator in the usual way, and finally, combine the parser and the evaluator
let parse_and_eval txt = (remove_err (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt))
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
- N.B. the example expression, for eg sigma, is a lambda that contains a boolean exp, which contains an arith exp, which contains a lambda exp, which contains an arith exp... the nesting of the languages is infinite
...using an oracle?
How to represent the parse results from the Earley stage that are supplied to the action stage? We must not use parse trees (otherwise the performance becomes exponential)
SPPFs are the traditional solution, but these are really an imperative datastructure, quite complicated, and they require a binarized grammar for O(n^3)
behaviour
An oracle is a functional way to represent the results of parsing. Takes O(n^3)
time to construct, even for non-binarized grammars.
Conclusion
- It works and is usable for prototyping and small applications. Various improvements are in the pipeline.
- I'm working to extend the grammars that can be handled beyond CFGs - eg grammars containing infinitely many nonterminals, generated lazily on demand; this is useful for parsing indentation-sensitive languages among other things
(* nonterminal E_i parses the number i, or the number i followed by nonterminal E_{i+1}
example input: 1234567891011 *)
let parse_E n = (a n)
|||| ((a n) ***> (parse_E (n+1)))