Main features

An approach to parsing, embodied in the OCaml P3 parsing library. Main features:

Combinator parsing library (also has a parser generator component); interactive development in the OCaml interactive REPL
Can handle all CFGs (including left-recursive CFGs)
Scannerless (parse strings directly without lexing), or can use a lexer
Good theoretical basis e.g. sound and complete (see here for what these mean formally); supports equational reasoning about parsers
Good performance (for a general parser)
Can handle some fancy examples

This talk

We look at some motivating examples, the design, and a few of the ideas

Example

Consider the grammar E -> E E E | "1" | eps

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  (parse_E ***> parse_E ***> parse_E)
  |||| (a "1")
  |||| (a "")))
  i)

left-recursive, highly-ambiguous
(Ridge, CPP'11) - sound and complete combinator parsing library, mechanically verified in HOL4; with this we can write examples such as parse_E above
complete? infinite number of parse trees even with the empty string as input

E    E    E 
|    |    |
""   E    E  ... (and so on)
     |    |
     ""   E
          |
          ""

What does complete mean?

Completeness naively means: all parse trees are returned. What if there are an infinite number of parse trees?
Define a class of "good" parse trees. Given a grammar, the good parse trees for that grammar are finite (but potentially exponentially large). "Good parse trees are all you need."
Completeness means: all good parse trees are returned.
There is a technique -- called parsing contexts in (Ridge, CPP'11) -- for avoiding exploring bad parse trees during a parse; this can be smoothly incorporated into combinator parsers.
The upshot is that we have carved out a finite set of parse trees and the combinator parsers always terminate, for any grammar (including left-recursion etc)
For details, see CPP'11 paper

The definition: a bad parse tree is one where a proper subtree corresponds to the same part of the input as the root; a good parse tree has no bad tree anywhere inside it; a bad parse tree:

     E
   / | \
  E  E  E 
 /   |   \
""  "1"  ""

A parsing context is a technique for eliminating bad parse trees during parsing; again, see CPP'11 for details

Back to the example

Consider the grammar E -> E E E | "1" | eps. Before, the actions were implicit. Now we add explicit actions...

let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

left-recursive, highly-ambiguous
a ~~infinite~~ finite number of parse trees with any input; in general, for any CFG, a combinator parser constructed as above will terminate on any input with a full set of parse trees
The title of the CPP'11 paper was: "Simple, functional, sound and complete parsing for all context-free grammars."
What is missing? efficiency

Efficiency problems

With this previous work, there were performance problems with left-recursive grammars (although for non-left-recursive grammars, the performance was as normal combinator parsers i.e. probably quite reasonable depending on your application)
but compared to O(n^3) approaches, combinator parsing is inherently inefficient
moreover, it appears difficult to make combinator parsing efficient (ie O(n^3)) - GLL is probably the best approach, but GLL is quite complex and uses programming language features that are not easy to verify in a theorem prover
the question is, what to do, if we want the flexibility of combinator parsers, but also good performance?

The main idea

After quite a few months experimenting with various approaches, the following seemed sensible

allow the user to write combinator parsers using the same interface as before ie something like

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

extract the grammar from the code somehow

let g = grammar_of parse_E

somehow use some other approach (in fact, we use Earley parsing) to do the actual parse

let parse_results = run_earley_parser g s

then use the parse results to guide the action phase (using the user-written combinator expression eg as above)

let final_results = run_actions parse_E parse_results

package these three steps into a single function

let final_results = run_parser parse_E s

A very clean and simple idea! The basic idea I eventually learned (many thanks to the SLE reviewers!) was first suggested by Ljunglof (2002), although there are earlier papers which use related ideas
But lots of hurdles to overcome e.g. How to represent parse results? How to define the combinators?

Applying the actions

There are some subtle aspects to applying actions, which I want to briefly discuss

Example: parse trees

let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

Input size|Number of results (good parse trees)
0         |1                                   
1         |1                                   
2         |3  
3         |19
4         |150                                
...
19        |441152315040444150

Sequence A120590 from the OEIS, see here
Exponential growth; computing results for inputs of size 19 or larger is not feasible
N.B.: exponential number of results means this must take (at least) an exponential amount of time
N.B.: this has nothing to do with parsing, it is due to the actions

Conceptual distinction: parsing is not applying actions

Parsing is different from applying the actions
Parsing can be done in polytime (e.g. using Earley). Results can be computed in polytime and stored in polyspace.
Applying actions takes how long? Well, it depends on the actions.
Applying actions over all good parse trees (where each action takes a constant time), takes an exponential amount of time

Example: counting

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

Input size|Number of results
0         |1                                   
1         |1                                   
2         |1                                   
4         |1                                 
...       
19        |1

The semantics is as before: compute the actions over all the (good) parse trees
Very important point: I believe that the only sensible semantics is to apply the actions over all the (good) parse trees; this enables equational reasoning about parsers etc, and is what underpins the theoretical tractability of this approach
This example takes an exponential amount of time. Does it need to? Before it had to, but now it is not so clear...
In the next slide, something that you (probably) can't do with existing tools...

Example: memoized counting (*)

let rec parse_E = 
  let tbl_E = MyHashtbl.create 100 in
  (fun i -> memo_p3 tbl_E (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result. But weren't inputs of length 19 supposed to be infeasible?
The library has been designed to ensure "optimal" performance at each stage.
- Earley parsing is O(n^3). If you want to write highly ambiguous grammars then parsing is still O(n^3).
- Representation of parse results via oracle is O(n^3) for arbitrary grammars
- The performance when applying actions can be optimized using memoization
Simple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time, providing the parse results are represented in a compact form (the oracle), and the actions are memoized
What this means in practice is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable (ie polytime)

Example: disambiguation

Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated

E -> E "+" E | E "-" E | ...

One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link

<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>

<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>

<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>

Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)

%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */

Both these approaches can be used by P3 without any special support in the parser e.g. see here

Example: disambiguation (*)

A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some if the parse is acceptable, None if the parse is not acceptable) link
Suppose input is 1*2+3; this should be parsed as (1*2)+3, not 1*(2+3)

E -> E "+" E           {{ fun (x,(_,y)) -> (match x,y with 
                          | (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
                          | (Some x,Some y) -> Some(Plus(x,y)) 
                          | _ -> None) }}

  | E "*" E            {{ fun (x,(_,y)) -> (match x,y with
                          | (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
                          | (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
                          | (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
                          | (Some x,Some y) -> Some(Times(x,y)) 
                          | _ -> None) }}
  | ?num?              {{ fun s -> Some(Num(int_of_string (content s))) }}

Directly encodes which results are acceptable and not acceptable
Clear semantics: we imagine that we are applying the actions over all parse trees
Traditionally not possible because there are exponentially many parse trees; the point of last few slides is: it is not the number of parse trees that matters
- eg 5+3*2*1+4+1+1+1+1+1+1+1+1+1+1 has an awful lot of parse trees (see here), but can be parsed in a fraction of a second to give a single result
Not only usable for arithmetic expressions...

The point is that you can perform actions apparently over an exponential number of parse trees, but everything happens in polytime

Expressivity

This approach also allows some fancy examples

Example: modular combination of parsers (*)

Consider grammars that are X (where X is LALR(1) or LR(1) or ...). You can't combine two X grammars to get an X grammar.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
We are in a functional, higher-order setting: combinator parsers can be parameterized over other parsers etc etc; this is extremely powerful; and we basically have no restrictions on the grammars we use, or the actions we apply...
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here

(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y)) 

let rec parse_A h = (fun i -> mkntparser "arithexp" (
  (((parse_A h) ***> (a "+") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
  |||| (((parse_A h) ***> (a "-") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
  |||| (parse_num 
        >>>> (fun s -> `Num(int_of_string (content s))))
  |||| (((a "(") ***> (parse_A h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B for booleans, and parse_L for lambda expressions

(* example input: \ x \ y x (y y) *)
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
  (((a "\\") ***> parse_var ***> (parse_L h))  (* lam *)
        >>>> (fun (_,(x,body)) -> `Lam(x,body)))
  |||| (((parse_L h) ***> (parse_L h))  (* app *)
        >>>> (fun (e1,e2) -> `App(e1,e2)))
  |||| (parse_var >>>> (fun s -> `Var s))  (* var *)
  |||| (((a "(") ***> (parse_L h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Can use parsers separately (with a "do nothing" helper), or can combine to form a parser for the union of the languages; in the following notice the definition of the helper function

let parse_U = (
  let rec l i = parse_L h i 
  and a i = parse_A h i 
  and b i = parse_B h i
  and h i = (l |||| a |||| b) i
  in
  l)

Then define an evaluator in the usual way, and finally, combine the parser and the evaluator

let parse_and_eval txt = (remove_err (
  p3_run_parser 
    ((parse_memo_U ()) >>>> eval empty_env)
    txt))


let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt

N.B. the example expression, for eg sigma, is a lambda that contains a boolean exp, which contains an arith exp, which contains a lambda exp, which contains an arith exp... the nesting of the languages is infinite

Performance

Lots of different parts, but the back-end Earley parser is the workhorse of this approach
We want an Earley parser
- that is purely functional
- that has O(n^3) performance
- that allows rather general terminal parsers (arbitrary functions!)
- that would permit a mechanized proof of both correctness and performance behaviour
These requirements mean that I was forced to write an Earley parser from scratch
- functional correctness took some time
- parameterization took some time (the core algorithm is parameterized over everything - data structures and data representations)
- performance correctness took an immense amount of time

P3 back-end Earley parser performance

Compare to Happy on 5 sample small grammars
On 2, P3 is much much better

E_EEE: E -> E E E | "1" | ""

Grammar|Input size|Happy/s|P3/s
E_EEE  |020       |0.14   |0.10
E_EEE  |040       |6.87   |0.13
E_EEE  |100       |2535.88|0.50

On 2, P3 is better, but only by a factor of 10 or so. On one, Happy appeared to loop for non-empty inputs.
See here for comparison with Happy
I think these are very good results; getting these results took an immense amount of time

...using an oracle?

How to represent the parse results from the Earley stage that are supplied to the action stage? We must not use parse trees (otherwise the performance becomes exponential)
SPPFs are the traditional solution, but these are really an imperative datastructure, quite complicated, and they require a binarized grammar for O(n^3) behaviour
An oracle is a functional way to represent the results of parsing. Takes O(n^3) time to construct, even for non-binarized grammars.

Conclusion

It works and is usable for prototyping and small applications. Various improvements are in the pipeline.

I'm working to extend the grammars that can be handled beyond CFGs - eg grammars containing infinitely many nonterminals, generated lazily on demand; this is useful for parsing indentation-sensitive languages among other things

(* nonterminal E_i parses the number i, or the number i followed by nonterminal E_{i+1}
example input: 1234567891011 *)
let parse_E n = (a n) 
  |||| ((a n) ***> (parse_E (n+1)))

Long term aim: general parser, verified correctness and performance, usable in the real-world.
Some ideas:
- "good" parse trees and the parsing context;
- combinator front end, Earley (or other) back end;
- simple semantics, applying actions over all parse trees (which permits equational reasoning about parsers);
- interaction between memoization and actions - things that we intuitively think might be exponential time are polytime;
- using an oracle to represent parse results

https://github.com/tomjridge/p3

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle (Talk at SLE 2014)

Main features

This talk

Example

What does complete mean?

Back to the example

Efficiency problems

The main idea

Applying the actions

Example: parse trees

Conceptual distinction: parsing is not applying actions

Example: counting

Example: memoized counting (*)

Example: disambiguation

Example: disambiguation (*)

Expressivity

Example: modular combination of parsers (*)

Performance

P3 back-end Earley parser performance

...using an oracle?

Conclusion

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle
(Talk at SLE 2014)