Basics of parse trees

Parse trees are trees from the previous slide, but matched against an input string.

Each terminal is associated with a set of strings e.g.

the terminal ?num? is associated with the set of strings {"0","1","1234",...}
the terminal "+" is associated with the set of strings {"+"}
the terminal "" or eps is associated with the set of strings {""}; a rule might be E -> eps indicating that E can be expanded to match the empty string

Example parse tree:

Grammar: E -> E "+" E | ?num?
Input string: "1+2"

N.B. With traditional defns of parse trees, terminals correspond to "characters" (the epsilon terminal is somehow special). But this is not modular: with our definitions, terminal parsers can be arbitrary user-supplied functions.

Parsing

Given a grammar, and an input string, parsing is the process of determining (somehow, and in some form) what parse trees correspond to the input.

There may be exponentially many parse trees:

Grammar: E -> E E | "1"    Input: "11...1" (of length n)

There may be infinitely many parse trees:

Grammar: E -> E | ""       Input: ""

Actions

There are finite and "small" representations (e.g. SPPFs) for possibly-infinite sets of parse trees
Sometimes we are just interested in getting such a representation of all parse trees
But mostly we want to do something else: apply actions
The process is:

parsing 
-> parse results (e.g. SPPFs, typically for internal use) 
-> apply actions to get final semantic results which are returned to user

An example, return the length of the input:

E -> E E     {{ fun (x,y) -> x+y }}
  | "1"      {{ fun _ -> 1 }}

After applying actions, you get a list of results (here, a list of integers)

"" ~> []
"1" ~> [1]
"111" ~> [3]

Another example, return parse trees:

E -> E E E   {{ fun (x,(y,z)) -> `Node(x,y,z) }}
  | "1"      {{ fun _ -> `Leaf 1 }}
  | ""       {{ fun _ -> `Leaf 0 }}

But there are infinitely-many parse trees so we know that applying the actions must run forever:

"111" ~> nontermination

Semantics of actions when there are infinitely-many parse trees

We typically want two things which are in conflict
- we want all the possible results (completeness)
- we want the "action" phase to terminate
Worth investigating whether we can slightly relax the notion of completeness, so that we don't talk about "all parse trees" (which leads to nontermination)
The aim is to define a restricted set of parse trees:
- The set is finite (so that the action phase always terminates)
- The set is "big enough": usually we don't lose anything by considering all results in the restricted set e.g. if an input can be parsed to give a parse tree t, then it can be parsed to give a parse tree t′ in the set
- The set has "good" properties

The problem of infinitely-many parse trees

Completeness (almost) means: every parse tree that is well-formed according to the grammar is returned.
Problem: there may be an infinite number of parse trees e.g. for grammar E -> E E E | "1" | "" and input "1"

Note that the fringe (in grey) isn't consuming any of the input, so we can keep stacking extra fringes to make arbitrarily large parse trees.

For infinitely-many parse trees you need: a nonterminal (E say) and a sequence of productions
E → ⁺αEβ, such that α and β match the empty string
N.B. This is an "if-and-only-if" i.e. it characterizes the situations where there are infinitely-many parse trees

Bad and good parse trees

The proposed fix is to avoid these sorts of "bad" parse trees

Bad parse trees have a root node and a descendant node with the same nonterminal and which matches the same string (here "1")
Good parse trees are those that do not contain a bad subtree

If an input can be parsed, it can be parsed to give a good parse tree
For any grammar, for any input, there are only a finite number of good parse trees (this is not obvious)
Completeness now means: every good parse tree is returned

P1 combinator parsing library and verified parsing

P1 motivation

Existing parsing tools are unsatisfactory in a number of ways e.g.:
- the parser generator doesn't accept all CFGs
- the class of CFGs it does accept is incomprehensible
- the underlying algorithm is complicated
- the implementation code is extremely complicated
- the error messages are incomprehensible
- the interface requires the user to understand the implementation details
Example: the definition of an LALR(1) grammar:

[...] The latter defines LALR(1) grammar in terms of LALR(1) parser; a grammar is LALR(1) iff its LALR(1) parser is deterministic. It is desirable to have a definition of LALR(1) grammar that does not involve the parser, but we know of no reasonable way to do this. [DeRemer and Pennello, 1982]

Example: output from Yacc:

//   A : B c d | E c f ;
//   B : x y ;
//   E : x y ;


// Yacc output
5: reduce/reduce conflict (reduce 3, reduce 4) on c state 5
        B : x y .  (3)
        E : x y .  (4)

Aim of P1 was to do better:
- Clean interface
- Clean semantics
- Verified correctness
The hope was that the resulting tool would be useful in the real world

P1 features

Combinator parsing library, purely functional
Sound and complete
Guaranteed terminating
Clear, simple semantics, independent of implementation details
A version of the library, including a parser generator, has a mechanized proof of correctness in HOL4

Combinator parsing

Represent rules for a nonterminal:

E -> "(" E "+" E ")" 
  | "1"

By a recursive function:

(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
  ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
  ||| (a "1"))
  i

Combinator parsing, terminal parsers

e.g. following used for (a "1") or more generally (a s)

let a s = (fun i -> 
  if starts_with i s then
    let (u,v) = split i (String.length s) in
    [(u,v)]
  else 
    [])

let _ = (a "1") "123"
(* [("1", "23")] *)

let _ = (a "1") "abc"
(* [] *)

Combinator parsing, combinators

The combinators ||| and **> are defined in the obvious way:

let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)

let ( **> ) p1 p2 = (fun s0 ->
  (* apply p2 to results (e1,s1) from p1 *)
  ...)

Note these combinators are written infix e.g. p1 **> p2

let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)

Combinator parsing, actions

There is another function which we need, the "action" function:

let (>>) p f = (fun s ->
  (* apply f to the values returned from (p s) *)
  ...)

Note this is written infix e.g. p >> f

let _ = ((a "1")                )  "123"
(* [("1", "23")] *)

let _ = ((a "1") >> (fun _ -> 1))  "123"
(* [(1, "23")] *)

Example

(* E -> "(" E "+" E ")" | "1" *)
let rec parse_E i = 
  let f1 = 
    ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")") 
      >> (fun (_,(e1,(_,(e2,_)))) -> e1+e2)) 
  in
  let f2 = ((a "1") >> (fun _ -> 1)) in
  (f1 ||| f2) i

let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)

The glory of combinator parsing

Arguably the best programmer interface to parsing, because the full power of host functional language is available

parameterization
modules
types and type checking
interactive use and debugging
etc

Programmers in functional languages usually write top-down, recursive descent, combinator-based parsers.

Some problems with combinator parsing

Cannot handle all grammars; e.g. no left recursion: the following leads to non-termination (or, more realistically, stack overflow)

E -> E "+" E | ...

Inefficient; essentially because there is no attempt to reuse the results of parses of the same part of the input and also because prefix parsing can lead to lots of work which is eventually thrown away

P1

Simple, functional, sound and complete parsing for all context-free grammars, Ridge, CPP'11
https://github.com/tomjridge/p1/
Idea: if we restrict to good parse trees, then combinator parsing terminates (for any grammar), and we can state and prove some nice theorems
See defn of e.g. soundness, grammar_to_parser and sound_grammar_to_parser

HOL4 verification

https://github.com/tomjridge/2012-11-29_parsing
Proof script about 4000 lines
Termination is complicated (about 1000 lines)
- getting HOL to accept the defn of grammar_to_parser was quite difficult
- not obvious there are only a finite number of good parse trees; not clear that grammar_to_parser terminates; a suitable measure that includes the context has to be found
- even then, the recursive defn is constructed "manually", outside the usual HOL defn packages (ugly)
Soundness fiddly but straightforward
Completeness involves reasoning about tree rewriting: replace a bad node with the descendant that matches it; repeat to rewrite a bad tree to a good tree which matches the same input

Performance

For non-left-recursive grammars, the performance is essentially the same as "usual" combinator parsers (with very small overhead for the context)
For left-recursive grammars, combinator parsers are prefix parsers, so even if we abandon a parse for (s, i, j), we still have to produce results for (s, i, j − 1), (s, i, j − 2), …;
- eg for the grammar E -> E | "1", with input "111", the parse_E function is called repeatedly on successively decreasing parts of the input: (0,3) (0,2) (0,1) (0,0)
For very large inputs, say length n, the (i, n), (i, n − 1), (i, n − 2)… behaviour kills performance (you could think of ways to ameliorate this ...)

So the performance should be terrible
But strangely it seems to outperform e.g. the Happy parser generator (at least on some grammars)
And it is quite usable for small inputs

Time to think again

We can handle context-free grammars ☺
Takes a long time ☹
Thinks...
Conclusion: difficult to preserve simplicity of combinator parsing and also achieve good performance (but sophisticated implementation techniques such as trampolining may get there - see GLL)

A simple idea

Revisit goals, weaken...
Abandon combinator parsing as a parsing technique, but retain as the programmer interface
Use any other technique to do the actual parsing
Our implementation is based on Earley parsing; the implementation can be horribly complicated providing no details intrude on the user's mental model
The problem now is to reconcile an interface based on combinator parsing with an existing technique such as Earley parsing

Two different worlds

Combinator parsing uses functions to express the grammar

let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

Earley parsing requires concrete datastructures

let g = [
  ("E",[NT "E";NT "E";NT "E"]);
  ("E",[TM "1"]);
  ("E",[TM "eps"])]

The idea in slightly more detail

allow the user to write combinator parsers using the same interface as before ie something like

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

extract the grammar from the code somehow

let g = grammar_of parse_E

somehow use some other approach (in fact, we use Earley parsing) to do the actual parse

let parse_results = run_earley_parser g s

then use the parse results to guide the action phase (using the user-written combinator expression eg as above)

let final_results = run_actions parse_E parse_results

package these three steps into a single function

let final_results = run_parser parse_E s

A very clean and simple idea! The basic idea I eventually learned was first suggested by Ljunglof (2002), although there are earlier papers which use related ideas
Definitions of combinators omitted...

An obvious question

We can extract the grammar from a combinator parser.
We feed this to an Earley parser, and we get back information about the parse.
An obvious question: suppose you have a parser for general CFGs (e.g. an Earley parser); when parsing an input, what should the parser return?
Certainly not parse trees...

Earley parsing and the Oracle

The parse information needs to be in a compact form. The traditional solution is SPPFs. SPPFs are a bad choice in my opinion (see next slide).
Instead I use an oracle.

A top-down parser:

The oracle answers the following problem:

Given a rule S → AB in Γ and a substring s_i, j, what are the indexes k such that Γ ⊢ A → ^*s_i, k and Γ ⊢ B → ^*S_k, j ?

In code:

type ty_oracle = (symbol * symbol) -> substring -> int list

A substring is s_i, j, or SS(s,i,j) in code

Why an oracle? Why not SPPFs?

This slide is saying: SPPFs are a terrible choice for representing parse results; an oracle is much better
Description of SPPFs: grammar E-> E E E | "1" | "" on input "1"
- or see example from Atkey 2012 "Semantics of parsing with semantic actions" http://bentnib.org/semantic-actions.pdf
SPPFs are
- mutable imperative datastructures
- must be binarized for O(n³) space/time, so either fiddle with the grammar (problems implementing, and problems using for action phase), or struggle with a complicated implementation involving maximal sharing of any suffix of a rhs
- difficult to implement correctly (even for binarized grammars)
- mutability means SPPFs are very hard for mechanized proofs to deal with
- difficult to use when applying actions
Oracles are
- functions
- trivial to implement and get right; easy to handle in a theorem prover
- O(n³) time/space for arbitrary grammars (not just binarized grammars)
- embody the same information as SPPFs
- easy to use when applying actions (see next slide)

Using the oracle to guide the action phase

What happens when executing (p1 ***> p2) (Inr i0)?

let seqr p1 p2 = (fun i0 -> 
  (* get symbol for each parser *)
  let sym1 = sym_of_parser p1 in
  let sym2 = sym_of_parser p2 in
  let SS(s,i,j) = i0.ss4 in
  (* call oracle on substring SS(s,i,j), get back list of k s *)
  let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
  (* for each k we do the following *)
  let f1 k = (
    (* call p1 on SS(s,i,k) ... *)
    let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
    (* ... and p2 on SS(s,k,j) .. *)
    let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
    (* and combine the results *)
    list_product rs1 rs2)
  in
  List.concat (List.map f1 ks))

Recap

P1 is a combinator parsing library with a semantically clean interface
P3 is like P1, but
- the back-end parsing is done by an Earley parser, O(n³) in the worst case, but often much better
- the results of Earley parsing are used to guide the action phase
The result is a combinator parser with clean semantics and "good performance"
- the parsing performance is as Earley
- the performance when applying the actions depends on the actions, of course, but in (Ridge, SLE 2014) I argue that this is close to optimal
You can do some amazing things with P1, but the performance is sometimes poor; you can do these things with P3, and get good performance
Examples...

Equational reasoning

Completeness and termination mean you can reason in a simple way about parser behaviour, and parsers compose properly

let rec parse_MANY p = (fun i -> (mkntparser ... (
  (((a "")) >>>> (fun _ -> []))
  |||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
  i)

let parse_LIST bra elt sep ket = (
  let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
  ((bra ***> ket) >>>> (fun _ -> []))
  |||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket) 
    >>>> (fun (_,(e,(es,_))) -> e::es)))

let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"

let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)

let _ = p3_run_parser (parse_LIST (a "") (a "") (a "") (a "")) ""

Support for "equational" reasoning about parsers:

parse_LIST (a "") p (a "") (a "") = parse_MANY p

Example: parse trees

let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

Input size|Number of results (good parse trees)
0         |1                                   
1         |1                                   
2         |3  
3         |19
4         |150                                
...
19        |441152315040444150

Sequence A120590 from the OEIS, see here
Exponential growth; computing results for inputs of size 19 or larger is not feasible
N.B.: exponential number of results means this must take (at least) an exponential amount of time
N.B.: this has nothing to do with parsing, it is due to the actions

Example: counting

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

Input size|Number of results
0         |1                                   
1         |1                                   
2         |1                                   
4         |1                                 
...       
19        |1

The semantics is as before: compute the actions over all the (good) parse trees
Very important point: I believe that the only sensible semantics is to apply the actions over all the (good) parse trees; this enables equational reasoning about parsers etc, and is what underpins the theoretical tractability of this approach
This example takes an exponential amount of time. Does it need to? Before it had to, but now it is not so clear...
In the next slide, something that you (probably) can't do with existing tools...

Example: memoized counting (*)

let rec parse_E = 
  let tbl_E = MyHashtbl.create 100 in
  (fun i -> memo_p3 tbl_E (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result. But weren't inputs of length 19 supposed to be infeasible?
The library has been designed to ensure "optimal" performance at each stage.
- Earley parsing is O(n^3). If you want to write highly ambiguous grammars then parsing is still O(n^3).
- Representation of parse results via oracle is O(n^3) for arbitrary grammars
- The performance when applying actions can be optimized using memoization
Simple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time, providing the parse results are represented in a compact form (the oracle), and the actions are memoized
What this means in practice is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable (ie polytime)

Example: disambiguation

Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated

E -> E "+" E | E "-" E | ...

One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link

<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>

<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>

<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>

Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)

%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */

Both these approaches can be used by P3 without any special support in the parser e.g. see https://github.com/tomjridge/p3/blob/master/examples/actions/precedence.g

Example: disambiguation (*)

A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some if the parse is acceptable, None if the parse is not acceptable) link
Suppose input is 1*2+3; this should be parsed as (1*2)+3, not 1*(2+3)

E -> E "+" E           {{ fun (x,(_,y)) -> (match x,y with 
                          | (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
                          | (Some x,Some y) -> Some(Plus(x,y)) 
                          | _ -> None) }}

  | E "*" E            {{ fun (x,(_,y)) -> (match x,y with
                          | (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
                          | (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
                          | (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
                          | (Some x,Some y) -> Some(Times(x,y)) 
                          | _ -> None) }}
  | ?num?              {{ fun s -> Some(Num(int_of_string (content s))) }}

Directly encodes which results are acceptable and not acceptable
Clear semantics: we imagine that we are applying the actions over all parse trees
- strong, simple guarantees: the parser will never return a tree of form x*(y+z)
- strong, simple guarantees: we get all the parse trees that are allowable
Traditionally not possible because there are exponentially many parse trees; the point of last few slides is: it is not the number of parse trees that matters
- eg 5+3*2*1+4+1+1+1+1+1+1+1+1+1+1 has an awful lot of parse trees (see here), but can be parsed in a fraction of a second to give a single result
Not only usable for arithmetic expressions...

The point is that you can perform actions apparently over an exponential number of parse trees, but everything happens in polytime

Example: modular combination of parsers (*)

Consider grammars that are X (where X is LALR(1) or LR(1) or ...). You can't combine two X grammars to get an X grammar.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
We are in a functional, higher-order setting: combinator parsers can be parameterized over other parsers etc etc; this is extremely powerful; and we basically have no restrictions on the grammars we use, or the actions we apply...
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here

(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y)) 

let rec parse_A h = (fun i -> mkntparser "arithexp" (
  (((parse_A h) ***> (a "+") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
  |||| (((parse_A h) ***> (a "-") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
  |||| (parse_num 
        >>>> (fun s -> `Num(int_of_string (content s))))
  |||| (((a "(") ***> (parse_A h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B for booleans, and parse_L for lambda expressions

(* example input: \ x \ y x (y y) *)
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
  (((a "\\") ***> parse_var ***> (parse_L h))  (* lam *)
        >>>> (fun (_,(x,body)) -> `Lam(x,body)))
  |||| (((parse_L h) ***> (parse_L h))  (* app *)
        >>>> (fun (e1,e2) -> `App(e1,e2)))
  |||| (parse_var >>>> (fun s -> `Var s))  (* var *)
  |||| (((a "(") ***> (parse_L h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Can use parsers separately (with a "do nothing" helper), or can combine to form a parser for the union of the languages; in the following notice the definition of the helper function

let parse_U = (
  let rec l i = parse_L h i 
  and a i = parse_A h i 
  and b i = parse_B h i
  and h i = (l |||| a |||| b) i
  in
  l)

Then define an evaluator in the usual way, and finally, combine the parser and the evaluator

let parse_and_eval txt = (remove_err (
  p3_run_parser 
    ((parse_memo_U ()) >>>> eval empty_env)
    txt))


let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt

N.B. the example expression, for eg sigma, is a lambda that contains a boolean exp, which contains an arith exp, which contains a lambda exp, which contains an arith exp... the nesting of the languages is infinite
Why is this important? We need to stop reinventing the wheel every time, and instead reuse code, including parsers

Performance evaluation

Some sample grammars

Asymptotic performance

Performance should be O(n³) in the worst case. So t(n) ≤ c.n³, ignoring some initial values of n.

For input size n, we record t(n), and calculate c, assuming t(n) = c.n³.

If we plot computed c against n, we expect c to tend towards a constant (assuming the implementation meets the bound!!!)

Asymptotic performance - `aho_sml`

aho_sml: S -> S S "x" | ""

Asymptotic performance - `E_EEE`

E_EEE: E -> E E E | "1" | ""

Asymptotic performance

Real-world performance appears O(n³) across all grammars and inputs, as far as we can observe.
This is the worst case. For many grammars, performance will be much better - e.g. linear when parsing xml

Real-world performance

We want to compare the performance of my approach (P3) with existing approaches; none of the existing approaches have a combinator interface
First problem: most parsers cannot actually parse these grammars; if they can, they typically bug out with very small inputs
Haskell Happy parser (based on GLR parsing) is the only one I found that allowed me to run a respectable number of tests
N.B. Happy is not a combinator parser, or a library, it is a parser generator
- it is allowed to do arbitrary optimizations on the grammar and generated parser code
- the time it takes to do this is not included in the comparison with P3
- so it should be really fast compared to P3

Performance comparison with Happy

Grammars aho_sml and S_xSx seem to be especially suited to Happy
Even so, P3 outperforms Happy, e.g. for aho_sml:

aho_sml: S -> S S "x" | ""

These are the timings for Earley with oracle construction. Omitting oracle construction roughly halves the time for P3.

There is (probably) a performance bug in Happy, even on these grammars that are well-suited to Happy
N.B. what happened for Happy with input size 800?

Performance comparison with Happy

Grammars aho_s and E_EEE should favour neither P3 nor Happy.

E_EEE: E -> E E E | "1" | ""

Grammar|Input size|Happy/s|P3/s
E_EEE  |020       |0.14   |0.10
E_EEE  |040       |6.87   |0.13
E_EEE  |100       |2535.88|0.50

Grammar aho_s is similar.
What about grammar brackets?

Performance comparison with Happy

On brackets, Happy bugged out. This is (probably) a functional correctness bug in Happy.

Summary of performance

Grammar |Winner
aho_s   |P3 by a mile
aho_sml |P3
brackets|P3 by a mile (competitor failed to turn up)
E_EEE   |P3 by a mile
S_xSx   |P3

Summary of performance

Observed correct asymptotic performance
Better real-world performance than Happy across all example grammars (although for grammar S_xSx the input size needs to be quite large)
- but for grammars that are close to LR, Happy/GLR will probably be faster
This is noteworthy - Happy has been under development for 10+ years, by many people, all of whom have reputations as excellent researchers and engineers.
Implementing algorithms correctly and efficiently is hard: this testing produced evidence that there are functional correctness bugs in Happy; it is also possible that there are time complexity bugs, and space complexity bugs in Happy but it is hard to be sure because it isn't clear what variant of GLR is implemented, nor what time/space bounds that variant of GLR actually satisfies
By contrast, no bugs have been found in P3 so far. I believe that the current version is both functionally correct, and provably efficient (assuming array-based datastructures).

Conclusion

Combinator parsing with Earley-like real-world performance
Combinator parsing library; many fancy examples (that are impossible with other approaches) in the online resources
Good performance, competitive with best parsers currently around on highly-ambiguous grammars; oracle architecture means we can swap in faster backend parsers when they become available; we can even analyse the grammar and choose the fastest parser for the given grammar class
Main challenges: linking combinator parsing with Earley parsing; Earley implementation; correctness and efficiency; tying everything together
Main research contributions:
- good parse trees; verified combinator parsing with strong, simple properties
- oracle representation of parse results; informal proofs of correctness and efficiency, (todo) mechanized proofs of correctness and efficiency for Earley
Overall approach is simple, both conceptually and in terms of the implementation code (although Earley parser is quite intricate)
Probably useful for: prototyping; general CFG parsing; modular specification and combination of parsers; equational reasoning about parsers;
What I plan to do:
- get P3 in a state where it is usable by others (mostly engineering)
- development of P4 (generalization of theory, allows to treat data-dependent grammars and e.g. indentation-sensitive parsing in a mathematically clean and tractable way)
- verification of Earley implementation

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle. (Talk at University of Sussex, 50 mins)

Overview

Plan for the talk

Intro

Basics of context-free grammars

Basics of parse trees

Parsing

Actions

Good parse trees

Semantics of actions when there are infinitely-many parse trees

The problem of infinitely-many parse trees

Bad and good parse trees

P1 combinator parsing library and verified parsing

P1 motivation

P1 features

Combinator parsing

Combinator parsing, terminal parsers

Combinator parsing, combinators

Combinator parsing, actions

Example

The glory of combinator parsing

Some problems with combinator parsing

P1

HOL4 verification

Performance

P3

Time to think again

A simple idea

Two different worlds

The idea in slightly more detail

An obvious question

Earley parsing and the Oracle

Why an oracle? Why not SPPFs?

Using the oracle to guide the action phase

Recap

P3 examples

Equational reasoning

Example: parse trees

Example: counting

Example: memoized counting (*)

Example: disambiguation

Example: disambiguation (*)

Example: modular combination of parsers (*)

Performance evaluation

Some sample grammars

Asymptotic performance

Asymptotic performance - aho_sml

Asymptotic performance - E_EEE

Asymptotic performance

Real-world performance

Performance comparison with Happy

Performance comparison with Happy

Performance comparison with Happy

Summary of performance

Summary of performance

Conclusion

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle.
(Talk at University of Sussex, 50 mins)

Asymptotic performance - `aho_sml`

Asymptotic performance - `E_EEE`