Dr Tom Ridge

2013-11-08

- How many people have used combinator parsers?

Simple arithmetic expressions:

```
E -> E "+" E
E -> ?num?
```

or

`E -> E "+" E | ?num?`

Terminology: terminal, nonterminal, symbol, rule, grammar

N.B. RHS of rules cannot be the empty list

Each terminal is associated with a set of strings

Parse trees are trees which are well-formed according to the grammar, and where the strings associated with terminal nodes are those associated to the terminal at that node

Given a grammar, and an input string, parsing is the process of *determining* (somehow, and in some form) what parse trees correspond to the input.

There may be exponentially many parse trees:

`E -> E E | "1"`

There may be infinitely many parse trees:

`E -> E | ""`

Represent rules for a nonterminal:

```
E -> "(" E "+" E ")"
| "1"
```

By a recursive function:

```
(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
||| (a "1"))
i
```

Given an input string `s`

, a parser returns a list of things of the form `(v,s')`

, where v is the result of parsing a prefix of `s`

, and `s'`

is the remainder of the input that wasn't parsed.

This is *prefix* parsing.

e.g. following used for `(a "1")`

or more generally `(a s)`

```
let a s = (fun i ->
if starts_with i s then
let (u,v) = split i (String.length s) in
[(u,v)]
else
[])
let _ = (a "1") "123"
(* [("1", "23")] *)
let _ = (a "1") "abc"
(* [] *)
```

The combinators `|||`

and `**>`

are defined in the obvious way:

```
let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)
let ( **> ) p1 p2 = (fun s0 ->
(* apply p2 to results (e1,s1) from p1 *)
let f (e1,s1) =
p2 s1
|> List.map (fun (e2,s2) -> ((e1,e2),s2))
in
s0
|> p1
|> List.map f
|> List.concat)
```

Here, `x |> f`

is just `f x`

Note these combinators are written infix e.g. `p1 **> p2`

```
let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)
```

There is another function which we need, the "action" function:

```
let (>>) p f = (fun s ->
s
|> p
|> List.map (fun (e,s) -> (f e, s)))
```

Note this is written infix e.g. `p >> f`

```
let _ = ((a "1") >> (fun _ -> 1))
(* "123" = [(1, "23")] *)
```

```
(* E -> "(" E "+" E ")" *)
let rec parse_E i =
let f1 =
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")")
>> (fun (_,(e1,(_,(e2,_)))) -> e1+e2))
||| ((a "1") >> (fun _ -> 1))
in
f1 i
let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)
```

- Arguably the best programmer interface to parsing: full power of host functional language available (parameterization, modules, types, interactive use and debugging etc)

- Cannot handle all grammars; e.g. no left recursion: the following leads to non-termination (or, more realistically, stack overflow)

`E -> E "+" E | ...`

- Inefficient; essentially because there is no attempt to reuse the results of parses of the same part of the input (memoization doesn't really help)

`E -> F G | F H`

and also because prefix parsing can lead to lots of work which is eventually thrown away

```
E -> F "X"
F -> A B
```

In this example, assuming there is no `"X"`

in the input, there may be substantial work involved in combining the results of parsing an `A`

followed by parsing a `B`

, before the parser attempts (and fails) to parse an `"X"`

.

Combinator parsing has a great programmer interface, but lousy performance, and can't handle all context-free grammars.

Consider a combinator parsing library with the following features:

Handles all context-free grammars

Sound

Complete

Conceptually simple (also, helps if implementation is simple)

Efficient

Real-world performance that matches or improves on the current-best parsers for any given subclass of context-free grammar

For context-free grammar parsing, this set of features, I would argue, is the best you could hope for.

In several recent publications academics have been trying to construct modular parsers, which can be easily combined in different ways.

Example: given a parser for lambda calculus, and a parser for arithmetic expressions, how to construct a parser for the combined language? e.g. parser accepts:

`((\ y. y + 1) 2) + 3`

In the following few slides: a modular presentation of parsers for 3 different languages, and their combination.

```
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(* plus expressions *)
(((parse_A h) ***> w ***> (a "+") ***> w ***> (parse_A h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Plus(e1,e2)))
|||| ...
(* helper parser *)
|||| h)
i)
```

Note boilerplate.

The parser `w`

parses whitespace.

The parser `h`

is a "helper parser".

Note that this example includes left recursion: you can't do this with traditional combinator parsing.

And we can add in subtraction, numbers and brackets in the obvious way.

```
let rec parse_A h = (fun i -> mkntparser "arithexp" (
...
(* subtraction *)
|||| (((parse_A h) ***> w ***> (a "-") ***> w ***> (parse_A h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Minus(e1,e2)))
(* numbers *)
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
(* brackets *)
|||| (((a "(") ***> w ***> (parse_A h) ***> w ***> (a ")"))
>>>> (fun (_,(_,(e,(_,_)))) -> e))
|||| ...)
i)
```

```
let txt = "1+2-3"
let _ = (p3_run_parser (parse_A never) txt)
(* [`Minus (`Plus (`Num 1, `Num 2), `Num 3);
`Plus (`Num 1, `Minus (`Num 2, `Num 3))] *)
```

`never`

, the helper parser in this example, is a parser that never parses.

Note multiple results - the grammar is ambiguous as well as left-recursive.

Interesting because it goes beyond traditional combinator parsing, but there is more!

```
let rec parse_B h = (fun i -> mkntparser "boolexp" (
(* true and false *)
((a "true") >>>> (fun _ -> `Bool true))
|||| ((a "false") >>>> (fun _ -> `Bool false))
(* if-then-else expressions *)
|||| (((a "if") ***> w ***> (parse_B h) ***> w ***>
(a "then") ***> w ***> (parse_B h) ***> w ***>
(a "else") ***> w ***> (parse_B h))
>>>> (fun (_,(_,(b,(_,(_,(_,(e1,...))))))) -> `Ite(b,e1,e2)))
(* numeric comparison < *)
|||| (((parse_B h) ***> w ***> (a "<") ***> w ***> (parse_B h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Lt(e1,e2)))
|||| h)
i)
```

Again, note left recursion.

This is just an example; you might choose e.g. to include `<`

in the arithmetic parser

```
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(* lambda: \ x body *)
(((a "\\") ***> w ***> parse_var ***> w ***> (parse_L h))
>>>> (fun (_,(_,(x,(_,body)))) -> `Lam(x,body)))
(* application: u v *)
|||| (((parse_L h) ***> w ***> (parse_L h))
>>>> (fun (e1,(_,e2)) -> `App(e1,e2)))
(* variables: x *)
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
(* brackets: (e) *)
|||| (((a "(") ***> w ***> (parse_L h) ***> w ***> (a ")"))
>>>> (fun (_,(_,(e,(_,_)))) -> e))
|||| h)
i)
```

this is a very succinct and direct parser for lambda expressions

there are no tricks: the grammar exactly matches the datatype

We want a parser that can handle the language which includes all features of the previous 3 languages.

We want a parser that can handle the language which includes all features of the previous 3 languages.

```
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (??? what goes here???) i
in
l)
```

What is the missing definition of `h`

?

*Hint: really only one possible answer for h*

We want a parser that can handle the language which includes all features of the previous 3 languages.

```
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
```

Note the power of combinator parsing.

Note the power of "completeness" - all results are returned; so in particular all parsers return; so we don't need to worry about nontermination.

Note that the parsers `l`

, `a`

, `b`

and `h`

all parse the same language, indeed, they are the same parsers.

The version of `parse_U`

on the previous slide is fine, but it can be made faster using memoization.

Memoization: `memo f i`

is like `f i`

, but saves the result and reuses it to avoid recomputation.

In the following, the memoization function `memo_p3`

takes an explicit table argument, in which to store the results of computations.

```
let parse_memo_U () = (
let tbl = MyHashtbl.create 100 in
let rec l i = memo_p3 tbl (parse_L h) i
and a i = memo_p3 tbl (parse_A h) i
and b i = memo_p3 tbl (parse_B h) i
and h i = memo_p3 tbl (l |||| a |||| b) i
in
l)
```

Note the power of combinator parsing.

```
let txt = "\\ x if (1+((\\ y y) 2)) < 4 then true else false"
let _ = (p3_run_parser parse_U txt)
(*
[`Lam
("x",
`Ite
(`Lt (`Plus (`Num 1, `App (`Lam ("y", `Var "y"), `Num 2)), `Num 4),
`Bool true, `Bool false))]
*)
```

Note nested recursion between grammars (lambda includes arith includes lambda etc)

```
let rec eval_A h x = (match x with
| `Plus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
| (`Num x, `Num y) -> `Num(x+y)
| (x,y) -> `Err("eval_A 1",`Plus(x,y)))
| `Minus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
| (`Num x, `Num y) -> `Num(x-y)
| (x,y) -> `Err("eval_A 2",`Minus(x,y)))
| `Num x -> (`Num x)
| _ -> h x)
```

Note we plan to evaluate e.g. "x+1", but this evaluator doesn't (and shouldn't) know anything about x, envs etc

```
let rec eval_L h env x = (match x with
| `App(e1,e2) -> (
let v1 = eval_L h env e1 in
match v1 with
| `Clos(env1,`Lam((x:string),e1)) -> (
let env' = fupdate env1 (x,`Clos(env,e2)) in
eval_L h env' e1)
| _ -> `Err("eval_L 2",`App(v1,e2)))
| `Lam(x,e) -> (`Clos(env,`Lam(x,e)))
| `Var(x) -> (eval_L h env (env x))
| `Clos(env,e) -> (eval_L h env e)
| _ -> h env x)
```

Note need for an `env`

.

Note how env is passed to the helper function: the evaluator for arithmetic expressions e.g. `x + 2`

needs to know the value of `x`

(but `eval_A`

doesn't know anything about an `env`

!)

The point is: the *helper* for `eval_A`

needs to know the `env`

, not `eval_A`

itself

```
let eval =
let rec l env x = eval_L a env x
and a env x = eval_A (b env) x
and b env x = eval_B (eval_err env) x
and eval_err env x = (match x with
| `Err x -> `Err x
| _ -> (l env x))
in
l
```

```
let parse_and_eval txt = (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt)
(* y is a lambda calculus fixpoint combinator *)
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma n = n + n-1 + ... + 1 *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
(* `Num 15 *)
```

It's about understanding what the behaviour is...

```
let rec parse_MANY p = (fun i -> (mkntparser ... (
(((a "")) >>>> (fun _ -> []))
|||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
i)
```

```
let parse_LIST bra elt sep ket = (
let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
((bra ***> ket) >>>> (fun _ -> []))
|||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket)
>>>> (fun (_,(e,(es,_))) -> e::es)))
```

```
let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"
let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)
```

Support for equational reasoning about parsers: `(a "") ***> p =~ p`

etc

Distinguish between parsing time, and time to execute the actions

Efficient, practical parsers execute in time $O(n^3)$ (where n is the length of the input)

Real-world performance also needs to be acceptable

For every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run. (Perlis)

Semantics of parsing

Key ideas, and a bit of implementation code

Some measurements

The state of play

Conclusion

- What does sound and complete mean for a parser?

- Soundness means: every parse tree returned is well-formed according to the grammar.

Completeness (almost) means: every parse tree that is well-formed according to the grammar is returned.

Problem: there may be an infinite number of parse trees

e.g.

```
E -> E E E | "1" | ""
E E E
| / | \ / | \
"1" E E E "" E ""
| | | / | \
"" "1" "" E E E
| | |
"" "1" ""
```

What causes the number of parse trees to be infinite?

```
E
/ | \
. . .
. . .
. . .
. . .
. . .
/ | \
""..."" E ""...""
```

In general you need: a nonterminal (E say) and a sequence of productions

`E ->+ "" ... "" E "" ... ""`

Bad parse trees are such as those in the previous slide

Good parse trees are those that do not contain a bad subtree

There are only a finite (but typically very large) number of good parse trees

They embody all parsing information

It is possible to motivate good parse trees in several other ways, but that is beyond the scope of this talk

- Completeness means: every good parse tree is returned.

If we restrict to good parse trees, it is possible to construct combinator parsers that are sound and complete (CPP'11).

At the implementation level we use a parsing context. Parsing contexts eliminate bad trees.

A parsing context is a set of pairs (X,s), where X is a nonterminal and s is a string

The meaning of the context is something like "A parse of nonterminal X on string s is already in process"

If you attempt to parse the same nonterminal on the same string, you abandon the parse.

Somewhat more complicated than this because combinator parsing is prefix parsing

For details, see CPP'11 paper

Sound, complete, mechanized proof in HOL4

Executable, usable as a reference parser

- Mechanization interesting for other reasons
- wellformedness conditions on terminal parsers (lexing is a form of disambiguation)
- what is the maximum depth of a good parse tree? $|NT|*(n+1)$ â˜œ link

My current view on "semantics of parsing" â˜œ

Efficiency: parsing worst case $O(n^5)$, but it often achieves this worst case â˜¹ â˜œ

We can handle context-free grammars â˜º

Takes a long time â˜¹

Thinks... thinks...

Conclusion: difficult to preserve simplicity of combinator parsing and also achieve good performance (but sophisticated implementation techniques such as trampolining may get there - see GLL)

- Revisit goals, weaken...
- Abandon combinator parsing as a parsing technique, but retain as the programmer interface
- the user's conceptual model
*must*remain simple

- the user's conceptual model
- Use any other technique to do the actual parsing
- our implementation is based on Earley parsing; the implementation can be horribly complicated providing no details intrude on the user's mental model

The problem now is to reconcile an interface based on combinator parsing with an existing technique such as Earley parsing

Another problem is to implement Earley parsing correctly and efficiently (I won't discuss this further in this talk)

- Combinator parsing is well integrated with the host language

```
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
```

- Earley parsing uses concrete datastructures

```
let g = [
("E",[NT "E";NT "E";NT "E"]);
("E",[TM "1"]);
("E",[TM "eps"])]
```

Picture (SD) of proposed approach â˜œ

If the programmer writes code such as

`parse_E`

there is no obvious way to get a grammar such as g

Somehow interpret the combinators in different ways

In one way, we can write the function

`grammar_of_parser`

such that

```
grammar_of_parser parse_E = [
("E",[NT "E";NT "E";NT "E"]);
("E",[TM "1"]);
("E",[TM "eps"])]
```

- In another way,
`parse_E`

appears to behave as a standard combinator parser

- The input type is no longer a string, but a (3-way) sum type

`type ('a,'b,'c) sum3 = Inl of 'a | Inm of 'b | Inr of 'c`

Think of 3 components: left, middle and right

A parser is a function with the following type

```
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
```

- Given an argument of type
`inl`

, it gives a result of type`outl`

, etc

```
type inl = unit
type outl = symbol
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
```

Given an argument `Inl()`

, a parser returns a value of form `Inl(sym)`

`parse_E (Inl()) = Inl(NT("E"))`

Define the following function

`let sym_of_parser p = (dest_inl (p (Inl ())))`

```
type mid = {
rules7: parse_rule list;
tmparsers7: (term,raw_parser) fmap
}
type inm = mid
type outm = mid
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
```

Given an argument of type `Inm mid`

, a parser (such as `parse_E`

) returns a value of form `Inm mid'`

, where `mid'`

is `mid`

extended with all grammar rules "reachable" from the relevant symbol (such as `E`

). Must be careful about termination.

Also collects all the terminal parsers. Example...

`let grammar_of_parser p = (dest_inm (p (Inm empty_mid)))`

Then

```
grammar_of_parser parse_E = {
rules7=[("(E*E)", Seq (NT "E", NT "E"));
("(E*(E*E))", Seq (NT "E", NT "(E*E)"));
("((E*(E*E))+1)", Alt (NT "(E*(E*E))", TM "1"));
("(((E*(E*E))+1)+eps)",Alt(NT "((E*(E*E))+1)",TM "eps"));
("E", Atom (NT "(((E*(E*E))+1)+eps)"))];
...
}
```

We can extract the grammar from a parser.

We feed this to an Earley parser, and we get back information about the parse.

This information needs to be in a compact form (e.g. not parse trees!). I use an oracle.

A top-down parser:

The oracle answers the following problem:

Given a rule $S \rightarrow A B$ in $\Gamma$ and a substring $s_{i,j}$, what are the indexes $k$ such that $\Gamma \vdash A \rightarrow^* s_{i,k}$ and $\Gamma \vdash B \rightarrow^* S_{k,j}$ ?

In code:

`type ty_oracle = (symbol * symbol) -> substring -> int list`

A substring is $s_{i,j}$, or `SS(s,i,j)`

in code

```
type inr = {
ss4: substring;
lc4: local_context;
oracle4: ty_oracle
}
type 'a outr = 'a list
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
```

The right component corresponds to the usual combinator parser. The `ss4`

field is the input string, `lc4`

is the parsing context (to eliminate bad trees) and `oracle4`

is the parsing information returned by the Earley parser.

What happens when executing `(p1 ***> p2) (Inr i0)`

?

```
let seqr p1 p2 = (fun i0 ->
(* get symbol for each parser *)
let sym1 = sym_of_parser p1 in
let sym2 = sym_of_parser p2 in
let SS(s,i,j) = i0.ss4 in
(* call oracle on substring SS(s,i,j), get back list of k s *)
let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
(* for each k we do the following *)
let f1 k = (
(* call p1 on SS(s,i,k) ... *)
let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
(* ... and p2 on SS(s,k,j) .. *)
let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
(* and combine the results *)
list_product rs1 rs2)
in
List.concat (List.map f1 ks))
```

Complete definitions of combinators

Code for the Earley parser (surprisingly difficult to get right)

- Various grammars, various inputs

We want to compare the performance of my approach (P3) with existing approaches; none of the existing approaches have a combinator interface

First problem: most parsers cannot actually parse these grammars; if they can, they typically bug out with very small inputs

Haskell Happy parser is the only one I found that allowed me to run a respectable number of tests

- On grammars
`aho_s`

and`E_EEE`

, P3 clearly outperforms Happy, e.g. for`aho_s`

There is (probably) a performance bug in Happy

On

`brackets`

, Happy bugged out. This is (probably) a functional correctness bug in Happy.

- On
`S_xSx`

and`aho_sml`

, P3 is worse than Happy by a small (2-4) factor. Following is`S_xSx`

:

- However, Earley appears to be asymptotically better than Happy on these grammars, and so with big enough inputs, Earley would win; unfortunately Happy ran out of memory before this could be evidenced (probably another memory usage/gc bug)

There is also additional overhead after parsing: constructing the oracle and executing the actions

With the released version, constructing the oracle can take about as long as parsing

Executing the actions appears to behave as it should (i.e. best possible performance that could be expected)

The released version does not attain $O(n^3)$ bound (but Happy is even worse)

A new implementation of Earley combines parsing with oracle construction, cuts execution time by a significant factor, and appears to fit the expected bound (actually $O(n^3.ln(n))$).

The initial release is usable. No bugs found so far. Lacks features that would make it really usable (e.g. debugging features)

The new version can work with arbitrary input types, not just strings. So you can use a lexer.

The new version fixes a minor annoyance with memoization.

The new version has significant performance improvements.

The new version has some support for debugging.

I use this technology all the time e.g. to port Netsem spec to Lem

Combinator parsing library; fancy examples

Good performance, competitive with best parsers currently around; oracle architecture means we can swap in faster parsers when they become available; we can even analyse the grammar and choose the fastest parser for the given grammar class

- New version (to be released before Christmas hopefully) has even better performance
- I also plan to start using other backends for restricted grammar classes

Needs a bit more engineering e.g. to support debugging

Main challenges: good trees; oracle architecture; 3 interpretations of combinators; Earley implementation; correctness and efficiency; tying everything together

I should verify the implementation of Earley

I should test performance on large grammars from restricted classes, against existing parsers for these subclasses.