Simple, efficient, sound-and-complete combinator parsing for all context-free grammars, using an oracle.

Dr Tom Ridge

2013-11-08

Prelude: quick intro to parsing and combinator parsing

Quick question

Basics of context-free grammars

Simple arithmetic expressions:

E -> E "+" E 
E -> ?num?

or

E -> E "+" E | ?num?

Terminology: terminal, nonterminal, symbol, rule, grammar

N.B. RHS of rules cannot be the empty list

Basics of parse trees

Each terminal is associated with a set of strings

Parse trees are trees which are well-formed according to the grammar, and where the strings associated with terminal nodes are those associated to the terminal at that node

Parsing

Given a grammar, and an input string, parsing is the process of determining (somehow, and in some form) what parse trees correspond to the input.

There may be exponentially many parse trees:

E -> E E | "1"

There may be infinitely many parse trees:

E -> E | ""

Combinator parsing

Represent rules for a nonterminal:

E -> "(" E "+" E ")" 
  | "1"

By a recursive function:

(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
  ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
  ||| (a "1"))
  i 

Given an input string s, a parser returns a list of things of the form (v,s'), where v is the result of parsing a prefix of s, and s' is the remainder of the input that wasn't parsed.

This is prefix parsing.

Combinator parsing, terminal parsers

e.g. following used for (a "1") or more generally (a s)

let a s = (fun i -> 
  if starts_with i s then
    let (u,v) = split i (String.length s) in
    [(u,v)]
  else 
    [])

let _ = (a "1") "123"
(* [("1", "23")] *)

let _ = (a "1") "abc"
(* [] *)

Combinator parsing, combinators

The combinators ||| and **> are defined in the obvious way:

let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)

let ( **> ) p1 p2 = (fun s0 ->
  (* apply p2 to results (e1,s1) from p1 *)
  let f (e1,s1) =
    p2 s1
    |> List.map (fun (e2,s2) -> ((e1,e2),s2))
  in
  s0
  |> p1
  |> List.map f
  |> List.concat)

Here, x |> f is just f x

Note these combinators are written infix e.g. p1 **> p2

let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)

Combinator parsing, combinators

There is another function which we need, the "action" function:

let (>>) p f = (fun s ->
  s
  |> p
  |> List.map (fun (e,s) -> (f e, s)))

Note this is written infix e.g. p >> f

let _ = ((a "1") >> (fun _ -> 1)) 
(* "123" = [(1, "23")] *)

With the following result

(* E -> "(" E "+" E ")" *)
let rec parse_E i = 
  let f1 = 
    ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")") 
      >> (fun (_,(e1,(_,(e2,_)))) -> e1+e2))
    ||| ((a "1") >> (fun _ -> 1))
  in
  f1 i

let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)

The glory of combinator parsing

Some problems with combinator parsing

E -> E "+" E | ...
E -> F G | F H

and also because prefix parsing can lead to lots of work which is eventually thrown away

E -> F "X"
F -> A B

In this example, assuming there is no "X" in the input, there may be substantial work involved in combining the results of parsing an A followed by parsing a B, before the parser attempts (and fails) to parse an "X".

Summary of problems

Combinator parsing has a great programmer interface, but lousy performance, and can't handle all context-free grammars.

The goal

Consider a combinator parsing library with the following features:

For context-free grammar parsing, this set of features, I would argue, is the best you could hope for.

Sneak peak

Motivating combinator parsing

In several recent publications academics have been trying to construct modular parsers, which can be easily combined in different ways.

Example: given a parser for lambda calculus, and a parser for arithmetic expressions, how to construct a parser for the combined language? e.g. parser accepts:

((\ y. y + 1) 2) + 3

In the following few slides: a modular presentation of parsers for 3 different languages, and their combination.

Modular parsing: arithmetic

let rec parse_A h = (fun i -> mkntparser "arithexp" (

  (* plus expressions *)
  (((parse_A h) ***> w ***> (a "+") ***> w ***> (parse_A h))
   >>>> (fun (e1,(_,(_,(_,e2)))) -> `Plus(e1,e2)))

  |||| ...

  (* helper parser *)
  |||| h)
  i)

Note boilerplate.

The parser w parses whitespace.

The parser h is a "helper parser".

Note that this example includes left recursion: you can't do this with traditional combinator parsing.

Modular parsing: arithmetic

And we can add in subtraction, numbers and brackets in the obvious way.

let rec parse_A h = (fun i -> mkntparser "arithexp" (
  ...
  (* subtraction *)
  |||| (((parse_A h) ***> w ***> (a "-") ***> w ***> (parse_A h))
        >>>> (fun (e1,(_,(_,(_,e2)))) -> `Minus(e1,e2)))

  (* numbers *)
  |||| (parse_num 
        >>>> (fun s -> `Num(int_of_string (content s))))

  (* brackets *)
  |||| (((a "(") ***> w ***> (parse_A h) ***> w ***> (a ")")) 
        >>>> (fun (_,(_,(e,(_,_)))) -> e))

  |||| ...)
  i)

Parsing arithmetic expressions

  let txt = "1+2-3"

  let _ = (p3_run_parser (parse_A never) txt)
  (* [`Minus (`Plus (`Num 1, `Num 2), `Num 3);
  `Plus (`Num 1, `Minus (`Num 2, `Num 3))] *)

never, the helper parser in this example, is a parser that never parses.

Note multiple results - the grammar is ambiguous as well as left-recursive.

Interesting because it goes beyond traditional combinator parsing, but there is more!

Parsing "boolean" expressions

let rec parse_B h = (fun i -> mkntparser "boolexp" (

  (* true and false *)
  ((a "true") >>>> (fun _ -> `Bool true))
  |||| ((a "false") >>>> (fun _ -> `Bool false))

  (* if-then-else expressions *)
  |||| (((a "if") ***> w ***> (parse_B h) ***> w ***> 
            (a "then") ***> w ***> (parse_B h) ***> w ***> 
            (a "else") ***> w ***> (parse_B h))
        >>>> (fun (_,(_,(b,(_,(_,(_,(e1,...))))))) -> `Ite(b,e1,e2)))

  (* numeric comparison < *)
  |||| (((parse_B h) ***> w ***> (a "<") ***> w ***> (parse_B h)) 
        >>>> (fun (e1,(_,(_,(_,e2)))) -> `Lt(e1,e2)))

  |||| h)
  i)

Again, note left recursion.

This is just an example; you might choose e.g. to include < in the arithmetic parser

Parsing lambda expressions

let rec parse_L h = (fun i -> mkntparser "lambdaexp" (

  (* lambda: \ x body *)
  (((a "\\") ***> w ***> parse_var ***> w ***> (parse_L h))  
   >>>> (fun (_,(_,(x,(_,body)))) -> `Lam(x,body)))

  (* application: u v *)
  |||| (((parse_L h) ***> w ***> (parse_L h)) 
        >>>> (fun (e1,(_,e2)) -> `App(e1,e2)))

  (* variables: x *)
  |||| (parse_var >>>> (fun s -> `Var s))  (* var *)

  (* brackets: (e) *)
  |||| (((a "(") ***> w ***> (parse_L h) ***> w ***> (a ")")) 
        >>>> (fun (_,(_,(e,(_,_)))) -> e))

  |||| h)
  i)

Magic

We want a parser that can handle the language which includes all features of the previous 3 languages.

Magic

We want a parser that can handle the language which includes all features of the previous 3 languages.

  let parse_U = (
    let rec l i = parse_L h i 
    and a i = parse_A h i 
    and b i = parse_B h i
    and h i = (??? what goes here???) i
    in
    l)

What is the missing definition of h?

Hint: really only one possible answer for h

Magic

We want a parser that can handle the language which includes all features of the previous 3 languages.

  let parse_U = (
    let rec l i = parse_L h i 
    and a i = parse_A h i 
    and b i = parse_B h i
    and h i = (l |||| a |||| b) i
    in
    l)

Note the power of combinator parsing.

Note the power of "completeness" - all results are returned; so in particular all parsers return; so we don't need to worry about nontermination.

Note that the parsers l, a, b and h all parse the same language, indeed, they are the same parsers.

Memoization

The version of parse_U on the previous slide is fine, but it can be made faster using memoization.

Memoization: memo f i is like f i, but saves the result and reuses it to avoid recomputation.

In the following, the memoization function memo_p3 takes an explicit table argument, in which to store the results of computations.

  let parse_memo_U () = (
    let tbl = MyHashtbl.create 100 in
    let rec l i = memo_p3 tbl (parse_L h) i 
    and a i = memo_p3 tbl (parse_A h) i 
    and b i = memo_p3 tbl (parse_B h) i
    and h i = memo_p3 tbl (l |||| a |||| b) i
    in
    l)

Note the power of combinator parsing.

Example

let txt = "\\ x if (1+((\\ y y) 2)) < 4 then true else false"

let _ = (p3_run_parser parse_U txt)

(* 
[`Lam
   ("x",
    `Ite
      (`Lt (`Plus (`Num 1, `App (`Lam ("y", `Var "y"), `Num 2)), `Num 4),
       `Bool true, `Bool false))]
*)

Note nested recursion between grammars (lambda includes arith includes lambda etc)

Modular evaluation

let rec eval_A h x = (match x with

| `Plus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
    | (`Num x, `Num y) -> `Num(x+y)
    | (x,y) -> `Err("eval_A 1",`Plus(x,y)))

  | `Minus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
    | (`Num x, `Num y) -> `Num(x-y)
    | (x,y) -> `Err("eval_A 2",`Minus(x,y)))

  | `Num x -> (`Num x)

  | _ -> h x)

Note we plan to evaluate e.g. "x+1", but this evaluator doesn't (and shouldn't) know anything about x, envs etc

Modular evaluation, lambda expressions

let rec eval_L h env x = (match x with

  | `App(e1,e2) -> (
    let v1 = eval_L h env e1 in
    match v1 with
    | `Clos(env1,`Lam((x:string),e1)) -> (
      let env' = fupdate env1 (x,`Clos(env,e2)) in
      eval_L h env' e1)
    | _ -> `Err("eval_L 2",`App(v1,e2)))

  | `Lam(x,e) -> (`Clos(env,`Lam(x,e)))

  | `Var(x) -> (eval_L h env (env x))

  | `Clos(env,e) -> (eval_L h env e) 

  | _ -> h env x)

Note need for an env.

Note how env is passed to the helper function: the evaluator for arithmetic expressions e.g. x + 2 needs to know the value of x (but eval_A doesn't know anything about an env!)

The point is: the helper for eval_A needs to know the env, not eval_A itself

Combining the evaluators

  let eval = 
    let rec l env x = eval_L a env x
    and a env x = eval_A (b env) x
    and b env x = eval_B (eval_err env) x
    and eval_err env x = (match x with
      | `Err x -> `Err x
      | _ -> (l env x))
    in
    l

Putting it all together

  let parse_and_eval txt = (
    p3_run_parser 
      ((parse_memo_U ()) >>>> eval empty_env)
      txt)
      
  (* y is a lambda calculus fixpoint combinator *)
  let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
  (* sigma n = n + n-1 + ... + 1 *)
  let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
  let txt = "("^y^" "^sigma^") 5"
  let [r] = parse_and_eval txt
  (* `Num 15 *)

Motivating soundness and completeness

It's about understanding what the behaviour is...

let rec parse_MANY p = (fun i -> (mkntparser ... (
  (((a "")) >>>> (fun _ -> []))
  |||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
  i)
let parse_LIST bra elt sep ket = (
  let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
  ((bra ***> ket) >>>> (fun _ -> []))
  |||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket) 
    >>>> (fun (_,(e,(es,_))) -> e::es)))  
let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"

let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)

Support for equational reasoning about parsers: (a "") ***> p =~ p etc

What do I mean by efficient?

For every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run. (Perlis)

The talk proper

Outline

Semantics of parsing

Soundness

Completeness

e.g.

E -> E E E | "1" | ""

 E         E              E           
 |       / | \          / | \  
"1"     E  E  E       ""  E  ""
        |  |  |         / | \            
       "" "1" ""       E  E  E           
                       |  |  | 
                      "" "1" ""

A reasonable definition of completeness

What causes the number of parse trees to be infinite?

         E
       / | \
      .  .  .
     .   .   .
    .    .    .
   .     .     .
  .      .      .
 /       |       \
 ""..."" E  ""...""

In general you need: a nonterminal (E say) and a sequence of productions

E ->+ "" ... "" E "" ... ""

Bad and good parse trees

Completeness

Fixing combinator parsing to handle all context-free grammars

Semantics of parsing, summary

A new approach

Time to think again

Time to think again

Two different worlds

let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)
let g = [
  ("E",[NT "E";NT "E";NT "E"]);
  ("E",[TM "1"]);
  ("E",[TM "eps"])]

Idea, forced

grammar_of_parser parse_E = [
  ("E",[NT "E";NT "E";NT "E"]);
  ("E",[TM "1"]);
  ("E",[TM "eps"])]

Solution

type ('a,'b,'c) sum3 = Inl of 'a | Inm of 'b | Inr of 'c
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3

Left component

type inl = unit
type outl = symbol

type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)

Given an argument Inl(), a parser returns a value of form Inl(sym)

parse_E (Inl()) = Inl(NT("E"))

Define the following function

let sym_of_parser p = (dest_inl (p (Inl ())))

Middle component

type mid = { 
  rules7: parse_rule list; 
  tmparsers7: (term,raw_parser) fmap 
}
type inm = mid
type outm = mid

type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)

Given an argument of type Inm mid, a parser (such as parse_E) returns a value of form Inm mid', where mid' is mid extended with all grammar rules "reachable" from the relevant symbol (such as E). Must be careful about termination.

Also collects all the terminal parsers. Example...

Middle component

let grammar_of_parser p = (dest_inm (p (Inm empty_mid)))

Then

grammar_of_parser parse_E = {
  rules7=[("(E*E)", Seq (NT "E", NT "E")); 
    ("(E*(E*E))", Seq (NT "E", NT "(E*E)"));
    ("((E*(E*E))+1)", Alt (NT "(E*(E*E))", TM "1"));
    ("(((E*(E*E))+1)+eps)",Alt(NT "((E*(E*E))+1)",TM "eps"));
    ("E", Atom (NT "(((E*(E*E))+1)+eps)"))];
  ... 
}

Earley parsing

We can extract the grammar from a parser.

We feed this to an Earley parser, and we get back information about the parse.

This information needs to be in a compact form (e.g. not parse trees!). I use an oracle.

Oracle

A top-down parser:

The oracle answers the following problem:

Given a rule $S \rightarrow A B$ in $\Gamma$ and a substring $s_{i,j}$, what are the indexes $k$ such that $\Gamma \vdash A \rightarrow^* s_{i,k}$ and $\Gamma \vdash B \rightarrow^* S_{k,j}$ ?

In code:

type ty_oracle = (symbol * symbol) -> substring -> int list

A substring is $s_{i,j}$, or SS(s,i,j) in code

Right component

type inr = {
  ss4: substring;
  lc4: local_context;
  oracle4: ty_oracle
}
type 'a outr = 'a list

type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)

The right component corresponds to the usual combinator parser. The ss4 field is the input string, lc4 is the parsing context (to eliminate bad trees) and oracle4 is the parsing information returned by the Earley parser.

Definition of combinators

What happens when executing (p1 ***> p2) (Inr i0)?

let seqr p1 p2 = (fun i0 -> 
  (* get symbol for each parser *)
  let sym1 = sym_of_parser p1 in
  let sym2 = sym_of_parser p2 in
  let SS(s,i,j) = i0.ss4 in
  (* call oracle on substring SS(s,i,j), get back list of k s *)
  let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
  (* for each k we do the following *)
  let f1 k = (
    (* call p1 on SS(s,i,k) ... *)
    let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
    (* ... and p2 on SS(s,k,j) .. *)
    let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
    (* and combine the results *)
    list_product rs1 rs2)
  in
  List.concat (List.map f1 ks))

What's missing from these slides

Experiments (on version 2013-05-15)

Performance comparison with Happy

Performance comparison with Happy

Performance

State of play

Conclusion