Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle.
(Talk at University of Sussex, 50 mins)

Dr Tom Ridge

2014-11-21

Overview

Plan for the talk

Some things that I won't have time to talk about:

Intro

Basics of context-free grammars

Simple arithmetic expressions:

E -> E "+" E 
E -> ?num?

or

E -> E "+" E | ?num?

Terminology: terminal, non-terminal, symbol, rule, grammar

This is essentially a language for describing a set of (labelled, finite) trees:

Basics of parse trees

Parse trees are trees from the previous slide, but matched against an input string.

Each terminal is associated with a set of strings e.g.

Example parse tree:

Grammar: E -> E "+" E | ?num?
Input string: "1+2"

N.B. With traditional defns of parse trees, terminals correspond to "characters" (the epsilon terminal is somehow special). But this is not modular: with our definitions, terminal parsers can be arbitrary user-supplied functions.

Parsing

Given a grammar, and an input string, parsing is the process of determining (somehow, and in some form) what parse trees correspond to the input.

There may be exponentially many parse trees:

Grammar: E -> E E | "1"    Input: "11...1" (of length n)

There may be infinitely many parse trees:

Grammar: E -> E | ""       Input: ""

Actions

parsing 
-> parse results (e.g. SPPFs, typically for internal use) 
-> apply actions to get final semantic results which are returned to user
E -> E E     {{ fun (x,y) -> x+y }}
  | "1"      {{ fun _ -> 1 }}
"" ~> []
"1" ~> [1]
"111" ~> [3]
E -> E E E   {{ fun (x,(y,z)) -> `Node(x,y,z) }}
  | "1"      {{ fun _ -> `Leaf 1 }}
  | ""       {{ fun _ -> `Leaf 0 }}
"111" ~> nontermination

Good parse trees

Semantics of actions when there are infinitely-many parse trees

The problem of infinitely-many parse trees

Note that the fringe (in grey) isn't consuming any of the input, so we can keep stacking extra fringes to make arbitrarily large parse trees.

Bad and good parse trees

  • Bad parse trees have a root node and a descendant node with the same nonterminal and which matches the same string (here "1")

  • Good parse trees are those that do not contain a bad subtree

P1 combinator parsing library and verified parsing

P1 motivation

[...] The latter defines LALR(1) grammar in terms of LALR(1) parser; a grammar is LALR(1) iff its LALR(1) parser is deterministic. It is desirable to have a definition of LALR(1) grammar that does not involve the parser, but we know of no reasonable way to do this. [DeRemer and Pennello, 1982]

//   A : B c d | E c f ;
//   B : x y ;
//   E : x y ;


// Yacc output
5: reduce/reduce conflict (reduce 3, reduce 4) on c state 5
        B : x y .  (3)
        E : x y .  (4)

P1 features

Combinator parsing

Represent rules for a nonterminal:

E -> "(" E "+" E ")" 
  | "1"

By a recursive function:

(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
  ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
  ||| (a "1"))
  i 

Combinator parsing, terminal parsers

e.g. following used for (a "1") or more generally (a s)

let a s = (fun i -> 
  if starts_with i s then
    let (u,v) = split i (String.length s) in
    [(u,v)]
  else 
    [])

let _ = (a "1") "123"
(* [("1", "23")] *)

let _ = (a "1") "abc"
(* [] *)

Combinator parsing, combinators

The combinators ||| and **> are defined in the obvious way:

let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)

let ( **> ) p1 p2 = (fun s0 ->
  (* apply p2 to results (e1,s1) from p1 *)
  ...)

Note these combinators are written infix e.g. p1 **> p2

let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)

Combinator parsing, actions

There is another function which we need, the "action" function:

let (>>) p f = (fun s ->
  (* apply f to the values returned from (p s) *)
  ...)

Note this is written infix e.g. p >> f

let _ = ((a "1")                )  "123"
(* [("1", "23")] *)

let _ = ((a "1") >> (fun _ -> 1))  "123"
(* [(1, "23")] *)

Example

(* E -> "(" E "+" E ")" | "1" *)
let rec parse_E i = 
  let f1 = 
    ((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")") 
      >> (fun (_,(e1,(_,(e2,_)))) -> e1+e2)) 
  in
  let f2 = ((a "1") >> (fun _ -> 1)) in
  (f1 ||| f2) i

let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)

The glory of combinator parsing

Arguably the best programmer interface to parsing, because the full power of host functional language is available

Programmers in functional languages usually write top-down, recursive descent, combinator-based parsers.

Some problems with combinator parsing

E -> E "+" E | ...

P1

HOL4 verification

Performance

P3

Time to think again

A simple idea

Two different worlds

let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)
let g = [
  ("E",[NT "E";NT "E";NT "E"]);
  ("E",[TM "1"]);
  ("E",[TM "eps"])]

The idea in slightly more detail

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)
let g = grammar_of parse_E
let parse_results = run_earley_parser g s
let final_results = run_actions parse_E parse_results
let final_results = run_parser parse_E s

An obvious question

Earley parsing and the Oracle

A top-down parser:

The oracle answers the following problem:

Given a rule S → AB in Γ and a substring si, j, what are the indexes k such that Γ ⊢ A →  * si, k and Γ ⊢ B →  * Sk, j ?

In code:

type ty_oracle = (symbol * symbol) -> substring -> int list

A substring is si, j, or SS(s,i,j) in code

Why an oracle? Why not SPPFs?

Using the oracle to guide the action phase

What happens when executing (p1 ***> p2) (Inr i0)?

let seqr p1 p2 = (fun i0 -> 
  (* get symbol for each parser *)
  let sym1 = sym_of_parser p1 in
  let sym2 = sym_of_parser p2 in
  let SS(s,i,j) = i0.ss4 in
  (* call oracle on substring SS(s,i,j), get back list of k s *)
  let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
  (* for each k we do the following *)
  let f1 k = (
    (* call p1 on SS(s,i,k) ... *)
    let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
    (* ... and p2 on SS(s,k,j) .. *)
    let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
    (* and combine the results *)
    list_product rs1 rs2)
  in
  List.concat (List.map f1 ks))

Recap

P3 examples

Equational reasoning

let rec parse_MANY p = (fun i -> (mkntparser ... (
  (((a "")) >>>> (fun _ -> []))
  |||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
  i)
let parse_LIST bra elt sep ket = (
  let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
  ((bra ***> ket) >>>> (fun _ -> []))
  |||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket) 
    >>>> (fun (_,(e,(es,_))) -> e::es)))  
let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"

let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)

let _ = p3_run_parser (parse_LIST (a "") (a "") (a "") (a "")) ""
parse_LIST (a "") p (a "") (a "") = parse_MANY p

Example: parse trees

let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)
Input size|Number of results (good parse trees)
0         |1                                   
1         |1                                   
2         |3  
3         |19
4         |150                                
...
19        |441152315040444150                  

Example: counting

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)
Input size|Number of results
0         |1                                   
1         |1                                   
2         |1                                   
4         |1                                 
...       
19        |1

Example: memoized counting (*)

let rec parse_E = 
  let tbl_E = MyHashtbl.create 100 in
  (fun i -> memo_p3 tbl_E (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

Example: disambiguation

E -> E "+" E | E "-" E | ...
<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>

<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>

<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>
%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */

Example: disambiguation (*)

E -> E "+" E           {{ fun (x,(_,y)) -> (match x,y with 
                          | (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
                          | (Some x,Some y) -> Some(Plus(x,y)) 
                          | _ -> None) }}

  | E "*" E            {{ fun (x,(_,y)) -> (match x,y with
                          | (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
                          | (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
                          | (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
                          | (Some x,Some y) -> Some(Times(x,y)) 
                          | _ -> None) }}
  | ?num?              {{ fun s -> Some(Num(int_of_string (content s))) }}

Example: modular combination of parsers (*)

(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y)) 

let rec parse_A h = (fun i -> mkntparser "arithexp" (
  (((parse_A h) ***> (a "+") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
  |||| (((parse_A h) ***> (a "-") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
  |||| (parse_num 
        >>>> (fun s -> `Num(int_of_string (content s))))
  |||| (((a "(") ***> (parse_A h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)
(* example input: \ x \ y x (y y) *)
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
  (((a "\\") ***> parse_var ***> (parse_L h))  (* lam *)
        >>>> (fun (_,(x,body)) -> `Lam(x,body)))
  |||| (((parse_L h) ***> (parse_L h))  (* app *)
        >>>> (fun (e1,e2) -> `App(e1,e2)))
  |||| (parse_var >>>> (fun s -> `Var s))  (* var *)
  |||| (((a "(") ***> (parse_L h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)
let parse_U = (
  let rec l i = parse_L h i 
  and a i = parse_A h i 
  and b i = parse_B h i
  and h i = (l |||| a |||| b) i
  in
  l)
let parse_and_eval txt = (remove_err (
  p3_run_parser 
    ((parse_memo_U ()) >>>> eval empty_env)
    txt))


let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt

Performance evaluation

Some sample grammars

Asymptotic performance

Performance should be O(n3) in the worst case. So t(n) ≤ c.n3, ignoring some initial values of n.

For input size n, we record t(n), and calculate c, assuming t(n) = c.n3.

If we plot computed c against n, we expect c to tend towards a constant (assuming the implementation meets the bound!!!)

Asymptotic performance - aho_sml

aho_sml: S -> S S "x" | ""

Asymptotic performance - E_EEE

E_EEE: E -> E E E | "1" | ""

Asymptotic performance

Real-world performance

Performance comparison with Happy

aho_sml: S -> S S "x" | ""

Performance comparison with Happy

E_EEE: E -> E E E | "1" | ""

Grammar|Input size|Happy/s|P3/s
E_EEE  |020       |0.14   |0.10
E_EEE  |040       |6.87   |0.13
E_EEE  |100       |2535.88|0.50

Performance comparison with Happy

Summary of performance

Grammar |Winner
aho_s   |P3 by a mile
aho_sml |P3
brackets|P3 by a mile (competitor failed to turn up)
E_EEE   |P3 by a mile
S_xSx   |P3

Summary of performance

Conclusion