Dr Tom Ridge
2014-11-21
Some things that I won't have time to talk about:
Simple arithmetic expressions:
E -> E "+" E
E -> ?num?
or
E -> E "+" E | ?num?
Terminology: terminal, non-terminal, symbol, rule, grammar
This is essentially a language for describing a set of (labelled, finite) trees:
Parse trees are trees from the previous slide, but matched against an input string.
Each terminal is associated with a set of strings e.g.
?num?
is associated with the set of strings {"0","1","1234",...}
"+"
is associated with the set of strings {"+"}
""
or eps
is associated with the set of strings {""}
; a rule might be E -> eps
indicating that E can be expanded to match the empty stringExample parse tree:
Grammar: E -> E "+" E | ?num?
Input string: "1+2"
N.B. With traditional defns of parse trees, terminals correspond to "characters" (the epsilon terminal is somehow special). But this is not modular: with our definitions, terminal parsers can be arbitrary user-supplied functions.
Given a grammar, and an input string, parsing is the process of determining (somehow, and in some form) what parse trees correspond to the input.
There may be exponentially many parse trees:
Grammar: E -> E E | "1" Input: "11...1" (of length n)
There may be infinitely many parse trees:
Grammar: E -> E | "" Input: ""
parsing
-> parse results (e.g. SPPFs, typically for internal use)
-> apply actions to get final semantic results which are returned to user
E -> E E {{ fun (x,y) -> x+y }}
| "1" {{ fun _ -> 1 }}
"" ~> []
"1" ~> [1]
"111" ~> [3]
E -> E E E {{ fun (x,(y,z)) -> `Node(x,y,z) }}
| "1" {{ fun _ -> `Leaf 1 }}
| "" {{ fun _ -> `Leaf 0 }}
"111" ~> nontermination
Worth investigating whether we can slightly relax the notion of completeness, so that we don't talk about "all parse trees" (which leads to nontermination)
Completeness (almost) means: every parse tree that is well-formed according to the grammar is returned.
Problem: there may be an infinite number of parse trees e.g. for grammar E -> E E E | "1" | ""
and input "1"
Note that the fringe (in grey) isn't consuming any of the input, so we can keep stacking extra fringes to make arbitrarily large parse trees.
For infinitely-many parse trees you need: a nonterminal (E say) and a sequence of productions
E → + αEβ, such that α and β match the empty string
N.B. This is an "if-and-only-if" i.e. it characterizes the situations where there are infinitely-many parse trees
|
|
If an input can be parsed, it can be parsed to give a good parse tree
For any grammar, for any input, there are only a finite number of good parse trees (this is not obvious)
Completeness now means: every good parse tree is returned
[...] The latter defines LALR(1) grammar in terms of LALR(1) parser; a grammar is LALR(1) iff its LALR(1) parser is deterministic. It is desirable to have a definition of LALR(1) grammar that does not involve the parser, but we know of no reasonable way to do this. [DeRemer and Pennello, 1982]
// A : B c d | E c f ;
// B : x y ;
// E : x y ;
// Yacc output
5: reduce/reduce conflict (reduce 3, reduce 4) on c state 5
B : x y . (3)
E : x y . (4)
Represent rules for a nonterminal:
E -> "(" E "+" E ")"
| "1"
By a recursive function:
(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
||| (a "1"))
i
e.g. following used for (a "1")
or more generally (a s)
let a s = (fun i ->
if starts_with i s then
let (u,v) = split i (String.length s) in
[(u,v)]
else
[])
let _ = (a "1") "123"
(* [("1", "23")] *)
let _ = (a "1") "abc"
(* [] *)
The combinators |||
and **>
are defined in the obvious way:
let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)
let ( **> ) p1 p2 = (fun s0 ->
(* apply p2 to results (e1,s1) from p1 *)
...)
Note these combinators are written infix e.g. p1 **> p2
let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)
There is another function which we need, the "action" function:
let (>>) p f = (fun s ->
(* apply f to the values returned from (p s) *)
...)
Note this is written infix e.g. p >> f
let _ = ((a "1") ) "123"
(* [("1", "23")] *)
let _ = ((a "1") >> (fun _ -> 1)) "123"
(* [(1, "23")] *)
(* E -> "(" E "+" E ")" | "1" *)
let rec parse_E i =
let f1 =
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")")
>> (fun (_,(e1,(_,(e2,_)))) -> e1+e2))
in
let f2 = ((a "1") >> (fun _ -> 1)) in
(f1 ||| f2) i
let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)
Arguably the best programmer interface to parsing, because the full power of host functional language is available
Programmers in functional languages usually write top-down, recursive descent, combinator-based parsers.
E -> E "+" E | ...
Simple, functional, sound and complete parsing for all context-free grammars, Ridge, CPP'11
Idea: if we restrict to good parse trees, then combinator parsing terminates (for any grammar), and we can state and prove some nice theorems
See defn of e.g. soundness, grammar_to_parser
and sound_grammar_to_parser
grammar_to_parser
was quite difficultgrammar_to_parser
terminates; a suitable measure that includes the context has to be foundparse_E
function is called repeatedly on successively decreasing parts of the input: (0,3) (0,2) (0,1) (0,0)We can handle context-free grammars ☺
Takes a long time ☹
Thinks...
Conclusion: difficult to preserve simplicity of combinator parsing and also achieve good performance (but sophisticated implementation techniques such as trampolining may get there - see GLL)
Our implementation is based on Earley parsing; the implementation can be horribly complicated providing no details intrude on the user's mental model
The problem now is to reconcile an interface based on combinator parsing with an existing technique such as Earley parsing
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
let g = [
("E",[NT "E";NT "E";NT "E"]);
("E",[TM "1"]);
("E",[TM "eps"])]
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
let g = grammar_of parse_E
let parse_results = run_earley_parser g s
let final_results = run_actions parse_E parse_results
let final_results = run_parser parse_E s
A very clean and simple idea! The basic idea I eventually learned was first suggested by Ljunglof (2002), although there are earlier papers which use related ideas
Definitions of combinators omitted...
We can extract the grammar from a combinator parser.
We feed this to an Earley parser, and we get back information about the parse.
An obvious question: suppose you have a parser for general CFGs (e.g. an Earley parser); when parsing an input, what should the parser return?
Certainly not parse trees...
The parse information needs to be in a compact form. The traditional solution is SPPFs. SPPFs are a bad choice in my opinion (see next slide).
Instead I use an oracle.
A top-down parser:
The oracle answers the following problem:
Given a rule S → AB in Γ and a substring si, j, what are the indexes k such that Γ ⊢ A → * si, k and Γ ⊢ B → * Sk, j ?
In code:
type ty_oracle = (symbol * symbol) -> substring -> int list
A substring is si, j, or SS(s,i,j)
in code
This slide is saying: SPPFs are a terrible choice for representing parse results; an oracle is much better
What happens when executing (p1 ***> p2) (Inr i0)
?
let seqr p1 p2 = (fun i0 ->
(* get symbol for each parser *)
let sym1 = sym_of_parser p1 in
let sym2 = sym_of_parser p2 in
let SS(s,i,j) = i0.ss4 in
(* call oracle on substring SS(s,i,j), get back list of k s *)
let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
(* for each k we do the following *)
let f1 k = (
(* call p1 on SS(s,i,k) ... *)
let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
(* ... and p2 on SS(s,k,j) .. *)
let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
(* and combine the results *)
list_product rs1 rs2)
in
List.concat (List.map f1 ks))
let rec parse_MANY p = (fun i -> (mkntparser ... (
(((a "")) >>>> (fun _ -> []))
|||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
i)
let parse_LIST bra elt sep ket = (
let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
((bra ***> ket) >>>> (fun _ -> []))
|||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket)
>>>> (fun (_,(e,(es,_))) -> e::es)))
let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"
let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)
let _ = p3_run_parser (parse_LIST (a "") (a "") (a "") (a "")) ""
parse_LIST (a "") p (a "") (a "") = parse_MANY p
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
Input size|Number of results (good parse trees)
0 |1
1 |1
2 |3
3 |19
4 |150
...
19 |441152315040444150
Sequence A120590 from the OEIS, see here
Exponential growth; computing results for inputs of size 19 or larger is not feasible
N.B.: exponential number of results means this must take (at least) an exponential amount of time
N.B.: this has nothing to do with parsing, it is due to the actions
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
Input size|Number of results
0 |1
1 |1
2 |1
4 |1
...
19 |1
The semantics is as before: compute the actions over all the (good) parse trees
Very important point: I believe that the only sensible semantics is to apply the actions over all the (good) parse trees; this enables equational reasoning about parsers etc, and is what underpins the theoretical tractability of this approach
This example takes an exponential amount of time. Does it need to? Before it had to, but now it is not so clear...
In the next slide, something that you (probably) can't do with existing tools...
let rec parse_E =
let tbl_E = MyHashtbl.create 100 in
(fun i -> memo_p3 tbl_E (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result. But weren't inputs of length 19 supposed to be infeasible?
O(n^3)
. If you want to write highly ambiguous grammars then parsing is still O(n^3)
.O(n^3)
for arbitrary grammarsSimple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time, providing the parse results are represented in a compact form (the oracle), and the actions are memoized
What this means in practice is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable (ie polytime)
E -> E "+" E | E "-" E | ...
<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>
<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>
<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>
%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */
A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some
if the parse is acceptable, None
if the parse is not acceptable) link
Suppose input is 1*2+3
; this should be parsed as (1*2)+3
, not 1*(2+3)
E -> E "+" E {{ fun (x,(_,y)) -> (match x,y with
| (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
| (Some x,Some y) -> Some(Plus(x,y))
| _ -> None) }}
| E "*" E {{ fun (x,(_,y)) -> (match x,y with
| (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
| (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
| (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
| (Some x,Some y) -> Some(Times(x,y))
| _ -> None) }}
| ?num? {{ fun s -> Some(Num(int_of_string (content s))) }}
Directly encodes which results are acceptable and not acceptable
5+3*2*1+4+1+1+1+1+1+1+1+1+1+1
has an awful lot of parse trees (see here), but can be parsed in a fraction of a second to give a single resultNot only usable for arithmetic expressions...
Consider grammars that are X (where X is LALR(1) or LR(1) or ...). You can't combine two X grammars to get an X grammar.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
We are in a functional, higher-order setting: combinator parsers can be parameterized over other parsers etc etc; this is extremely powerful; and we basically have no restrictions on the grammars we use, or the actions we apply...
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here
(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y))
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(((parse_A h) ***> (a "+") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
|||| (((parse_A h) ***> (a "-") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
|||| (((a "(") ***> (parse_A h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B
for booleans, and parse_L
for lambda expressions
(* example input: \ x \ y x (y y) *)
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(((a "\\") ***> parse_var ***> (parse_L h)) (* lam *)
>>>> (fun (_,(x,body)) -> `Lam(x,body)))
|||| (((parse_L h) ***> (parse_L h)) (* app *)
>>>> (fun (e1,e2) -> `App(e1,e2)))
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
|||| (((a "(") ***> (parse_L h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
let parse_and_eval txt = (remove_err (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt))
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
N.B. the example expression, for eg sigma, is a lambda that contains a boolean exp, which contains an arith exp, which contains a lambda exp, which contains an arith exp... the nesting of the languages is infinite
Why is this important? We need to stop reinventing the wheel every time, and instead reuse code, including parsers
Performance should be O(n3) in the worst case. So t(n) ≤ c.n3, ignoring some initial values of n.
For input size n, we record t(n), and calculate c, assuming t(n) = c.n3.
If we plot computed c against n, we expect c to tend towards a constant (assuming the implementation meets the bound!!!)
aho_sml
aho_sml: S -> S S "x" | ""
E_EEE
E_EEE: E -> E E E | "1" | ""
Real-world performance appears O(n3) across all grammars and inputs, as far as we can observe.
This is the worst case. For many grammars, performance will be much better - e.g. linear when parsing xml
We want to compare the performance of my approach (P3) with existing approaches; none of the existing approaches have a combinator interface
First problem: most parsers cannot actually parse these grammars; if they can, they typically bug out with very small inputs
Haskell Happy parser (based on GLR parsing) is the only one I found that allowed me to run a respectable number of tests
Grammars aho_sml
and S_xSx
seem to be especially suited to Happy
Even so, P3 outperforms Happy, e.g. for aho_sml
:
aho_sml: S -> S S "x" | ""
There is (probably) a performance bug in Happy, even on these grammars that are well-suited to Happy
N.B. what happened for Happy with input size 800?
aho_s
and E_EEE
should favour neither P3 nor Happy.E_EEE: E -> E E E | "1" | ""
Grammar|Input size|Happy/s|P3/s
E_EEE |020 |0.14 |0.10
E_EEE |040 |6.87 |0.13
E_EEE |100 |2535.88|0.50
Grammar aho_s
is similar.
What about grammar brackets
?
brackets
, Happy bugged out. This is (probably) a functional correctness bug in Happy.Grammar |Winner
aho_s |P3 by a mile
aho_sml |P3
brackets|P3 by a mile (competitor failed to turn up)
E_EEE |P3 by a mile
S_xSx |P3
Observed correct asymptotic performance
S_xSx
the input size needs to be quite large)
This is noteworthy - Happy has been under development for 10+ years, by many people, all of whom have reputations as excellent researchers and engineers.
Implementing algorithms correctly and efficiently is hard: this testing produced evidence that there are functional correctness bugs in Happy; it is also possible that there are time complexity bugs, and space complexity bugs in Happy but it is hard to be sure because it isn't clear what variant of GLR is implemented, nor what time/space bounds that variant of GLR actually satisfies
By contrast, no bugs have been found in P3 so far. I believe that the current version is both functionally correct, and provably efficient (assuming array-based datastructures).
Combinator parsing with Earley-like real-world performance
Combinator parsing library; many fancy examples (that are impossible with other approaches) in the online resources
Good performance, competitive with best parsers currently around on highly-ambiguous grammars; oracle architecture means we can swap in faster backend parsers when they become available; we can even analyse the grammar and choose the fastest parser for the given grammar class
Main challenges: linking combinator parsing with Earley parsing; Earley implementation; correctness and efficiency; tying everything together
Overall approach is simple, both conceptually and in terms of the implementation code (although Earley parser is quite intricate)
Probably useful for: prototyping; general CFG parsing; modular specification and combination of parsers; equational reasoning about parsers;