Dr Tom Ridge
2013-11-08
Simple arithmetic expressions:
E -> E "+" E
E -> ?num?
or
E -> E "+" E | ?num?
Terminology: terminal, nonterminal, symbol, rule, grammar
N.B. RHS of rules cannot be the empty list
Each terminal is associated with a set of strings
Parse trees are trees which are well-formed according to the grammar, and where the strings associated with terminal nodes are those associated to the terminal at that node
Given a grammar, and an input string, parsing is the process of determining (somehow, and in some form) what parse trees correspond to the input.
There may be exponentially many parse trees:
E -> E E | "1"
There may be infinitely many parse trees:
E -> E | ""
Represent rules for a nonterminal:
E -> "(" E "+" E ")"
| "1"
By a recursive function:
(* following code ignores results and returns a dummy value *)
let rec parse_E i = (
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")"))
||| (a "1"))
i
Given an input string s
, a parser returns a list of things of the form (v,s')
, where v is the result of parsing a prefix of s
, and s'
is the remainder of the input that wasn't parsed.
This is prefix parsing.
e.g. following used for (a "1")
or more generally (a s)
let a s = (fun i ->
if starts_with i s then
let (u,v) = split i (String.length s) in
[(u,v)]
else
[])
let _ = (a "1") "123"
(* [("1", "23")] *)
let _ = (a "1") "abc"
(* [] *)
The combinators |||
and **>
are defined in the obvious way:
let (|||) p1 p2 = fun s -> List.append (p1 s) (p2 s)
let ( **> ) p1 p2 = (fun s0 ->
(* apply p2 to results (e1,s1) from p1 *)
let f (e1,s1) =
p2 s1
|> List.map (fun (e2,s2) -> ((e1,e2),s2))
in
s0
|> p1
|> List.map f
|> List.concat)
Here, x |> f
is just f x
Note these combinators are written infix e.g. p1 **> p2
let _ = ((a "1") **> (a "2")) "123"
(* [(("1", "2"), "3")] *)
There is another function which we need, the "action" function:
let (>>) p f = (fun s ->
s
|> p
|> List.map (fun (e,s) -> (f e, s)))
Note this is written infix e.g. p >> f
let _ = ((a "1") >> (fun _ -> 1))
(* "123" = [(1, "23")] *)
(* E -> "(" E "+" E ")" *)
let rec parse_E i =
let f1 =
((a "(") **> parse_E **> (a "+") **> parse_E **> (a ")")
>> (fun (_,(e1,(_,(e2,_)))) -> e1+e2))
||| ((a "1") >> (fun _ -> 1))
in
f1 i
let _ = parse_E "(1+(1+1))"
(* # - : (int * string) list = [(3, "")] *)
E -> E "+" E | ...
E -> F G | F H
and also because prefix parsing can lead to lots of work which is eventually thrown away
E -> F "X"
F -> A B
In this example, assuming there is no "X"
in the input, there may be substantial work involved in combining the results of parsing an A
followed by parsing a B
, before the parser attempts (and fails) to parse an "X"
.
Combinator parsing has a great programmer interface, but lousy performance, and can't handle all context-free grammars.
Consider a combinator parsing library with the following features:
Handles all context-free grammars
Sound
Complete
Conceptually simple (also, helps if implementation is simple)
Efficient
Real-world performance that matches or improves on the current-best parsers for any given subclass of context-free grammar
For context-free grammar parsing, this set of features, I would argue, is the best you could hope for.
In several recent publications academics have been trying to construct modular parsers, which can be easily combined in different ways.
Example: given a parser for lambda calculus, and a parser for arithmetic expressions, how to construct a parser for the combined language? e.g. parser accepts:
((\ y. y + 1) 2) + 3
In the following few slides: a modular presentation of parsers for 3 different languages, and their combination.
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(* plus expressions *)
(((parse_A h) ***> w ***> (a "+") ***> w ***> (parse_A h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Plus(e1,e2)))
|||| ...
(* helper parser *)
|||| h)
i)
Note boilerplate.
The parser w
parses whitespace.
The parser h
is a "helper parser".
Note that this example includes left recursion: you can't do this with traditional combinator parsing.
And we can add in subtraction, numbers and brackets in the obvious way.
let rec parse_A h = (fun i -> mkntparser "arithexp" (
...
(* subtraction *)
|||| (((parse_A h) ***> w ***> (a "-") ***> w ***> (parse_A h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Minus(e1,e2)))
(* numbers *)
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
(* brackets *)
|||| (((a "(") ***> w ***> (parse_A h) ***> w ***> (a ")"))
>>>> (fun (_,(_,(e,(_,_)))) -> e))
|||| ...)
i)
let txt = "1+2-3"
let _ = (p3_run_parser (parse_A never) txt)
(* [`Minus (`Plus (`Num 1, `Num 2), `Num 3);
`Plus (`Num 1, `Minus (`Num 2, `Num 3))] *)
never
, the helper parser in this example, is a parser that never parses.
Note multiple results - the grammar is ambiguous as well as left-recursive.
Interesting because it goes beyond traditional combinator parsing, but there is more!
let rec parse_B h = (fun i -> mkntparser "boolexp" (
(* true and false *)
((a "true") >>>> (fun _ -> `Bool true))
|||| ((a "false") >>>> (fun _ -> `Bool false))
(* if-then-else expressions *)
|||| (((a "if") ***> w ***> (parse_B h) ***> w ***>
(a "then") ***> w ***> (parse_B h) ***> w ***>
(a "else") ***> w ***> (parse_B h))
>>>> (fun (_,(_,(b,(_,(_,(_,(e1,...))))))) -> `Ite(b,e1,e2)))
(* numeric comparison < *)
|||| (((parse_B h) ***> w ***> (a "<") ***> w ***> (parse_B h))
>>>> (fun (e1,(_,(_,(_,e2)))) -> `Lt(e1,e2)))
|||| h)
i)
Again, note left recursion.
This is just an example; you might choose e.g. to include <
in the arithmetic parser
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(* lambda: \ x body *)
(((a "\\") ***> w ***> parse_var ***> w ***> (parse_L h))
>>>> (fun (_,(_,(x,(_,body)))) -> `Lam(x,body)))
(* application: u v *)
|||| (((parse_L h) ***> w ***> (parse_L h))
>>>> (fun (e1,(_,e2)) -> `App(e1,e2)))
(* variables: x *)
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
(* brackets: (e) *)
|||| (((a "(") ***> w ***> (parse_L h) ***> w ***> (a ")"))
>>>> (fun (_,(_,(e,(_,_)))) -> e))
|||| h)
i)
this is a very succinct and direct parser for lambda expressions
there are no tricks: the grammar exactly matches the datatype
We want a parser that can handle the language which includes all features of the previous 3 languages.
We want a parser that can handle the language which includes all features of the previous 3 languages.
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (??? what goes here???) i
in
l)
What is the missing definition of h
?
Hint: really only one possible answer for h
We want a parser that can handle the language which includes all features of the previous 3 languages.
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
Note the power of combinator parsing.
Note the power of "completeness" - all results are returned; so in particular all parsers return; so we don't need to worry about nontermination.
Note that the parsers l
, a
, b
and h
all parse the same language, indeed, they are the same parsers.
The version of parse_U
on the previous slide is fine, but it can be made faster using memoization.
Memoization: memo f i
is like f i
, but saves the result and reuses it to avoid recomputation.
In the following, the memoization function memo_p3
takes an explicit table argument, in which to store the results of computations.
let parse_memo_U () = (
let tbl = MyHashtbl.create 100 in
let rec l i = memo_p3 tbl (parse_L h) i
and a i = memo_p3 tbl (parse_A h) i
and b i = memo_p3 tbl (parse_B h) i
and h i = memo_p3 tbl (l |||| a |||| b) i
in
l)
Note the power of combinator parsing.
let txt = "\\ x if (1+((\\ y y) 2)) < 4 then true else false"
let _ = (p3_run_parser parse_U txt)
(*
[`Lam
("x",
`Ite
(`Lt (`Plus (`Num 1, `App (`Lam ("y", `Var "y"), `Num 2)), `Num 4),
`Bool true, `Bool false))]
*)
Note nested recursion between grammars (lambda includes arith includes lambda etc)
let rec eval_A h x = (match x with
| `Plus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
| (`Num x, `Num y) -> `Num(x+y)
| (x,y) -> `Err("eval_A 1",`Plus(x,y)))
| `Minus(e1,e2) -> (match (eval_A h e1,eval_A h e2) with
| (`Num x, `Num y) -> `Num(x-y)
| (x,y) -> `Err("eval_A 2",`Minus(x,y)))
| `Num x -> (`Num x)
| _ -> h x)
Note we plan to evaluate e.g. "x+1", but this evaluator doesn't (and shouldn't) know anything about x, envs etc
let rec eval_L h env x = (match x with
| `App(e1,e2) -> (
let v1 = eval_L h env e1 in
match v1 with
| `Clos(env1,`Lam((x:string),e1)) -> (
let env' = fupdate env1 (x,`Clos(env,e2)) in
eval_L h env' e1)
| _ -> `Err("eval_L 2",`App(v1,e2)))
| `Lam(x,e) -> (`Clos(env,`Lam(x,e)))
| `Var(x) -> (eval_L h env (env x))
| `Clos(env,e) -> (eval_L h env e)
| _ -> h env x)
Note need for an env
.
Note how env is passed to the helper function: the evaluator for arithmetic expressions e.g. x + 2
needs to know the value of x
(but eval_A
doesn't know anything about an env
!)
The point is: the helper for eval_A
needs to know the env
, not eval_A
itself
let eval =
let rec l env x = eval_L a env x
and a env x = eval_A (b env) x
and b env x = eval_B (eval_err env) x
and eval_err env x = (match x with
| `Err x -> `Err x
| _ -> (l env x))
in
l
let parse_and_eval txt = (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt)
(* y is a lambda calculus fixpoint combinator *)
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma n = n + n-1 + ... + 1 *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
(* `Num 15 *)
It's about understanding what the behaviour is...
let rec parse_MANY p = (fun i -> (mkntparser ... (
(((a "")) >>>> (fun _ -> []))
|||| ((p ***> (parse_MANY p)) >>>> (fun (x,xs) -> x::xs))))
i)
let parse_LIST bra elt sep ket = (
let sep_elt = (sep ***> elt) >>>> (fun (_,e) -> e) in
((bra ***> ket) >>>> (fun _ -> []))
|||| ((bra ***> elt ***> parse_MANY(sep_elt) ***> ket)
>>>> (fun (_,(e,(es,_))) -> e::es)))
let _ = p3_run_parser (parse_LIST (a "[") (a "x") (a ";") (a "]")) "[x;x;x]"
let _ = p3_run_parser (parse_LIST (a "") (a "x") (a "") (a "")) "xxx"
(* i.e. ... = parse_MANY (a "x") *)
Support for equational reasoning about parsers: (a "") ***> p =~ p
etc
Distinguish between parsing time, and time to execute the actions
Efficient, practical parsers execute in time $O(n^3)$ (where n is the length of the input)
Real-world performance also needs to be acceptable
For every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run. (Perlis)
Semantics of parsing
Key ideas, and a bit of implementation code
Some measurements
The state of play
Conclusion
Completeness (almost) means: every parse tree that is well-formed according to the grammar is returned.
Problem: there may be an infinite number of parse trees
e.g.
E -> E E E | "1" | ""
E E E
| / | \ / | \
"1" E E E "" E ""
| | | / | \
"" "1" "" E E E
| | |
"" "1" ""
What causes the number of parse trees to be infinite?
E
/ | \
. . .
. . .
. . .
. . .
. . .
/ | \
""..."" E ""...""
In general you need: a nonterminal (E say) and a sequence of productions
E ->+ "" ... "" E "" ... ""
Bad parse trees are such as those in the previous slide
Good parse trees are those that do not contain a bad subtree
There are only a finite (but typically very large) number of good parse trees
They embody all parsing information
It is possible to motivate good parse trees in several other ways, but that is beyond the scope of this talk
If we restrict to good parse trees, it is possible to construct combinator parsers that are sound and complete (CPP'11).
At the implementation level we use a parsing context. Parsing contexts eliminate bad trees.
A parsing context is a set of pairs (X,s), where X is a nonterminal and s is a string
The meaning of the context is something like "A parse of nonterminal X on string s is already in process"
If you attempt to parse the same nonterminal on the same string, you abandon the parse.
Somewhat more complicated than this because combinator parsing is prefix parsing
For details, see CPP'11 paper
Sound, complete, mechanized proof in HOL4
Executable, usable as a reference parser
My current view on "semantics of parsing" ☜
Efficiency: parsing worst case $O(n^5)$, but it often achieves this worst case ☹ ☜
We can handle context-free grammars ☺
Takes a long time ☹
Thinks... thinks...
Conclusion: difficult to preserve simplicity of combinator parsing and also achieve good performance (but sophisticated implementation techniques such as trampolining may get there - see GLL)
The problem now is to reconcile an interface based on combinator parsing with an existing technique such as Earley parsing
Another problem is to implement Earley parsing correctly and efficiently (I won't discuss this further in this talk)
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
let g = [
("E",[NT "E";NT "E";NT "E"]);
("E",[TM "1"]);
("E",[TM "eps"])]
Picture (SD) of proposed approach ☜
If the programmer writes code such as parse_E
there is no obvious way to get a grammar such as g
Somehow interpret the combinators in different ways
In one way, we can write the function grammar_of_parser
such that
grammar_of_parser parse_E = [
("E",[NT "E";NT "E";NT "E"]);
("E",[TM "1"]);
("E",[TM "eps"])]
parse_E
appears to behave as a standard combinator parsertype ('a,'b,'c) sum3 = Inl of 'a | Inm of 'b | Inr of 'c
Think of 3 components: left, middle and right
A parser is a function with the following type
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
inl
, it gives a result of type outl
, etctype inl = unit
type outl = symbol
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
Given an argument Inl()
, a parser returns a value of form Inl(sym)
parse_E (Inl()) = Inl(NT("E"))
Define the following function
let sym_of_parser p = (dest_inl (p (Inl ())))
type mid = {
rules7: parse_rule list;
tmparsers7: (term,raw_parser) fmap
}
type inm = mid
type outm = mid
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
Given an argument of type Inm mid
, a parser (such as parse_E
) returns a value of form Inm mid'
, where mid'
is mid
extended with all grammar rules "reachable" from the relevant symbol (such as E
). Must be careful about termination.
Also collects all the terminal parsers. Example...
let grammar_of_parser p = (dest_inm (p (Inm empty_mid)))
Then
grammar_of_parser parse_E = {
rules7=[("(E*E)", Seq (NT "E", NT "E"));
("(E*(E*E))", Seq (NT "E", NT "(E*E)"));
("((E*(E*E))+1)", Alt (NT "(E*(E*E))", TM "1"));
("(((E*(E*E))+1)+eps)",Alt(NT "((E*(E*E))+1)",TM "eps"));
("E", Atom (NT "(((E*(E*E))+1)+eps)"))];
...
}
We can extract the grammar from a parser.
We feed this to an Earley parser, and we get back information about the parse.
This information needs to be in a compact form (e.g. not parse trees!). I use an oracle.
A top-down parser:
The oracle answers the following problem:
Given a rule $S \rightarrow A B$ in $\Gamma$ and a substring $s_{i,j}$, what are the indexes $k$ such that $\Gamma \vdash A \rightarrow^* s_{i,k}$ and $\Gamma \vdash B \rightarrow^* S_{k,j}$ ?
In code:
type ty_oracle = (symbol * symbol) -> substring -> int list
A substring is $s_{i,j}$, or SS(s,i,j)
in code
type inr = {
ss4: substring;
lc4: local_context;
oracle4: ty_oracle
}
type 'a outr = 'a list
type input = (inl,inm,inr) sum3
type 'a output = (outl,outm,'a outr) sum3
type 'a parser3 = (input -> 'a output)
The right component corresponds to the usual combinator parser. The ss4
field is the input string, lc4
is the parsing context (to eliminate bad trees) and oracle4
is the parsing information returned by the Earley parser.
What happens when executing (p1 ***> p2) (Inr i0)
?
let seqr p1 p2 = (fun i0 ->
(* get symbol for each parser *)
let sym1 = sym_of_parser p1 in
let sym2 = sym_of_parser p2 in
let SS(s,i,j) = i0.ss4 in
(* call oracle on substring SS(s,i,j), get back list of k s *)
let ks = i0.oracle4 (sym1,sym2) i0.ss4 in
(* for each k we do the following *)
let f1 k = (
(* call p1 on SS(s,i,k) ... *)
let rs1 = dest_inr (p1 (Inr { i0 with ss4=(SS(s,i,k)) })) in
(* ... and p2 on SS(s,k,j) .. *)
let rs2 = dest_inr (p2 (Inr { i0 with ss4=(SS(s,k,j)) })) in
(* and combine the results *)
list_product rs1 rs2)
in
List.concat (List.map f1 ks))
Complete definitions of combinators
Code for the Earley parser (surprisingly difficult to get right)
We want to compare the performance of my approach (P3) with existing approaches; none of the existing approaches have a combinator interface
First problem: most parsers cannot actually parse these grammars; if they can, they typically bug out with very small inputs
Haskell Happy parser is the only one I found that allowed me to run a respectable number of tests
aho_s
and E_EEE
, P3 clearly outperforms Happy, e.g. for aho_s
There is (probably) a performance bug in Happy
On brackets
, Happy bugged out. This is (probably) a functional correctness bug in Happy.
S_xSx
and aho_sml
, P3 is worse than Happy by a small (2-4) factor. Following is S_xSx
:There is also additional overhead after parsing: constructing the oracle and executing the actions
With the released version, constructing the oracle can take about as long as parsing
Executing the actions appears to behave as it should (i.e. best possible performance that could be expected)
The released version does not attain $O(n^3)$ bound (but Happy is even worse)
A new implementation of Earley combines parsing with oracle construction, cuts execution time by a significant factor, and appears to fit the expected bound (actually $O(n^3.ln(n))$).
The initial release is usable. No bugs found so far. Lacks features that would make it really usable (e.g. debugging features)
The new version can work with arbitrary input types, not just strings. So you can use a lexer.
The new version fixes a minor annoyance with memoization.
The new version has significant performance improvements.
The new version has some support for debugging.
I use this technology all the time e.g. to port Netsem spec to Lem
Combinator parsing library; fancy examples
Good performance, competitive with best parsers currently around; oracle architecture means we can swap in faster parsers when they become available; we can even analyse the grammar and choose the fastest parser for the given grammar class
Needs a bit more engineering e.g. to support debugging
Main challenges: good trees; oracle architecture; 3 interpretations of combinators; Earley implementation; correctness and efficiency; tying everything together
I should verify the implementation of Earley
I should test performance on large grammars from restricted classes, against existing parsers for these subclasses.