I have been interested in various aspects of parsing since 2009. My research is particularly focused on parsing general context-free grammars, using combinator parsers. I have published several papers, formalized various proofs about parsers in the HOL4 theorem prover, and produced quite a few OCaml implementations to try out various ideas:
P1 is a combinator parsing library for all context-free grammars (CFGs). It resembles traditional combinator-parsing libraries (both in interface, and in performance) with the additional feature that it can handle arbitrary CFGs. The performance with highly ambiguous left-recursive grammars is not optimal, but can be surprisingly good. For example, on the grammar
E -> E E E | "1" | eps, P1 outperforms the Haskell Happy parser generator.
P3 is a combinator parsing library that can parse all CFGs, whilst offering O(n3) performance (which I consider optimal for real-world usage: there are approaches that can do better than O(n3), but the real-world overhead is currently unacceptable, and the implementation complexity is high).
E3 is an Earley parser which is used by P3.
The downside is that the interface to P4 is more complicated than that for P3.
P1, P3, E3 and P4 can all be obtained from github.
P1 is a combinator parsing library for all CFGs. The implementation is based on the verified parsing work (see below).
The best documentation for P3 internals is the extended version of the following paper: Tom Ridge. Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle. In Proceedings of SLE 2014. ridge14sle.pdf extended version.
P4 is available from github at https://github.com/tomjridge/p4, but there is no documentation at the moment.
P5 is a version of P4 for Scala, available at https://github.com/tomjridge/p5_scala. Currently not in a very polished state.
This work produced a verified, sound and complete approach to combinator parsing. This introduced the idea of "good parse trees". The problem with this approach is that it is too inefficient with left-recursive grammars. There is a post describing verified parsing here.