Date: 2013-12-03
Categories: news; parsing; p3

Implementing algorithms efficiently

I reimplemented the core Earley parser used by P3. The new version of the core Earley parser, earley3, now appears to meet the desired asymptotic worst-case time complexity of O(n^3. ln n). The following are some observations on implementing algorithms so that they achieve a given complexity bound.

The first point to make is that implementing algorithms to achieve a given bound is hard. It is interesting to observe that as the code approaches the worst-case bound, even the slightest imperfections in the code tend to show up as observably-bad performance. It feels as though you are trying to squeeze a big balloon into a small box, and whenever you push on one part of the balloon, in order to get it to fit in the box, another part of the balloon escapes the box. My sense is that implementing algorithms so that they have the correct time complexity is harder than functional correctness, and that formalizing a proof that the implementation meets a given bound is going to be extremely difficult.

With Earley, I spent about a day going through the existing code, identifying invariants, which in this setting include not only functional correctness invariants, but also size invariants and time invariants. For example, an invariant might be: the size of every set in the range of this finite map is O(n) and each set can be accessed (looked up in the map) in time O(ln n). I then thought about the code, and decided that, from a complexity viewpoint, it was too complicated. This is interesting because I normally write code for functional correctness, but it is now clear that correctness in terms of execution time might also influence how the code is written. I then reorganized the code to make the time-cost more transparent, and more aligned to an informal argument I had about why the time complexity should be O(n^3. ln n). The result appears to hit the desired time bound, but I can't help but feel slightly nervous because it took so much effort to get to this point, and I am still not absolutely certain that the code has the correct time complexity.

Of course, one way to be sure is to produce a formal proof that the implementation meets the desired time complexity bound. One barrier to doing this is that informal proofs about time complexity leave out an awful lot of details, and tend to be hard to formalize. Reasoning about time complexity is qualitatively different from the reasoning typically found in functional correctness proofs. I was reading Okasaki's "Purely Functional Data Structures" book recently, where he talks about amortized costs of algorithms, and the Banker's and Physicist's methods. These seem like plausible approaches that could be formalized to give formal guarantees, but my sense is that the task of giving a mechanized proof that an implementation meets a given time bound is still phenomenally difficult. Interesting though!

Update on 2013-12-04: it now appears that the reason previous versions seemed not to hit the time bound was due to GC issues (again!). The current version appears not to hit the time bound, until you fiddle about with various parameters. Also, fiddling about with a central datastructure also seemed to help (although theoretically this was a no-op): I changed a map from pairs of ints to a map from an int to (a map from an int) ie a currying-like change. Anyway, the algorithm is now nicely within the time bound. Rewriting the algorithm allowed me to be absolutely clear about how I thought the algorithm should behave, which in turn made it possible to track down the reason it was not behaving as expected.

Related articles:

  • 2018-07-02 Funded PhD places
  • 2018-06-14 A typed DSL for parsing
  • 2018-05-22 First Python program: an Earley parser!
  • 2017-11-14 New OCaml parsing algorithm: tjr_simple_earley
  • 2017-09-17 Two new OCaml libraries: P0 and tjr-csv
  • 2017-09-06 ICFP most influential paper from 10 years ago
  • 2017-03-16 tjr-btree: a CoW B-tree library in OCaml
  • 2016-06-05 Do not buy TP-Link products
  • 2016-05-27 Visit to Cambridge, for REMS workshop
  • 2016-04-22 Visit to Denmark, to see Joergen Villadsen
  • 2016-02-19 Tree-structured text
  • 2016-02-09 Simple implementation of an Earley-like parsing algorithm
  • 2016-01-27 Talk at York University
  • 2016-01-06 Stephanie Heintz PhD viva
  • 2015-10-05 SibylFS presentation at SOSP'15
  • 2015-09-28 SibylFS SOSP poster
  • 2015-09-21 SibylFS presentation to Industrial Advisory Board
  • 2015-06-28 SOSP 2015 paper acceptance for SibylFS
  • 2015-06-26 P5 scala parsing library
  • 2015-04-21 REMS Cambridge talk on SibylFS
  • 2014-12-19 Parsing the IMAP protocol
  • 2014-12-04 Parsing examples
  • 2014-11-21 Talk on parsing at the University of Sussex
  • 2014-09-26 P1 combinator parsing library for OCaml
  • 2014-09-26 E3 earley parser library for OCaml
  • 2014-09-18 SLE 2014 conference, and Parsing at SLE workshop, slides
  • 2014-09-07 ICFP 2014, OCaml workshop, slides and video
  • 2014-09-07 ICFP 2014, Lem talk
  • 2014-07-11 P3 paper accepted for SLE 2014
  • 2014-07-11 Lem paper accepted to ICFP 2014
  • 2014-04-15 New release of P3 code on github
  • 2014-03-02 New release of P3 code on github
  • 2013-12-16 New release of P3 code on github
  • 2013-12-03 Implementing algorithms efficiently
  • 2013-11-24 Experience of using Lem
  • 2013-11-13 PhD student course on logic, proof and mechanized theorem proving
  • 2013-11-13 New website design
  • 2013-11-13 Engineer position
  • 2013-11-12 Netsem TCP/IP specification on github
  • 2013-11-08 Talk on parsing and P3 given at Cambridge
  • 2013-11-01 Funded PhD positions
  • 2013-04-01 PhD position funded by Microsoft Research
  • 2013-02-01 EPSRC grant funded
  • 2011-12-01 Verified parsing