• Stars
    star
    302
  • Rank 138,030 (Top 3 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 15 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A parser combinator library based on the GLL algorithm

README

https://travis-ci.org/djspiewak/gll-combinators.svg?branch=master

GLL Combinators is a framework designed to implement the GLL parsing algorithm (Scott and Johnstone, LDTA 2009) in a functional manner. More specifically, the framework makes use of atomic parser combinators to compose grammars which are then evaluated using the GLL algorithm. The framework provides a syntax for this task which is almost identical to that of the parser combinator framework built into Scala. For example, we can render the classic "parentheses grammar" using GLL combinators:

lazy val expr: Parser[Any] = (
    "(" ~ expr ~ ")"
  | ""
)

As the type annotation implies, expr will reference an instance of type Parser[Any]. That is, an atomic parser which consumes some input and returns a value of type Any. We can invoke this parser against a String input in the following way:

expr("((()))")

This will return a value of type Stream[Result[Any]]. The Result[A] ADT is defined as one of the following (for some type A):

  • Success[A] -- Represents a successful parse and contains the resulting value.
  • Failure -- Represents a failed parse and contains the relevant error message as well as the remainder of the parse stream (the characters which were not consumed).

If any result is successful (i.e. an instance of Success[A]), then no failures will be returned. Thus, the Stream returned will be completely homogeneous, containing either Success or Failure, but not both. A Stream is returned rather than a single value to allow for ambiguity in the grammar (see below).

It's worth mentioning that GLL is a form of recursive-descent parsing. It has all of the advantages of conventional recursive-descent, including an intuitive control flow, arbitrary positioning of semantic actions and superior error messages. In fact, it is the fact that GLL is recursive-descent which allows it to be implemented using atomic combinators. Other algorithms which share the same capabilities as GLL (such as GLR, Earley Parsing, etc) are fundamentally incompatible with the combinator model due to their highly-unintuitive control flow. In GLL parsing, the control flow follows that of the grammar, as it does in traditional parser combinators or any other form of recursive-descent.

Getting Started

Compiled artifacts are pushed to Maven Central with the groupId of com.codecommit and the artifactId of gll-combinators. The most recent stable version is 2.3. The artifacts are cross-published for Scala 2.10, 2.11 and 2.12. If you're using SBT, you can simply copy/paste the following artifact descriptor into your build.sbt:

libraryDependencies += "com.codecommit" %% "gll-combinators" % "2.3"

You should have the following packages available for use:

  • com.codecommit.gll -- The complete public interface for the framework.
  • com.codecommit.util -- A number of useful utility classes used internally. Should not be considered as part of the stable API.

Fundamentals

TODO

Examples

TODO

Advantages

Despite superficial syntactic similarities, the GLL combinators framework does improve somewhat upon the syntax provided by the default Scala framework. For example, we can modify the above grammar with semantic actions using the ^^ combinator:

lazy val expr: Parser[Any] = (
    "(" ~ expr ~ ")" ^^ { _ + _ + _ }
  | ""
)

The key difference here is that GLL combinators does not require semantic actions to be defined as functions of arity-1 (although this is supported). Instead, the semantic action is expected (and type-checked) to be of the same arity as the number of tokens returned by the relevant production. Unlike the default Scala framework, this is made possible by an extremely intricate series of implicit conversions, allowing Scala to automatically infer the argument types of the function. This is what enables use of the "underscore shorthand" as shown above.

GLL combinators are also capable of parsing grammars with certain useful properties, such as left-recursion and even ambiguity. In fact, the GLL algorithm is capable of parsing any context-free grammar, no matter how arcane. As an example, consider this very natural grammar for arithmetic expressions:

lazy val expr: Parser[Int] = (
    expr ~ "*" ~ expr     ^^ { (e1, _, e2) => e1 * e2 }
  | expr ~ "/" ~ expr     ^^ { (e1, _, e2) => e1 / e2 }
  | expr ~ "+" ~ expr     ^^ { (e1, _, e2) => e1 + e2 }
  | expr ~ "-" ~ expr     ^^ { (e1, _, e2) => e1 - e2 }
  | "(" ~> expr <~ ")"
  | "-" ~> expr           ^^ { -_ }
  | """\d+""".r           ^^ { _.toInt }
)

Unfortunately, while this grammar may be very natural, it is also well beyond the capabilities of a traditional parser combinator framework. Specifically, this grammar exhibits both left-recursion and a rather pesky form of ambiguity. For the uninitiated, left-recursion is when a non-terminal -- in this case, expr -- refers to itself as the first token in any one of its productions -- in this case, the productions for multiplication, division, addition and subtraction. Ambiguity is when it is possible for a parser to produce two different values from the same input while still following the rules of the grammar. The expr parser is ambiguous in two ways: first, it doesn't dictate operator associativity ("1 + 2 + 3" could parse as either "(1 + 2) + 3" or "1 + (2 + 3)"); second, it doesn't dictate operator precedence ("1 + 2 * 3" could parse as either "(1 + 2) * 3" or "1 + (2 * 3)").

Now, the updated parser combinator framework in Scala 2.8.0 will be able to handle the left-recursion aspect of this grammar (through the use of a modified form of the Packrat algorithm), but not the ambiguity. This is where the GLL algorithm really begins to shine. Let's imagine that we ran our parser in the following way:

val results = expr("-1 + 2 * 3")

The results value will contain the following Stream [1]:

Stream(Success(-7,), Success(5,), Success(-9,), Success(3,))

These results represent all of the different values which can be produced by following the grammar while parsing the input string "1 + 2 * -3 + 4". The different interpretations are as follows:

Value Interpretation
5 (-1) + (2 * 3)
-9 -(1 + 2) * 3
3 ((-1) + 2) * 3
-9 -((1 + 2) * 3)
-7 -(1 + (2 * 3))

If we were to feed this grammar into the 2.7.4 (or earlier) version of the Scala parser combinator framework, the result would be an immediate infinite loop as the expr parser attempted to consume an expr as the first step in consuming an expr (a well-known problem inherent to recursive-descent). As mentioned earlier, the Scala 2.8.0 version of the framework would do better, parsing the input completely and producing a result. However, this would produce only one of the four possible results (shown above). In other words, even Packrat parser combinators (as are used in Scala 2.8.0) must select a single unambiguous line to follow at the expense of the other possibilities. While this sounds like a good thing, it ultimately imposes some severe limits on the grammars which can be handled.

Ambiguity is essential in fields like natural-language processing, where the language to be parsed may even be inherantly ambiguous. However, it is also extremely useful in other, less escoteric applications. While it is always possible to create an unambiguous grammar for a language which does not have any inherant ambiguity, it is often easier to simply allow for local ambiguity which is resolved later on in the parse.

TODO: I suppose I should come up with an example here. Maybe Haskell?

Critically, GLL does not impose a significant cost when dealing with ambiguous grammars. One would expect that following all possible parse trees in a highly-ambiguous grammar would lead to exponentially long runtimes. However, GLL is able to effectively exploit the same data structure which allows generalized bottom-up parsing algorithms (such as GLR) to function efficiently: the graph-structured stack. Describing this data structure is beyond the scope of this README. Instead, I would refer you to this paper by Masaru Tomita, original creator of GLR and inventor of the graph-structured stack. Suffice it to say that the GSS makes it possible for the GLL combinators framework to parse any grammar in O(n^3) time. This is even better than GLR, which is O(n^4) in the worst case.

Note that Stream is used as a result type (rather than List) to allow users to retrieve only the results which they actually need. Technically, generalized parsing has an exponential lower-bound due to the fact that a parser may need to return an exponential number of results. The O(n^3) performance guarantee offered by GLL is only valid when GLL is being used as a recognizer with a single result value for all parse trails. To get around this problem, the parse process will run only until it reaches the first successful value (or runs out of possible parse trails to attempt). Once it hits this first success, it bundles up the Result[A] along with a thunk representing the remainder of the parse. If you only require a single result, then the remainder of the parse can be discarded, resulting in truly O(n^3) performance in the worst case (likely much faster). If you need all possible results, then you are free to enumerate the entire result Stream, forcing the parse to return all possible values.

Please note that Scala's Stream implementation is highly prone to memory leaks. For example, even if you have already traversed the entire Stream (and thus completed the parse), the data structure will continue to maintain a reference to the transient data structures used during the GLL parse process. It is recommended that you allow the result Stream to go out of scope as quickly as possible. If you need to retain a list of results for any amount of time, you should use the toList method to copy the Stream into a List, rather than simply saving a reference to the Stream.

[1]The "extra" comma in the Success constructors is not a typo, it indicates that the entire stream was consumed by the parse. Without some serious conniptions, this is the default. Any Success which does not consume the entire stream is converted into a Failure prior to return. This is to enforce greedy matching in repetitions (the default for PEGs).

Performance

At the moment, performance is basically non-existent. The GLL algorithm itself is O(n^3) even in the worst case, but there is a high constant factor which is introduced by the framework which makes this quite a bit slower than it sounds. This is significantly better than traditional parser combinators, which are O(k^n) in the worst case (where k is a constant representing the ambiguity of the grammar), but the constant overhead imposed by the framework does make parsing according to the average grammar a somewhat longer affair than the traditional parser combinators or even mainstream bottom-up parsers such as Bison. A good example of poor performance is the MiniML example in the examples/ directory. Another, somewhat pathological example is the following highly-ambiguous grammar:

lazy val s: Parser[String] = (
    "b"
  | s ~ s       ^^ { _ + _ }
  | s ~ s ~ s   ^^ { _ + _ + _ }
)

It takes roughly 18 seconds to run this grammar against an input consisting of the letter b repeated 100 times. If we increase that number to 300, the parser will actually exhaust the available heap space in the default JVM configuration.

The actual performance on the s grammar is demonstrated by the following graph (plotted on a cubic scale). The gray line is y = kx^3 (for some constant k). The blue line was determined emperically from progressively longer runs (starting at strings of length 10 and increasing to length 100) on the s parser shown above. The y axis represents time in milliseconds.

performance.jpg

With all this said, there are very few grammar/input combinations which push the framework to its limit. In fact, for grammars which are LL(1)_, the GLL Combinators framework should actually be faster than traditional parser combinators. For example:

val num = ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9") ^^ { _.toInt }

In order to parse this grammar, traditional parser combinators would require time proportional to the number of alternates (in this case, 10). GLL Combinators are capable of parsing this grammar in linear time (O(n)), which is equivalent to the best LL(k) parsers. This is because the GLL algorithm degrades gracefully to predictive recursive-descent when the grammar (or sub-grammar) is LL(1). Note that GLL also lacks any form of conventional backtracking, which is how it is able to avoid the exponential cases which make naive recursive-descent so problematic.

It is also worth noting that the GLL algorithm is inherantly parallelizable. This means that, given enough processors, GLL should be quite a bit faster (in terms of total parse time) than any conventional bottom-up or top-down parser. The framework does not currently exploit this design property, but the plan is to eventually do so. Essentially, the parse would seamlessly distribute across all available cores. The more ambiguous the grammar, the better the algorithm could parallelize the parse.

Theory

The theoretical underpinnings for GLL are quite interesting, but also beyond the scope of this readme. I would refer you to the original paper by doctors Elizabeth Scott and Adrian Johnstone of Royal Holloway, University of London.

In a nutshell, the algorithm is almost identical to conventional single-token predictive recursive-descent parsing with no backtracking. This technique (recursive-descent) is only capable of handling grammars which are LL(1), meaning no left-recursion, no ambiguity, and no alternates which begin with the same token. The key difference is that GLL uses a trampoline function to dispatch ambiguous alternates. The idea of using a trampoline function to implement mutual tail-recursion in constant stack space is a well-known technique in functional programming (it's at the heart of Scheme's dispatch system). However, GLL is the first (to my knowledge) to apply this idea to text parsing.

The trampoline contains a queue (or stack) of pending alternate productions and their corresponding position in the input stream. Any number of alternates may be pending at any given point in time. These alternates are considered individually and parsed using conventional recursive-descent. That is, until the parsing process hits another ambiguity, at which point the possible alternates are added to the trampoline and control flow is returned to the main loop. This process continues until no further alternates are available.

The entire proceding is saved from exponentially-long runtimes by the graph-structured stack (GSS), a well-known device used in many generalized parsing algorithms. GLL expands slightly upon the original concept of the GSS by allowing for full-blown cycles in the graph structure, symbolizing direct or indirect left-recursion. These cycles effectively take the place of the GOTO operation used by LR parser automata on grammars with non-hidden left-recursion (hidden left-recursion, where the left-recursive production has a nullable non-terminal (one which goes to the empty string) as its first token, is not supported by any of the mainstream LR variants, including the ever-popular LALR).

More Repositories

1

emm

A general monad for managing stacking effects
Scala
205
star
2

parseback

A Scala implementation of parsing with derivatives
Scala
197
star
3

sbt-github-packages

A simple sbt plugin for publishing to GitHub Packages, in the style of sbt-sonatype and sbt-bintray
Scala
174
star
4

shims

Seamless interop layer between cats and scalaz
Scala
174
star
5

anti-xml

The scala.xml library has some very annoying issues. Time for a clean-room replacement!
Scala
169
star
6

cccp

Common Colaborative Coding Protocol
Scala
119
star
7

extreme-cleverness

A set of functional collections created, ported and modified for my tak at NE Scala 2011
Scala
91
star
8

sparse

Generalized, incremental parser combinators for scalaz-stream
Scala
63
star
9

skolems

A microlibrary for Scala encodings of higher-rank quantifiers
Scala
60
star
10

derivative-combinators

An implementation of derivative parsing in the parser combinator framework
Scala
59
star
11

sbt-spiewak

A plugin which represents my personal SBT project baseline
Scala
53
star
12

kitteh-redis

A toy Redis server implemented using pure FP on top of Cats Effect, Fs2, and Scodec
Scala
42
star
13

jedit-modes

A collection of new and modified edit modes for jEdit
34
star
14

scala-bison

A recursive ascent/descent parser generator for Scala
Scala
33
star
15

smock

A utility harness for testing free programs (built on specs2)
Scala
29
star
16

scala-collections

A number of collection implementations for Scala (including a few ported from Clojure)
Scala
27
star
17

async-runtime-benchmarks

Comparative benchmarks between asynchronous runtimes
Scala
27
star
18

scala-stm-proto

An implementation of a software transactional memory framework in Scala
Scala
19
star
19

sillio

Scala
16
star
20

ensime-sidekick

An ENSIME SideKick plugin for jEdit
Scala
16
star
21

linguistic-programming

Sources for my presentation on crafting DSLs in Scala
Scala
10
star
22

activeobjects

A Git fork of the mainline ActiveObjects SVN
Java
10
star
23

base.g8

My cool template
Scala
8
star
24

jbencode

A very fast Java Bencode pull parser/generator
Java
7
star
25

sbt-evil-mode

๐Ÿ˜ˆ
Scala
7
star
26

shapely

Scala
7
star
27

cont-exp

Experiments with the continuation monad
Scala
7
star
28

dbpool

A fork of the excellent DBPool connection pool
Java
5
star
29

wronglisp

Scala
4
star
30

rst2twiki

A converter script from ReStructured Text to TWiki markup
4
star
31

iota-instances

Scalaz instances and things for iotaz coproducts
Scala
3
star
32

parser-workshop

Parser workshop at flatMap(Oslo) 2013
Scala
3
star
33

dist-mapper

A non-functional key/value store backend
Scala
2
star
34

schrodinger

Common communication protocol for IO / Task / Future data types
Scala
2
star
35

jedit-macros

1
star