• Stars
    star
    360
  • Rank 118,215 (Top 3 %)
  • Language
    Haskell
  • License
    BSD 3-Clause "New...
  • Created over 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Parsing all context-free grammars using Earley's algorithm in Haskell.

Earley Hackage

Go to the API documentation on Hackage.

This (Text.Earley) is a library consisting of a few main parts:

Text.Earley.Grammar

An embedded context-free grammar (CFG) domain-specific language (DSL) with semantic action specification in applicative style.

An example of a typical expression grammar working on an input tokenised into strings is the following:

   expr :: Grammar r (Prod r String String Expr)
   expr = mdo
     x1 <- rule $ Add <$> x1 <* namedToken "+" <*> x2
               <|> x2
               <?> "sum"
     x2 <- rule $ Mul <$> x2 <* namedToken "*" <*> x3
               <|> x3
               <?> "product"
     x3 <- rule $ Var <$> (satisfy ident <?> "identifier")
               <|> namedToken "(" *> x1 <* namedToken ")"
     return x1
     where
       ident (x:_) = isAlpha x
       ident _     = False

Text.Earley.Parser

An implementation of (a modification of) the Earley parsing algorithm.

To invoke the parser on the above grammar, run e.g. (here using words as a stupid tokeniser):

   fullParses (parser expr) $ words "a + b * ( c + d )"
   = ( [Add (Var "a") (Mul (Var "b") (Add (Var "c") (Var "d")))]
     , Report {...}
     )

Note that we get a list of all the possible parses (though in this case there is only one).

Another invocation, which shows the error reporting capabilities (giving the last position that the parser reached and what it expected at that point), is the following:

   fullParses (parser expr) $ words "a +"
   = ( []
     , Report { position   = 2
              , expected   = ["(","identifier","product"]
              , unconsumed = []
              }
     )

Text.Earley.Generator

Functionality to generate the members of the language that a grammar generates.

To get the language of a grammar given a list of allowed tokens, run e.g.:

   language (generator romanNumeralsGrammar "VIX")
   = [(0,""),(1,"I"),(5,"V"),(10,"X"),(20,"XX"),(11,"XI"),(15,"XV"),(6,"VI"),(9,"IX"),(4,"IV"),(2,"II"),(3,"III"),(19,"XIX"),(16,"XVI"),(14,"XIV"),(12,"XII"),(7,"VII"),(21,"XXI"),(25,"XXV"),(30,"XXX"),(31,"XXXI"),(35,"XXXV"),(8,"VIII"),(13,"XIII"),(17,"XVII"),(26,"XXVI"),(29,"XXIX"),(24,"XXIV"),(22,"XXII"),(18,"XVIII"),(36,"XXXVI"),(39,"XXXIX"),(34,"XXXIV"),(32,"XXXII"),(23,"XXIII"),(27,"XXVII"),(33,"XXXIII"),(28,"XXVIII"),(37,"XXXVII"),(38,"XXXVIII")]

The above example shows the language generated by a Roman numerals grammar limited to the tokens 'V', 'I', and 'X'.

Text.Earley.Mixfix

Helper functionality for creating parsers for expressions with mixfix identifiers in the style of Agda.

How do I write grammars?

As hinted at above, the grammars are written inside Grammar, which is a Monad and MonadFix. For the library to be able to tame the recursion in the grammars, we have to use the rule function whenever a production is recursive.

Whenever you would write e.g.

...
p = foo <|> bar <*> p
...

in a conventional combinator parser library, you instead write the following:

grammar = mdo
  ...
  p <- rule $ foo <|> bar <*> p
  ...

Apart from making it possible to do recursion (even left-recursion), rules have an additional benefit: they control where work is shared, by the rule that any rule is only ever expanded once per position in the input string. If a rule is encountered more than once at a position, the work is shared.

Compared to parser generators and combinator libraries

This library differs from the main methods that are used to write parsers in the Haskell ecosystem:

  • Compared to parser generators (YACC, Happy, etc.) it requires very little pre-processing of the grammar. It also allows you to stay in the host language for both grammar and parser, i.e. there is no use of a separate tool. This also means that you are free to use the abstraction facilities of Haskell when writing a grammar. Currently the library requires a linear traversal of the grammar's rules before use, which is usually fast enough to do at run time, but precludes infinite grammars.

  • The grammar language is similar to that of many parser combinators (Parsec, Attoparsec, parallel parsing processes, etc.), providing an applicative interface, but the parser gracefully handles all finite CFGs, including those with left-recursion. On the other hand, its productions are not monadic meaning that it does not support context-sensitive or infinite grammars, which are supported by many parser combinator libraries.

    Note: The Grammar type is a Monad (used to provide observable sharing) but it lives a layer above productions. It cannot be used to decide what production to use depending on the result of a previous production, i.e. it does not give us monadic parsing.

The parsing algorithm

The parsing algorithm that this library uses is based on Earley's parsing algorithm. The algorithm has been modified to produce online parse results, to give good error messages, and to allow garbage collection of the item sets. Essentially, instead of storing a sequence of sets of items like in the original algorithm, the modified algorithm just stores pointers back to sets of reachable items.

The worst-case run time performance of the Earley parsing algorithm is cubic in the length of the input, but for large classes of grammars it is linear. It should however be noted that this library will likely be slower than most parser generators and parser combinator libraries.

The parser implements an optimisation similar to that presented in Joop M.I.M Leo's paper A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead, which removes indirections in sequences of non-ambiguous backpointers between item sets.

For more in-depth information about the internals of the library, there are implementation notes currently being written.

Contact

Olle Fredriksson - https://github.com/ollef

More Repositories

1

sixten

Functional programming with fewer indirections
Haskell
757
star
2

sixty

Dependent type checker using normalisation by evaluation
Haskell
253
star
3

Bidirectional

Haskell implementation of Dunfield and Krishnaswami's "Complete and easy bidirectional typechecking for higher-rank polymorphism"
Haskell
123
star
4

rock

Build system
Haskell
114
star
5

braces-be-gone

Get those pesky braces out of your face
Haskell
49
star
6

Generate-C

Embedded C code generation DSL for Haskell.
Haskell
27
star
7

dependent-hashmap

Dependent hash maps
Haskell
14
star
8

rope-utf16-splay

Thick strings optimised for indexing and updating using UTF-16 code units and row/column pairs
Haskell
14
star
9

parsix

Adventures in parser combinators
Haskell
10
star
10

region

Adventures in region inference
Haskell
8
star
11

mirrored-keyboard-layouts

Mirrored XKB layouts for one-handed typing
8
star
12

Grempa

Embedded grammar DSL and LALR parser generator
Haskell
8
star
13

navm-hs

Not a virtual machine
Haskell
6
star
14

navm

Not A Virtual Machine
Rust
6
star
15

environment-bench

Benchmarking compiler representations of variable environments
Haskell
5
star
16

incrementalism

Haskell
4
star
17

Fuse.JSON

Convert various things (JavaScript values, Objective-C objects) to and from JSON in Uno
Uno
4
star
18

oslo-haskell

Exercises
Haskell
4
star
19

packrat

Adventures in packrat parsing
Haskell
3
star
20

rope-utf16

Haskell
3
star
21

ibuild

Indexed build systems a la carte
Haskell
2
star
22

FunSharp

Learning F#, nothing to see here
F#
2
star
23

vim-svorsk

Vim for Swedes in Norway
Vim Script
2
star
24

paws

Utility for automatically pausing and resuming music at appropriate times.
C
2
star
25

fenwick

Self-balancing Fenwick trees with logarithmic time monoidal prefix sums
Haskell
2
star
26

by-value

Polymorphism without pointer indirections in C++, allowing you to store and pass subclass objects by value
C++
1
star
27

llvm-irbuilder

IRBuilder for LLVM Haskell Bindings ( Experimental )
Haskell
1
star
28

system-rea

Dad made me do it
Haskell
1
star
29

deriving-gcompare

Derive instances for GEq and GCompare from the dependent-sum package
Haskell
1
star
30

ollef.github.io

HTML
1
star
31

postgres-woobat

Haskell
1
star
32

rope-bench

Haskell
1
star
33

satire

Satisfaction
Rust
1
star