• Stars
    star
    241
  • Rank 167,643 (Top 4 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created over 10 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Regexp for sequences!

Seqexp

Seqexp: regular expressions for sequences (and other sequables)!

Seqexp salient features are:

  • linear runtime
  • lazy-friendly (doesn't consume the whole seq when possible)
  • named groups using as (in addition to default groups: :match and :rest)
  • greedy (*, +, ?, repeat) and reluctant quantifiers (*?, +?, ??, repeat?)
  • separated quantifiers (*', +' and repeat' -- also available in reluctant versions): they take an additional separator expression
  • short and sweet codebase (~220LOC)

Usage

Add this dependency to your project.

[net.cgrand/seqexp "0.6.2"]

Require it as:

(ns playground
  (:require [net.cgrand.seqexp :as se]))

And have fun:

;; match as many odd numbers as possible and ends the match with 7
=> (se/exec
     (se/cat (se/* odd?) 7)
     [1 3 3 7 7 9])

{:rest (9), :match (1 3 3 7 7)}

(exec returns either nil on failure or a map of group names to matched sub-sequences. They are two special groups: :match and :rest, corresponding to the matched sub sequence and the rest of the input sequence.)

Named groups are introduced with as:

=> (se/exec
        (se/cat (se/as :odds-before-7 (se/* odd?)) 7)
        [1 3 3 7 7 9])

{:rest (9), :odds-before-7 (1 3 3 7), :match (1 3 3 7 7)}

What about the difference between greedy and reluctant quantifiers?

; greedy *
=> (se/exec
        (se/* (se/cat (se/as :odds-before-7 (se/* odd?)) 7))
        [1 3 3 7 1 5 5 7])

{:rest (), :odds-before-7 (1 3 3 7 1 5 5), :match (1 3 3 7 1 5 5 7)}

; reluctant *?
=> (se/exec
        (se/* (se/cat (se/as :odds-before-7 (se/*? odd?)) 7))
        [1 3 3 7 1 5 5 7])

{:rest (), :odds-before-7 (1 5 5), :match (1 3 3 7 1 5 5 7)}

In the above example we see that in the first case the inner * matches as much as possible while in the second case *? matches as little as possible.

Greediness affects only submatches and never changes the whole match: flipping a quantifier from greedy to reluctant (or the other way round) is never going to change what is returned under :match and :rest, it only affects groups.

Now, a more complex example:

;; match a defn-like body (including name, optional docstring, optional metadata
;; and multiple arities)
=> (def args+body (se/cat vector? (se/* se/_)))

#'playground/args+body

=> (se/exec
     (se/cat
       (se/as :name symbol?)
       (se/? (se/as :docstring string?))
       (se/? (se/as :meta map?))
       (se/|
         (se/as :body args+body)
         (se/as :bodies
           (se/+ (partial se/exec args+body)))))
     '(fn-name "some-doc" {:meta :data}
        ([a] ...)
        ([a b] ...)))

{:rest (), :match (fn-name "some-doc" {:meta :data} ([a] ...) ([a b] ...)),
 :bodies (([a] ...) ([a b] ...)),
 :name (fn-name),
 :docstring ("some-doc"),
 :meta ({:meta :data})}

Lazy-friendliness:

;; proof that the whole infinite seq is not consumed
=> (:match (se/exec (se/* #(< % 10)) (range)))

(0 1 2 3 4 5 6 7 8 9)

Implementation details

Seqexp uses a stackless VM with 5 opcodes (PRED, JUMP, FORK>, FORK< and SAVE), all taking one argument.

The VM is multithreaded (these are not real threads) and all threads are run in lockstep.

The state of a thread consists only of its PC (program counter) and its set of registers (two for each named group). At any time there can't be more than one thread with the same PC.

This approach is used by Pike and Janson in the sam editor and is documented here.

PRED pred-fn

pred-fn is a predicate function which gets called on the current element.

If the predicate succeeds, the thread continues to the next instruction and to the next element of the input sequence. When the predicate fails the thread is terminated.

JUMP address

Performs a relative jump, address is the relative address.

FORK> address and FORK< address

Forks the current thread in two threads, one thread will continue to the next instruction while the other will performs a relative jump to the specified address.

The difference between FORK> and FORK< is the relative priority of the two resulting threads: FORK> gives priority to the continuing thread while FORK< gives priority to the jumping thread.

It should be noted that the only effect of this priority is submatch selection: when a match is found, the highest-priority thread amongst the successful threads gets to pick the submatches (groups). It follows that priority can't change the whole match.

SAVE register-name

Saves the current position to the specified register.

Example

For example, (se/cat (se/* odd?) 7) compiles down to:

FORK> 3
PRED  odd?
JUMP  -2
PRED  #(= 7 %)

Lookahead support

As of 0.6.0, the Pike and Janson's algorithm has been extended to support lookaheads.

Two versions of the VM now exists (one could do with only one): the top-level VM which is the grouping VM and the nested accepting VM which is simpler because it ignores registers (ie it doesn't track match or submatches: it just tells whether an accept state has been reached or not). Not having to track submatches mean than thread priority is not needed any more so the whole state of a nested VM is the set of its threads ids.

Previously threads were identified by their PC alone, now with lookahead support, each thread is identified by a pair: its PC and a nested VM state. Nested VMs support lookaheads too (so you can put lookaheads in your lookaheads) so nested thread ids are pairs of PC and a nested VM state.

thread-id = [pc, nested-vm-state]
nested-vm-state = set of thread-id
top-level-vm-state = (prioritized) map of thread-id to register-bank 

Despite the above definitions the state space is still bounded because the nesting depth is bounded by construction (pathological code exists but can't be produced by a regex/seqexp) so the complexity is still linear with the size of the input.

One new opcode is added: NLA.

NLA addr spawns thread [(inc pc) #{}] in the nested VM while the current VM jumps at addr.

The nested VM is used to perform negative lookahead: if it reaches an accept state (PC at the end of the program and empty nested VM state) then the current thread is killed.

A given thread only needs one nested VM even if it traverses several negative lookaheads because:

main AND NOT nla1 AND NOT nla2 = main AND NOT (nla1 OR nla2)

Surprisingly, positive lookahead comes for free from the decision to support nested lookaheads:

main AND pla = main AND NOT (NOT pla)
main AND pla = main AND NOT (true AND NOT pla)

So a positive lookahead is just a negative lookahead inside another negative lookahead!

Changes

0.6.0

  • Lookahead operators ?! and ?=.

0.5.1

  • Dynamic cycle detection:
;; before
=> (exec (* (*? _))
    [:c :a :b :b :a :c :a :b :b :b :a])
StackOverflowError   net.cgrand.seqexp/add-thread (seqexp.clj:254)
;; now
=> (exec (* (*? _))
    [:c :a :b :b :a :c :a :b :b :b :a])
{:rest nil, :match (:c :a :b :b :a :c :a :b :b :b :a)}

0.5.0

  • lift-tree renamed to exec-tree and the original return value is now nested under a :match key. Similarly to exec, there's now a :rest key along the :match key.

0.4.0

  • replacement of the Register protocol by the RegisterBank one, which allows to implement lift-tree
  • lift-tree which in conjuction with hierarchical group names allows to recreate a tree out of groups:
=> (se/lift-tree
     (se/* 
       (se/as [:sections]
         (se/cat :h1
           (se/as [:sections :ps]
             (se/+ :p)))))
     [:h1 :p :h1 :p :p :p])

[{:match (:h1 :p :h1 :p :p :p),
  :sections
  [{:match (:h1 :p), :ps [{:match (:p)}]}
   {:match (:h1 :p :p :p), :ps [{:match (:p :p :p)}]}]}]

0.2.0

  • (repeat n e) now matches e repeated exactly n times (and not upto n times)
  • separated variants of *, + and repeat: *', +' and repeat' (reluctant versions: *'?, +'? and repeat'?).

License

Copyright © 2014 Christophe Grand

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

More Repositories

1

enlive

a selector-based (à la CSS) templating and transformation system for Clojure
Clojure
1,619
star
2

xforms

Extra transducers and reducing fns for Clojure(script)
Clojure
573
star
3

moustache

a micro web framework/internal DSL to wire Ring handlers and middlewares
Clojure
261
star
4

enliven

Enlive next: faster, better, broader
Clojure
243
star
5

parsley

a DSL for creating total and truly incremental parsers in Clojure
Clojure
200
star
6

macrovich

A set of three macros to ease writing `*.cljc` supporting Clojure, Clojurescript and self-hosted Clojurescript.
Clojure
163
star
7

sjacket

Structural code transformations for the masses.
Clojure
114
star
8

megaref

STM ref types that allow for more concurrency on associative values.
Clojure
95
star
9

spreadmap

Evil project to turn excel spreadsheets in persistent reactive structures.
Clojure
89
star
10

regex

a regex DSL for those who prefer verbose composable regexes to terse ones
Clojure
88
star
11

poucet

trace as data for Clojure/JVM
Clojure
86
star
12

confluent-map

A persistent confluent map for Clojure
Java
39
star
13

packed-printer

Compact pretty printer
Clojure
37
star
14

sqrel

The SQL library that won't drive you nuts.
Clojure
37
star
15

utils

useful functions and extensible macros
Clojure
36
star
16

indexed-set

A set implementation which supports unicity constraints and maintains summaries (indexes).
Clojure
31
star
17

replay

Instant test suites from repl transcript.
Clojure
26
star
18

cljs-js-repl

Upgradable self hosted clojurescript repl
Clojure
21
star
19

parsnip

parsley is dead, long live parsnip!
Clojure
17
star
20

boring

A tunnel-boring library
Clojure
11
star
21

enlivez

Clojure
8
star
22

dynvars

dynamic bindings in callback hell
Clojure
5
star
23

sandbox

expeiments, works in progress etc.
Clojure
4
star
24

advent2017

Clojure
2
star
25

trainings.geneva.2012

Ressources issues des formations Clojure données du 13 au 15 mai 2012
Clojure
2
star
26

berlin-tron

Clojure
1
star