Pattern
Pattern lets you transform data structures in amazing ways.
The focus is on simplicity and expressivity. It works on immutable persistent data, greatly simplifying development and debugging of transformations.
pangloss/pattern {:git/url "https://github.com/pangloss/pattern"
:sha "<use the current commit hash>"}
Video
See my Nov 2022 talk for London Clojurians
How can this be used?
Here are a few examples of how it's been used already:
- Create simple and clear macros
- Create an infix math macro in about 30 LOC
- Define the simplification rules of a computer algebra system (see a very similar engine in use in SICMUtils)
- Create a Python to Clojure source-to-source converter in under 400 LOC
- Compile Scheme to X86 assembly in about 1500 LOC
The primary tools in Pattern are the rule and rule combinator. To understand how to use those, it's important to first introduce matcher patterns and substitution.
How does it work?
Pattern is an collection of tools for pattern matching and substitution. Those tools can be extended and combined in flexible ways.
The match pattern looks similar to the data structure it's matching on. That enables a very intuitive understanding of how the pattern will apply to the data being matched.
A transformation is a match pattern with a substitution pattern. Combined, that is called a rule.
The substitution pattern uses exactly the same syntax as the match pattern. Substitution supports all of core functionality of the matcher. That makes it much simpler to understand the behavior of a rule.
In the spirit of Clojure, pattern maintains a high degree of simplicity in the interface.
Pattern Matching
(require '[pattern :refer [compile-pattern]])
The core of Pattern is the pattern matcher, which can be used to match any kind of Clojure data structure.
To use the pattern matcher, call (compile-pattern my-pattern)
.
That returns a function that you can call with the data you want to extract matches from.
The return value is a map from match names to matching data.
Here is a simple example that matches the first and last elements of a vector if the central elements are 1 2 3
.
(def m (compile-pattern '[?before 1 2 3 ?after])
(m [0 1 2 3 4]) ; => {:before 0 :after 4}
Even a simple pattern encodes sophisticated matching logic. For instance you could replicate the above matcher with this code. As you can see, it's more complex, less clear in intention, and easier to make a mistake:
(def m #(when (vector? %)
(if (and (= 5 (count %))
(= [1 2 3] (subvec % 1 4)))
{:before (first %)
:after (last %)})))
(m [0 1 2 3 4]) ; => {:before 0 :after 4}
The advantages of these pattern matchers become even more apparent as the complexity of pattern increases.
Unification Within a Pattern
If multiple matchers in a pattern have the same name, the values they match must unify.
Unification increases the sophistication of patterns that can be defined.
(def m (compile-pattern '[?fn-name (fn ?fn-name [??args] ??body)]))
(m ['abc '(fn abc [x] x)]) ; => {:fn-name 'abc ...}
(m ['xyz '(fn abc [x] x)]) ; => nil
Unification works across different matcher types.
The pattern [?list [??list 3]]
, could match [[1 2] [1 2 3]]
or [[:x] [:x 3]]
, etc.
Pattern Matchers Available
Above I showed 2 matchers: match-list
and match-element
matching against a fixed-length vector.
Much more flexibility is available.
The pattern could be changed to [??before 1 2 3 ??after]
, for instance.
Now it would match a vector of any length as long as it contains the sequence 1 2 3
.
(def m* (compile-pattern '[??before 1 2 3 ??after]))
(m* [1 2 3]) ; => {:before [] :after []}
(m* [0 1 2 3 4 5]) ; => {:before [0] :after [4 5]}
Patterns also do not need to be flat. For instance, I can match against Clojure S-Expressions. Here I'll introduce a three more matcher types:
(def let-bindings (compile-pattern '(let [(?:* (? binding* symbol?) ?_)] ??_)))
(let-bindings '(let [datum (first data)
c (count datum)]
...))
;; => {:binding* [datum c]}
?_
and ??_
are special cases of unnamed matchers.
These matchers do not unify with each other, and the value matched by them is not returned in the match results.
They are useful for ignoring parts of the input data that are not relevant to what you are trying to do.
Anywhere you can provide a matcher name, using _
as the name has the same behavior.
The ?:*
matcher is match-many
.
It matches a repeated subsequence within the sequence it is placed in.
In this case, that allows it to handle the characteristic flattened Clojure binding pattern.
The (? ...)
matcher is the full form of match-element
that we saw earlier.
In this case, it is equivalent to ?binding*
, except that here I gave it a predicate function.
The matcher will only succeed if the predicate returns a truthy value when called with the matched value.
That means the following does not match:
(let-bindings '(let [42 (first data)
c (count datum)]
...))
;; => nil
There are a lot more matchers and the list is gradually expanding with ever more weird and wonderful behaviors.
Each matcher in the list has a matcher implementation function with detailed documentation in the pattern.matchers namespace.
(better docs coming eventually!)
Matcher | Implementation | Notes |
---|---|---|
? |
match-element | Match a single element |
?? |
match-segment | Match 0 or more elements |
??! |
Same as ?? , but greedy |
|
?:map |
match-map | Match a map with specific keys |
?:set |
match-*set | Match items in a set |
?:*map /?:map* |
match-*map | Match each key-value pair in a map |
?:+map /?:map+ |
match-+map | Like ?:*map , but require at least one match |
?:as |
match-as | Capture an entire sub pattern if the contained pattern matches |
?:? |
match-optional | Match 0 or 1 instance |
?:1 |
match-one | Match exactly 1 instance |
?:* |
match-many | Match any number of instances |
?:+ |
match-at-least-one | Match at least one instance |
?:n |
match-n-times | Match the given sequence exactly n times. |
?:chain |
match-chain | Alternate between patterns and functions on the matched data |
??:chain |
Same as ?:chain but capture sequence data |
|
| |
match-or | Match alternative patterns on the same data |
& |
match-and | Match all patterns on the same data |
?:= |
match-literal | Match a literal value (needed if the form looks like a matcher pattern) |
?:not |
match-not | Match only if the contained pattern does not |
?:when |
match-when | If the predicate, match the pattern |
?:if |
match-if | If the predicate, match the then pattern, otherwise the else pattern |
?:letrec |
match-letrec | Create named subpatterns that can be used later |
$ |
match-ref | Use a named subpattern defined in ?:letrec |
?:fresh |
match-fresh | Match a new copy of the given name |
?:all-fresh |
match-all-fresh | Don't unify with captures outside the block |
?:restartable |
match-restartable | If the pattern doesn't match, raise a condition that can provide a value |
?:re-matches |
match-regex | Match with a regex, binding captures |
?:re-seq |
match-regex | Multiple matches with a regex, binding captures |
?:filter |
match-filter | Match only the items in the list where the predicate returns true |
??:filter |
Inline sequence version of ?:filter | |
?:remove |
Match only the items in the list where the predicate returns false | |
??:remove |
Inline sequence version of ?:remove |
A little about regex matchers
The ?:re-matches
pattern takes a second form which is a pattern that matches the structure
returned by clojure.core/re-matches
:
(let [matcher (pattern/compile-pattern
'(?:re-matches #"(\d+) (.+)" ?re-matches))]
(matcher "123 [a b c]"))
That returned structure can itself be matched:
(let [matcher (pattern/compile-pattern
'(?:re-matches #"(\d+) (.+)"
[?full_match ?group1 ?group2]))]
(matcher "123 [a b c]"))
There is no way that I know of to discover group names, so here foo
will not
be included in the bound names matched by the compiled pattern.
Here, ?foo
will bind to 999 just fine, because the named group in the regex
is not exposed outside of the regex itself.
(let [matcher (pattern/compile-pattern
'[(?:re-matches #"(?<foo>\d+) (.+)"
[?full_match ?group1 ?group2])
?foo])]
(matcher ["123 [a b c]" "999"]))
Within the pattern, I can specify the name with ?foo, though. If I do then the two instances must unify. This now fails to match:
(let [matcher (pattern/compile-pattern
'[(?:re-matches #"(?<foo>\d+) (.+)"
[?full_match ?foo ?group2])
?foo])]
(matcher ["123 [a b c]" "999"]))
Named matches can still be useful in the regex and they together with all features are perfectly acceptable to use. Here the number at the beginning and the end of the string must match for the regex to match.
(let [matcher (pattern/compile-pattern
'(?:re-matches #"(?<foo>\d+) (.+) \k<foo>"
[?full_match ?group2 ?group1]))]
(matcher "123 [a b c] 123"))
The captured data may have further matchers applied to it. Here I use the ?:chain matcher to parse the second group with read-string, then match within the parsed form.
(let [matcher (pattern/compile-pattern
'(?:re-matches #"(?<foo>\d+) (.+) \k<foo>"
[?full_match
?group2
(?:chain ?group1 read-string [?x ?y ?z])]))]
(matcher "123 [a b c] 123"))
Adding new matchers
The ?:chain rule can often be used to create a custom matcher. This is the implementation of the ?:map one.
This matcher starts with an unnamed matcher that ensures the matched datum is either a map or nil: (? _ (some-fn nil? map?))
Then if that matches, calls seq
on that datum, turning the map into a list of [key value]
pairs.
The result may match either nil
if the map is empty, or each key value pair must match the given k
and v
pattern forms, which are spliced into the matcher here: (| nil ((?:* ~[k v])))
.
The generated matcher is then compiled with compile-pattern*
, passing along the compilation environment.
Finally, new matchers must be registered with their name and optional aliases.
(defn match-*map
"Create a ?:*map matcher than can match a key/value pair multiple times."
[[_ k v] comp-env]
(compile-pattern* `(~'?:chain
(~'? ~'_ ~(some-fn nil? map?))
seq
(~'| nil ((~'?:* ~[k v]))))
comp-env))
(register-matcher '?:*map #'match-*map {:aliases ['?:map*]})
Substitution
The core substitution function is sub
.
It is actually a macro that expands to be equivalent to just writing the substitution pattern as a literal value.
(let [x 2
y '[a b c]]
(sub (* (+ ??y) ?x)))
;; => (* (+ a b c) 2)
Macroexpanded, you can see that the substitution overhead is effectively zero:
(let [x 2
y '[a b c]]
(list '* (list* '+ y) x))
You may notice that instead of passing in a map of values to sub
, it effectively looks in the current evaluation context for replacement values.
This tends to be very convenient.
In practice, 99% of the time I use this method.
For that last 1%, though, a more traditional data-driven substitution method is required.
For that, we can use substitute
, which interprets and executes the pattern at runtime:
(substitute
'(* (+ ??y) ?x)
{'x 2 'y '[a b c]})
;; => (* (+ a b c) 2)
Rules
Rules are pattern matchers tied to function bodies. The captured data from the matcher becomes the function arguments.
Most frequently the function body is modifying data and thu sub
macro is the perfect tool, but any value can be returned by a matching rule.
(rule mult-id
'(* 1 ?x)
x)
(rule strength-reduce
'(* 2 ?x)
(sub (+ ?x ?x)))
Rules are implemented in terms of standard matchers, so any of the above matchers may be used.
A rule or rule combinator can be called like a function.
(def my-rule (rule '(* 2 ?x) (sub (+ ?x ?x))))
(my-rule '(* 2 5)) # => (+ 5 5)
If the rule body returns nil
or false
, the rule is treated the same as if
the pattern didn't match, which lets you do any arbitrary check against what you
matched. If you'd really like to to return nil
or false
, use (success nil)
or (success false)
.
Rule Combinators
Rule combinators allow multiple rules to be used together. Rule combinators can themselves be combined. A rule combinator is itself just a rule. Rules and rule combinators are interchangeable.
The following combinators are documented for now in the pattern.r3.combinators namespace:
Rule combinator | Note |
---|---|
rule-list | Search the rules from top to bottom. Stop after successfully matching. |
in-order | Search the rules from top to bottom. Continue with the result of each matching rule. |
on-subexpressions | Try the given rule on every element in the datastructure in post-walk order |
iterated | Keep applying the rule until a fixed point is reached |
simplifier | Like on-subexpressions which iterates at every level |
directed | Guide the traversal either using the pattern or from within the rule body |
scanner | Convert any rule combinator to scan through a list or vector |
Scanner rules
If you have rules that transform sequential data, especially long ones, their performance can benefit from using the scanner combinator.
Say you're working on data like that (which is the result of processing a string
like "Quantities: 5ml or 10gr"
using a natural language library):
(def in
[{:text "Quantities:", :pos :word}
{:text " ", :pos :ws}
{:text "5", :pos :number}
{:text "ml", :pos :unit}
{:text " ", :pos :ws}
{:text "or", :pos :word}
{:text " ", :pos :ws}
{:text "10", :pos :number}
{:text "gr", :pos :unit}])
And you'd like to produce group the number and unit tokens into a new map. You may write a rule that looks like so:
(def group-quantities
(iterated
(rule group-quantities
'[??before
(?:as scalar (?:map :pos :number))
(?:? (?:map :text ?ws :pos :ws))
(?:as unit (?:map :pos :unit))
??after]
(vec
(concat
before
[{:pos :quantity
:text (str (:text scalar) ws (:text unit))
:scalar scalar
:unit unit}]
after)))))
Running it would yield:
> (group-quantities in)
[{:text "Quantities:", :pos :word}
{:text " ", :pos :ws}
{:pos :quantity,
:text "5ml",
:scalar {:text "5", :pos :number},
:unit {:text "ml", :pos :unit}}
{:text " ", :pos :ws}
{:text "or", :pos :word}
{:text " ", :pos :ws}
{:pos :quantity,
:text "10gr",
:scalar {:text "10", :pos :number},
:unit {:text "gr", :pos :unit}}]
The iterated
combinator is necessary because without it the rule would match
and transform the first pair of number and unit and leave the second
untouched. But it also means that after the vector is transformed once, Pattern
will restart scanning the elements from the beginning, which is of course
wasteful.
For such use cases, the rule can be rewritten with scanner
:
(def group-quantities
(scanner
(rule group-quantities
'[(?:as scalar (?:map :pos :number))
(?:? (?:map :text ?ws :pos :ws))
(?:as unit (?:map :pos :unit))]
[{:pos :quantity
:text (str (:text scalar) ws (:text unit))
:scalar scalar
:unit unit}])))
Running this rule results in identical output, but each time the rule matches, Pattern continues on to the rest of the vector instead of restarting from the first element, resulting in better performance. Also, the rule is shorter because you only have to provide matchers for the just the "window" of the sequence you're looking to transform and you don't need to reconstruct the original vector.
Compilation Dialects and Tools
...
Acknowledgements
Pattern is based upon the core ideas described in the excellent book Software Design for Flexibilty by Chris Hanson and Gerald Jay Sussman.
The compilation tools aspect is heavily inspired by the Nanopass Framework which introduces the idea of dialects and was the inspiration for some of the more powerful rule combinators that this library introduces.
License
Copyright © 2022 Darrick Wiebe
Distributed under the Eclipse Public License version 1.0.