• This repository has been archived on 07/Jul/2023
  • Stars
    star
    354
  • Rank 115,691 (Top 3 %)
  • Language
    Rust
  • License
    The Unlicense
  • Created over 5 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A low level regular expression library that uses deterministic finite automata.

WARNING: This repository is now archived. The regex-automata crate now resides at https://github.com/rust-lang/regex

regex-automata

A low level regular expression library that uses deterministic finite automata. It supports a rich syntax with Unicode support, has extensive options for configuring the best space vs time trade off for your use case and provides support for cheap deserialization of automata for use in no_std environments.

Build status Crates.io Minimum Supported Rust Version 1.41

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/regex-automata

Usage

Add this to your Cargo.toml:

[dependencies]
regex-automata = "0.1"

WARNING: The master branch currently contains code for the 0.2 release, but this README still targets the 0.1 release. Namely, it is recommended to stick with the 0.1 release. The 0.2 release was made prematurely in order to unblock some folks.

Example: basic regex searching

This example shows how to compile a regex using the default configuration and then use it to find matches in a byte string:

use regex_automata::Regex;

let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
let text = b"2018-12-24 2016-10-08";
let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
assert_eq!(matches, vec![(0, 10), (11, 21)]);

For more examples and information about the various knobs that can be turned, please see the docs.

Support for no_std

This crate comes with a std feature that is enabled by default. When the std feature is enabled, the API of this crate will include the facilities necessary for compiling, serializing, deserializing and searching with regular expressions. When the std feature is disabled, the API of this crate will shrink such that it only includes the facilities necessary for deserializing and searching with regular expressions.

The intended workflow for no_std environments is thus as follows:

  • Write a program with the std feature that compiles and serializes a regular expression. Serialization should only happen after first converting the DFAs to use a fixed size state identifier instead of the default usize. You may also need to serialize both little and big endian versions of each DFA. (So that's 4 DFAs in total for each regex.)
  • In your no_std environment, follow the examples above for deserializing your previously serialized DFAs into regexes. You can then search with them as you would any regex.

Deserialization can happen anywhere. For example, with bytes embedded into a binary or with a file memory mapped at runtime.

Note that the ucd-generate tool will do the first step for you with its dfa or regex sub-commands.

Cargo features

  • std - Enabled by default. This enables the ability to compile finite automata. This requires the regex-syntax dependency. Without this feature enabled, finite automata can only be used for searching (using the approach described above).
  • transducer - Disabled by default. This provides implementations of the Automaton trait found in the fst crate. This permits using finite automata generated by this crate to search finite state transducers. This requires the fst dependency.

Differences with the regex crate

The main goal of the regex crate is to serve as a general purpose regular expression engine. It aims to automatically balance low compile times, fast search times and low memory usage, while also providing a convenient API for users. In contrast, this crate provides a lower level regular expression interface that is a bit less convenient while providing more explicit control over memory usage and search times.

Here are some specific negative differences:

  • Compilation can take an exponential amount of time and space in the size of the regex pattern. While most patterns do not exhibit worst case exponential time, such patterns do exist. For example, [01]*1[01]{N} will build a DFA with 2^(N+1) states. For this reason, untrusted patterns should not be compiled with this library. (In the future, the API may expose an option to return an error if the DFA gets too big.)
  • This crate does not support sub-match extraction, which can be achieved with the regex crate's "captures" API. This may be added in the future, but is unlikely.
  • While the regex crate doesn't necessarily sport fast compilation times, the regexes in this crate are almost universally slow to compile, especially when they contain large Unicode character classes. For example, on my system, compiling \w{3} with byte classes enabled takes just over 1 second and almost 5MB of memory! (Compiling a sparse regex takes about the same time but only uses about 500KB of memory.) Conversly, compiling the same regex without Unicode support, e.g., (?-u)\w{3}, takes under 1 millisecond and less than 5KB of memory. For this reason, you should only use Unicode character classes if you absolutely need them!
  • This crate does not support regex sets.
  • This crate does not support zero-width assertions such as ^, $, \b or \B.
  • As a lower level crate, this library does not do literal optimizations. In exchange, you get predictable performance regardless of input. The philosophy here is that literal optimizations should be applied at a higher level, although there is no easy support for this in the ecosystem yet.
  • There is no &str API like in the regex crate. In this crate, all APIs operate on &[u8]. By default, match indices are guaranteed to fall on UTF-8 boundaries, unless RegexBuilder::allow_invalid_utf8 is enabled.

With some of the downsides out of the way, here are some positive differences:

  • Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply deserialized. Deserialization always takes constant time since searching can be performed directly on the raw serialized bytes of a DFA.
  • This crate was specifically designed so that the searching phase of a DFA has minimal runtime requirements, and can therefore be used in no_std environments. While no_std environments cannot compile regexes, they can deserialize pre-compiled regexes.
  • Since this crate builds DFAs ahead of time, it will generally out-perform the regex crate on equivalent tasks. The performance difference is likely not large. However, because of a complex set of optimizations in the regex crate (like literal optimizations), an accurate performance comparison may be difficult to do.
  • Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search performance a small amount, but uses much less storage space. Potentially even less than what the regex crate uses.
  • This crate exposes DFAs directly, such as DenseDFA and SparseDFA, which enables one to do less work in some cases. For example, if you only need the end of a match and not the start of a match, then you can use a DFA directly without building a Regex, which always requires a second DFA to find the start of a match.
  • Aside from choosing between dense and sparse DFAs, there are several options for configuring the space usage vs search time trade off. These include things like choosing a smaller state identifier representation, to premultiplying state identifiers and splitting a DFA's alphabet into equivalence classes. Finally, DFA minimization is also provided, but can increase compilation times dramatically.

Future work

  • Look into being smarter about generating NFA states for large Unicode character classes. These can create a lot of additional work for both the determinizer and the minimizer, and I suspect this is the key thing we'll want to improve if we want to make DFA compile times faster. I believe it's possible to potentially build minimal or nearly minimal NFAs for the special case of Unicode character classes by leveraging Daciuk's algorithms for building minimal automata in linear time for sets of strings. See https://blog.burntsushi.net/transducers/#construction for more details. The key adaptation I think we need to make is to modify the algorithm to operate on byte ranges instead of enumerating every codepoint in the set. Otherwise, it might not be worth doing.
  • Add support for regex sets. It should be possible to do this by "simply" introducing more match states. I think we can also report the positions at each match, similar to how Aho-Corasick works. I think the long pole in the tent here is probably the API design work and arranging it so that we don't introduce extra overhead into the non-regex-set case without duplicating a lot of code. It seems doable.
  • Stretch goal: support capturing groups by implementing "tagged" DFA (transducers). Laurikari's paper is the usual reference here, but Trofimovich has a much more thorough treatment here: https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf I've only read the paper once. I suspect it will require at least a few more read throughs before I understand it. See also: https://re2c.org
  • Possibly less ambitious goal: can we select a portion of Trofimovich's work to make small fixed length look-around work? It would be really nice to support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $.
  • Experiment with code generating Rust code. There is an early experiment in src/codegen.rs that is thoroughly bit-rotted. At the time, I was experimenting with whether or not codegen would significant decrease the size of a DFA, since if you squint hard enough, it's kind of like a sparse representation. However, it didn't shrink as much as I thought it would, so I gave up. The other problem is that Rust doesn't support gotos, so I don't even know whether the "match on each state" in a loop thing will be fast enough. Either way, it's probably a good option to have. For one thing, it would be endian independent where as the serialization format of the DFAs in this crate are endian dependent (so you need two versions of every DFA, but you only need to compile one of them for any given arch).
  • Experiment with unrolling the match loops and fill out the benchmarks.
  • Add some kind of streaming API. I believe users of the library can already implement something for this outside of the crate, but it would be good to provide an official API. The key thing here is figuring out the API. I suspect we might want to support several variants.
  • Make a decision on whether or not there is room for literal optimizations in this crate. My original intent was to not let this crate sink down into that very very very deep rabbit hole. But instead, we might want to provide some way for literal optimizations to hook into the match routines. The right path forward here is to probably build something outside of the crate and then see about integrating it. After all, users can implement their own match routines just as efficiently as what the crate provides.
  • A key downside of DFAs is that they can take up a lot of memory and can be quite costly to build. Their worst case compilation time is O(2^n), where n is the number of NFA states. A paper by Yang and Prasanna (2011) actually seems to provide a way to character state blow up such that it is detectable. If we could know whether a regex will exhibit state explosion or not, then we could make an intelligent decision about whether to ahead-of-time compile a DFA. See: https://dl.acm.org/doi/10.1109/PACT.2011.73

More Repositories

1

ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Rust
45,030
star
2

xsv

A fast CSV command line toolkit written in Rust.
Rust
10,084
star
3

toml

TOML parser for Golang with reflection.
Go
4,407
star
4

quickcheck

Automated property based testing for Rust (with shrinking).
Rust
2,269
star
5

erd

Translates a plain text description of a relational database schema to a graphical entity-relationship diagram.
Haskell
1,757
star
6

fst

Represent large sets and maps compactly with finite state transducers.
Rust
1,712
star
7

rust-csv

A CSV parser for Rust, with Serde support.
Rust
1,603
star
8

nflgame

An API to retrieve and read NFL Game Center JSON data. It can work with real-time data, which can be used for fantasy football.
Python
1,257
star
9

walkdir

Rust library for walking directories recursively.
Rust
1,179
star
10

nfldb

A library to manage and update NFL data in a relational database.
Python
1,068
star
11

wingo

A fully-featured window manager written in Go.
Go
958
star
12

aho-corasick

A fast implementation of Aho-Corasick in Rust.
Rust
950
star
13

byteorder

Rust library for reading/writing numbers in big-endian and little-endian.
Rust
927
star
14

memchr

Optimized string search routines for Rust.
Rust
758
star
15

bstr

A string type for Rust that is not required to be valid UTF-8.
Rust
744
star
16

xgb

The X Go Binding is a low-level API to communicate with the X server. It is modeled on XCB and supports many X extensions.
Go
472
star
17

advent-of-code

Rust solutions to AoC 2018
Rust
469
star
18

termcolor

Cross platform terminal colors for Rust.
Rust
446
star
19

rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
Rust
433
star
20

go-sumtype

A simple utility for running exhaustiveness checks on Go "sum types."
Go
409
star
21

chan

Multi-producer, multi-consumer concurrent channel for Rust.
Rust
392
star
22

cargo-benchcmp

A small utility to compare Rust micro-benchmarks.
Rust
337
star
23

suffix

Fast suffix arrays for Rust (with Unicode support).
Rust
254
star
24

rure-go

Go bindings to Rust's regex engine.
Go
246
star
25

tabwriter

Elastic tabstops for Rust.
Rust
244
star
26

imdb-rename

A command line tool to rename media files based on titles from IMDb.
Rust
221
star
27

critcmp

A command line tool for comparing benchmarks run by Criterion.
Rust
198
star
28

rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
Python
197
star
29

ty

Easy parametric polymorphism at run time using completely unidiomatic Go.
Go
197
star
30

xgbutil

A utility library to make use of the X Go Binding easier. (Implements EWMH and ICCCM specs, key binding support, etc.)
Go
191
star
31

pytyle3

An updated (and much faster) version of pytyle that uses xpybutil and is compatible with Openbox Multihead.
Python
181
star
32

dotfiles

My configuration files and personal collection of scripts.
Vim Script
141
star
33

rsc-regexp

Translations of a simple C program to Rust.
Rust
133
star
34

rust-cbor

CBOR (binary JSON) for Rust with automatic type based decoding and encoding.
Rust
127
star
35

chan-signal

Respond to OS signals with channels.
Rust
126
star
36

goim

Goim is a robust command line utility to maintain and query the Internet Movie Database (IMDb).
Go
117
star
37

clibs

A smattering of miscellaneous C libraries. Includes sane argument parsing, a thread-safe multi-producer/multi-consumer queue, and implementation of common data structures (hashmaps, vectors and linked lists).
C
98
star
38

same-file

Cross platform Rust library for checking whether two file paths are the same file.
Rust
98
star
39

nflvid

An experimental library to map play meta data to footage of that play.
Python
90
star
40

ucd-generate

A command line tool to generate Unicode tables as source code.
Rust
90
star
41

rust-stats

Basic statistical functions on streams for Rust.
Rust
86
star
42

migration

Package migration for Golang automatically handles versioning of a database schema by applying a series of migrations supplied by the client.
Go
79
star
43

xpybutil

An incomplete xcb-util port plus some extras
Python
62
star
44

graphics-go

Automatically exported from code.google.com/p/graphics-go
Go
59
star
45

winapi-util

Safe wrappers for various Windows specific APIs.
Rust
57
star
46

rust-pcre2

High level Rust bindings to PCRE2.
C
51
star
47

rust-sorts

Implementations of common sorting algorithms in Rust with comprehensive tests and benchmarks.
Rust
51
star
48

blog

My blog.
Rust
50
star
49

openbox-multihead

Openbox with patches for enhanced multihead support.
C
46
star
50

nakala

A low level embedded information retrieval system.
Rust
45
star
51

nflfan

View your fantasy teams with nfldb using a web interface.
JavaScript
43
star
52

utf8-ranges

Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
Rust
43
star
53

rtmpdump-ksv

rtmpdump with ksv's patch. Intended to track upstream git://git.ffmpeg.org/rtmpdump as well.
C
40
star
54

globset

A globbing library for Rust.
Rust
39
star
55

regexp

A regular expression library implemented in Rust.
Rust
37
star
56

xdg

A Go package for reading config and data files according to the XDG Base Directory specification.
Go
35
star
57

locker

A simple Golang package for conveniently using named read/write locks. Useful for synchronizing access to session based storage in web applications.
Go
34
star
58

nflcmd

A collection of command line utilities for viewing NFL statistics and rankings with nfldb.
Python
30
star
59

notes

A collection of small notes that aren't appropriate for my blog.
30
star
60

mempool

A fast thread safe memory pool for reusing allocations.
Rust
29
star
61

gribble

A command oriented language whose environment is defined through Go struct types by reflection.
Go
28
star
62

vcr

A simple wrapper tool around ffmpeg to capture video from a VCR.
Rust
27
star
63

encoding_rs_io

Streaming I/O adapters for the encoding_rs crate.
Rust
22
star
64

rust-cmail

A simple command line utility for periodically sending email containing the output of long-running commands.
Rust
21
star
65

cluster

A simple API for managing a network cluster with smart peer discovery.
Go
19
star
66

pager-multihead

A pager that supports per-monitor desktops (compatible with Openbox Multihead and Wingo)
Python
15
star
67

rg-cratesio-typosquat

The source code of the 'rg' crate. It is an intentional typo-squat that redirects folks to 'ripgrep'.
Rust
15
star
68

imgv

An image viewer for Linux written in Go.
Go
14
star
69

cablastp

Performs BLAST on compressed proteomic data.
Go
14
star
70

rust-error-handling-case-study

Code for the case study in my blog post: http://blog.burntsushi.net/rust-error-handling
Rust
14
star
71

cmd

A convenience library for executing commands in Go, including executing commands in parallel with a pool.
Go
14
star
72

cmail

cmail runs a command and sends the output to your email address at certain intervals.
Go
12
star
73

fanfoot

View your fantasy football leagues and get text alerts when one of your players scores.
Python
12
star
74

burntsushi-blog

A small Go application for my old blog.
CSS
12
star
75

gohead

An xrandr wrapper script to manage multi-monitor configurations. With hooks.
Go
12
star
76

intern

A simple package for interning strings, with a focus on efficiently representing dense pairwise data.
Go
11
star
77

crev-proofs

My crev reviews.
10
star
78

pytyle1

A lightweight X11 tool for simulating tiling in a stacking window manager.
Python
9
star
79

rucd

WIP
Rust
8
star
80

qcsv

An API to read and analyze CSV files by inferring types for each column of data.
Python
8
star
81

cif

A golang package for reading and writing data in the Crystallographic Information File (CIF) format. It mostly conforms to the CIF 1.1 specification.
Go
8
star
82

pyndow

A window manager written in Python
Python
8
star
83

csql

Package csql provides convenience functions for use with the types and functions defined in the standard library `database/sql` package.
Go
6
star
84

freetype-go

A fork of freetype-go with bounding box calculations.
Go
6
star
85

sqlsess

Simple database backed session management. Integrates with Gorilla's sessions package.
Go
6
star
86

go-wayland-simple-shm

C
5
star
87

sqlauth

A simple Golang package that provides database backed user authentication with bcrypt.
Vim Script
4
star
88

lcmweb

A Go web application for coding documents with the Linguistic Category Model.
JavaScript
4
star
89

bcbgo

Computational biology tools for the BCB group at Tufts University.
Go
4
star
90

fex

A framework for specifying and executing experiments.
Haskell
3
star
91

present

My presentations.
HTML
3
star
92

memchr-2.6-mov-regression

Rust
3
star
93

genecentric

A tool to generate between-pathway modules and perform GO enrichment on them.
Python
3
star
94

rust-docs

A silly repo for managing my Rust crate documentation.
Python
3
star
95

pcre2-mirror

A git mirror for PCRE2's SVN repository at svn://vcs.exim.org/pcre2/code
2
star
96

xpyb

A clone of xorg-xpyb.
C
2
star
97

burntsushi-homepage

A small PHP web application for my old homepage.
PHP
2
star
98

window-marker

Use vim-like marks on windows.
Python
2
star
99

sudoku

An attempt at a sudoku solver in Haskell.
Haskell
1
star
100

play

Testing stuff.
1
star