• Stars
    star
    227
  • Rank 175,900 (Top 4 %)
  • Language
    Python
  • License
    The Unlicense
  • Created almost 2 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.

rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.

Links

  • METHODOLOGY describes the motivation, design, benchmark selection and evaluation protocol used by rebar.
  • BUILD describes how to build rebar and the regex engines it measures.
  • TUTORIAL provides a guided exploration of some of the most useful rebar sub-commands.
  • CONTRIBUTING describes how to add new benchmarks and how to add a new regex engine to benchmark.
  • MODELS describes the different types of workloads measured.
  • FORMAT describes the directory hierarchy and TOML format for how benchmarks are defined.
  • KLV describes the format of data given to regex engine runner programs for how to execute a benchmark.
  • BIAS is a work-in-progress document describing the bias of this barometer.
  • WANTED provides some ideas for other regex engines to add to rebar.
  • BYOB discusses how to "bring your own benchmarks." That is, anyone can use rebar with their own engine and benchmark definitions.

Results

This section shows the results of a curated and biased set of benchmarks. These reflect only a small subset of the benchmarks defined in this repository, but were carefully crafted to attempt to represent a broad range of use cases and annotated where possible with analysis to aide in the interpretation of results.

The results begin with a summary, then a list of links to each benchmark group and then finally the results for each group. Results are shown one benchmark group at a time, where a single group is meant to combine related regexes or workloads, where it is intended to be useful to see how results change across regex engines. Analysis is provided, at minimum, for every group. Although, analysis is heavily biased towards Rust's regex crate, as it is what this author knows best. However, contributions that discuss other regex engines are very welcomed.

Below each group of results are the parameters for each individual benchmark within that group. An individual benchmark may contain some analysis specific to it, but it will at least contain a summary of the benchmark details. Some parameters, such as the haystack, are usually too big to show in this README. One can use rebar to look at the haystack directly. Just take the full name of the benchmark and give it to the rebar haystack command. For example:

$ rebar haystack unicode/compile/fifty-letters
ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή

Similarly, the full benchmark execution details (including the haystack) can be seen with the rebar klv command:

$ rebar klv unicode/compile/fifty-letters
name:29:unicode/compile/fifty-letters
model:7:compile
pattern:7:\pL{50}
case-insensitive:5:false
unicode:4:true
haystack:106:ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή
max-iters:1:0
max-warmup-iters:1:0
max-time:1:0
max-warmup-time:1:0

Finally, you can run the benchmark yourself and look at results on the command line:

$ rebar measure -f '^unicode/compile/fifty-letters$' | tee results.csv
$ rebar cmp results.csv

Summary

Below are two tables summarizing the results of regex engines benchmarked. Each regex engine includes its version at the time measurements were captured, a summary score that ranks it relative to other regex engines across all benchmarks and the total number of measurements collected.

The first table ranks regex engines based on search time. The second table ranks regex engines based on compile time.

The summary statistic used is the geometric mean of the speed ratios for each regex engine across all benchmarks that include it. The ratios within each benchmark are computed from the median of all timing samples taken, and dividing it by the best median of the regex engines that participated in the benchmark. For example, given two regex engines A and B with results 35 ns and 25 ns on a single benchmark, A has a speed ratio of 1.4 and B has a speed ratio of 1.0. The geometric mean reported here is then the "average" speed ratio for that regex engine across all benchmarks.

If you're looking to compare two regex engines specifically, then it is better to do so based only on the benchmarks that they both participate in. For example, to compared based on the results recorded on 2023-05-04, one can do:

$ rebar rank record/all/2023-05-04/*.csv -f '^curated/' -e '^(rust/regex|hyperscan)$' --intersection -M compile
Engine      Version           Geometric mean of speed ratios  Benchmark count
------      -------           ------------------------------  ---------------
hyperscan   5.4.1 2023-02-22  2.03                            25
rust/regex  1.8.1             2.13                            25

Caution: Using a single number to describe the overall performance of a regex engine is a fraught endeavor, and it is debatable whether it should be included here at all. It is included primarily because the number of benchmarks is quite large and overwhelming. It can be quite difficult to get a general sense of things without a summary statistic. In particular, a summary statistic is also useful to observe how the overall picture itself changes as changes are made to the barometer. (Whether it be by adding new regex engines or adding/removing/changing existing benchmarks.) One particular word of caution is that while geometric mean is more robust with respect to outliers than arithmetic mean, it is not unaffected by them. Therefore, it is still critical to examine individual benchmarks if one wants to better understand the performance profile of any specific regex engine or workload.

Summary of search-time benchmarks

Engine Version Geometric mean of speed ratios Benchmark count
hyperscan 5.4.2 2023-04-22 2.04 28
rust/regex 1.9.3 2.68 38
pcre2/jit 10.42 2022-12-11 5.29 34
dotnet/compiled 7.0.7 6.32 34
re2 2023-08-01 9.15 31
dotnet/nobacktrack 7.0.7 10.22 29
javascript/v8 20.3.0 12.37 30
regress 0.6.0 30.60 30
python/regex 2023.6.3 36.54 34
python/re 3.11.3 37.29 33
java/hotspot 20.0.1+9-29 37.33 34
perl 5.36.1 44.34 33
icu 72.1.0 45.59 34
go/regexp 1.20.5 69.83 31
pcre2 10.42 2022-12-11 102.09 33
rust/regex/lite 0.1.0 131.31 28

Summary of compile-time benchmarks

Engine Version Geometric mean of speed ratios Benchmark count
pcre2 10.42 2022-12-11 1.40 10
rust/regex/lite 0.1.0 2.12 10
icu 72.1.0 2.92 11
regress 0.6.0 3.06 8
go/regexp 1.20.5 5.06 10
pcre2/jit 10.42 2022-12-11 5.45 11
re2 2023-08-01 10.88 10
rust/regex 1.9.3 10.95 14
dotnet/compiled 7.0.7 19.49 10
python/re 3.11.3 32.44 11
python/regex 2023.6.3 90.38 11
dotnet/nobacktrack 7.0.7 129.46 6
hyperscan 5.4.2 2023-04-22 441.34 7

Benchmark Groups

Below is a list of links to each benchmark group in this particular barometer. Each benchmark group contains 1 or more related benchmarks. The idea of each group is to tell some kind of story about related workloads, and to give a sense of how performance changes based on the variations between each benchmark.

This report was generated by rebar 0.0.1 (rev 8c6edf6843).

literal

This group of benchmarks measures regex patterns that are simple literals. When possible, we also measure case insensitive versions of the same pattern. We do this across three languages: English, Russian and Chinese. For English, Unicode mode is disabled while it is enabled for Russian and Chinese. (Which mostly only matters for the case insensitive benchmarks.)

This group is mainly meant to demonstrate two things. Firstly is whether the regex engine does some of the most basic forms of optimization by recognizing that a pattern is just a literal, and that a full blown regex engine is probably not needed. Indeed, naively using a regex engine for this case is likely to produce measurements much worse than most regex engines. Secondly is how the performance of simple literal searches changes with respect to both case insensitivity and Unicode. Namely, substring search algorithms that work well on ASCII text don't necessarily also work well on UTF-8 that contains many non-ASCII codepoints. This is especially true for case insensitive searches.

Notice, for example, how RE2 seems to be faster in the sherlock-casei-ru benchmark than in the sherlock-ru benchmark, even though the latter is "just" a simple substring search where as the former is a multiple substring search. In the case of sherlock-ru, RE2 actually attempts a literal optimization that likely gets caught up in dealing with a high false positive rate of candidates. Where as in the case of sherlock-casei-ru, no literal optimization is attempted and instead its lazy DFA is used. The high false positive rate in the simpler literal case winds up making it overall slower than it likely would be if it would just use the DFA.

This is not in any way to pick on RE2. Every regex engine that does literal optimizations (and most do) will suffer from this kind of setback in one way or another.

Engine sherlock-en sherlock-casei-en sherlock-ru sherlock-casei-ru sherlock-zh
dotnet/compiled 12.2 GB/s 6.1 GB/s 20.3 GB/s 5.1 GB/s 39.2 GB/s
dotnet/nobacktrack 8.5 GB/s 4.1 GB/s 7.3 GB/s 2.5 GB/s 29.5 GB/s
go/regexp 4.0 GB/s 47.7 MB/s 2.1 GB/s 34.2 MB/s 2.0 GB/s
hyperscan 32.7 GB/s 29.2 GB/s 4.3 GB/s 7.5 GB/s 50.6 GB/s
icu 1596.2 MB/s 451.4 MB/s 3.0 GB/s 281.5 MB/s 4.2 GB/s
java/hotspot 2.4 GB/s 282.1 MB/s 3.9 GB/s 228.0 MB/s 5.2 GB/s
javascript/v8 6.1 GB/s 3.0 GB/s 41.2 GB/s 3.3 GB/s 11.0 GB/s
pcre2 7.0 GB/s 866.2 MB/s 2.0 MB/s 1984.9 KB/s 55.2 MB/s
pcre2/jit 26.2 GB/s 16.7 GB/s 31.7 GB/s 19.0 GB/s 37.2 GB/s
perl 2.9 GB/s 560.5 MB/s 3.4 GB/s 103.2 MB/s 7.8 GB/s
python/re 3.7 GB/s 347.2 MB/s 6.4 GB/s 507.7 MB/s 9.1 GB/s
python/regex 3.4 GB/s 3.0 GB/s 4.5 GB/s 3.9 GB/s 6.7 GB/s
re2 11.9 GB/s 2.5 GB/s 745.2 MB/s 948.0 MB/s 2.7 GB/s
regress 3.5 GB/s 1149.5 MB/s 3.5 GB/s 318.7 MB/s 3.6 GB/s
rust/regex 32.3 GB/s 11.4 GB/s 33.0 GB/s 9.0 GB/s 40.7 GB/s
rust/regex/lite 76.0 MB/s 56.5 MB/s 121.1 MB/s - 166.8 MB/s
rust/regexold 33.0 GB/s 7.9 GB/s 33.1 GB/s 6.6 GB/s 39.8 GB/s
Show individual benchmark parameters.

sherlock-en

Parameter Value
full name curated/01-literal/sherlock-en
model count
regex Sherlock Holmes
case-insensitive false
unicode false
haystack-path opensubtitles/en-sampled.txt
count(.*) 513

sherlock-casei-en

Parameter Value
full name curated/01-literal/sherlock-casei-en
model count
regex Sherlock Holmes
case-insensitive true
unicode false
haystack-path opensubtitles/en-sampled.txt
count(.*) 522

sherlock-ru

Parameter Value
full name curated/01-literal/sherlock-ru
model count
regex Шерлок Холмс
case-insensitive false
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(.*) 724

sherlock-casei-ru

Parameter Value
full name curated/01-literal/sherlock-casei-ru
model count
regex Шерлок Холмс
case-insensitive true
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(.*) 746

rust/regex/lite is not included because it doesn't support Unicode-aware case insensitive matching.

sherlock-zh

Parameter Value
full name curated/01-literal/sherlock-zh
model count
regex 夏洛克·福尔摩斯
case-insensitive false
unicode true
haystack-path opensubtitles/zh-sampled.txt
count(.*) 30

literal-alternate

This group is like literal, but expands the complexity from a simple literal to a small alternation of simple literals, including case insensitive variants where applicable. Once again, we do this across three languages: English, Russian and Chinese. We disable Unicode mode for English but enable it for Russian and Chinese. Enabling Unicode here generally only means that case insensitivity takes Unicode case folding rules into account.

This benchmark ups the ante when it comes to literal optimizations. Namely, for a regex engine to optimize this case, it generally needs to be capable of reasoning about literal optimizations that require one or more literals from a set to match. Many regex engines don't deal with this case well, or at all. For example, after a quick scan at comparing the sherlock-en benchmark here and in the previous literal group, one thing that should stand out is the proportion of regex engines that now measure throughput in MB/s instead of GB/s.

One of the difficulties in optimizing for this case is that multiple substring search is difficult to do in a way that is fast. In particular, this benchmark carefully selected each alternation literal to start with a different character than the other alternation literals. This, for example, inhibits clever regex engines from noticing that all literals begin with the same byte (or small number of bytes). Consider an alternation like foo|far|fight. It is not hard to see that a regex engine could just scan for the letter f as a prefilter optimization. Here, we pick our regex such that this sort of shortcut isn't available. For the regex engine to optimize this case, it really needs to deal with the problem of multiple substring search.

Multiple substring search can be implemented via a DFA, and perhaps in some cases, quite quickly via a shift DFA. Beyond that though, multiple substring search can be implemented by other various algorithms such as Aho-Corasick or Rabin-Karp. (The standard Aho-Corasick formulation is an NFA, but it can also be converted to a DFA by pre-computing all failure transitions. This winds up with a similar result as using Thompson's construction to produce an NFA and then powerset construction to get a DFA, but the Aho-Corasick construction algorithm is usually quite a bit faster because it doesn't need to deal with a full NFA.)

The problem here is that DFA speeds may or may not help you. For example, in the case of RE2 and Rust's regex engine, it will already get DFA speeds by virtue of their lazy DFAs. Indeed, in this group, RE2 performs roughly the same across all benchmarks. So even if you, say build an Aho-Corasick DFA, it's not going to help much if at all. So it makes sense to avoid it.

But Rust's regex crate has quite a bit higher throughputs than RE2 on most of the benchmarks in this group. So how is it done? Currently, this is done via the Teddy algorithm, which was ported out of Hyperscan. It is an algorithm that makes use of SIMD to accelerate searching for a somewhat small set of literals. Most regex engines don't have this sort of optimization, and indeed, it seems like Teddy is not particularly well known. Alas, regex engines that want to move past typical DFA speeds for multiple substring search likely need some kind of vectorized algorithm to do so. (Teddy is also used by Rust's regex crate in the previous literal group of benchmarks for accelerating case insensitive searches. Namely, it enumerates some finite set of prefixes like she, SHE, ShE and so on, and then looks for matches of those as a prefilter.)

Engine sherlock-en sherlock-casei-en sherlock-ru sherlock-casei-ru sherlock-zh
dotnet/compiled 3.7 GB/s 456.2 MB/s 2.2 GB/s 780.1 MB/s 16.9 GB/s
dotnet/nobacktrack 2.6 GB/s 374.5 MB/s 1018.9 MB/s 296.6 MB/s 10.3 GB/s
go/regexp 25.2 MB/s 15.5 MB/s 32.5 MB/s 9.1 MB/s 46.5 MB/s
hyperscan 14.0 GB/s 13.5 GB/s 4.5 GB/s 4.0 GB/s 19.8 GB/s
icu 630.6 MB/s 113.1 MB/s 168.1 MB/s 106.5 MB/s 337.3 MB/s
java/hotspot 69.9 MB/s 63.8 MB/s 118.9 MB/s 55.8 MB/s 175.9 MB/s
javascript/v8 691.6 MB/s 675.3 MB/s 942.0 MB/s 596.7 MB/s 6.5 GB/s
pcre2 916.1 MB/s 166.8 MB/s 1690.8 KB/s 1581.2 KB/s 8.3 MB/s
pcre2/jit 1470.9 MB/s 654.6 MB/s 1188.7 MB/s 298.4 MB/s 2.5 GB/s
perl 1090.3 MB/s 114.5 MB/s 121.0 MB/s 82.6 MB/s 244.0 MB/s
python/re 428.8 MB/s 42.2 MB/s 317.3 MB/s 57.6 MB/s 635.9 MB/s
python/regex 303.0 MB/s 67.8 MB/s 304.4 MB/s 91.5 MB/s 873.9 MB/s
re2 923.9 MB/s 922.2 MB/s 936.1 MB/s 930.3 MB/s 966.7 MB/s
regress 1558.9 MB/s 282.1 MB/s 257.4 MB/s 105.9 MB/s 263.9 MB/s
rust/regex 12.8 GB/s 3.0 GB/s 6.3 GB/s 1544.1 MB/s 15.1 GB/s
rust/regex/lite 33.7 MB/s 23.5 MB/s 53.4 MB/s - 73.3 MB/s
rust/regexold 13.8 GB/s 2.7 GB/s 2.9 GB/s 443.1 MB/s 19.5 GB/s
Show individual benchmark parameters.

sherlock-en

Parameter Value
full name curated/02-literal-alternate/sherlock-en
model count
regex Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty
case-insensitive false
unicode false
haystack-path opensubtitles/en-sampled.txt
count(.*) 714

sherlock-casei-en

Parameter Value
full name curated/02-literal-alternate/sherlock-casei-en
model count
regex Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty
case-insensitive true
unicode false
haystack-path opensubtitles/en-sampled.txt
count(.*) 725

sherlock-ru

Parameter Value
full name curated/02-literal-alternate/sherlock-ru
model count
regex Шерлок Холмс|Джон Уотсон|Ирен Адлер|инспектор Лестрейд|профессор Мориарти
case-insensitive false
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(.*) 899

sherlock-casei-ru

Parameter Value
full name curated/02-literal-alternate/sherlock-casei-ru
model count
regex Шерлок Холмс|Джон Уотсон|Ирен Адлер|инспектор Лестрейд|профессор Мориарти
case-insensitive true
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(.*) 971

rust/regex/lite is not included because it doesn't support Unicode-aware case insensitive matching.

sherlock-zh

Parameter Value
full name curated/02-literal-alternate/sherlock-zh
model count
regex 夏洛克·福尔摩斯|约翰华生|阿德勒|雷斯垂德|莫里亚蒂教授
case-insensitive false
unicode true
haystack-path opensubtitles/zh-sampled.txt
count(.*) 207

date

This is a monster regex for extracting dates from unstructured text from the datefinder project written in Python. The regex itself was taken from printing the DATES_PATTERN variable in the datefinder project. I then removed all names from the capture groups, unnecessary escapes and collapsed it to a single line (because not all regex engines support verbose mode).

The regex is more akin to a tokenizer, and the datefinder library attempts to combine these tokens into timestamps.

We measure an ASCII only version of it and a Unicode-aware version of it. Unicode is relevant here because of case insensitivity, and because the regex makes use of the character classes \s and \d, which are bigger when they're Unicode aware. We also measure the compilation time of each.

The results here can be a little tricky to interpret. Namely, it looks like backtrackers tend to do worse than automata oriented regex engines, but go/regexp uses automata and is itself quite slow here. Notice, though, that hyperscan, re2 and rust/regex do well here. While I'm less familiar with hyperscan, the explanation for re2 and rust/regex is obvious once you look at a profile: it's the lazy DFA. Both have implementations of a regex engine that build a DFA during search time, with at most one new transition (and one new state) being create per byte of haystack. In practice, most transitions get reused, which means that it tends to act like a real DFA most of the time for most regexes on most haystacks.

Compilation time of this monster regex is also all over the place. PCRE2 does the best, and Hyperscan winds up being quite slow. Once you enable Unicode mode, compilation time generally gets worse, and especially so for re2 and rust/regex. In particular, both compile byte oriented automata, which means the transitions are defined over bytes and not codepoints. That means large Unicode classes like \d tend to balloon in size, because they get converted into UTF-8 automata.

Engine ascii unicode compile-ascii compile-unicode
dotnet/compiled 1100.5 KB/s 1080.6 KB/s - 1.66ms
go/regexp 254.9 KB/s - 1.31ms -
hyperscan 104.2 MB/s - 656.86ms -
icu 321.0 KB/s 320.1 KB/s 434.21us 434.36us
java/hotspot 2.1 MB/s 1620.4 KB/s - -
javascript/v8 32.9 MB/s 29.4 MB/s - -
pcre2 1253.6 KB/s 170.5 KB/s 114.40us 133.25us
pcre2/jit 21.0 MB/s 13.0 MB/s 695.29us 0.99ms
perl 2.9 MB/s - - -
python/re 1077.0 KB/s 899.1 KB/s 3.63ms 3.84ms
python/regex 1133.3 KB/s 1025.0 KB/s 10.17ms 29.54ms
re2 80.4 MB/s - 1.16ms -
regress 1931.5 KB/s 1954.3 KB/s 1.04ms 1.05ms
rust/regex 164.3 MB/s 163.3 MB/s 1.49ms 5.38ms
rust/regex/lite 950.8 KB/s - 309.78us -
rust/regexold 143.2 MB/s 408.9 KB/s 1.67ms 6.08ms
Show individual benchmark parameters.

ascii

Parameter Value
full name curated/03-date/ascii
model count-spans
regex-path wild/date.txt
case-insensitive true
unicode false
haystack-path rust-src-tools-3b0d4813.txt
count(dotnet.*) 111825
count(hyperscan) 547662
count(icu) 111825
count(javascript/v8) 111825
count(regress) 111841
count(.*) 111817

As with many other benchmarks, Hyperscan reports all matches, even ones that are overlapping. This particular regex is too big to analyze closely, but it seems plausible one could still use it (possibly with a slightly tweaked regex) for this task.

unicode

Parameter Value
full name curated/03-date/unicode
model count-spans
regex-path wild/date.txt
case-insensitive true
unicode true
haystack-path rust-src-tools-3b0d4813.txt
count(`dotnet/compiled icu
count(.*) 111841

regress is included here despite its \d not being Unicode-aware (as required by ECMAScript). Notably, its \s is Unicode aware. (\w is too, but it's not used in this regex.) In this particular haystack, \d being ASCII-only doesn't impact the match count.

However, neither re2 nor go/regexp are included here because neither \d nor \s are Unicode-aware, and the \s being ASCII-only does impact the match count.

hyperscan is excluded here because the pattern results in a "too large" compilation error. As far as I know, Hyperscan doesn't expose any knobs for increasing this limit.

dotnet/compiled gets a different count here, but it's not clear why.

perl is left out of this benchmark because it times out.

rust/regex/lite is excluded because it doesn't support Unicode-aware \w, \d or \s.

compile-ascii

Parameter Value
full name curated/03-date/compile-ascii
model compile
regex-path wild/date.txt
case-insensitive true
unicode false
haystack 2010-03-14
count(hyperscan) 10
count(.*) 5

Notice that regress is now include in the ASCII benchmark, because in compile-unicode we specifically test that the \d used in this regex is Unicode-aware. regress does not make \d Unicode-aware, so it gets thrown into the ASCII group. But do note that it does appear to have some Unicode awareness.

compile-unicode

Parameter Value
full name curated/03-date/compile-unicode
model compile
regex-path wild/date.txt
case-insensitive true
unicode true
haystack ۲۰۱۰-۰۳-۱۴
count(`javascript/v8 regress`)
count(.*) 5

We use "extended arabic-indic digits" to represent the same date, 2010-03-14, that we use for verification in compile-ascii. These digits are part of \d when it is Unicode aware.

ruff-noqa

The regex benchmarked here comes from the Ruff project, which is a Python linter written in Rust. The project uses many regexes, but we pluck one out in particular that is likely to be run more frequently than the others:

(\s*)((?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)

This is a regex that looks for # noqa annotations on each line. The noqa annotation generally causes the linter to ignore those lines with respect to warnings it emits. The regex also tries to extract annotations following the noqa that permit ignoring only specific rules in the linter.

This benchmark has a few interesting characteristics worth pointing out:

  • It is line oriented, which means the haystacks it searches are likely to be small. This in turn means that the overhead of the regex engine is likely to matter more than in throughput oriented benchmarks.
  • On this particular haystack (the CPython source code), the number of matches is quite small. Therefore, it is quite beneficial here to be able to have a fast path to say "there is no match" without doing any extra work. While the number of matches here is perhaps uncharacteristically small for a Python project, you would generally expect most lines to not have # noqa in them, and so the presumption of a fast rejection is probably a decent assumption for this particular regex.
  • Ruff uses capturing groups to pick out parts of the match, so when a match is found, the regex engine needs to report additional information beyond just the overall match spans. The spans of each matching capture group also need to be reported.
  • There are no prefix (or suffix) literals in the regex to enable any straight-forward prefilter optimizations.

With respect to the point about no prefix or suffix literals, we also include a tweaked version of the regex that removes the leading (\s*):

(?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?

In this case, the regex now starts with a literal, albeit one that is asked to match case insensitively. We can actually see pretty clearly the impact the tweaked version has on the speed for each regex engine. pcre2/jit, for example, improves its throughput from around 500 MB/s to 1.5 GB/s. go/regexp has an even more dramatic (relatively speaking) improvement.

rust/regex is a little different in that it's quite fast in both cases. The key optimization that applies for rust/regex is the "reverse inner" optimization. Even in the original regex, rust/regex will pluck out the # noqa literal and search for it case insensitively. When a candidate is found, it then searches for (\s*) in reverse to find the start position, and then finally does a standard forward search from that point to find the reverse position.

Engine real tweaked compile-real
dotnet/compiled 154.7 MB/s 503.1 MB/s 46.84us
dotnet/nobacktrack 241.6 MB/s 450.3 MB/s 389.45us
go/regexp 33.6 MB/s 716.3 MB/s 2.75us
icu 20.4 MB/s 330.8 MB/s 5.50us
java/hotspot 38.0 MB/s 218.4 MB/s -
pcre2 127.0 MB/s 1423.7 MB/s 1.13us
pcre2/jit 575.6 MB/s 1495.1 MB/s 6.81us
perl 105.4 MB/s 136.3 MB/s -
python/re 30.4 MB/s 122.5 MB/s 60.70us
python/regex 99.9 MB/s 98.9 MB/s 128.09us
re2 549.6 MB/s 747.0 MB/s 6.98us
rust/regex 1669.8 MB/s 1555.9 MB/s 55.88us
rust/regex/lite 30.1 MB/s 47.3 MB/s 2.03us
rust/regexold 172.0 MB/s 1177.7 MB/s 41.20us
Show individual benchmark parameters.

real

Parameter Value
full name curated/04-ruff-noqa/real
model grep-captures
regex (\s*)((?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(.*) 84

tweaked

Parameter Value
full name curated/04-ruff-noqa/tweaked
model grep-captures
regex (?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(.*) 44

compile-real

Parameter Value
full name curated/04-ruff-noqa/compile-real
model compile
regex (\s*)((?i:# noqa)(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)
case-insensitive false
unicode false
haystack # noqa
count(.*) 1

lexer-veryl

This group benchmarks a "lexer" where it combines a whole bunch of different patterns that identify tokens in a language into a single regex. It then uses capture groups to determine which branch of the alternation actually matched, and thus, which token matched. We also benchmark a variant of this that asks the regex engine to search for each pattern individually (most regex engines don't support this mode).

This is used by the Veryl project by way of the Parol parser generator. The regex was extracted by the Parol maintainers upon my request.

We use this regex to represent the "lexing" use case, where sometimes folks will build a pretty big regex with a bunch of small regexes for identifying tokens. Usually the idea is that the lexer matches literally everything in the haystack (indeed, the last branch in this regex is a . and the first is any newline), and thus these sorts of regexes tend to be quite latency sensitive. Namely, it really matters just how much overhead is involved in reporting matches. This is likely one of the reasons why most regex engines are overall pretty slow here.

The other aspect of this that's quite difficult is the sheer number of capturing groups. There's several dozen of them, which means regex engines have to keep track of a fair bit of state to handle it.

You might think this would be bad for backtrackers and good for automata engines, since automata engines are supposed to be able to handle large alternations better than backtrackers. But that's not the case here. Even for example Python's regex engine (backtracker) beats RE2 (automata). My hypothesis for why this is, is latency. Automata engines tend to have multiple engines internally and therefore tend to have higher latency, and sometimes multiple engines run to service one search. Backtrackers tend to have one engine that handles everything. But still, shouldn't the huge alternation be disastrous for the backtracker? Perhaps, unless many of the matches occur in an early branch, which is likely the case here. Namely, the second alternation matches a (single ASCII space), which is probably the most frequently occurring byte in the haystack. An automata engine that doesn't use a DFA (which might be the case here, because the regex is so big), will wind up spending a lot of time keeping track of all branches of the alternation, even if it doesn't need to explore all of them. In contrast, a backtracker will try one after the other, and if most cases match an early branch, the backtracker is likely to take less overall time.

Most regex engines are stuck in the 1 MB/s (or less) range. The regex crate and PCRE2's JIT get up to about 10 MB/s, with PCRE2 edging out the regex crate.

Note that the regex was lightly modified from the original to increase portability across different regex engines. For example, the [\s--\r\n] class was changed to [\t\v\f ].

As for the second benchmark, multiple, it uses the same patterns from each alternation in the single benchmark, but treats each one as a distinct pattern. Doing this requires explicit support for searching multiple regex patterns. (RE2's and Rust's regex crate "regex set" functionality is not enough for this, as it only reports which patterns match a haystack, and not where they match. That's partially why the rust/regex engine in this barometer actually just use the lower level meta::Regex APIs from the regex-automata crate.)

In the multiple case, the rust/regex does very well and the key reason is the abdication of capture groups as a necessary tool to determine which token matched. Namely, now we can simply use a pattern ID from the match to determine which "branch" in the original regex was taken. We no longer need to ask for or inspect capture groups. This gives a critical benefit to automata engines that support searching for multiple patterns, because it no longer requires them to use slower engines for resolving capturing groups.

Engine single compile-single multi
dotnet/compiled 208.0 KB/s 275.13us -
go/regexp 320.9 KB/s 62.93us -
hyperscan - - 17.8 MB/s
icu 1002.2 KB/s 56.71us -
java/hotspot 6.1 MB/s - -
javascript/v8 6.9 MB/s - -
pcre2 2.8 MB/s 24.53us -
pcre2/jit 12.5 MB/s 128.44us -
perl 1146.4 KB/s - -
python/re 1821.8 KB/s 889.55us -
python/regex 1667.3 KB/s 2.38ms -
re2 1146.5 KB/s 149.71us -
regress 6.3 MB/s - -
rust/regex 9.2 MB/s 271.83us 95.7 MB/s
rust/regex/lite 539.8 KB/s 38.15us -
rust/regexold 225.5 KB/s 247.70us -
Show individual benchmark parameters.

single

Parameter Value
full name curated/05-lexer-veryl/single
model count-captures
regex-path wild/parol-veryl.txt
case-insensitive false
unicode false
haystack-path wild/parol-veryl.vl
count(.*) 124800

Note that we don't include Hyperscan here because it doesn't support the count-captures benchmark model. It is included in the multiple benchmark below, which doesn't require capture groups.

Also, I tried to use dotnet/nobacktrack here, but it failed because it was too big and it wasn't obvious to me how to raise the limit.

compile-single

Parameter Value
full name curated/05-lexer-veryl/compile-single
model compile
regex-path wild/parol-veryl.txt
case-insensitive false
unicode false
haystack abcdefg_foobar
count(.*) 1

This measures how long it takes to a compile a moderately large lexer.

multi

Parameter Value
full name curated/05-lexer-veryl/multi
model count-spans
regex-path wild/parol-veryl.txt
case-insensitive false
unicode false
haystack-path wild/parol-veryl.vl
count(hyperscan) 669500
count(.*) 150600

Hyperscan reports everything that matches, including overlapping matches, and that's why its count is higher. It is likely still serviceable for this use case, but might in practice require changing the regex to suit Hyperscan's match semantics. Still, it's a decent barometer to include it here, particularly because of its multi-regex support.

Most regex engines do not support searching for multiple patterns and finding the corresponding match offsets, which is why this benchmark has very few entries.

cloud-flare-redos

This benchmark uses a regex that helped cause an outage at Cloudflare. This class of vulnerability is typically called a "regular expression denial of service," or "ReDoS" for short. It doesn't always require a malicious actor to trigger. Since it can be difficult to reason about the worst case performance of a regex when using an unbounded backtracking implementation, it might happen entirely accidentally on valid inputs.

The particular regex that contributed to the outage was:

(?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|\-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))

As discussed in Cloudflare's post mortem, the specific problematic portion of the regex is:

.*(?:.*=.*)

Or more simply:

.*.*=.*;

We benchmark the original regex along with the simplified variant. We also split the simplified variant into one with a short haystack (about 100 bytes) and one with a long haystack (about 10,000 bytes). The benchmark results for the original and simplified short variant should be roughly similar, but the difference between the short and long variant is where things get interesting. The automata based engines generally maintain a similar throughput for both the short and long benchmarks, but the backtrackers slow way down. This is because the backtracking algorithm for this specific regex and haystack doesn't scale linearly with increases in the size of the haystack.

The purpose of this benchmark is to show a real world scenario where the use of a backtracking engine can bite you in production if you aren't careful.

We include Hyperscan in this benchmark, although it is questionable to do so. Hyperscan reports many overlapping matches from the regex used by Cloudflare because of the trailing .*, so it is probably not a great comparison. In particular, this regex was originally used in a firewall, so it seems likely that it would be used in a "is a match" or "not a match" scenario. But our benchmark here reproduces the analysis in the appendix of Cloudflare's port mortem. But the real utility in including Hyperscan here is that it demonstrates that it is not a backtracking engine. While its throughput is not as high as some other engines, it remains roughly invariant with respect to haystack length, just like other automata oriented engines.

Note that rust/regex has very high throughput here because the regex is small enough to get compiled into a full DFA. The compilation process also "accelerates" some states, particularly the final .*. This acceleration works by noticing that almost all of the state's transitions loop back on itself, and only a small number transition to another state. The final .* for example only leaves its state if it sees the end of the haystack or a \n. So the DFA will actually run memchr on \n and skip right to the end of the haystack.

Engine original simplified-short simplified-long
dotnet/compiled 131.5 MB/s 860.8 MB/s 13.5 GB/s
dotnet/nobacktrack 12.7 MB/s 188.5 MB/s 288.4 MB/s
go/regexp 41.7 MB/s 44.2 MB/s 48.6 MB/s
hyperscan 85.0 MB/s 81.7 MB/s 84.7 MB/s
icu 3.4 MB/s 3.5 MB/s 42.7 KB/s
java/hotspot 5.8 MB/s 6.3 MB/s 95.3 KB/s
javascript/v8 19.5 MB/s 18.9 MB/s 335.4 KB/s
pcre2 2.9 MB/s 2.7 MB/s 30.4 KB/s
pcre2/jit 50.0 MB/s 42.1 MB/s 670.8 KB/s
perl 10.3 MB/s 10.0 MB/s 176.5 KB/s
python/re 22.6 MB/s 22.3 MB/s 384.7 KB/s
python/regex 6.4 MB/s 6.2 MB/s 91.9 KB/s
re2 348.3 MB/s 333.1 MB/s 493.7 MB/s
regress 8.3 MB/s 8.0 MB/s 104.6 KB/s
rust/regex 583.1 MB/s 2.0 GB/s 88.7 GB/s
rust/regex/lite 19.1 MB/s 22.8 MB/s 23.5 MB/s
rust/regexold 436.1 MB/s 472.2 MB/s 600.2 MB/s
Show individual benchmark parameters.

original

Parameter Value
full name curated/06-cloud-flare-redos/original
model count-spans
regex (?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))
case-insensitive false
unicode false
haystack math x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..]
count(hyperscan) 5757
count(.*) 107

simplified-short

Parameter Value
full name curated/06-cloud-flare-redos/simplified-short
model count-spans
regex .*.*=.*
case-insensitive false
unicode false
haystack x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..]
count(hyperscan) 5252
count(.*) 102

simplified-long

Parameter Value
full name curated/06-cloud-flare-redos/simplified-long
model count-spans
regex .*.*=.*
case-insensitive false
unicode false
haystack-path cloud-flare-redos.txt
count(hyperscan) 50004999
count(.*) 10000

unicode-character-data

This regex parses data from UnicodeData.txt, which is part of the Unicode Character Database. This regex was extracted from the ucd-parse crate, which is part of the ucd-generate project.

This benchmark works by iterating over every line in the haystack and then running the regex on each line. Every line matches the regex, so regex engines that attempt to do some extra work to reject non-matches quickly will get penalized. For example, rust/regex looks for a semi-colon first via its "reverse inner" optimization, since a semi-colon is a required part of the regex. But this optimization is just extra work here. Indeed, disabling it will improve the thoughput of rust/regex on this benchmark.

pcre2/jit does remarkably well here, and these types of regexes are one of the many things that pcre2/jit does quickly compared to most other regex engines.

We also include compilation time for this regex, where PCRE2 again does quite well.

Engine parse-line compile
dotnet/compiled 89.9 MB/s 41.50us
dotnet/nobacktrack 17.2 MB/s 152.39us
go/regexp 79.4 MB/s 11.85us
icu 130.7 MB/s 13.87us
java/hotspot 205.8 MB/s -
javascript/v8 229.6 MB/s -
pcre2 225.3 MB/s 2.13us
pcre2/jit 710.1 MB/s 12.10us
perl 23.2 MB/s -
python/re 51.9 MB/s 100.92us
python/regex 36.4 MB/s 272.53us
re2 102.1 MB/s 14.43us
regress 203.9 MB/s 6.19us
rust/regex 356.5 MB/s 28.79us
rust/regex/lite 30.1 MB/s 3.77us
rust/regexold 94.1 MB/s 18.67us
Show individual benchmark parameters.

parse-line

Parameter Value
full name curated/07-unicode-character-data/parse-line
model grep-captures
regex-path wild/ucd-parse.txt
case-insensitive false
unicode false
haystack-path wild/UnicodeData-15.0.0.txt
count(.*) 558784

compile

Parameter Value
full name curated/07-unicode-character-data/compile
model compile
regex-path wild/ucd-parse.txt
case-insensitive false
unicode false
haystack 249D;PARENTHESIZED LATIN SMALL LETTER B;So;0;L;<compat> 0028 [.. snip ..]
count(.*) 1

words

This benchmark measures how long it takes for a regex engine to find words in a haystack. We compare one regex that finds all words, \b\w+\b and another regex that only looks for longer words, \b\w{12,}\b. We also compare ASCII regexes on English text with Unicode regexes on Russian text.

The split between finding all words and finding only long words tends to highlight the overhead of matching in each regex engine. Regex engines that are quicker to get in and out of its match routine do better at finding all words than regex engines that have higher overhead. For example, regress is faster than rust/regex on all-english, but substantially slower than rust/regex on long-english. This is likely because rust/regex is doing more work per search call than regress, which is in part rooted in the optimizations it performs to gain higher throughput.

Otherwise, pcre2/jit does quite well here across the board, but especially on the Unicode variants. When comparing it against rust/regex for example, it is substantially faster. In the case of rust/regex, its faster DFA oriented engines cannot handle the Unicode aware \b on non-ASCII haystacks, and this causes rust/regex to use a slower internal engine. It's so slow in fact that python/re and python/regex are both faster than rust/regex for the Unicode benchmarks. For the ASCII long-english benchmark, rust/regex and re2 both do well because most of the time is spent in its lazy DFA, which has pretty good throughput performance when compared to a pure backtracker.

Note that several regex engines can't be used in the Unicode variants because either they don't support a Unicode aware \w or because they don't support a Unicode aware \b (or both).

Engine all-english all-russian long-english long-russian
dotnet/compiled 56.9 MB/s 97.7 MB/s 178.3 MB/s 113.7 MB/s
dotnet/nobacktrack 29.5 MB/s 39.8 MB/s 144.5 MB/s 123.1 MB/s
go/regexp 20.3 MB/s - 45.3 MB/s -
hyperscan 158.0 MB/s - 431.1 MB/s -
icu 77.8 MB/s 105.5 MB/s 41.4 MB/s 55.5 MB/s
java/hotspot 74.3 MB/s 141.1 MB/s 68.7 MB/s 111.5 MB/s
javascript/v8 169.3 MB/s - 197.5 MB/s -
pcre2 99.2 MB/s 127.9 KB/s 69.4 MB/s 6.1 MB/s
pcre2/jit 192.0 MB/s 229.3 MB/s 244.7 MB/s 196.6 MB/s
perl 13.6 MB/s 35.1 KB/s 107.7 MB/s 1854.1 KB/s
python/re 38.6 MB/s 46.4 MB/s 125.8 MB/s 124.4 MB/s
python/regex 22.9 MB/s 43.8 MB/s 37.2 MB/s 104.6 MB/s
re2 68.1 MB/s - 924.3 MB/s -
regress 164.3 MB/s - 153.8 MB/s -
rust/regex 118.4 MB/s 19.4 MB/s 802.4 MB/s 35.0 MB/s
rust/regex/lite 33.9 MB/s - 47.0 MB/s -
rust/regexold 122.1 MB/s 7.1 MB/s 806.3 MB/s 34.2 MB/s
Show individual benchmark parameters.

all-english

Parameter Value
full name curated/08-words/all-english
model count-spans
regex \b[0-9A-Za-z_]+\b
case-insensitive false
unicode false
haystack-path opensubtitles/en-sampled.txt
count(dotnet/compiled) 56601
count(dotnet/nobacktrack) 56601
count(icu) 56601
count(.*) 56691

We specifically write out [0-9A-Za-z_] instead of using \w because some regex engines, such as the one found in .NET, make \w Unicode aware and there doesn't appear to be any easy way of disabling it.

Also, the .NET engine makes \b Unicode-aware, which also appears impossible to disable. To account for that, we permit a different count.

all-russian

Parameter Value
full name curated/08-words/all-russian
model count-spans
regex \b\w+\b
case-insensitive false
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(dotnet.*) 53960
count(icu) 53960
count(java.*) 53960
count(perl) 53960
count(.*) 107391

rust/regex/lite, regress, re2 and go/regexp are excluded because \w is not Unicode aware. hyperscan is exclude because it doesn't support a Unicode aware \b.

For dotnet/compiled, since the length of matching spans is in the number of UTF-16 code units, its expected count is smaller.

For perl, it has the same count as dotnet/compiled, but only because it counts total encoded codepoints. Since every match span in this benchmark seemingly corresponds to codepoints in the basic multi-lingual plane, it follows that the number of UTF-16 code units is equivalent to the number of codepoints.

long-english

Parameter Value
full name curated/08-words/long-english
model count-spans
regex \b[0-9A-Za-z_]{12,}\b
case-insensitive false
unicode false
haystack-path opensubtitles/en-sampled.txt
count(.*) 839

We specifically write out [0-9A-Za-z_] instead of using \w because some regex engines, such as the one found in .NET, make \w Unicode aware and there doesn't appear to be any easy way of disabling it.

Also, the fact that \b is Unicode-aware in .NET does not seem to impact the match counts in this benchmark.

long-russian

Parameter Value
full name curated/08-words/long-russian
model count-spans
regex \b\w{12,}\b
case-insensitive false
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(dotnet.*) 2747
count(icu) 2747
count(java.*) 2747
count(perl) 2747
count(.*) 5481

rust/regex/lite, regress, re2 and go/regexp are excluded because \w is not Unicode aware. hyperscan is exclude because it doesn't support a Unicode aware \b.

For dotnet/compiled, since the length of matching spans is in the number of UTF-16 code units, its expected count is smaller.

For perl, it has the same count as dotnet/compiled, but only because it counts total encoded codepoints. Since every match span in this benchmark seemingly corresponds to codepoints in the basic multi-lingual plane, it follows that the number of UTF-16 code units is equivalent to the number of codepoints.

aws-keys

This measures a regex for detecting AWS keys in source codeaws-key-blog. In particular, to reduce false positives, it looks for both an access key and a secret key within a few lines of one another.

We also measure a "quick" version of the regex that is used to find possible candidates by searching for things that look like an AWS access key.

The measurements here demonstrate why the pypi-aws-secrets project splits this task into two pieces. First it uses the "quick" version to identify candidates, and then it uses the "full" version to lower the false positive rate of the "quick" version. The "quick" version of the regex runs around an order of magnitude faster than the "full" version across the board. To understand why, let's look at the "quick" regex:

((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))

Given this regex, every match starts with one of ASIA, AKIA, AROA or AIDA. This makes it quite amenable to prefilter optimizations where a regex engine can look for matches of one of those 4 literals, and only then use the regex engine to confirm whether there is a match at that position. Some regex engines will also notice that every match starts with an A and use memchr to look for occurrences of A as a fast prefilter.

We also include compilation times to give an idea of how long it takes to compile a moderately complex regex, and how that might vary with the compilation time of a much simpler version of the regex.

Note that in all of the measurements for this group, we search the CPython source code (concatenated into one file). We also lossily convert it to UTF-8 so that regex engines like regress can participate in this benchmark. (The CPython source code contains a very small amount of invalid UTF-8.)

Engine full quick compile-full compile-quick
dotnet/compiled 513.0 MB/s 795.3 MB/s 102.85us 41.70us
dotnet/nobacktrack - 658.1 MB/s - 218.32us
go/regexp 111.0 MB/s 873.7 MB/s 19.68us 2.91us
hyperscan - 1350.5 MB/s - 6.76ms
icu 191.3 MB/s 332.9 MB/s 11.36us 3.02us
java/hotspot 40.7 MB/s 114.9 MB/s - -
javascript/v8 320.1 MB/s 308.8 MB/s - -
pcre2 972.7 MB/s 1501.6 MB/s 3.66us 835.00ns
pcre2/jit 1232.9 MB/s 1020.3 MB/s 19.86us 4.78us
perl 107.6 MB/s 143.8 MB/s - -
python/re 103.4 MB/s 185.3 MB/s 167.32us 39.10us
python/regex 104.5 MB/s 125.1 MB/s 473.97us 95.27us
re2 550.1 MB/s 1038.1 MB/s 69.08us 9.10us
regress 265.3 MB/s 740.2 MB/s 8.68us 2.06us
rust/regex 1780.0 MB/s 1872.5 MB/s 85.28us 15.24us
rust/regex/lite 21.8 MB/s 37.8 MB/s 8.46us 1.33us
rust/regexold 627.7 MB/s 1395.5 MB/s 74.65us 17.97us
Show individual benchmark parameters.

full

Parameter Value
full name curated/09-aws-keys/full
model grep-captures
regex (('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|").*?(\n^.*?){0,4}(('|")[a-zA-Z0-9+/]{40}('|"))+|('|")[a-zA-Z0-9+/]{40}('|").*?(\n^.*?){0,3}('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|"))+
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(.*) 0

quick

Parameter Value
full name curated/09-aws-keys/quick
model grep
regex ((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(.*) 0

compile-full

Parameter Value
full name curated/09-aws-keys/compile-full
model compile
regex (('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|").*?(\n^.*?){0,4}(('|")[a-zA-Z0-9+/]{40}('|"))+|('|")[a-zA-Z0-9+/]{40}('|").*?(\n^.*?){0,3}('|")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|"))+
case-insensitive false
unicode false
haystack "AIDAABCDEFGHIJKLMNOP""aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [.. snip ..]
count(.*) 1

compile-quick

Parameter Value
full name curated/09-aws-keys/compile-quick
model compile
regex ((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))
case-insensitive false
unicode false
haystack AIDAABCDEFGHIJKLMNOP
count(.*) 1

bounded-repeat

This group of benchmarks measures how well regex engines do with bounded repeats. Bounded repeats are sub-expressions that are permitted to match up to some fixed number of times. For example, a{3,5} matches 3, 4 or 5 consecutive a characters. Unlike unbounded repetition operators, the regex engine needs some way to track when the bound has reached its limit. For this reason, many regex engines will translate a{3,5} to aaaa?a?. Given that the bounds may be much higher than 5 and that the sub-expression may be much more complicated than a single character, bounded repeats can quickly cause the underlying matcher to balloon in size.

We measure three different types of bounded repeats:

  • A search for a number of consecutive letters, both ASCII only and Unicode aware.
  • A search for certain types of words surrounding a Result type in Rust source code.
  • A search for consecutive words, all beginning with a capital letter.

We also include measurements for the compilation time of the last two.

Hyperscan does unusually well here, particularly for an automata oriented engine. It's plausible that it has some specific optimizations in place for bounded repeats.

rust/regex slows down quite a bit on the context regex. Namely, the context regex is quite gnarly and its (?s:.) sub-expression coupled with the bounded repeat causes a large portion of its transition table to get filled out. This in turn results in more time than usual being spent actually building the lazy DFA's transition table during a search. Typically, the lazy DFA's transition table is built pretty quickly and then mostly reused on subsequent searches. But in this case, the transition table exceeds the lazy DFA's cache capacity and results in the cache getting cleared. However, the rate at which new transitions are created is still low enough that the lazy DFA is used instead of falling back to a slower engine.

Engine letters-en letters-ru context capitals compile-context compile-capitals
dotnet/compiled 263.5 MB/s 191.4 MB/s 333.1 MB/s 864.1 MB/s 33.47us 29.28us
dotnet/nobacktrack 146.0 MB/s 103.7 MB/s 54.4 MB/s 555.0 MB/s 202.09us 49.31us
go/regexp 29.7 MB/s 27.8 MB/s 29.6 MB/s 54.1 MB/s 22.65us 17.41us
hyperscan 730.6 MB/s 267.9 MB/s 497.3 MB/s 2.7 GB/s 24.63ms 660.66us
icu 51.4 MB/s 70.4 MB/s 70.6 MB/s 264.9 MB/s 4.03us 2.80us
java/hotspot 83.5 MB/s 134.1 MB/s 74.5 MB/s 127.0 MB/s - -
javascript/v8 157.0 MB/s 60.1 MB/s 153.6 MB/s 742.9 MB/s - -
pcre2 57.6 MB/s 404.4 KB/s 76.1 MB/s 600.9 MB/s 4.84us 29.40us
pcre2/jit 334.6 MB/s 287.1 MB/s 373.8 MB/s 1572.0 MB/s 13.03us 37.10us
perl 70.1 MB/s 50.5 MB/s 90.6 MB/s 226.3 MB/s - -
python/re 81.2 MB/s - 78.0 MB/s 63.9 MB/s 47.41us 26.23us
python/regex 34.7 MB/s 78.3 MB/s 33.1 MB/s 266.7 MB/s 105.64us 55.77us
re2 507.1 MB/s 6.8 MB/s 87.8 MB/s 982.2 MB/s 92.25us 121.01us
regress 165.8 MB/s 30.2 MB/s 164.3 MB/s 379.9 MB/s - 1.23us
rust/regex 723.0 MB/s 662.0 MB/s 97.0 MB/s 825.6 MB/s 59.10us 58.75us
rust/regex/lite 28.7 MB/s - 29.8 MB/s 57.4 MB/s 8.68us 12.32us
rust/regexold 608.7 MB/s 532.5 MB/s 20.6 MB/s 825.6 MB/s 41.42us 69.83us
Show individual benchmark parameters.

letters-en

Parameter Value
full name curated/10-bounded-repeat/letters-en
model count
regex [A-Za-z]{8,13}
case-insensitive false
unicode false
haystack-path opensubtitles/en-sampled.txt
count(hyperscan) 3724
count(.*) 1833

letters-ru

Parameter Value
full name curated/10-bounded-repeat/letters-ru
model count
regex \p{L}{8,13}
case-insensitive false
unicode true
haystack-path opensubtitles/ru-sampled.txt
count(hyperscan) 8570
count(.*) 3475

context

Parameter Value
full name curated/10-bounded-repeat/context
model count
regex [A-Za-z]{10}\s+[\s\S]{0,100}Result[\s\S]{0,100}\s+[A-Za-z]{10}
case-insensitive false
unicode false
haystack-path rust-src-tools-3b0d4813.txt
count(hyperscan) 109
count(.*) 53

capitals

Parameter Value
full name curated/10-bounded-repeat/capitals
model count
regex (?:[A-Z][a-z]+\s*){10,100}
case-insensitive false
unicode false
haystack-path rust-src-tools-3b0d4813.txt
count(hyperscan) 237
count(.*) 11

compile-context

Parameter Value
full name curated/10-bounded-repeat/compile-context
model compile
regex [A-Za-z]{10}\s+(?s:.){0,100}Result(?s:.){0,100}\s+[A-Za-z]{10}
case-insensitive false
unicode false
haystack abcdefghij blah blah blah Result blib blab klmnopqrst
count(.*) 1

compile-capitals

Parameter Value
full name curated/10-bounded-repeat/compile-capitals
model compile
regex (?:[A-Z][a-z]+\s*){10,100}
case-insensitive false
unicode false
haystack Crazy Janey Mission Man Wild Billy Greasy Lake Hazy Davy Kil [.. snip ..]
count(hyperscan) 12
count(.*) 1

unstructured-to-json

These benchmarks come from a task that converts unstructured log data to structured JSON data. It works by iterating over every line in the log file and parsing various parts of each line into different sections using capture groups. The regex matches every line, so any fast logic design to reject non-matches will generally penalize regex engines here.

The original regex looks like this:

(?x)
^
(?P<timestamp>[^\ ]+\ [^\ ]+)

[\ ](?P<level>[DIWEF])[1234]:[\ ]

(?P<header>
    (?:
        (?:
            \[ [^\]]*? \] | \( [^\)]*? \)
        ):[\ ]
    )*
)

(?P<body>.*?)

[\ ]\{(?P<location>[^\}]*)\}
$

(The actual regex is flattened since not all engines support verbose mode. We also remove the names from each capture group.)

pcre2/jit does really well here. I'm not personally familiar with how PCRE2's JIT works, but if I had to guess, I'd say there are some clever optimizations with respect to the [^ ]+ (and similar) sub-expressions in this regex.

Otherwise, the backtracking engines generally outperform the automata engines in this benchmark. Interestingly, all of re2, go/regexp and rust/regex principally use their own bounded backtracking algorithms. But it looks like "proper" backtrackers tend to be better optimized than the ones found in RE2 and its descendants. (Bounded backtracking does have to pay for checking that no combination of haystack position and NFA state is visited more than once, but even removing that check does not bring, e.g., rust/regex up to speeds similar to other backtrackers.)

Engine extract compile
dotnet/compiled 549.1 MB/s 44.63us
dotnet/nobacktrack 17.4 MB/s 493.98us
go/regexp 83.8 MB/s 6.26us
icu 93.5 MB/s 7.75us
java/hotspot 217.2 MB/s -
javascript/v8 969.1 MB/s -
pcre2 209.9 MB/s 1.36us
pcre2/jit 1495.9 MB/s 7.00us
perl 149.2 MB/s -
python/re 119.4 MB/s 71.69us
python/regex 125.4 MB/s 191.86us
re2 111.9 MB/s 9.09us
regress 286.6 MB/s 4.12us
rust/regex 112.7 MB/s 19.86us
rust/regex/lite 25.2 MB/s 2.29us
rust/regexold 89.7 MB/s 13.92us
Show individual benchmark parameters.

extract

Parameter Value
full name curated/11-unstructured-to-json/extract
model grep-captures
regex-path wild/unstructured-to-json.txt
case-insensitive false
unicode false
haystack-path wild/unstructured-to-json.log
count(.*) 600

compile

Parameter Value
full name curated/11-unstructured-to-json/compile
model compile
regex-path wild/unstructured-to-json.txt
case-insensitive false
unicode false
haystack 2022/06/17 06:25:22 I4: [17936:140245395805952:(17998)]: (8f [.. snip ..]
count(.*) 1

dictionary

This benchmark highlights how well each regex engine does searching for a small dictionary of words. The dictionary is made up of about 2,500 words, where every word is at least 15 bytes in length. The number of words was chosen to be small enough that most regex engines can execute a search in reasonable time. The bigger minimum length of each word was chosen in order to make this a throughput benchmark. That is, there is only one match found here, so this benchmark is measuring the raw speed with which an engine can handle a big alternation of plain literals.

Most regex engines run quite slowly here. perl, re2 and rust/regex lead the pack with throughput measured in MB/s, while the rest are measured in KB/s. One might think that this is a benchmark that would manifest as a bright dividing line between finite automata engines and backtracking engines. Namely, finite automata engines should handle all of the alternations in "parallel," where as backtrackers will essentially try to match each alternate at each position in the haystack (owch). Indeed, this seems mostly true, but perl (a backtracker) does quite well while go/regexp (a finite automata engine) does quite poorly. Moreover, what explains the differences between perl, re2 and rust/regex?

There are several knots to untangle here.

First, we'll tackle the reason why go/regexp has a poor showing here. The answer lies in how the Thompson NFA construction works. A Thompson NFA can be built in worst case linear time (in the size of the pattern), but in exchange, it has epsilson transitions in its state graph. Epsilon transitions are transitions in a finite state machine that are followed without consuming any input. In a case like foo|bar|quux, you can think of the corresponding Thompson NFA (very loosely) as creating a single starting state with three epsilon transitions to each of foo, bar and quux. In a Thompson NFA simulation (i.e., a regex search using a Thompson NFA), all of these epsilon transitions have to be continually followed at every position in the haystack. With a large number of alternations, the amount of time spent shuffling through these epsilon transitions can be quite enormous. While the search time remains linear with respect to the size of the haystack, the "constant" factor here (i.e., the size of the regex pattern) can become quite large. In other words, a Thompson NFA scales poorly with respect to the size of the pattern. In this particular case, a Thompson NFA just doesn't do any better than a backtracker.

The second knot to untangle here is why perl does so well despite being a backtracker. While I'm not an expert on Perl internals, it appears to do well here because of something called a trie optimization. That is, Perl's regex engine will transform large alternations like this into an equivalent but much more efficient structure by essentially building a trie and encoding it into the regex itself. It turns out that rust/regex does the same thing, because the exact same optimization helps a backtracker in the same way it helps a Thompson NFA simulation. The optimization exploits the fact that the branches in the alternation are not truly independent and actually share a lot of overlap. Without the optimization, the branches are treated as completely independent and one must brute force their way through each one.

So what does this trie optimization look like? Consider a regex like zapper|z|zap. There is quite a bit of redundant structure. With some care, and making sure to preserve leftmost-first match semantics, it can be translated to the equivalent pattern z(apper||ap). Notice how in the pattern we started with, the alternation needs to be dealt with for every byte in the haystack, because you never know which branch is going to match, if any. But in the latter case, you now don't even need to consider the alternation until the byte z matches, which is likely to be quite rare.

Indeed, the algorithm for constructing such a pattern effectively proceeds by building a trie from the original alternation, and then converting the trie back to whatever intermediate representation the regex engine uses.

The last knot to untangle is to explain the differences between perl, re2 and rust/regex. Perl still uses a backtracking strategy, but with the trie optimization described above, it can try much fewer things for each position in the haystack. But what's going on with re2 and rust/regex? In this case, re2 uses the Thompson NFA simulation, but re2 does not use the trie optimization described above, so it gets stuck in a lot epsilon transition shuffling. Finally, rust/regex does the trie optimization and uses its lazy DFA internally for this case. re2 probably could too, but both libraries use various heuristics for deciding which engine to use. In this case, the regex might be too big for re2 to use its lazy DFA.

OK, that wraps up discussion of the single benchmark. But what is the multi benchmark? Where single represents combining all words in the dictionary into a single pattern, multi represents a strategy where each word is treated as its own distinct pattern. In the single case, Hyperscan actually rejects the pattern for being too large, but is happy to deal with it if each word is treated as its own pattern. The main semantic difference between these strategies is that the multi approach permits not only identifying where a match occurred, but which word in the dictionary matched. And this is done without using capture groups.

Hyperscan does really well here. While its source code is difficult to penetrate, my understanding is that Hyperscan uses its "FDR" algorithm here, which is essentially SIMD-ified variant of multi-substring Shift-Or. This benchmark represents Hyperscan's bread and butter: multi-pattern search.

rust/regex actually does worse in the multi case versus the single case. rust/regex's support for multi-pattern search is still young, and in particular, the multi-pattern case currently inhibits the trie optimization discussed above.

Finally, we also include compile-time benchmarks for each of the above cases in order to give an idea of how long it takes to build a regex from a dictionary like this. I don't have much to say here other than to call out the fact that the trie optimization does have a meaningful impact on regex compile times in the rust/regex case at least.

Engine single multi compile-single compile-multi
dotnet/compiled 1292.7 KB/s - 11.16ms -
go/regexp 568.9 KB/s - 5.86ms -
hyperscan - 8.2 GB/s - 20.72ms
icu 146.5 KB/s - 1.45ms -
java/hotspot 108.2 KB/s - - -
javascript/v8 28.3 KB/s - - -
perl 126.9 MB/s - - -
python/re 176.4 KB/s - 25.41ms -
python/regex 155.9 KB/s - 69.37ms -
re2 5.4 MB/s - 4.20ms -
regress 87.2 KB/s - 3.02ms -
rust/regex 714.9 MB/s 172.6 MB/s 7.69ms 16.25ms
rust/regex/lite 51.3 KB/s - 1.50ms -
rust/regexold 29.8 KB/s - 8.01ms -
Show individual benchmark parameters.

single

Parameter Value
full name curated/12-dictionary/single
model count
regex-path dictionary/english/length-15.txt
case-insensitive false
unicode false
haystack-path opensubtitles/en-medium.txt
count(.*) 1

dotnet/nobacktrack is omitted because the regex is too large.

hyperscan is omitted because the regex is too large.

pcre2/* are omitted because the regex is too large.

multi

Parameter Value
full name curated/12-dictionary/multi
model count
regex-path dictionary/english/length-15.txt
case-insensitive false
unicode false
haystack-path opensubtitles/en-medium.txt
count(.*) 1

Only hyperscan and rust/regex are included because they are the only regex engines to support multi-pattern regexes. (Note that the regex crate API does not support this. You need to drop down to the meta::Regex API in the regex-automata crate.)

compile-single

Parameter Value
full name curated/12-dictionary/compile-single
model compile
regex-path dictionary/english/length-15.txt
case-insensitive false
unicode false
haystack Zubeneschamali's
count(.*) 1

dotnet/nobacktrack is omitted because the regex is too large.

hyperscan is omitted because the regex is too large.

java/hotspot is omitted because we currently don't benchmark Perl regex compilation.

javascript/v8 is omitted because we currently don't benchmark Perl regex compilation.

pcre2/* are omitted because the regex is too large.

perl is omitted because we currently don't benchmark Perl regex compilation.

compile-multi

Parameter Value
full name curated/12-dictionary/compile-multi
model compile
regex-path dictionary/english/length-15.txt
case-insensitive false
unicode false
haystack Zubeneschamali's
count(.*) 1

Only hyperscan and rust/regex are included because they are the only regex engines to support multi-pattern regexes. (Note that the regex crate API does not support this. You need to drop down to the meta::Regex API in the regex-automata crate.)

noseyparker

This benchmark measures how well regex engines do when asked to look for matches for many different patterns. The patterns come from the Nosey Parker project, which finds secrets and sensitive information in textual data and source repositories. Nosey Parker operates principally by defining a number of rules for detecting secrets (for example, AWS API keys), and then looking for matches of those rules in various corpora. The rules are, as you might have guessed, defined as regular expressions.

I went through each of its rules and extracted a total of 96 regular expressions, as of commit be8c26e8. These 96 regexes make up the single and multi benchmarks below, with single corresponding to joining all of patterns into one big alternation and multi corresponding to treating each pattern as its own regex. In the latter case, only the rust/regex and hyperscan engines are measured, since they are the only ones to support multi-regex matching.

This is a particularly brutal benchmark. Most regex engines can't deal with it at all, and will either reject it at compilation time for being too big or simply take longer than we're willing to wait. (rebar imposes a reasonable timeout for all benchmarks, and if the timeout is exceeded, no measurements are collected.)

Hyperscan is in its own class here. Hyperscan was purpose built to deal with the multi-pattern use case, and it deals with it very well here. The specific patterns also put this in its wheelhouse because they all have some kind of literal string in them. Hyperscan uses a literal searching and finite automata decomposition strategy to quickly identify candidate matches and avoids doing redundant work. Although how it all fits together and avoids pitfalls such as worst case quadratic search time doesn't appear to be written down anywhere.

rust/regex just barely does serviceably here. It uses its lazy DFA to handle this regex, but with the default cache sizes, profiling suggests that it is spending a lot of its time building the DFA. It's plausible that increasing the cache size for such a big regex would let it execute searches faster.

pcre2/jit doesn't do as well here, but one might expect that because it is a backtracking engine. With that said, no other backtracking engine could deal with this regex at all, so pcre2/jit is doing quite well relative to other backtracking engines.

Finally, we also include compile time benchmarks for each of the single and multi cases to give a general sense of how long this monster regex takes to build.

Engine single multi compile-single compile-multi
hyperscan 4.3 GB/s 4.3 GB/s 214.59ms 134.09ms
pcre2/jit 12.9 MB/s - 613.09us -
rust/regex 126.7 MB/s 100.7 MB/s 2.42ms 2.80ms
rust/regexold 9.5 MB/s - 4.34ms -
Show individual benchmark parameters.

single

Parameter Value
full name curated/13-noseyparker/single
model count
regex-path wild/noseyparker.txt
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(hyperscan) 241
count(.*) 55
  • dotnet/compiled is omitted because it times out.
  • dotnet/nobacktrack is omitted because the regex is too big.
  • go/regexp is omitted because there are bounded repeats that exceed its limit.
  • icu is omitted because it times out.
  • java/hotspot is omitted because it times out.
  • javascript/v8 is omitted because it doesn't support inline flags.
  • pcre2 is omitted because it times out.
  • perl is omitted because it times out.
  • python/* is omitted because it times out.
  • re2 is omitted because it seems to fail and reports a count of 0.
  • regress is omitted because it doesn't support inline flags.
  • rust/regex/lite is omitted because it times out.

multi

Parameter Value
full name curated/13-noseyparker/multi
model count
regex-path wild/noseyparker.txt
case-insensitive false
unicode false
haystack-path wild/cpython-226484e4.py
count(hyperscan) 241
count(.*) 55

Only hyperscan and rust/regex are included because they are the only regex engines to support multi-pattern regexes. (Note that the regex crate API does not support this. You need to drop down to the meta::Regex API in the regex-automata crate.)

compile-single

Parameter Value
full name curated/13-noseyparker/compile-single
model compile
regex-path wild/noseyparker.txt
case-insensitive false
unicode false
haystack TWITTER_API_KEY = 'UZYoBAfBzNace3mBwPOGYw'
count(.*) 1

We only include the engines that are measured in the single benchmark.

compile-multi

Parameter Value
full name curated/13-noseyparker/compile-multi
model compile
regex-path wild/noseyparker.txt
case-insensitive false
unicode false
haystack TWITTER_API_KEY = 'UZYoBAfBzNace3mBwPOGYw'
count(.*) 1

We only include the engines that are measured in the multi benchmark.

quadratic

This set of benchmarks is meant to convince you that, even if you use a regex engine that purports to guarantee worst case linear time searches, it is likely possible to use it in a way that results in worst case quadratic time!

The regex we use here is .*[^A-Z]|[A-Z] and the haystack we search is the letter A repeated 100, 200 and 1000 times. There are two key insights to understanding how this results in quadratic behavior:

  1. It requires one to iterate over all matches in a haystack. Some regex engines (e.g., rust/regex and go/regexp) provide first class APIs for such an operation. They typically handle the pathological case of an empty match for you, which would result in an infinite loop in naively written code. Some regex engines (e.g., pcre2 and re2) do not provide any APIs for iterating over all matches. Callers have to write that code themselves. The point here is that a regex search is executed many times for a haystack.
  2. Because of how leftmost-first match semantics work, a regex engine might scan all the way to the end of a haystack before reporting a match that starts and ends at the beginning of the haystack. The reason for this is that most regex engines will, by default, greedily consume as much as possible.

Quadratic behavior occurs by exploiting both of the insights above: by crafting a regex and a haystack where every search scans to the end of the haystack, but also that every search reports a match at the beginning of the search that is exactly one character long.

Indeed, this is exactly what the regex .*[^A-Z]|[A-Z] does on a haystack like AAAAA. Leftmost-first match semantics says that if there are multiple matches that occur at the same position, then the match generated "first" by the pattern should be preferred. In this case, .*[^A-Z] is preferred over [A-Z]. But since .* matches as much as possible, it is not actually known whether that branch matches until the regex engine reaches the end of the haystack and realizes that it cannot match. At that point, the match from the second branch, [A-Z] corresponding to the first A, is reported. Since we're iterating over every match, the search advances to immediately after the first A and repeats the same behavior: scanning all the way to the end of the haystack, only to discover there is no match, and then reporting the second A as the next match. This repeats itself, scanning the entire haystack a number of times proportional to n^2, where n is the length of the haystack.

It is important to note that in a finite automata oriented regex engine, the fact that [A-Z] matches at the beginning of the haystack is known after the regex engine scans that part of the haystack. That is, its internal state is aware of the fact that a match exists. It specifically continues searching because leftmost-first semantics demand it. Once it reaches the end of the haystack (or a point at which no other match could occur), then it stops and returns the most recent match that it found. Unlike a backtracker, it does not need to go back to the beginning of the haystack and start looking for a match of the second branch.

Given the semantics of leftmost-first matching, there is no way to avoid this. It is, unfortunately, just how the cookie crumbles.

With all of that said, hyperscan is the one regex engine that manages to maintain the same throughput for each of the 1x, 2x and 10x benchmarks. That is, it does not exhibit worst case quadratic behavior here. It retains its linear search time. How does it do it? The secret lay in the fact that Hyperscan doesn't implement leftmost-first match semantics. (Indeed, this is why some of its match counts differ throughout the benchmarks in rebar.) Instead, Hyperscan reports a match as soon as it is seen. Once a match is found, it doesn't continue on to try and greedily match the regex. For example, the regex \w+ will report 5 matches in the haystack aaaaa, where as for most other regex engines, only one match will be reported. This means hyperscan can zip through this benchmark in one pass of the haystack.

The rust/regex engine can also do this, but requires dropping down to the regex-automata crate and using Input::new(haystack).earliest(true) when running a search. This instructs the regex engine to report matches as they're seen, just like Hyperscan. Indeed, if the rust/regex runner program uses this approach, then its throughput remains constant for the 1x, 2x and 10x benchmarks, just like for Hyperscan.

Credit goes to this bug filed against the go/regexp engine for making me aware of this issue.

Note: We use [A-Z] in this example instead of A in an attempt to subvert any sort of literal optimizations done by the regex engine.

Engine 1x 2x 10x
dotnet/compiled 20.3 MB/s 13.3 MB/s 3.5 MB/s
dotnet/nobacktrack 7.8 MB/s 4.8 MB/s 1187.7 KB/s
go/regexp 1658.8 KB/s 895.8 KB/s 193.4 KB/s
hyperscan 173.7 MB/s 178.3 MB/s 184.5 MB/s
icu 3.4 MB/s 1925.2 KB/s 380.0 KB/s
java/hotspot 10.6 MB/s 5.9 MB/s 1006.8 KB/s
javascript/v8 16.3 MB/s 10.6 MB/s 2.9 MB/s
pcre2 2.1 MB/s 1167.4 KB/s 244.8 KB/s
pcre2/jit 18.7 MB/s 11.5 MB/s 3.0 MB/s
perl 2.4 MB/s 1658.3 KB/s 456.3 KB/s
python/re 3.3 MB/s 1997.1 KB/s 460.6 KB/s
python/regex 3.7 MB/s 2.5 MB/s 707.7 KB/s
re2 9.4 MB/s 6.5 MB/s 1838.0 KB/s
regress 5.6 MB/s 3.1 MB/s 682.9 KB/s
rust/regex 18.0 MB/s 8.6 MB/s 1708.3 KB/s
rust/regex/lite 1173.3 KB/s 620.7 KB/s 129.2 KB/s
rust/regexold 13.9 MB/s 7.6 MB/s 1640.4 KB/s
Show individual benchmark parameters.

1x

Parameter Value
full name curated/14-quadratic/1x
model count
regex .*[^A-Z]|[A-Z]
case-insensitive false
unicode false
haystack AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]
count(.*) 100

This is our baseline benchmark the searches a haystack with the letter A repeated 100 times.

2x

Parameter Value
full name curated/14-quadratic/2x
model count
regex .*[^A-Z]|[A-Z]
case-insensitive false
unicode false
haystack AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]
count(.*) 200

This is like 1x, but doubles the haystack length. This should provide a way to show the quadratic nature of this particular benchmark.

The throughputs reported should remain roughly the same if the time complexity is linear, but in fact, the throughputs decrease by about a factor of 2. That demonstrates a superlinear relationship between the inputs and the time taken.

10x

Parameter Value
full name curated/14-quadratic/10x
model count
regex .*[^A-Z]|[A-Z]
case-insensitive false
unicode false
haystack AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]
count(.*) 1000

This is like 1x, but increases the haystack length by a factor of 10. This should provide more evidence that the relationship is quadratic in the same way that the 2x benchmark does.

More Repositories

1

ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Rust
48,517
star
2

xsv

A fast CSV command line toolkit written in Rust.
Rust
10,377
star
3

toml

TOML parser for Golang with reflection.
Go
4,464
star
4

quickcheck

Automated property based testing for Rust (with shrinking).
Rust
2,408
star
5

erd

Translates a plain text description of a relational database schema to a graphical entity-relationship diagram.
Haskell
1,805
star
6

fst

Represent large sets and maps compactly with finite state transducers.
Rust
1,771
star
7

jiff

A date-time library for Rust that encourages you to jump into the pit of success.
Rust
1,736
star
8

rust-csv

A CSV parser for Rust, with Serde support.
Rust
1,710
star
9

walkdir

Rust library for walking directories recursively.
Rust
1,283
star
10

nflgame

An API to retrieve and read NFL Game Center JSON data. It can work with real-time data, which can be used for fantasy football.
Python
1,257
star
11

nfldb

A library to manage and update NFL data in a relational database.
Python
1,079
star
12

aho-corasick

A fast implementation of Aho-Corasick in Rust.
Rust
1,028
star
13

byteorder

Rust library for reading/writing numbers in big-endian and little-endian.
Rust
980
star
14

wingo

A fully-featured window manager written in Go.
Go
958
star
15

memchr

Optimized string search routines for Rust.
Rust
888
star
16

bstr

A string type for Rust that is not required to be valid UTF-8.
Rust
795
star
17

advent-of-code

Rust solutions to AoC 2018
Rust
479
star
18

xgb

The X Go Binding is a low-level API to communicate with the X server. It is modeled on XCB and supports many X extensions.
Go
472
star
19

termcolor

Cross platform terminal colors for Rust.
Rust
462
star
20

rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
Rust
449
star
21

go-sumtype

A simple utility for running exhaustiveness checks on Go "sum types."
Go
421
star
22

chan

Multi-producer, multi-consumer concurrent channel for Rust.
Rust
392
star
23

regex-automata

A low level regular expression library that uses deterministic finite automata.
Rust
352
star
24

cargo-benchcmp

A small utility to compare Rust micro-benchmarks.
Rust
342
star
25

suffix

Fast suffix arrays for Rust (with Unicode support).
Rust
262
star
26

rure-go

Go bindings to Rust's regex engine.
Go
250
star
27

tabwriter

Elastic tabstops for Rust.
Rust
247
star
28

imdb-rename

A command line tool to rename media files based on titles from IMDb.
Rust
226
star
29

critcmp

A command line tool for comparing benchmarks run by Criterion.
Rust
216
star
30

ty

Easy parametric polymorphism at run time using completely unidiomatic Go.
Go
198
star
31

xgbutil

A utility library to make use of the X Go Binding easier. (Implements EWMH and ICCCM specs, key binding support, etc.)
Go
191
star
32

pytyle3

An updated (and much faster) version of pytyle that uses xpybutil and is compatible with Openbox Multihead.
Python
181
star
33

dotfiles

My configuration files and personal collection of scripts.
Vim Script
154
star
34

rsc-regexp

Translations of a simple C program to Rust.
Rust
138
star
35

rust-cbor

CBOR (binary JSON) for Rust with automatic type based decoding and encoding.
Rust
129
star
36

chan-signal

Respond to OS signals with channels.
Rust
125
star
37

goim

Goim is a robust command line utility to maintain and query the Internet Movie Database (IMDb).
Go
117
star
38

same-file

Cross platform Rust library for checking whether two file paths are the same file.
Rust
101
star
39

clibs

A smattering of miscellaneous C libraries. Includes sane argument parsing, a thread-safe multi-producer/multi-consumer queue, and implementation of common data structures (hashmaps, vectors and linked lists).
C
98
star
40

ucd-generate

A command line tool to generate Unicode tables as source code.
Rust
95
star
41

nflvid

An experimental library to map play meta data to footage of that play.
Python
91
star
42

rust-stats

Basic statistical functions on streams for Rust.
Rust
87
star
43

migration

Package migration for Golang automatically handles versioning of a database schema by applying a series of migrations supplied by the client.
Go
81
star
44

winapi-util

Safe wrappers for various Windows specific APIs.
Rust
64
star
45

xpybutil

An incomplete xcb-util port plus some extras
Python
62
star
46

graphics-go

Automatically exported from code.google.com/p/graphics-go
Go
60
star
47

rust-pcre2

High level Rust bindings to PCRE2.
C
56
star
48

blog

My blog.
Rust
52
star
49

rust-sorts

Implementations of common sorting algorithms in Rust with comprehensive tests and benchmarks.
Rust
51
star
50

openbox-multihead

Openbox with patches for enhanced multihead support.
C
46
star
51

nakala

A low level embedded information retrieval system.
Rust
45
star
52

nflfan

View your fantasy teams with nfldb using a web interface.
JavaScript
43
star
53

globset

A globbing library for Rust.
Rust
42
star
54

utf8-ranges

Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
Rust
42
star
55

rtmpdump-ksv

rtmpdump with ksv's patch. Intended to track upstream git://git.ffmpeg.org/rtmpdump as well.
C
40
star
56

regexp

A regular expression library implemented in Rust.
Rust
37
star
57

xdg

A Go package for reading config and data files according to the XDG Base Directory specification.
Go
35
star
58

locker

A simple Golang package for conveniently using named read/write locks. Useful for synchronizing access to session based storage in web applications.
Go
34
star
59

nflcmd

A collection of command line utilities for viewing NFL statistics and rankings with nfldb.
Python
32
star
60

notes

A collection of small notes that aren't appropriate for my blog.
31
star
61

mempool

A fast thread safe memory pool for reusing allocations.
Rust
29
star
62

gribble

A command oriented language whose environment is defined through Go struct types by reflection.
Go
28
star
63

vcr

A simple wrapper tool around ffmpeg to capture video from a VCR.
Rust
27
star
64

encoding_rs_io

Streaming I/O adapters for the encoding_rs crate.
Rust
25
star
65

rust-cmail

A simple command line utility for periodically sending email containing the output of long-running commands.
Rust
21
star
66

cluster

A simple API for managing a network cluster with smart peer discovery.
Go
19
star
67

pager-multihead

A pager that supports per-monitor desktops (compatible with Openbox Multihead and Wingo)
Python
15
star
68

cablastp

Performs BLAST on compressed proteomic data.
Go
15
star
69

rust-error-handling-case-study

Code for the case study in my blog post: http://blog.burntsushi.net/rust-error-handling
Rust
15
star
70

rg-cratesio-typosquat

The source code of the 'rg' crate. It is an intentional typo-squat that redirects folks to 'ripgrep'.
Rust
15
star
71

imgv

An image viewer for Linux written in Go.
Go
14
star
72

cmd

A convenience library for executing commands in Go, including executing commands in parallel with a pool.
Go
14
star
73

fanfoot

View your fantasy football leagues and get text alerts when one of your players scores.
Python
12
star
74

cmail

cmail runs a command and sends the output to your email address at certain intervals.
Go
12
star
75

gohead

An xrandr wrapper script to manage multi-monitor configurations. With hooks.
Go
12
star
76

burntsushi-blog

A small Go application for my old blog.
CSS
12
star
77

intern

A simple package for interning strings, with a focus on efficiently representing dense pairwise data.
Go
11
star
78

crev-proofs

My crev reviews.
10
star
79

pytyle1

A lightweight X11 tool for simulating tiling in a stacking window manager.
Python
9
star
80

cif

A golang package for reading and writing data in the Crystallographic Information File (CIF) format. It mostly conforms to the CIF 1.1 specification.
Go
9
star
81

rucd

WIP
Rust
8
star
82

qcsv

An API to read and analyze CSV files by inferring types for each column of data.
Python
8
star
83

pyndow

A window manager written in Python
Python
8
star
84

csql

Package csql provides convenience functions for use with the types and functions defined in the standard library `database/sql` package.
Go
6
star
85

freetype-go

A fork of freetype-go with bounding box calculations.
Go
6
star
86

sqlsess

Simple database backed session management. Integrates with Gorilla's sessions package.
Go
6
star
87

go-wayland-simple-shm

C
5
star
88

sqlauth

A simple Golang package that provides database backed user authentication with bcrypt.
Vim Script
4
star
89

lcmweb

A Go web application for coding documents with the Linguistic Category Model.
JavaScript
4
star
90

bcbgo

Computational biology tools for the BCB group at Tufts University.
Go
4
star
91

fex

A framework for specifying and executing experiments.
Haskell
3
star
92

present

My presentations.
HTML
3
star
93

memchr-2.6-mov-regression

Rust
3
star
94

genecentric

A tool to generate between-pathway modules and perform GO enrichment on them.
Python
3
star
95

rust-docs

A silly repo for managing my Rust crate documentation.
Python
3
star
96

pcre2-mirror

A git mirror for PCRE2's SVN repository at svn://vcs.exim.org/pcre2/code
2
star
97

xpyb

A clone of xorg-xpyb.
C
2
star
98

burntsushi-homepage

A small PHP web application for my old homepage.
PHP
2
star
99

window-marker

Use vim-like marks on windows.
Python
2
star
100

sudoku

An attempt at a sudoku solver in Haskell.
Haskell
1
star