• Stars
    star
    353
  • Rank 120,358 (Top 3 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 11 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Generate JavaScript-compatible regular expressions based on a given set of Unicode symbols or code points.

Regenerate Build status Code coverage status

Regenerate is a Unicode-aware regex generator for JavaScript. It allows you to easily generate ES5-compatible regular expressions based on a given set of Unicode symbols or code points. (This is trickier than you might think, because of how JavaScript deals with astral symbols.)

Installation

Via npm:

npm install regenerate

Via Bower:

bower install regenerate

In a browser:

<script src="regenerate.js"></script>

In Node.js, io.js, and RingoJS ≥ v0.8.0:

var regenerate = require('regenerate');

In Narwhal and RingoJS ≤ v0.7.0:

var regenerate = require('regenerate').regenerate;

In Rhino:

load('regenerate.js');

Using an AMD loader like RequireJS:

require(
  {
    'paths': {
      'regenerate': 'path/to/regenerate'
    }
  },
  ['regenerate'],
  function(regenerate) {
    console.log(regenerate);
  }
);

API

regenerate(value1, value2, value3, ...)

The main Regenerate function. Calling this function creates a new set that gets a chainable API.

var set = regenerate()
  .addRange(0x60, 0x69) // add U+0060 to U+0069
  .remove(0x62, 0x64) // remove U+0062 and U+0064
  .add(0x1D306); // add U+1D306
set.valueOf();
// → [0x60, 0x61, 0x63, 0x65, 0x66, 0x67, 0x68, 0x69, 0x1D306]
set.toString();
// → '[`ace-i]|\\uD834\\uDF06'
set.toRegExp();
// → /[`ace-i]|\uD834\uDF06/

Any arguments passed to regenerate() will be added to the set right away. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.

regenerate(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

var items = [0x1D306, 'A', '©', 0x2603];
regenerate(items).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

regenerate.prototype.add(value1, value2, value3, ...)

Any arguments passed to add() are added to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.

regenerate().add(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

var items = [0x1D306, 'A', '©', 0x2603];
regenerate().add(items).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

It’s also possible to pass in a Regenerate instance. Doing so adds all code points in that instance to the current set.

var set = regenerate(0x1D306, 'A');
regenerate().add('©', 0x2603).add(set).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

Note that the initial call to regenerate() acts like add(). This allows you to create a new Regenerate instance and add some code points to it in one go:

regenerate(0x1D306, 'A', '©', 0x2603).toString();
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'

regenerate.prototype.remove(value1, value2, value3, ...)

Any arguments passed to remove() are removed from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.

regenerate(0x1D306, 'A', '©', 0x2603).remove('☃').toString();
// → '[A\\xA9]|\\uD834\\uDF06'

It’s also possible to pass in a Regenerate instance. Doing so removes all code points in that instance from the current set.

var set = regenerate('☃');
regenerate(0x1D306, 'A', '©', 0x2603).remove(set).toString();
// → '[A\\xA9]|\\uD834\\uDF06'

regenerate.prototype.addRange(start, end)

Adds a range of code points from start to end (inclusive) to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.

regenerate(0x1D306).addRange(0x00, 0xFF).toString(16);
// → '[\\0-\\xFF]|\\uD834\\uDF06'

regenerate().addRange('A', 'z').toString();
// → '[A-z]'

regenerate.prototype.removeRange(start, end)

Removes a range of code points from start to end (inclusive) from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.

regenerate()
  .addRange(0x000000, 0x10FFFF) // add all Unicode code points
  .removeRange('A', 'z') // remove all symbols from `A` to `z`
  .toString();
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'

regenerate()
  .addRange(0x000000, 0x10FFFF) // add all Unicode code points
  .removeRange(0x0041, 0x007A) // remove all code points from U+0041 to U+007A
  .toString();
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'

regenerate.prototype.intersection(codePoints)

Removes any code points from the set that are not present in both the set and the given codePoints array. codePoints must be an array of numeric code point values, i.e. numbers.

regenerate()
  .addRange(0x00, 0xFF) // add extended ASCII code points
  .intersection([0x61, 0x69]) // remove all code points from the set except for these
  .toString();
// → '[ai]'

Instead of the codePoints array, it’s also possible to pass in a Regenerate instance.

var whitelist = regenerate(0x61, 0x69);

regenerate()
  .addRange(0x00, 0xFF) // add extended ASCII code points
  .intersection(whitelist) // remove all code points from the set except for those in the `whitelist` set
  .toString();
// → '[ai]'

regenerate.prototype.contains(value)

Returns true if the given value is part of the set, and false otherwise. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.

var set = regenerate().addRange(0x00, 0xFF);
set.contains('A');
// → true
set.contains(0x1D306);
// → false

regenerate.prototype.clone()

Returns a clone of the current code point set. Any actions performed on the clone won’t mutate the original set.

var setA = regenerate(0x1D306);
var setB = setA.clone().add(0x1F4A9);
setA.toArray();
// → [0x1D306]
setB.toArray();
// → [0x1D306, 0x1F4A9]

regenerate.prototype.toString(options)

Returns a string representing (part of) a regular expression that matches all the symbols mapped to the code points within the set.

regenerate(0x1D306, 0x1F4A9).toString();
// → '\\uD834\\uDF06|\\uD83D\\uDCA9'

If the bmpOnly property of the optional options object is set to true, the output matches surrogates individually, regardless of whether they’re lone surrogates or just part of a surrogate pair. This simplifies the output, but it can only be used in case you’re certain the strings it will be used on don’t contain any astral symbols.

var highSurrogates = regenerate().addRange(0xD800, 0xDBFF);
highSurrogates.toString();
// → '[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])'
highSurrogates.toString({ 'bmpOnly': true });
// → '[\\uD800-\\uDBFF]'

var lowSurrogates = regenerate().addRange(0xDC00, 0xDFFF);
lowSurrogates.toString();
// → '(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
lowSurrogates.toString({ 'bmpOnly': true });
// → '[\\uDC00-\\uDFFF]'

Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript without the use of lookbehind assertions, which aren't yet widely supported. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.

If the hasUnicodeFlag property of the optional options object is set to true, the output makes use of Unicode code point escapes (\u{…}) where applicable. This simplifies the output at the cost of compatibility and portability, since it means the output can only be used as a pattern in a regular expression with the ES6 u flag enabled.

var set = regenerate().addRange(0x0, 0x10FFFF);

set.toString();
// → '[\\0-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]''

set.toString({ 'hasUnicodeFlag': true });
// → '[\\0-\\u{10FFFF}]'

regenerate.prototype.toRegExp(flags = '')

Returns a regular expression that matches all the symbols mapped to the code points within the set. Optionally, you can pass flags to be added to the regular expression.

var regex = regenerate(0x1D306, 0x1F4A9).toRegExp();
// → /\uD834\uDF06|\uD83D\uDCA9/
regex.test('𝌆');
// → true
regex.test('A');
// → false

// With flags:
var regex = regenerate(0x1D306, 0x1F4A9).toRegExp('g');
// → /\uD834\uDF06|\uD83D\uDCA9/g

Note: This probably shouldn’t be used. Regenerate is intended as a tool that is used as part of a build process, not at runtime.

regenerate.prototype.valueOf() or regenerate.prototype.toArray()

Returns a sorted array of unique code points in the set.

regenerate(0x1D306)
  .addRange(0x60, 0x65)
  .add(0x59, 0x60) // note: 0x59 is added after 0x65, and 0x60 is a duplicate
  .valueOf();
// → [0x59, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x1D306]

regenerate.version

A string representing the semantic version number.

Combine Regenerate with other libraries

Regenerate gets even better when combined with other libraries such as Punycode.js. Here’s an example where Punycode.js is used to convert a string into an array of code points, that is then passed on to Regenerate:

var regenerate = require('regenerate');
var punycode = require('punycode');

var string = 'Lorem ipsum dolor sit amet.';
// Get an array of all code points used in the string:
var codePoints = punycode.ucs2.decode(string);

// Generate a regular expression that matches any of the symbols used in the string:
regenerate(codePoints).toString();
// → '[ \\.Ladeilmopr-u]'

In ES6 you can do something similar with Array.from which uses the string’s iterator to split the given string into an array of strings that each contain a single symbol. regenerate() accepts both strings and code points, remember?

var regenerate = require('regenerate');

var string = 'Lorem ipsum dolor sit amet.';
// Get an array of all symbols used in the string:
var symbols = Array.from(string);

// Generate a regular expression that matches any of the symbols used in the string:
regenerate(symbols).toString();
// → '[ \\.Ladeilmopr-u]'

Support

Regenerate supports at least Chrome 27+, Firefox 3+, Safari 4+, Opera 10+, IE 6+, Node.js v0.10.0+, io.js v1.0.0+, Narwhal 0.3.2+, RingoJS 0.8+, PhantomJS 1.9.0+, and Rhino 1.7RC4+.

Unit tests & code coverage

After cloning this repository, run npm install to install the dependencies needed for Regenerate development and testing. You may want to install Istanbul globally using npm install istanbul -g.

Once that’s done, you can run the unit tests in Node using npm test or node tests/tests.js. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use grunt test.

To generate the code coverage report, use grunt cover.

Author

twitter/mathias
Mathias Bynens

License

Regenerate is available under the MIT license.

More Repositories

1

dotfiles

🔧 .files, including ~/.macos — sensible hacker defaults for macOS
Shell
29,301
star
2

jquery-placeholder

A jQuery plugin that enables HTML5 placeholder behavior for browsers that aren’t trying hard enough yet
JavaScript
3,983
star
3

he

A robust HTML entity encoder/decoder written in JavaScript.
JavaScript
3,289
star
4

evil.sh

🙊 Subtle and not-so-subtle shell tweaks that will slowly drive people insane.
Shell
2,159
star
5

small

Smallest possible syntactically valid files of different types
HTML
1,900
star
6

emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
JavaScript
1,641
star
7

punycode.js

A robust Punycode converter that fully complies to RFC 3492 and RFC 5891.
JavaScript
1,479
star
8

mothereff.in

Web developer tools
JavaScript
1,024
star
9

esrever

A Unicode-aware string reverser written in JavaScript.
JavaScript
878
star
10

jsesc

Given some data, jsesc returns the shortest possible stringified & ASCII-safe representation of that data.
JavaScript
683
star
11

utf8.js

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.
JavaScript
539
star
12

base64

A robust base64 encoder/decoder that is fully compatible with `atob()` and btoa()`, written in JavaScript.
JavaScript
491
star
13

CSS.escape

A robust polyfill for the CSS.escape utility method as defined in CSSOM.
JavaScript
486
star
14

jsperf.com

jsPerf.com source code
JavaScript
473
star
15

php-url-shortener

Simple PHP URL shortener, as used on mths.be
PHP
334
star
16

regexpu

A source code transpiler that enables the use of ES2015 Unicode regular expressions in ES5.
JavaScript
226
star
17

tpyo

A small script that enables you to make typos in JavaScript property names. Powered by ES2015 proxies + Levenshtein string distance.
JavaScript
205
star
18

luamin

A Lua minifier written in JavaScript
JavaScript
188
star
19

cssesc

A JavaScript library for escaping CSS strings and identifiers while generating the shortest possible ASCII-only output.
HTML
150
star
20

String.prototype.startsWith

A robust & optimized ES3-compatible polyfill for the `String.prototype.startsWith` method in ECMAScript 6.
JavaScript
143
star
21

grunt-template

This Grunt plugin interpolates template files with any data you provide and saves the result to another file.
JavaScript
136
star
22

document.scrollingElement

A polyfill for document.scrollingElement as defined in the CSSOM specification.
JavaScript
131
star
23

jquery-visibility

Page Visibility shim for jQuery
JavaScript
129
star
24

jquery-details

World’s first <details>/<summary> polyfill™
HTML
121
star
25

rel-noopener

Quick demonstration of why `<a rel=noopener>` is needed.
HTML
114
star
26

quoted-printable

A robust & character encoding–agnostic JavaScript implementation of the `Quoted-Printable` content transfer encoding as defined by RFC 2045.
JavaScript
88
star
27

grunt-zopfli

A Grunt plugin for compressing files using Zopfli.
JavaScript
87
star
28

emoji-test-regex-pattern

A regular expression pattern for Java/JavaScript to match all emoji in the emoji-test.txt file provided by UTS#51.
JavaScript
79
star
29

jquery-smooth-scrolling

Smooth anchor scrolling plugin for jQuery.
JavaScript
73
star
30

String.prototype.includes

A robust & optimized ES3-compatible polyfill for the `String.prototype.contains` method in ECMAScript 6.
JavaScript
69
star
31

Array.from

A robust & optimized ES3-compatible polyfill for the `Array.from` method in ECMAScript 6.
JavaScript
66
star
32

regexpu-core

regexpu’s core functionality, i.e. `rewritePattern(pattern, flag, options)`, which enables rewriting regular expressions that make use of the ES6 `u` flag into equivalent ES5-compatible regular expression patterns.
JavaScript
63
star
33

jquery-slideshow

The simplest jQuery slideshow plugin. Evar.
JavaScript
61
star
34

String.fromCodePoint

A robust & optimized `String.fromCodePoint` polyfill, based on the ECMAScript 6 specification.
JavaScript
61
star
35

hashtag-regex

A regular expression to match hashtag identifiers as per the Unicode Standard.
JavaScript
60
star
36

custom.keylayout

Custom QWERTY/AZERTY .keylayout files for use with Apple keyboards
59
star
37

unicode-data

Python scripts that generate JavaScript-compatible Unicode data
JavaScript
59
star
38

grunt-yui-compressor

A Grunt plugin for compressing JavaScript and CSS files using YUI Compressor.
JavaScript
59
star
39

String.prototype.at

A robust & optimized ES3-compatible polyfill for the `String.prototype.at` proposal for ECMAScript 6/7.
JavaScript
55
star
40

String.prototype.codePointAt

A robust & optimized `String.prototype.codePointAt` polyfill, based on the ECMAScript 6 specification.
JavaScript
55
star
41

covid-19-vaccinations-germany

Historical data on COVID-19 vaccination doses administered in Germany, per state.
HTML
54
star
42

windows-1252

A robust JavaScript implementation of the windows-1252 character encoding as defined by the Encoding Standard.
JavaScript
44
star
43

flag-emoji-replacements

'🇩🇰🇲🇬'.replace('🇰🇲', '🇪🇨'); // → '🇩🇪🇨🇬'
JavaScript
38
star
44

unicode-tr51

Emoji data extracted from Unicode Technical Report #51.
JavaScript
38
star
45

String.prototype.endsWith

A robust & optimized ES3-compatible polyfill for the `String.prototype.endsWith` method in ECMAScript 6.
JavaScript
35
star
46

wtf-8

A well-tested WTF-8 encoder/decoder written in JavaScript.
JavaScript
34
star
47

caniunicode

Unicode version support across JavaScript features & engines
JavaScript
32
star
48

math-tex

A web component for mathematical typesetting using TeX notation.
HTML
27
star
49

jquery-noselect

A jQuery plugin which disables text selection on any element. Useful for UI elements; evil for pretty much everything else.
JavaScript
27
star
50

String.prototype.repeat

A robust & optimized ES3-compatible polyfill for the `String.prototype.repeat` method in ECMAScript 6.
JavaScript
27
star
51

windows-1251

A robust JavaScript implementation of the windows-1251 character encoding as defined by the Encoding Standard.
JavaScript
26
star
52

kali-linux-docker

Kali Linux Docker
Shell
26
star
53

bacon-cipher

A robust JavaScript implementation of Bacon’s cipher, a.k.a. the Baconian cipher.
JavaScript
24
star
54

jquery-custom-data-attributes

An easy setter/getter for HTML5 data-* attributes
JavaScript
21
star
55

rgi-emoji-regex-pattern

A JavaScript-compatible regular expression pattern to match all RGI emoji symbols and sequences as per the Unicode Standard and UTS#51.
JavaScript
21
star
56

q-encoding

A robust & character encoding–agnostic JavaScript implementation of the `Q` encoding as defined by RFC 2047.
JavaScript
20
star
57

babel-plugin-transform-unicode-property-regex

Compile Unicode property escapes in Unicode regular expressions to ES5 or ES6 that works in today’s environments.
JavaScript
19
star
58

regenerate-unicode-properties

A collection of Regenerate sets for Unicode various properties.
JavaScript
17
star
59

rot

Perform simple rotational letter substitution (such as ROT-13) in JavaScript.
JavaScript
17
star
60

regex-trie-cli

Create regular expression patterns based on a list of strings to be matched.
JavaScript
17
star
61

strip-combining-marks

Easily remove Unicode combining marks from strings.
JavaScript
16
star
62

Array.of

A robust & optimized ES3-compatible polyfill for the `Array.of` method in ECMAScript 6.
JavaScript
15
star
63

jquery-oninput

My `oninput` polyfill as a jQuery plugin
JavaScript
14
star
64

homebrew-ecmascript

Homebrew formulae for ECMAScript engines
Ruby
13
star
65

tibia.com-extension

User script that enhances the character info pages on Tibia.com.
HTML
13
star
66

is-ascii-safe

is-ascii-safe determines whether a given string is ASCII-safe, i.e. if it consists of ASCII characters (U+0000 to U+007F) only.
JavaScript
13
star
67

es-regexp-unicode-character-class-escapes

Proposal to improve the character class escape tokens `\d`, `\D`, `\w`, `\W`, and the word boundary assertions `\b` and `\B` in ES6 Unicode regular expressions (with the `u` flag).
12
star
68

unicode-canonical-property-names-ecmascript

The set of canonical Unicode property names supported in ECMAScript RegExp property escapes.
JavaScript
11
star
69

node-unshorten

URL unshortener for Node.js
JavaScript
11
star
70

RegExp.prototype.match

A robust & optimized ES3-compatible polyfill for the `RegExp.prototype.match` method in ECMAScript 6.
JavaScript
10
star
71

is-potential-custom-element-name

Check whether a given string matches the `PotentialCustomElementName` production as defined in the HTML Standard.
JavaScript
10
star
72

atom-blackboard

TextMate’s Blackboard theme, ported to Atom.
CSS
10
star
73

unicode-emoji-modifier-base

The set of Unicode symbols that can serve as a base for emoji modifiers, i.e. those with the `Emoji_Modifier_Base` property set to `Yes`.
JavaScript
9
star
74

strip-variation-selectors

Remove Unicode variation selectors from strings.
JavaScript
9
star
75

nginx-zopfli-test

This repository contains some files that make it easy to test whether Nginx is correctly serving Zopfli-pre-compressed files.
JavaScript
9
star
76

unicode-property-escapes-tests

Tests for RegExp Unicode property escapes
JavaScript
8
star
77

unicode-match-property-value-ecmascript

Match a Unicode property or property alias to its canonical property name per the algorithm used for RegExp Unicode property escapes in ECMAScript.
JavaScript
8
star
78

grunt-esmangle

A Grunt plugin for mangling or minifying JavaScript files using Esmangle.
JavaScript
8
star
79

unicode-property-value-aliases

Unicode property value alias mappings in JavaScript format.
JavaScript
7
star
80

css-dbg-stories

HTML
7
star
81

unicode-property-aliases-ecmascript

Unicode property alias mappings in JavaScript format for property names that are supported in ECMAScript RegExp property escapes.
JavaScript
7
star
82

unicode-property-aliases

Unicode property alias mappings in JavaScript format.
JavaScript
7
star
83

unicode-match-property-ecmascript

Match a given Unicode property or property alias to its canonical property name per the algorithm used for RegExp Unicode property escapes in ECMAScript.
JavaScript
7
star
84

iso-8859-2

A robust JavaScript implementation of the iso-8859-2 character encoding as defined by the Encoding Standard.
JavaScript
6
star
85

string-prototype-replace-regexp-benchmark

Generated JavaScript benchmarks for String.prototype.{replace,replaceAll} with global regular expressions based on emoji-test-regex-pattern.
JavaScript
6
star
86

idn-allowed-code-points-regex

A regular expression that matches any of the code points that Verisign allows by default in IDN.
JavaScript
6
star
87

pogotransfercalc

Easily calculate how many Pokémon you should transfer before kicking off an evolution spree in Pokémon GO.
Python
6
star
88

macintosh

A robust JavaScript implementation of the macintosh character encoding as defined by the Encoding Standard.
JavaScript
6
star
89

windows-874

A robust JavaScript implementation of the windows-874 character encoding as defined by the Encoding Standard.
JavaScript
5
star
90

swapcase

A letter case swapper with full Unicode support, i.e. based on the official Unicode case folding mappings.
JavaScript
5
star
91

RegExp.prototype.search

A robust & optimized ES3-compatible polyfill for the `RegExp.prototype.search` method in ECMAScript 6.
JavaScript
5
star
92

netlify-test

HTML
4
star
93

pogocpm2level

Easily calculate the level of a given Pokémon in Pokémon GO based on its total CP multiplier value.
Python
4
star
94

tibia-bosses

JavaScript
4
star
95

covid-19-vaccinations-munich

Archive of historical coronavirus data for Munich, Germany
HTML
4
star
96

windows-1250

A robust JavaScript implementation of the windows-1250 character encoding as defined by the Encoding Standard.
JavaScript
4
star
97

stack-exchange-logos

Stack Exchange logos in SVG format.
HTML
4
star
98

gulp-regexpu

Gulp plugin to transpile ES6 Unicode regular expressions to ES5 with regexpu.
JavaScript
4
star
99

is-ascii-safe-cli

is-ascii-safe-cli checks whether a given file (or list of files) is ASCII-safe, i.e. consisting of ASCII characters (U+0000 to U+007F) only.
JavaScript
4
star
100

windows-1257

A robust JavaScript implementation of the windows-1257 character encoding as defined by the Encoding Standard.
JavaScript
4
star