• Stars
    star
    407
  • Rank 102,510 (Top 3 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 10 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tiny JavaScript tokenizer.

js-tokens

The tiny, regex powered, lenient, almost spec-compliant JavaScript tokenizer that never fails.

const jsTokens = require("js-tokens");

const jsString = 'JSON.stringify({k:3.14**2}, null /*replacer*/, "\\t")';

Array.from(jsTokens(jsString), (token) => token.value).join("|");
// JSON|.|stringify|(|{|k|:|3.14|**|2|}|,| |null| |/*replacer*/|,| |"\t"|)

Installation

npm install js-tokens

import jsTokens from "js-tokens";
// or:
const jsTokens = require("js-tokens");

Usage

jsTokens(string, options?)
Option Type Default Description
jsx boolean false Enable JSX support.

This package exports a generator function, jsTokens, that turns a string of JavaScript code into token objects.

For the empty string, the function yields nothing (which can be turned into an empty list). For any other input, the function always yields something, even for invalid JavaScript, and never throws. Concatenating the token values reproduces the input.

The package is very close to being fully spec compliant (it passes all but 3 of test262-parser-tests), but has taken a couple of shortcuts. See the following sections for limitations of some tokens.

// Loop over tokens:
for (const token of jsTokens("hello, !world")) {
  console.log(token);
}

// Get all tokens as an array:
const tokens = Array.from(jsTokens("hello, !world"));

Tokens

Spec: ECMAScript Language: Lexical Grammar + Additional Syntax

export default function jsTokens(input: string): Iterable<Token>;

type Token =
  | { type: "StringLiteral"; value: string; closed: boolean }
  | { type: "NoSubstitutionTemplate"; value: string; closed: boolean }
  | { type: "TemplateHead"; value: string }
  | { type: "TemplateMiddle"; value: string }
  | { type: "TemplateTail"; value: string; closed: boolean }
  | { type: "RegularExpressionLiteral"; value: string; closed: boolean }
  | { type: "MultiLineComment"; value: string; closed: boolean }
  | { type: "SingleLineComment"; value: string }
  | { type: "IdentifierName"; value: string }
  | { type: "PrivateIdentifier"; value: string }
  | { type: "NumericLiteral"; value: string }
  | { type: "Punctuator"; value: string }
  | { type: "WhiteSpace"; value: string }
  | { type: "LineTerminatorSequence"; value: string }
  | { type: "Invalid"; value: string };

StringLiteral

Spec: StringLiteral

If the ending " or ' is missing, the token has closed: false. JavaScript strings cannot contain (unescaped) newlines, so unclosed strings simply end at the end of the line.

Escape sequences are supported, but may be invalid. For example, "\u" is matched as a StringLiteral even though it contains an invalid escape.

Examples:

"string"
'string'
""
''
"\""
'\''
"valid: \u00a0, invalid: \u"
'valid: \u00a0, invalid: \u'
"multi-\
line"
'multi-\
line'
" unclosed
' unclosed

NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

Spec: NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

A template without interpolations is matched as is. For, example:

  • `abc`: NoSubstitutionTemplate
  • `abc: NoSubstitutionTemplate with closed: false

A template with interpolations is matched as many tokens. For example, `head${1}middle${2}tail` is matched as follows (apart from the two NumericLiterals):

  • `head${: TemplateHead
  • }middle${: TemplateMiddle
  • }tail`: TemplateTail

TemplateMiddle is optional, and TemplateTail can be unclosed. For example, `head${1}tail (note the missing ending `):

  • `head${: TemplateHead
  • }tail: TemplateTail with closed: false

Templates can contain unescaped newlines, so unclosed templates go on to the end of input.

Just like for StringLiteral, templates can also contain invalid escapes. `\u` is matched as a NoSubstitutionTemplate even though it contains an invalid escape. Also note that in tagged templates, invalid escapes are not syntax errors: x`\u` is syntactically valid JavaScript.

RegularExpressionLiteral

Spec: RegularExpressionLiteral

Regex literals may contain invalid regex syntax. They are still matched as regex literals.

If the ending / is missing, the token has closed: false. JavaScript regex literals cannot contain newlines (not even escaped ones), so unclosed regex literals simply end at the end of the line.

According to the specification, the flags of regular expressions are IdentifierParts (unknown and repeated regex flags become errors at a later stage).

Differentiating between regex and division in JavaScript is really tricky. js-tokens looks at the previous token to tell them apart. As long as the previous tokens are valid, it should do the right thing. For invalid code, js-tokens might be confused and start matching division as regex or vice versa.

Examples:

/a/
/a/gimsuy
/a/Inva1id
/+/
/[/]\//

MultiLineComment

Spec: MultiLineComment

If the ending */ is missing, the token has closed: false. Unclosed multi-line comments go on to the end of the input.

Examples:

/* comment */
/* console.log(
    "commented", out + code);
    */
/**/
/* unclosed

SingleLineComment

Spec: SingleLineComment

Examples:

// comment
// console.log("commented", out + code);
//

IdentifierName

Spec: IdentifierName

Keywords, reserved words, null, true, false, variable names and property names.

Examples:

if
for
var
instanceof
package
null
true
false
Infinity
undefined
NaN
$variab1e_name
π

ಠ_ಠ
\u006C\u006F\u006C\u0077\u0061\u0074

PrivateIdentifier

Spec: PrivateIdentifier

Any IdentifierName preceded by a #.

Examples:

#if
#for
#var
#instanceof
#package
#null
#true
#false
#Infinity
#undefined
#NaN
#$variab1e_name
#π
#
#ಠ_ಠ
#\u006C\u006F\u006C\u0077\u0061\u0074

NumericLiteral

Spec: NumericLiteral

Examples:

0
1.5
1
1_000
12e9
0.123e-32
0xDead_beef
0b110
12n
07
09.5

Punctuator

Spec: Punctuator + DivPunctuator + RightBracePunctuator

All possible values:

&&  ||  ??
--  ++
.   ?.
<   <=   >   >=
!=  !==  ==  ===
   +   -   %   &   |   ^   /   *   **   <<   >>   >>>
=  +=  -=  %=  &=  |=  ^=  /=  *=  **=  <<=  >>=  >>>=
(  )  [  ]  {  }
!  ?  :  ;  ,  ~  ...  =>

WhiteSpace

Spec: WhiteSpace

Unlike the specification, multiple whitespace characters in a row are matched as one token, not one token per character.

LineTerminatorSequence

Spec: LineTerminatorSequence

CR, LF and CRLF, plus \u2028 and \u2029.

Invalid

Spec: n/a

Single code points not matched in another token.

Examples:

#
@
💩

JSX Tokens

Spec: JSX Specification

export default function jsTokens(
  input: string,
  options: { jsx: true }
): Iterable<Token | JSXToken>;

export declare type JSXToken =
  | { type: "JSXString"; value: string; closed: boolean }
  | { type: "JSXText"; value: string }
  | { type: "JSXIdentifier"; value: string }
  | { type: "JSXPunctuator"; value: string }
  | { type: "JSXInvalid"; value: string };
  • The tokenizer switches between outputting runs of Token and runs of JSXToken.
  • Runs of JSXToken can also contain WhiteSpace, LineTerminatorSequence, MultiLineComment and SingleLineComment.

JSXString

Spec: " JSXDoubleStringCharacters " + ' JSXSingleStringCharacters '

If the ending " or ' is missing, the token has closed: false. JSX strings can contain unescaped newlines, so unclosed JSX strings go on to the end of input.

Note that JSX don’t support escape sequences as part of the token grammar. A " or ' always closes the string, even with a backslash before.

Examples:

"string"
'string'
""
''
"\"
'\'
"multi-
line"
'multi-
line'
" unclosed
' unclosed

JSXText

Spec: JSXText

Anything but <, >, { and }.

JSXIdentifier

Spec: JSXIdentifier

Examples:

div
class
xml
x-element
x------
$htm1_element
ಠ_ಠ

JSXPunctuator

Spec: n/a

All possible values:

<
>
/
.
:
=
{
}

JSXInvalid

Spec: n/a

Single code points not matched in another token.

Examples in JSX tags:

1
`
+
,
#
@
💩

All possible values in JSX children:

>
}

Compatibility

ECMAScript

The intention is to always support the latest ECMAScript version whose feature set has been finalized.

Currently, ECMAScript 2022 is supported.

Annex B

Annex B: Additional ECMAScript Features for Web Browsers of the spec is optional if the ECMAScript host is not a web browser, and specifies some additional syntax.

  • Numeric literals: js-tokens supports legacy octal and octal like numeric literals. It was easy enough, so why not.
  • String literals: js-tokens supports legacy octal escapes, since it allows any invalid escapes.
  • HTML-like comments: Not supported. js-tokens prefers treating 5<!--x as 5 < !(--x) rather than as 5 //x.
  • Regular expression patterns: js-tokens doesn’t care what’s between the starting / and ending /, so this is supported.

TypeScript

Supporting TypeScript is not an explicit goal, but js-tokens and Babel both tokenize this TypeScript fixture and this TSX fixture the same way, with one edge case:

type A = Array<Array<string>>
type B = Array<Array<Array<string>>>

Both lines above should end with a couple of > tokens, but js-tokens instead matches the >> and >>> operators.

JSX

JSX is supported: jsTokens("<p>Hello, world!</p>", { jsx: true }).

JavaScript runtimes

js-tokens should work in any JavaScript runtime that supports Unicode property escapes.

Known errors

Here are a couple of tricky cases:

// Case 1:
switch (x) {
  case x: {}/a/g;
  case x: {}<div>x</div>/g;
}

// Case 2:
label: {}/a/g;
label: {}<div>x</div>/g;

// Case 3:
(function f() {}/a/g);
(function f() {}<div>x</div>/g);

This is what they mean:

// Case 1:
switch (x) {
  case x:
    {
    }
    /a/g;
  case x:
    {
    }
    <div>x</div> / g;
}

// Case 2:
label: {
}
/a/g;
label: {
}
<div>x</div> / g;

// Case 3:
(function f() {}) / a / g;
(function f() {}) < div > x < /div>/g;

But js-tokens thinks they mean:

// Case 1:
switch (x) {
  case x:
    ({}) / a / g;
  case x:
    ({}) < div > x < /div>/g;
}

// Case 2:
label: ({}) / a / g;
label: ({}) < div > x < /div>/g;

// Case 3:
function f() {}
/a/g;
function f() {}
<div>x</div> / g;

In other words, js-tokens:

  • Mis-identifies regex as division and JSX as comparison in case 1 and 2.
  • Mis-identifies division as regex and comparison as JSX in case 3.

This happens because js-tokens looks at the previous token when deciding between regex and division or JSX and comparison. In these cases, the previous token is }, which either means “end of block” (→ regex/JSX) or “end of object literal” (→ division/comparison). How does js-tokens determine if the } belongs to a block or an object literal? By looking at the token before the matching {.

In case 1 and 2, that’s a :. A : usually means that we have an object literal or ternary:

let some = weird ? { value: {}/a/g } : {}/a/g;

But : is also used for case and labeled statements.

One idea is to look for case before the : as an exception to the rule, but it’s not so easy:

switch (x) {
  case weird ? true : {}/a/g: {}/a/g
}

The first {}/a/g is a division, while the second {}/a/g is an empty block followed by a regex. Both are preceded by a colon with a case on the same line, and it does not seem like you can distinguish between the two without implementing a parser.

Labeled statements are similarly difficult, since they are so similar to object literals:

{
  label: {}/a/g
}

({
  key: {}/a/g
})

Finally, case 3 ((function f() {}/a/g);) is also difficult, because a ) before a { means that the { is part of a block, and blocks are usually statements:

if (x) {
}
/a/g;

function f() {}
/a/g;

But function expressions are of course not statements. It’s difficult to tell an function expression from a function statement without parsing.

Luckily, none of these edge cases are likely to occur in real code.

Performance

With @babel/parser for comparison. Node.js 18.13.0 on a MacBook Pro M1 (Ventura).

Lines of code Size [email protected] @babel/[email protected]
~100 ~4.1 KiB ~2 ms ~10 ms
~1 000 ~39 KiB ~5 ms ~29 ms
~10 000 ~353 KiB ~37 ms ~119 ms
~100 000 ~5.1 MiB ~317 ms ~2.2 s
~2 400 000 ~138 MiB ~8 s ~8 m 32 s (*)

(*) Required increasing the Node.js the memory limit (I set it to 8 GiB).

See benchmark.js if you want to run benchmarks yourself.

More Repositories

1

eslint-plugin-simple-import-sort

Easy autofixable import sorting.
JavaScript
1,893
star
2

json-stringify-pretty-compact

The best of both `JSON.stringify(obj)` and `JSON.stringify(obj, null, indent)`.
JavaScript
232
star
3

elm-watch

`elm make` in watch mode. Fast and reliable.
TypeScript
146
star
4

LinkHints

A browser extension that lets you click with your keyboard.
TypeScript
132
star
5

dual

[ABANDONED] Dual is an AutoHotkey script that lets you define dual-role modifier keys easily.
AutoHotkey
122
star
6

source-map-resolve

[DEPRECATED] Resolve the source map and/or sources for a generated file.
JavaScript
86
star
7

spacefn-win

A Windows implementation of the SpaceFN keyboard layout.
AutoHotkey
71
star
8

run-pty

Run several commands concurrently. Show output for one command at a time. Kill all at once.
JavaScript
70
star
9

urix

[DEPRECATED] Makes Windows-style paths more unix and URI friendly.
JavaScript
59
star
10

eslump

Fuzz testing JavaScript parsers and suchlike programs.
JavaScript
57
star
11

resolve-url

[DEPRECATED] Like Node.js’ `path.resolve`/`url.resolve` for the browser.
JavaScript
52
star
12

source-map-visualize

Quickly open an online source map visualization with local files
JavaScript
50
star
13

tiny-decoders

Type-safe data decoding for the minimalist.
TypeScript
48
star
14

source-map-url

[DEPRECATED] Tools for working with sourceMappingURL comments.
JavaScript
41
star
15

elm-app-url

URLs for applications
Elm
25
star
16

webextension-keyboard

⚠️ ABANDONED ⚠️ WebExtension API proposals related to keyboard handling.
JavaScript
20
star
17

line-numbers

[DEPRECATED] Add line numbers to a string.
JavaScript
16
star
18

anishtro

anishtro is a layout for the letters of the English alphabet, made for symmetrical keyboards with at least one main key per thumb.
JavaScript
15
star
19

dotfiles

Shell
13
star
20

source-map-concat

[DEPRECATED] Concatenate files with source maps.
JavaScript
13
star
21

vim-like-key-notation

Parse and generate vim-like key notation for modern browsers.
JavaScript
12
star
22

n-ary-huffman

An n-ary Huffman algorithm implementation.
CoffeeScript
9
star
23

frappe

JavaScript with some nice fluff on top of it.
7
star
24

elm-safe-virtual-dom

JavaScript
7
star
25

video-audio-sync

Fix videos where the audio is out of sync
Elm
7
star
26

elm-value-graph

Show how every value in an Elm application depend on each other, as a graph.
Elm
6
star
27

test-cli

[DEPRECATED] Test CLI applications (that are written a certain way).
JavaScript
5
star
28

VimFx-config

[DEPRECATED] VimFx Config template
JavaScript
5
star
29

keyboard

My personal keyboard layout.
Shell
5
star
30

strip-css-singleline-comments

Adds support for singleline comments in CSS.
JavaScript
5
star
31

next-minimal-routes

[ABANDONED] Next.js dynamic URLs for the minimalist.
JavaScript
5
star
32

parse-stack

[DEPRECATED] Parses the stack property of errors. Cross-browser.
JavaScript
4
star
33

autoprefixer-brunch

[DEPRECATED] Adds autoprefixer support to brunch.
CoffeeScript
4
star
34

map-replace

[ABANDONED] A command line tool that applies replacements described in a JSON map to files.
CoffeeScript
3
star
35

hash-filename

[DEPRECATED] A command line tool that puts the hash of a file into its filename.
CoffeeScript
3
star
36

elm-minesweeper

The classic game MineSweeper made with Elm.
Elm
3
star
37

blog

JavaScript, the web and other programming languages from a JavaScripter's perspective.
3
star
38

css-tokens

[DEPRECATED] A regex that tokenizes CSS.
JavaScript
3
star
39

hacking-elm

Live coding slides for a talk on how to do dirty tricks in Elm.
Elm
2
star
40

macos-safari-overscroll-bugs

macOS Safari overscroll color bugs
HTML
2
star
41

source-map-dummy

[DEPRECATED] Creates “dummy” source maps.
JavaScript
2
star
42

webextension-polyfill-messaging-issue

https://github.com/mozilla/webextension-polyfill/issues/130
JavaScript
2
star
43

elm-jason

elm/json re-implemented in Elm.
Elm
2
star
44

eslint-config-lydell

[DEPRECATED] Kinda strict Prettier-friendly ESLint config.
JavaScript
2
star
45

elm-old-binaries

Elm 0.15.1 – 0.18.0 binaries
2
star
46

browser-tweaks

Personal browser extension containing a couple of tweaks.
JavaScript
2
star
47

image-upload

Test different `<input type="file">` variations for images
HTML
1
star
48

microbit-simulator

Incomplete simulator for the Microbit.
Python
1
star
49

elm-version

Download and run Elm tooling from elm-tooling.json.
Shell
1
star
50

webext-message-issue

https://bugzilla.mozilla.org/show_bug.cgi?id=1369841
JavaScript
1
star
51

yaba

[DEPRECATED] Yet Another Better Assert. Shows the expression. Cross-browser.
JavaScript
1
star
52

cctop

[ABANDONED] Plain text configuration parser.
JavaScript
1
star
53

pegjs-each-code

[DEPRECATED] pegjs plugin helper: Run a function once for each code snippet in a grammar AST.
JavaScript
1
star
54

climap

[DEPRECATED] Super simple source map generation for CLI tools.
JavaScript
1
star
55

elm-github-actions-test

1
star
56

pegjs-each-node

[DEPRECATED] pegjs plugin helper: Run a function once for each node in a grammar AST.
JavaScript
1
star
57

elm-review-simplify-issue

Elm
1
star