• Stars
    star
    961
  • Rank 47,587 (Top 1.0 %)
  • Language
  • Created over 7 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Binary AST proposal for ECMAScript

Binary AST Proposal Overview

Shu-yu Guo (Bloomberg)

David Teller, Kannan Vijayan (Mozilla)

Vladan Djeric (Facebook)

Ingvar Stepanyan (Cloudflare)

This is the explainer document for a proposed new binary AST format for JS.

Motivation

Performance of applications on the web platform is becoming increasingly bottlenecked by startup (load) time. Larger amounts of JS code are transferred over the wire by more sophisticated web properties. While caching helps, these properties regularly release new code, and cold load times are very important.

A brief survey from July 2017 of the uncompressed JS payload sizes for some popular web applications on desktop and mobile:

Web App (Desktop) Uncompressed JS Size
LinkedIn 7.2 MB
Facebook 7.1 MB
Google Sheets 5.8 MB
Gmail 3.9 MB
Yahoo 3.4 MB
Web App (Mobile) Uncompressed JS Size
LinkedIn 6.2 MB
YouTube 1.9 MB
Twitter 1.8 MB
Facebook 1.8 MB
Reddit 1.3 MB

Startup performance degrades with larger JS payloads, even if only a fraction of the code is actually executed. Parsing time is a significant component, taking more CPU time than bytecode / initial JIT code generation. For example, on a powerful laptop, Chrome spends 10% to 15% of CPU time parsing JS while loading facebook.com.

We propose a new over-the-wire format for JS that is a binary encoding of an AST. We believe this new format would allow for drastically faster parsing. Moreover, web developers are well positioned to adopt a new format as they have embraced build tooling.

We have also implemented a prototype encoder and decoder that demonstrates encouraging performance improvements without bloating file size.

Design Philosophy

The overriding design philosophy is to be conservative, so as to be realistic both in implementation and committee. This proposal is intended to be simply an alternative encoding of the surface syntax with the smallest possible delta to enable high performance parsing. By design, this proposal does not attempt any semantics-level encoding (e.g., bytecode, encoding variables instead of identifiers).

That said, this proposal is highly ambitious.

Current Parsing Bottlenecks

1. Information Not Available Where Needed

One fundamental issue making parsing slow is that information needed to make decisions during parsing is oftentimes not yet available at the point in the input stream at which it is needed. This happens when the parser needs information from code it either hasn't parsed yet (like variable hoisting) or code it doesn't want to parse (like inner functions).

One concrete example is the problem of efficiently representing bindings. In order to make decisions about how to represent or allocate a variable, the parser needs to know whether the variable is closed over, so it becomes necessary to parse any nested functions. The specification only states where a declaration like var x needs to be reachable, not how to allocate it -- the information needed for the allocation decision is not encoded where the variable is declared, nor at the spot where it needs to be allocated.

In the following example, due to variable hoisting, use_x closes over x, which is not known to be declared in f at the point of parsing the inner function. An engine cannot emit optimized accesses to x inside use_x until the entirety of f is parsed.

var x;
function f() {
  function use_x() {
    use(x); // Closes over hoisted var x.
  }

  var x;
}

In the following example, an engine's frontend would like to know how to efficiently allocate space for the var declarations as they are parsed. That is not possible without parsing the rest of the function.

function g() {
  var x; // Not closed over, should go on function frame.
  var y; // Closed over, should go on activation object.

  return (function() { use(y); });
}

In the following example, the presence of eval at the bottom of the function is unknown at the time when the engine needs to decide if var x may be accessed dynamically.

function h(input) {
  var x;

  (function() {
    eval(input);
  })();
}

2. Early Error Semantics

JavaScriptโ€™s early error semantics requires the entirety of every file be parsed. Engines employ a lazy parsing (also known as pre-parsing) optimization that avoids building a full AST by skipping initial code generation for inner functions until time of first invocation. This is only 50% faster on average, and parsing effort is still proportional to file size. What's worse, this optimization can backfire. If an inner function whose code generation was skipped happens to be called during application startup, then the engine must in effect re-parse the entire function. This happens often enough that developers have resorted to using immediately-invoked function expressions to altogether bypass lazy parsing.

Consider the following. Currently, since innerWithSyntaxError has an early error, outer also must throw the early error.

function outer() {
  function innerWithSyntaxError() {
    var;
  }
}

The proposed behavior change is analogous to wrapping function bodies with eval, as below.

function outer() {
  function innerWithSyntaxError() {
    eval("var");
  }
}

3. Inefficiencies in Using Characters

When parsing, there can be ambiguity at the character-level about what type of expression a JavaScript syntax is encoding. For example, to distinguish between list expressions and the parameter list for an arrow function, parsers need to do additional bookkeeping to account for all valid interpretations until the ambiguity is resolved, or they need to have the ability to backtrack while parsing.

Similarly, lexing is slow due to reasons such as Unicode encoding and having to handle Unicode escape codes. Engines need to check if some text is single-byte or double-byte encoded, for instance.

Proposal

We propose a binary encoding based on an efficient abstract syntax tree representation of JavaScript syntax. This is a new, alternate encoding of the surface syntax, with a fairly close bidirectional mapping to the text representation. We seek to be as conservative as possible in introducing new semantics, which are currently limited to:

  • deferring early errors
  • changing Function.prototype.toString behavior
  • requiring UTF-8

To further speed up parsing, we propose to encode static semantics as implementation hints, and verify them as deferred assertions.

For this AST format, we currently do not propose to automatically derive from the ECMA-262 grammar since it is too verbose to be an efficient tree representation, and as such it would require a lot of flattening and inlining. Additionally, this might restrict the evolution of the JavaScript specification by effectively making the grammar a normative specification of the binary format.

Instead, we propose a separate tree grammar, with annotations on AST nodes that only express information about the syntax. There are several existing tree grammars which we may take as a starting point, such as Babylon or Shift AST.

(Two grammars is likely to be a contentious issue; this decision is not set in stone and feedback is welcome.)

Borrowing from the WebAssembly approach, the binary encoding would be split into 3 layers:

  1. A simple binary encoding of the AST nodes using basic primitives (e.g., strings, numbers, tuples)
  2. Additional structural compression on top of the previous layer, leveraging knowledge about the nature of the format of the file and the AST (e.g., constant tables)
  3. A generic compression algorithm like gzip or Brotli.

We expect the format to be output by existing compilers such as Babel and TypeScript, and by bundlers such as WebPack.

Grammar

The tree grammar consists of the primitives (booleans, strings, numbers), lists, and disjoint unions. Disjoint unions are tagged and have a fixed, expected structure (ordered properties) for a given file. Certain nodes, such as strings and lists, are encoded together with their encoded byte length, allowing parsers to skip past the encoded representation of the node.

Language Evolution and Versioning

The grammar provides the set of all possible node kinds and their ordered properties, which we expect to be monotonically growing. However, not all vendors will support all nodes at any given time. To this end, each file contains a header containing a list of nodes and their properties that the file expects to be used. For example, suppose the grammar currently provides a FunctionExpression node that includes an isAsync property. If a file expects async functions to be supported, it would include a FunctionExpression entry that includes isAsync in the list of properties. If a node or property in the header is not supported by the implementation, an error is thrown.

The header solves the versioning problem, forwards compatibility, and backwards compatibility. It also maintains the expected behavior of JavaScript parsers with respect to new features, insofar as parsing a correctly formatted file fails if a parser encounters a feature that the engine does not implement, rather than relying upon a monolithic version number. Finally, it acts as structural compression: node kinds are to be referred to by their index in the header instead of their name.

To save file space, presets will be supported (e.g., ES2015).

Static Semantics as Annotations

To address concern 1 Information Not Available Where Needed above, the insight is that all the currently known cases of "information not available where needed" are binding-related. Scope nodes, like function or lexical scope nodes, are able to encode additional annotations about the properties of the AST.

A non-exhaustive list of annotations currently considered:

  1. VarDeclaredNames
  2. LexicallyDeclaredNames
  3. ClosedOverNames (new)
  4. AssignedToOutsideOfDeclarationNames (new)
  5. Has CallExpression where the callee expression is IdentifierReference of string literal eval (new)

These annotations, if present, behave as lazy assertions. By the time the scope node is finished being traversed (which may be arbitrarily delayed in case a skipped inner function is never invoked), if the encoded annotation is contrary to the body of the node, a runtime error is thrown. For example, if a function body is annotated with "has VarDeclaredNames x", but does not in fact contain a var declaration with name x, an error would be thrown.

In order to prevent divergence across implementations with regards to timing, such errors are required to be thrown when the nearest enclosing function is invoked or the nearest enclosing global script is evaluated.

It is important to note that these annotations must be checked and are not, strictly speaking, hints that may be ignored. They are assertions on the structure of the syntax tree, and must be verified whether or not an engine chooses to use them as parsing hints.

These annotations would be specified as existing or new static semantics.

Ultimately, this list is ad-hoc and the union of properties that various engines currently derive during parsing. With these annotations, engines may generate code for functions in a single forward pass without parsing the functions in entirety. If the needed hints are not present for an engine to do the single-pass fast path, the fallback is the existing analysis that engines already implement. Should the need for new annotations arise, we expect them to be standardized as new static semantics.

Annotations and scopes are embedded in the AST just as any other node properties and use the same versioning mechanism.

Deferred Early Errors

To address concern 2 Early Error Semantics, Early Errors, like annotation verification, will be deferred to until when the nearest enclosing function is invoked or the nearest enclosing global script is evaluated.

This is a breaking change from textual JS.

Since this format is expected to be compiler output, early errors would be caught at compile time in practice.

UTF-8 with no escape codes

To address concern 3 Inefficiencies in Using Characters, strings and identifiers will be required to be in UTF-8 without support for escape codes. Again, we expect this to be feasible as the format will be compiler output.

Verification

Verification of the header, annotations, and the tree structure itself (e.g., its length-encoded nodes, the allowed kinds for child nodes) is intended to be possible in a single forward pass as constant work per node, as the AST is being decoded.

Function.prototype.toString()

This method would return something like "[sourceless code]".

Exception Offsets

Thrown exceptions, instead of having a line and column number, will have a byte offset into the binary AST file.

Further Improvements

It is straightforward to layer additional improvements on this format in order to further improve parsing speed or binary file size. For example, the format could allow for a string table, or allow function bodies to be stored separately from function declarations, and so on.

Prototype

We implemented an early prototype in Mozillaโ€™s SpiderMonkey engine, by using a grammar based on internal AST format. This was done for speed of implementation, and our next prototype will be based on the Babylon AST.

For the facebook.com static newsfeed benchmark, the binary AST representation was slightly smaller than the original JavaScript. This held true even after both representations were passed through gzip for compression. The size reduction mainly came from the use of a string table and variable-length identifiers for entries in the table. It also used variable-length encodings for representing numbers. Additional size wins are possible by leveraging domain-specific information, such as factoring out common subtrees.

The time required to create a full AST (without verifying annotations) was reduced by ~70-90%, which is a considerable reduction since SpiderMonkey's AST construction time for the plain JavaScript was 500-800 ms for the benchmark.

FAQ

Why not ship bytecode instead?

On the Web, no vendor would agree to ship bytecode.

  1. Engines donโ€™t want to be tied to a bytecode version.
  2. Itโ€™s harder to verify bytecode, as bytecode is more expressive than syntax.
  3. It runs the risk of bifurcating the language, as bytecode may be more expressive than structured JavaScript source.
  4. Designing a new bytecode is an even more ambitious undertaking.

Why not use WebAssembly?

There are massive existing untyped JS codebases, and there is no easy way to convert an untyped, garbage collected language like JS to WebAssembly. Given that currently using WebAssembly to ship JavaScript most likely involves bundling a JS runtime as well, it is unclear such a setup would net any load-time performance improvements.

Why not a semantic graph? Or why not go further? Why not types?

We do not want to invent a new language by adding new front-end semantics, nor do we want to encode analysis result and require more sophisticated verification that this information is correct.

Won't this be a massive maintenance burden?

Parsing a binary AST format will not be as complicated as parsing existing JavaScript. It will undoubtedly increase complexity of implementation, but we believe is worth the cost given the trajectory of the complexity and size of web applications.

Won't an AST format bloat file size?

In our prototype described above, we found the AST format to be slightly smaller, even when comparing compressed sizes.

How much of a performance win is there?

For parsing desktop facebook.com, 10-15% of client-side CPU time is spent parsing JavaScript. The prototype we implemented reduced time to build the AST by 70-90%.

Additionally, many complex sites today fetch JavaScript on demand when a user interacts with some secondary functionality (e.g. when a user opens the Notifications panel on Facebook). Long parsing times on these interactive paths reduce the responsiveness of the site.

Would it be possible to write a tool to convert serialized AST back to JS?

Yes, it would be possible to automatically generate human-readable JavaScript from the AST format. By virtue of the syntax tree being abstract and not concrete, the pretty printer will not be a perfect recreation of the input JavaScript but it will be semantically equivalent.

Such serializers may be crucial to providing a compelling devtooling story.

Is this made obsolete by programmable caching APIs?

No. Programmable caching APIs drastically improve warm load times. This proposal improves cold load times. It composes nicely with caching APIs.

Will this work alongside text JS source?

Yes. HTML integration is forthcoming, but an application is free to mix loading text source and binary AST files.

Will this be a new surface for attacks?

Yes, but we believe this is a much smaller and much more controlled and testable surface than alternatives, such as bytecode.

More Repositories

1

proposals

Tracking ECMAScript Proposals
17,177
star
2

ecma262

Status, process, and documents for ECMA-262
HTML
14,437
star
3

proposal-pipeline-operator

A proposal for adding a useful pipe operator to JavaScript.
HTML
7,534
star
4

proposal-pattern-matching

Pattern matching syntax for ECMAScript
HTML
5,498
star
5

proposal-optional-chaining

HTML
4,942
star
6

proposal-type-annotations

ECMAScript proposal for type syntax that is erased - Stage 1
JavaScript
4,252
star
7

proposal-signals

A proposal to add signals to JavaScript.
3,387
star
8

proposal-temporal

Provides standard objects and functions for working with dates and times.
HTML
3,321
star
9

proposal-observable

Observables for ECMAScript
JavaScript
3,058
star
10

proposal-decorators

Decorators for ES6 classes
2,640
star
11

proposal-record-tuple

ECMAScript proposal for the Record and Tuple value types. | Stage 2: it will change!
HTML
2,496
star
12

test262

Official ECMAScript Conformance Test Suite
JavaScript
2,073
star
13

proposal-dynamic-import

import() proposal for JavaScript
HTML
1,863
star
14

proposal-bind-operator

This-Binding Syntax for ECMAScript
1,742
star
15

proposal-class-fields

Orthogonally-informed combination of public and private fields proposals
HTML
1,722
star
16

proposal-async-await

Async/await for ECMAScript
HTML
1,578
star
17

proposal-object-rest-spread

Rest/Spread Properties for ECMAScript
HTML
1,493
star
18

proposal-shadowrealm

ECMAScript Proposal, specs, and reference implementation for Realms
HTML
1,429
star
19

proposal-iterator-helpers

Methods for working with iterators in ECMAScript
HTML
1,307
star
20

proposal-nullish-coalescing

Nullish coalescing proposal x ?? y
HTML
1,232
star
21

proposal-top-level-await

top-level `await` proposal for ECMAScript (stage 4)
HTML
1,083
star
22

proposal-partial-application

Proposal to add partial application to ECMAScript
HTML
1,002
star
23

proposal-do-expressions

Proposal for `do` expressions
HTML
990
star
24

agendas

TC39 meeting agendas
JavaScript
952
star
25

proposal-built-in-modules

HTML
891
star
26

proposal-async-iteration

Asynchronous iteration for JavaScript
HTML
857
star
27

proposal-explicit-resource-management

ECMAScript Explicit Resource Management
JavaScript
746
star
28

proposal-set-methods

Proposal for new Set methods in JS
HTML
655
star
29

proposal-string-dedent

TC39 Proposal to remove common leading indentation from multiline template strings
HTML
614
star
30

proposal-operator-overloading

JavaScript
610
star
31

proposal-import-attributes

Proposal for syntax to import ES modules with assertions
HTML
591
star
32

proposal-async-context

Async Context for JavaScript
HTML
587
star
33

proposal-bigint

Arbitrary precision integers in JavaScript
HTML
561
star
34

ecmascript_simd

SIMD numeric type for EcmaScript
JavaScript
540
star
35

ecma402

Status, process, and documents for ECMA 402
HTML
529
star
36

proposal-slice-notation

HTML
523
star
37

proposal-change-array-by-copy

Provides additional methods on Array.prototype and TypedArray.prototype to enable changes on the array by returning a new copy of it with the change.
HTML
511
star
38

notes

TC39 meeting notes
JavaScript
496
star
39

proposal-class-public-fields

Stage 2 proposal for public class fields in ECMAScript
HTML
489
star
40

proposal-iterator.range

A proposal for ECMAScript to add a built-in Iterator.range()
HTML
483
star
41

proposal-decimal

Built-in exact decimal numbers for JavaScript
HTML
477
star
42

proposal-uuid

UUID proposal for ECMAScript (Stage 1)
JavaScript
463
star
43

proposal-module-expressions

HTML
433
star
44

proposal-throw-expressions

Proposal for ECMAScript 'throw' expressions
JavaScript
425
star
45

proposal-UnambiguousJavaScriptGrammar

413
star
46

proposal-weakrefs

WeakRefs
HTML
409
star
47

proposal-array-grouping

A proposal to make grouping of array items easier
HTML
407
star
48

proposal-error-cause

TC39 proposal for accumulating errors
HTML
380
star
49

proposal-cancelable-promises

Former home of the now-withdrawn cancelable promises proposal for JavaScript
Shell
376
star
50

proposal-ecmascript-sharedmem

Shared memory and atomics for ECMAscript
HTML
374
star
51

proposal-module-declarations

JavaScript Module Declarations
HTML
369
star
52

proposal-first-class-protocols

a proposal to bring protocol-based interfaces to ECMAScript users
352
star
53

proposal-relative-indexing-method

A TC39 proposal to add an .at() method to all the basic indexable classes (Array, String, TypedArray)
HTML
351
star
54

proposal-global

ECMAScript Proposal, specs, and reference implementation for `global`
HTML
346
star
55

proposal-private-methods

Private methods and getter/setters for ES6 classes
HTML
345
star
56

proposal-numeric-separator

A proposal to add numeric literal separators in JavaScript.
HTML
330
star
57

proposal-private-fields

A Private Fields Proposal for ECMAScript
HTML
319
star
58

tc39.github.io

Get involved in specifying JavaScript
HTML
318
star
59

proposal-object-from-entries

TC39 proposal for Object.fromEntries
HTML
318
star
60

proposal-promise-allSettled

ECMAScript Proposal, specs, and reference implementation for Promise.allSettled
HTML
314
star
61

proposal-await.ops

Introduce await.all / await.race / await.allSettled / await.any to simplify the usage of Promises
HTML
310
star
62

proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
JavaScript
309
star
63

proposal-export-default-from

Proposal to add `export v from "mod";` to ECMAScript.
HTML
306
star
64

proposal-logical-assignment

A proposal to combine Logical Operators and Assignment Expressions
HTML
302
star
65

proposal-promise-finally

ECMAScript Proposal, specs, and reference implementation for Promise.prototype.finally
HTML
279
star
66

proposal-json-modules

Proposal to import JSON files as modules
HTML
272
star
67

proposal-asset-references

Proposal to ECMAScript to add first-class location references relative to a module
270
star
68

proposal-cancellation

Proposal for a Cancellation API for ECMAScript
HTML
267
star
69

proposal-promise-with-resolvers

HTML
255
star
70

proposal-string-replaceall

ECMAScript proposal: String.prototype.replaceAll
HTML
253
star
71

proposal-export-ns-from

Proposal to add `export * as ns from "mod";` to ECMAScript.
HTML
242
star
72

proposal-structs

JavaScript Structs: Fixed Layout Objects
230
star
73

proposal-ses

Draft proposal for SES (Secure EcmaScript)
HTML
223
star
74

proposal-intl-relative-time

`Intl.RelativeTimeFormat` specification [draft]
HTML
215
star
75

proposal-json-parse-with-source

Proposal for extending JSON.parse to expose input source text.
HTML
214
star
76

proposal-flatMap

proposal for flatten and flatMap on arrays
HTML
214
star
77

proposal-defer-import-eval

A proposal for introducing a way to defer evaluate of a module
HTML
208
star
78

ecmarkup

An HTML superset/Markdown subset source format for ECMAScript and related specifications
TypeScript
201
star
79

proposal-promise-any

ECMAScript proposal: Promise.any
HTML
200
star
80

proposal-optional-chaining-assignment

`a?.b = c` proposal
186
star
81

proposal-decorators-previous

Decorators for ECMAScript
HTML
184
star
82

proposal-smart-pipelines

Old archived draft proposal for smart pipelines. Go to the new Hack-pipes proposal at js-choi/proposal-hack-pipes.
HTML
181
star
83

proposal-array-from-async

Draft specification for a proposed Array.fromAsync method in JavaScript.
HTML
178
star
84

proposal-upsert

ECMAScript Proposal, specs, and reference implementation for Map.prototype.upsert
HTML
176
star
85

proposal-collection-methods

HTML
171
star
86

proposal-array-filtering

A proposal to make filtering arrays easier
HTML
171
star
87

proposal-ptc-syntax

Discussion and specification for an explicit syntactic opt-in for Tail Calls.
HTML
169
star
88

proposal-extractors

Extractors for ECMAScript
JavaScript
166
star
89

proposal-error-stacks

ECMAScript Proposal, specs, and reference implementation for Error.prototype.stack / System.getStack
HTML
166
star
90

proposal-intl-duration-format

164
star
91

how-we-work

Documentation of how TC39 operates and how to participate
161
star
92

proposal-Array.prototype.includes

Spec, tests, reference implementation, and docs for ESnext-track Array.prototype.includes
HTML
157
star
93

proposal-promise-try

ECMAScript Proposal, specs, and reference implementation for Promise.try
HTML
154
star
94

proposal-extensions

Extensions proposal for ECMAScript
HTML
150
star
95

proposal-hashbang

#! for JS
HTML
148
star
96

proposal-import-meta

import.meta proposal for JavaScript
HTML
146
star
97

proposal-intl-segmenter

Unicode text segmentation for ECMAScript
HTML
146
star
98

proposal-resizablearraybuffer

Proposal for resizable array buffers
HTML
145
star
99

proposal-seeded-random

Proposal for an options argument to be added to JS's Math.random() function, and some options to start it with.
HTML
143
star
100

eshost

A uniform wrapper around a multitude of ECMAScript hosts. CLI: https://github.com/bterlson/eshost-cli
JavaScript
142
star