• Stars
    star
    146
  • Rank 252,769 (Top 5 %)
  • Language
    HTML
  • Created over 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unicode text segmentation for ECMAScript

Intl.Segmenter: Unicode segmentation in JavaScript

Stage 4 proposal, champion Richard Gibson

Motivation

A code point is not a "letter" or a displayed unit on the screen. That designation goes to the grapheme, which can consist of multiple code points (e.g., including accent marks, conjoining Korean characters). Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. This may be useful in implementing advanced editors/input methods, or other forms of text processing.

Unicode also defines an algorithm for finding boundaries between words and sentences, which CLDR tailors per locale. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences.

Grapheme, word and sentence segmentation is defined in UAX 29. Web browsers need an implementation of this kind of segmentation to function, and shipping it to JavaScript saves memory and network bandwidth as compared to expecting developers to implement it themselves in JavaScript.

Chrome has been shipping its own nonstandard segmentation API called Intl.v8BreakIterator for a few years. However, for a few reasons, this API does not seem suitable for standardization. This explainer outlines a new API which attempts to be more in accordance with modern, post-ES2015 JavaScript API design.

Examples

Segment iteration

Objects returned by the segment method of an Intl.Segmenter instance find boundaries and expose segments between them via the Iterable interface.

// Create a locale-specific word segmenter
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

// Use it to get an iterator for a string
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

// Use that for segmentation!
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}
// console.log output:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»

For flexibility and advanced use cases, they also support direct random access.

// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined

API

polyfill for a historical snapshot of this proposal

new Intl.Segmenter(locale, options)

Creates a new locale-dependent Segmenter. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity ("grapheme", "word", or "sentence", defaulting to "grapheme").

Intl.Segmenter.prototype.segment(string)

Creates a new Iterable %Segments% instance for the input string using the Segmenter's locale and granularity.

Segment data

Segments are described by plain objects with the following data properties:

  • segment is the string segment.
  • index is the code unit index in the string at which the segment begins.
  • input is the string being segmented.
  • isWordLike is true when granularity is "word" and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is "word" and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not "word".

Methods of %Segments%.prototype:

%Segments%.prototype.containing(index)

Returns a segment data object describing the segment in the string including the code unit at the specified index, or undefined if the index is out of bounds.

%Segments%.prototype[Symbol.iterator]

Creates a new %SegmentIterator% instance which will lazily find segments in the input string using the Segmenter's locale and granularity, keeping track of its current position within the string.

Methods of %SegmentIterator%.prototype:

%SegmentIterator%.prototype.next()

The next method implements the Iterator interface, finding the next segment and returning a corresponding IteratorResult object whose value property is a segment data object as described above.

FAQ

Why should we pass a locale and options bag for grapheme boundaries? Isn't there just one way to do it?

The situation is a little more complicated, e.g., for Indic scripts. Work is ongoing to support grapheme boundary options for these scripts better; see this bug, and in particular this CLDR wiki page. Seems like CLDR/ICU don't support this yet, but it's planned.

Shouldn't we be putting new APIs in built-in modules?

If built-in modules had come out before this gets to Stage 3, that sounds like a good option. However, so far the idea in TC39 has been not to block either thing on the other. Built-in modules still have some big questions to resolve, e.g., how/whether polyfills should interact with them.

Why is line breaking not included?

Line breaking was provided in an earlier version of this API, but it is excluded because simply a line breaking API would be incomplete: Line breaking is typically used when laying out text, and text layout requires a larger set of APIs, e.g., determining the width of a rendered string of text. For this reason, we suggest continued development of a line breaking API as part of the CSS Houdini effort.

Why is hyphenation not included?

Hyphenation is expected to have a different sort of API shape for various reasons:

  • Adding a hyphenation break may change the spelling of the affected text
  • There may be hyphenation breaks of different priorities
  • Hyphenation plays into line layout and font rendering in a more complex way, and we might want to expose it at that level (e.g., in the Web Platform rather than ECMAScript)
  • Hyphenation is just a less well-developed thing in the internationalization world. CLDR and ICU don't support it yet; certain web browsers are only getting support for it now in CSS. It's often not done perfectly. It could use some more time to bake. By contrast, word, grapheme, sentence and line breaks have been in the Unicode specification for a long time; this is a shovel-ready project.

Why is random-access stateless?

It would be possible to expose methods on %SegmentIterator%.prototype that mutate internal state (e.g., seek([inclusiveStartIndex = thisIterator.index + 1]) and seekBefore([exclusiveLastIndex = thisIterator.index]), and in fact these were part of earlier designs. They were dropped for consistency with other ECMA-262 iterators (whose movement is always forward and without gaps). If real-world use reveals that their absence is an ergonomic and/or performance flaw, they can be added in a followup proposal.

Why is this an Intl API instead of String methods?

All of these boundary types are actually locale-dependent, and some allow complex options. The result of the segment method is a SegmentIterator. For many non-trivial cases like this, analogous APIs are put in ECMA-402's Intl object. This allows for the work that happens on each instantiation to be shared, improving performance. We could make a convenience method on String as a follow-on proposal.

What exactly does the index refer to?

An index n refers to the code unit index within a string that is potentially the start of a segment. For example, when iterating over the string "Hello, world💙" by words in English,segments will start at indexes 0, 5, 6, 7, and 12 (i.e., the string gets segmented like ┃Hello┃,┃ ┃world┃💙┃, with the final segment consisting of a surrogate pair of two code units encoding a single code point). The definition of these boundary indexes does not depend on whether forwards or backwards iteration is used.

What happens when segmenting an empty string?

No segments will be found, and iterators will complete immediately upon first next() access.

What happens when I try to use random access with non-Number values?

Someone's in QA. 😉 The containing argument is processed into an integer Number—null, undefined, and NaN become 0, Booleans become 0 or 1, Strings are parsed as string numeric literals, Objects are cast to primitives, and Symbols and BigInts fail with a TypeError exception. Fractional components are truncated, but infinite Numbers are accepted as-is (although they are always out of bounds and will therefore never find a segment).

Implementations

More Repositories

1

proposals

Tracking ECMAScript Proposals
17,177
star
2

ecma262

Status, process, and documents for ECMA-262
HTML
14,437
star
3

proposal-pipeline-operator

A proposal for adding a useful pipe operator to JavaScript.
HTML
7,534
star
4

proposal-pattern-matching

Pattern matching syntax for ECMAScript
HTML
5,498
star
5

proposal-optional-chaining

HTML
4,942
star
6

proposal-type-annotations

ECMAScript proposal for type syntax that is erased - Stage 1
JavaScript
4,252
star
7

proposal-signals

A proposal to add signals to JavaScript.
3,387
star
8

proposal-temporal

Provides standard objects and functions for working with dates and times.
HTML
3,321
star
9

proposal-observable

Observables for ECMAScript
JavaScript
3,058
star
10

proposal-decorators

Decorators for ES6 classes
2,640
star
11

proposal-record-tuple

ECMAScript proposal for the Record and Tuple value types. | Stage 2: it will change!
HTML
2,496
star
12

test262

Official ECMAScript Conformance Test Suite
JavaScript
2,073
star
13

proposal-dynamic-import

import() proposal for JavaScript
HTML
1,863
star
14

proposal-bind-operator

This-Binding Syntax for ECMAScript
1,742
star
15

proposal-class-fields

Orthogonally-informed combination of public and private fields proposals
HTML
1,722
star
16

proposal-async-await

Async/await for ECMAScript
HTML
1,578
star
17

proposal-object-rest-spread

Rest/Spread Properties for ECMAScript
HTML
1,493
star
18

proposal-shadowrealm

ECMAScript Proposal, specs, and reference implementation for Realms
HTML
1,429
star
19

proposal-iterator-helpers

Methods for working with iterators in ECMAScript
HTML
1,307
star
20

proposal-nullish-coalescing

Nullish coalescing proposal x ?? y
HTML
1,232
star
21

proposal-top-level-await

top-level `await` proposal for ECMAScript (stage 4)
HTML
1,083
star
22

proposal-partial-application

Proposal to add partial application to ECMAScript
HTML
1,002
star
23

proposal-do-expressions

Proposal for `do` expressions
HTML
990
star
24

proposal-binary-ast

Binary AST proposal for ECMAScript
961
star
25

agendas

TC39 meeting agendas
JavaScript
952
star
26

proposal-built-in-modules

HTML
891
star
27

proposal-async-iteration

Asynchronous iteration for JavaScript
HTML
857
star
28

proposal-explicit-resource-management

ECMAScript Explicit Resource Management
JavaScript
746
star
29

proposal-set-methods

Proposal for new Set methods in JS
HTML
655
star
30

proposal-string-dedent

TC39 Proposal to remove common leading indentation from multiline template strings
HTML
614
star
31

proposal-operator-overloading

JavaScript
610
star
32

proposal-import-attributes

Proposal for syntax to import ES modules with assertions
HTML
591
star
33

proposal-async-context

Async Context for JavaScript
HTML
587
star
34

proposal-bigint

Arbitrary precision integers in JavaScript
HTML
561
star
35

ecmascript_simd

SIMD numeric type for EcmaScript
JavaScript
540
star
36

ecma402

Status, process, and documents for ECMA 402
HTML
529
star
37

proposal-slice-notation

HTML
523
star
38

proposal-change-array-by-copy

Provides additional methods on Array.prototype and TypedArray.prototype to enable changes on the array by returning a new copy of it with the change.
HTML
511
star
39

notes

TC39 meeting notes
JavaScript
496
star
40

proposal-class-public-fields

Stage 2 proposal for public class fields in ECMAScript
HTML
489
star
41

proposal-iterator.range

A proposal for ECMAScript to add a built-in Iterator.range()
HTML
483
star
42

proposal-decimal

Built-in exact decimal numbers for JavaScript
HTML
477
star
43

proposal-uuid

UUID proposal for ECMAScript (Stage 1)
JavaScript
463
star
44

proposal-module-expressions

HTML
433
star
45

proposal-throw-expressions

Proposal for ECMAScript 'throw' expressions
JavaScript
425
star
46

proposal-UnambiguousJavaScriptGrammar

413
star
47

proposal-weakrefs

WeakRefs
HTML
409
star
48

proposal-array-grouping

A proposal to make grouping of array items easier
HTML
407
star
49

proposal-error-cause

TC39 proposal for accumulating errors
HTML
380
star
50

proposal-cancelable-promises

Former home of the now-withdrawn cancelable promises proposal for JavaScript
Shell
376
star
51

proposal-ecmascript-sharedmem

Shared memory and atomics for ECMAscript
HTML
374
star
52

proposal-module-declarations

JavaScript Module Declarations
HTML
369
star
53

proposal-first-class-protocols

a proposal to bring protocol-based interfaces to ECMAScript users
352
star
54

proposal-relative-indexing-method

A TC39 proposal to add an .at() method to all the basic indexable classes (Array, String, TypedArray)
HTML
351
star
55

proposal-global

ECMAScript Proposal, specs, and reference implementation for `global`
HTML
346
star
56

proposal-private-methods

Private methods and getter/setters for ES6 classes
HTML
345
star
57

proposal-numeric-separator

A proposal to add numeric literal separators in JavaScript.
HTML
330
star
58

proposal-private-fields

A Private Fields Proposal for ECMAScript
HTML
319
star
59

tc39.github.io

Get involved in specifying JavaScript
HTML
318
star
60

proposal-object-from-entries

TC39 proposal for Object.fromEntries
HTML
318
star
61

proposal-promise-allSettled

ECMAScript Proposal, specs, and reference implementation for Promise.allSettled
HTML
314
star
62

proposal-await.ops

Introduce await.all / await.race / await.allSettled / await.any to simplify the usage of Promises
HTML
310
star
63

proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
JavaScript
309
star
64

proposal-export-default-from

Proposal to add `export v from "mod";` to ECMAScript.
HTML
306
star
65

proposal-logical-assignment

A proposal to combine Logical Operators and Assignment Expressions
HTML
302
star
66

proposal-promise-finally

ECMAScript Proposal, specs, and reference implementation for Promise.prototype.finally
HTML
279
star
67

proposal-json-modules

Proposal to import JSON files as modules
HTML
272
star
68

proposal-asset-references

Proposal to ECMAScript to add first-class location references relative to a module
270
star
69

proposal-cancellation

Proposal for a Cancellation API for ECMAScript
HTML
267
star
70

proposal-promise-with-resolvers

HTML
255
star
71

proposal-string-replaceall

ECMAScript proposal: String.prototype.replaceAll
HTML
253
star
72

proposal-export-ns-from

Proposal to add `export * as ns from "mod";` to ECMAScript.
HTML
242
star
73

proposal-structs

JavaScript Structs: Fixed Layout Objects
230
star
74

proposal-ses

Draft proposal for SES (Secure EcmaScript)
HTML
223
star
75

proposal-intl-relative-time

`Intl.RelativeTimeFormat` specification [draft]
HTML
215
star
76

proposal-json-parse-with-source

Proposal for extending JSON.parse to expose input source text.
HTML
214
star
77

proposal-flatMap

proposal for flatten and flatMap on arrays
HTML
214
star
78

proposal-defer-import-eval

A proposal for introducing a way to defer evaluate of a module
HTML
208
star
79

ecmarkup

An HTML superset/Markdown subset source format for ECMAScript and related specifications
TypeScript
201
star
80

proposal-promise-any

ECMAScript proposal: Promise.any
HTML
200
star
81

proposal-optional-chaining-assignment

`a?.b = c` proposal
186
star
82

proposal-decorators-previous

Decorators for ECMAScript
HTML
184
star
83

proposal-smart-pipelines

Old archived draft proposal for smart pipelines. Go to the new Hack-pipes proposal at js-choi/proposal-hack-pipes.
HTML
181
star
84

proposal-array-from-async

Draft specification for a proposed Array.fromAsync method in JavaScript.
HTML
178
star
85

proposal-upsert

ECMAScript Proposal, specs, and reference implementation for Map.prototype.upsert
HTML
176
star
86

proposal-collection-methods

HTML
171
star
87

proposal-array-filtering

A proposal to make filtering arrays easier
HTML
171
star
88

proposal-ptc-syntax

Discussion and specification for an explicit syntactic opt-in for Tail Calls.
HTML
169
star
89

proposal-extractors

Extractors for ECMAScript
JavaScript
166
star
90

proposal-error-stacks

ECMAScript Proposal, specs, and reference implementation for Error.prototype.stack / System.getStack
HTML
166
star
91

proposal-intl-duration-format

164
star
92

how-we-work

Documentation of how TC39 operates and how to participate
161
star
93

proposal-Array.prototype.includes

Spec, tests, reference implementation, and docs for ESnext-track Array.prototype.includes
HTML
157
star
94

proposal-promise-try

ECMAScript Proposal, specs, and reference implementation for Promise.try
HTML
154
star
95

proposal-extensions

Extensions proposal for ECMAScript
HTML
150
star
96

proposal-hashbang

#! for JS
HTML
148
star
97

proposal-import-meta

import.meta proposal for JavaScript
HTML
146
star
98

proposal-resizablearraybuffer

Proposal for resizable array buffers
HTML
145
star
99

proposal-seeded-random

Proposal for an options argument to be added to JS's Math.random() function, and some options to start it with.
HTML
143
star
100

eshost

A uniform wrapper around a multitude of ECMAScript hosts. CLI: https://github.com/bterlson/eshost-cli
JavaScript
142
star