• Stars
    star
    8,657
  • Rank 3,966 (Top 0.09 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece

Build C++ Build Wheels GitHub Issues PyPI version PyPi downloads Contributions welcome License SLSA 3

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

  • Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
  • Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
  • Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
  • Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
  • Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
  • Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
  • Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
  • NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature SentencePiece subword-nmt WordPiece
Supported algorithm BPE, unigram, char, word BPE BPE*
OSS? Yes Yes Google internal
Subword regularization Yes No No
Python Library (pip) Yes No N/A
C++ Library Yes No N/A
Pre-segmentation required? No Yes Yes
Customizable normalization (e.g., NFKC) Yes No N/A
Direct id generation Yes No N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between โ€œWorldโ€ and โ€œ.โ€ is dropped from the tokenized sequence, since e.g., Tokenize(โ€œWorld.โ€) == Tokenize(โ€œWorld .โ€)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "โ–" (U+2581) as follows.

Helloโ–World.

Then, this text is segmented into small pieces, for example:

[Hello] [โ–Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('โ–', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

  • (en) Hello world. โ†’ [Hello] [World] [.] (A space between Hello and World)
  • (ja) ใ“ใ‚“ใซใกใฏไธ–็•Œใ€‚ โ†’ [ใ“ใ‚“ใซใกใฏ] [ไธ–็•Œ] [ใ€‚] (No space between ใ“ใ‚“ใซใกใฏ and ไธ–็•Œ)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-dropout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['โ–', 'N', 'e', 'w', 'โ–York']
['โ–', 'New', 'โ–York']
['โ–', 'New', 'โ–Y', 'o', 'r', 'k']
['โ–', 'New', 'โ–York']
['โ–', 'New', 'โ–York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

  • cmake
  • C++11 compiler
  • gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Download and install SentencePiece from signed released wheels

You can download the wheel from the GitHub releases page. We generate SLSA3 signatures using the OpenSSF's slsa-framework/slsa-github-generator during the release process. To verify a release binary:

  1. Install the verification tool from slsa-framework/slsa-verifier#installation.
  2. Download the provenance file attestation.intoto.jsonl from the GitHub releases page.
  3. Run the verifier:
slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>

pip install wheel_file.whl

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
  • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
  • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
  • --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
โ–I โ–saw โ–a โ–girl โ–with โ–a โ– te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3. ย 

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

More Repositories

1

material-design-icons

Material Design icons by Google
49,605
star
2

guava

Google core libraries for Java
Java
48,313
star
3

zx

A tool for writing better scripts
JavaScript
37,928
star
4

styleguide

Style guides for Google-originated open-source projects
HTML
36,377
star
5

leveldb

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
C++
33,564
star
6

material-design-lite

Material Design Components in HTML/CSS/JS
HTML
32,276
star
7

googletest

GoogleTest - Google Testing and Mocking Framework
C++
32,215
star
8

jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Python
27,471
star
9

python-fire

Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object.
Python
26,112
star
10

mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
C++
25,107
star
11

comprehensive-rust

This is the Rust course used by the Android team at Google. It provides you the material to quickly teach Rust.
Rust
24,867
star
12

gson

A Java serialization/deserialization library to convert Java Objects into JSON and back
Java
22,856
star
13

flatbuffers

FlatBuffers: Memory Efficient Serialization Library
C++
21,883
star
14

iosched

The Google I/O Android App
Kotlin
21,801
star
15

ExoPlayer

An extensible media player for Android
Java
21,309
star
16

eng-practices

Google's Engineering Practices documentation
19,715
star
17

web-starter-kit

Web Starter Kit - a workflow for multi-device websites
HTML
18,426
star
18

flexbox-layout

Flexbox for Android
Kotlin
18,141
star
19

fonts

Font files available from Google Fonts, and a public issue tracker for all things Google Fonts
HTML
17,389
star
20

filament

Filament is a real-time physically based rendering engine for Android, iOS, Windows, Linux, macOS, and WebGL2
C++
16,946
star
21

cadvisor

Analyzes resource usage and performance characteristics of running containers.
Go
16,184
star
22

libphonenumber

Google's common Java, C++ and JavaScript library for parsing, formatting, and validating international phone numbers.
C++
15,728
star
23

gvisor

Application Kernel for Containers
Go
14,646
star
24

WebFundamentals

Former git repo for WebFundamentals on developers.google.com
JavaScript
13,848
star
25

yapf

A formatter for Python files
Python
13,560
star
26

tink

Tink is a multi-language, cross-platform, open source library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.
Java
13,318
star
27

deepdream

13,212
star
28

brotli

Brotli compression format
TypeScript
12,921
star
29

guetzli

Perceptual JPEG encoder
C++
12,863
star
30

guice

Guice (pronounced 'juice') is a lightweight dependency injection framework for Java 8 and above, brought to you by Google.
Java
12,324
star
31

wire

Compile-time Dependency Injection for Go
Go
12,131
star
32

blockly

The web-based visual programming editor.
TypeScript
12,033
star
33

grumpy

Grumpy is a Python to Go source code transcompiler and runtime.
Go
10,464
star
34

sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
C
10,437
star
35

dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
Jupyter Notebook
10,319
star
36

or-tools

Google's Operations Research tools:
C++
10,299
star
37

auto

A collection of source code generators for Java.
Java
10,234
star
38

go-github

Go library for accessing the GitHub v3 API
Go
9,941
star
39

oss-fuzz

OSS-Fuzz - continuous fuzzing for open source software.
Shell
9,324
star
40

go-cloud

The Go Cloud Development Kit (Go CDK): A library and tools for open cloud development in Go.
Go
9,314
star
41

re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
C++
8,190
star
42

traceur-compiler

Traceur is a JavaScript.next-to-JavaScript-of-today compiler
JavaScript
8,178
star
43

tsunami-security-scanner

Tsunami is a general purpose network security scanner with an extensible plugin system for detecting high severity vulnerabilities with high confidence.
Java
8,036
star
44

trax

Trax โ€” Deep Learning with Clear Code and Speed
Python
7,917
star
45

skia

Skia is a complete 2D graphic library for drawing Text, Geometries, and Images.
C++
7,874
star
46

benchmark

A microbenchmark support library
C++
7,812
star
47

android-classyshark

Android and Java bytecode viewer
Java
7,440
star
48

pprof

pprof is a tool for visualization and analysis of profiling data
Go
7,235
star
49

agera

Reactive Programming for Android
Java
7,227
star
50

closure-compiler

A JavaScript checker and optimizer.
Java
7,195
star
51

magika

Detect file content types with deep learning
Python
7,171
star
52

accompanist

A collection of extension libraries for Jetpack Compose
Kotlin
7,157
star
53

flutter-desktop-embedding

Experimental plugins for Flutter for Desktop
C++
7,108
star
54

diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Python
6,918
star
55

lovefield

Lovefield is a relational database for web apps. Written in JavaScript, works cross-browser. Provides SQL-like APIs that are fast, safe, and easy to use.
JavaScript
6,847
star
56

glog

C++ implementation of the Google logging module
C++
6,748
star
57

jsonnet

Jsonnet - The data templating language
Jsonnet
6,711
star
58

latexify_py

A library to generate LaTeX expression from Python code.
Python
6,708
star
59

error-prone

Catch common Java mistakes as compile-time errors
Java
6,690
star
60

model-viewer

Easily display interactive 3D models on the web and in AR!
TypeScript
6,390
star
61

gops

A tool to list and diagnose Go processes currently running on your system
Go
6,375
star
62

automl

Google Brain AutoML
Jupyter Notebook
6,113
star
63

gopacket

Provides packet processing capabilities for Go
Go
6,067
star
64

physical-web

The Physical Web: walk up and use anything
Java
6,017
star
65

j2objc

A Java to iOS Objective-C translation tool and runtime.
Java
5,975
star
66

grafika

Grafika test app
Java
5,964
star
67

draco

Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.
C++
5,947
star
68

snappy

A fast compressor/decompressor
C++
5,940
star
69

ios-webkit-debug-proxy

A DevTools proxy (Chrome Remote Debugging Protocol) for iOS devices (Safari Remote Web Inspector).
C
5,848
star
70

osv-scanner

Vulnerability scanner written in Go which uses the data provided by https://osv.dev
Go
5,763
star
71

seesaw

Seesaw v2 is a Linux Virtual Server (LVS) based load balancing platform.
Go
5,586
star
72

seq2seq

A general-purpose encoder-decoder framework for Tensorflow
Python
5,577
star
73

EarlGrey

๐Ÿต iOS UI Automation Test Framework
Objective-C
5,570
star
74

google-java-format

Reformats Java source code to comply with Google Java Style.
Java
5,366
star
75

flax

Flax is a neural network library for JAX that is designed for flexibility.
Python
5,358
star
76

wireit

Wireit upgrades your npm/pnpm/yarn scripts to make them smarter and more efficient.
TypeScript
5,280
star
77

battery-historian

Battery Historian is a tool to analyze battery consumers using Android "bugreport" files.
Go
5,249
star
78

clusterfuzz

Scalable fuzzing infrastructure.
Python
5,170
star
79

bbr

5,156
star
80

gumbo-parser

An HTML5 parsing library in pure C99
HTML
5,141
star
81

git-appraise

Distributed code review system for Git repos
Go
5,090
star
82

google-authenticator

Open source version of Google Authenticator (except the Android app)
Java
5,077
star
83

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
C++
5,076
star
84

syzkaller

syzkaller is an unsupervised coverage-guided kernel fuzzer
Go
5,063
star
85

uuid

Go package for UUIDs based on RFC 4122 and DCE 1.1: Authentication and Security Services.
Go
4,942
star
86

gts

โ˜‚๏ธ TypeScript style guide, formatter, and linter.
TypeScript
4,890
star
87

closure-library

Google's common JavaScript library
JavaScript
4,832
star
88

cameraview

[DEPRECATED] Easily integrate Camera features into your Android app
Java
4,734
star
89

grr

GRR Rapid Response: remote live forensics for incident response
Python
4,627
star
90

liquidfun

2D physics engine for games
C++
4,559
star
91

pytype

A static type analyzer for Python code
Python
4,454
star
92

gxui

An experimental Go cross platform UI library.
Go
4,450
star
93

bloaty

Bloaty: a size profiler for binaries
C++
4,386
star
94

clasp

๐Ÿ”— Command Line Apps Script Projects
TypeScript
4,336
star
95

ko

Build and deploy Go applications on Kubernetes
Go
4,329
star
96

santa

A binary authorization and monitoring system for macOS
Objective-C
4,288
star
97

google-ctf

Google CTF
Go
4,237
star
98

tamperchrome

Tamper Dev is an extension that allows you to intercept and edit HTTP/HTTPS requests and responses as they happen without the need of a proxy. Works across all operating systems (including Chrome OS).
TypeScript
4,137
star
99

end-to-end

End-To-End is a crypto library to encrypt, decrypt, digital sign, and verify signed messages (implementing OpenPGP)
JavaScript
4,126
star
100

orbit

C/C++ Performance Profiler
C++
3,981
star