Tokenizer
Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.
Overview
By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:
- Reversible tokenization
Marking joints or spaces by annotating tokens or injecting modifier characters. - Subword tokenization
Support for training and using BPE and SentencePiece models. - Advanced text segmentation
Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc. - Case management
Lowercase text and return case information as a separate feature or inject case modifier tokens. - Protected sequences
Sequences can be protected against tokenization with the special characters ⦅ and ï½ .
See the available options for an overview of supported features.
Using
The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.
Python API
pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens = tokenizer("Hello World!")
>>> tokens
['Hello', 'World', 'ï¿!']
>>> tokenizer.detokenize(tokens)
'Hello World!'
See the Python API description for more details.
C++ API
#include <onmt/Tokenizer.h>
using namespace onmt;
int main() {
Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
std::vector<std::string> tokens;
tokenizer.tokenize("Hello World!", tokens);
}
See the Tokenizer class for more details.
Command line clients
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ï¿!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!
See the -h
flag to list the available options.
Development
Dependencies
Compiling
CMake and a compiler that supports the C++11 standard are required to compile the project.
git submodule update --init
mkdir build
cd build
cmake ..
make
It will produce the dynamic library libOpenNMTTokenizer
and tokenization clients in cli/
.
- To compile only the library, use the
-DLIB_ONLY=ON
flag.
Testing
The tests are using Google Test which is included as a Git submodule. Run the tests with:
mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data