• This repository has been archived on 23/Apr/2024
  • Stars
    star
    951
  • Rank 48,061 (Top 1.0 %)
  • Language
    C++
  • License
    MIT License
  • Created over 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unsupervised text tokenizer focused on computational efficiency

PyPI Downloads Code style: black GitHub Build Status

YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than Hugging Face, fastBPE and SentencePiece. In some test cases, it is 60 times faster. Check out our benchmark results.

Key advantages:

  • Multithreading for training and tokenization
  • The algorithm has O(N) complexity, where N is the length of training data
  • Highly efficient implementation in C++
  • Python wrapper and command-line interface

Extra features:

As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in SentencePiece, all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.

For example, the phrase Blazingly fast tokenization! can be tokenized into

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

Installation

pip install youtokentome

Python interface

Example

Let's start with a self-contained example.

import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open(train_data_path, "w") as fout:
    for _ in range(n_lines):
        print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)

# Generating random text
test_text = "".join([random.choice("abcde ") for _ in range(100)])

# Training model
yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)

# Loading model
bpe = yttm.BPE(model=model_path)

# Two types of tokenization
print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))

 

Training model

youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)

Trains BPE model and saves to file.

Args:

  • data: string, path to file with training data
  • model: string, path to where the trained model will be saved
  • vocab_size: int, number of tokens in the final vocabulary
  • coverage: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
  • n_threads: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see benchmark).
  • pad_id: int, reserved id for padding
  • unk_id: int, reserved id for unknown symbols
  • bos_id: int, reserved id for begin of sentence token
  • eos_id: int, reserved id for end of sentence token

Returns: Class youtokentome.BPE with the loaded model.

 

Model loading

youtokentome.BPE(model, n_threads=-1)

Class constructor. Loads the trained model.

  • model: string, path to the trained model
  • n_threads: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.

 

Methods

Class youtokentome.BPE has the following methods:

encode

encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)

Args:

  • sentences: list of strings, sentences for tokenization.
  • output_type: enum, sentence can be tokenized to ids or subwords. Use OutputType.ID for ids and OutputType.SUBWORD for subwords.
  • bos: bool, if True then token “beginning of sentence” will be added
  • eos: bool, if True then token “end of sentence” will be added
  • reverse: bool, if True the output sequence of tokens will be reversed
  • dropout_prob: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].

Returns: If output_type is equal to youtokentome.OutputType.ID or youtokentome.OutputType.SUBWORD then a list of lists of integers or list of lists of strings will be returned respectively.

 

vocab

vocab(self)

Returns: A list vocab_size strings. The i-th string in the list corresponds to i-th subword.

 

vocab_size

vocab_size(self)

Returns: int. Size of vocabulary.

 

subword_to_id

subword_to_id(self, subword)

Args:

  • subword: string.

Returns: Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, unk_id will be returned.

 

id_to_subword

id_to_subword(self, id)

Args:

  • id: int, must be in the range [0, vocab_size-1]

Returns: string. Subword from vocabulary by id.

 

decode

decode(self, ids, ignore_ids=None)

Convert each id to subword and concatenate with space symbol.

Args:

  • ids: list of lists of integers. All integers must be in the range [0, vocab_size-1]
  • ignore_ids: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]

Returns: List of strings.

Command line interface

Example

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA 

Supported commands

YouTokenToMe supports the following commands:

$ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

Command bpe allows you to train Byte Pair Encoding model based on a text file.

$ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

Apply BPE encoding for a corpus of sentences. Use stdin for input and stdout for output.

By default, encoding works in parallel using n_threads threads. Number of threads is limited by 8 (see benchmark).

With the --stream option, --n_threads will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the stdout before the next sentence is read.

$ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

Print vocabulary. This can be useful for understanding the model.

$ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

Convert ids back to text. Use stdin for input and stdout for output.

$ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

More Repositories

1

kphp

KPHP — a PHP compiler
C++
1,322
star
2

VKUI

VKUI – это набор React-компонентов, с помощью которых можно создавать интерфейсы, внешне неотличимые от наших iOS и Android приложений.
TypeScript
995
star
3

noverify

Pretty fast linter (code static analysis utility) for PHP
Go
667
star
4

vk-android-sdk

Android library for working with VK API, authorization through VK app, using VK functions.
Kotlin
458
star
5

vk-ios-sdk

iOS library for working with VK API, authorization through VK app, using VK functions
Objective-C
298
star
6

vk-java-sdk

Java library for working with VK API
Java
290
star
7

vk-api-schema

JSON Schema of VK API
Shell
206
star
8

statshouse

StatsHouse is a highly available, scalable, multitenant monitoring system
C
206
star
9

vk-php-sdk

PHP library for working with VK API
PHP
204
star
10

vkompose

Kotlin Compiler Plugins, an IDEA Plugin, and a Detekt Rule that will help to improve your experience with Jetpack Compose
Kotlin
190
star
11

kittenhouse

Go
185
star
12

lighthouse

Lightweight interface for ClickHouse
JavaScript
185
star
13

joy4

Golang audio/video library and streaming server
Go
180
star
14

nocolor

Validate the architecture of your PHP project based on the concept of function colors
Go
161
star
15

nginx-quic

C
151
star
16

KNet

Android network library with QUIC protocol supporting.
Kotlin
148
star
17

nocc

A distributed C++ compiler: like distcc, but faster
Go
142
star
18

bot-example-php

Пример бота для VK
PHP
134
star
19

icons

Набор SVG иконок, представленный в виде React компонентов.
JavaScript
124
star
20

vk-bridge

A package for integrating VK Mini Apps with official VK clients for iOS, Android and Web
TypeScript
70
star
21

php-parser

PHP parser written in Go
Go
69
star
22

modulite

A plugin for PHPStorm that brings modules to the PHP language
Kotlin
65
star
23

vk-qr

VK QR Code generator library
TypeScript
58
star
24

create-vk-mini-app

Create VK Apps with no build configuration.
TypeScript
53
star
25

vk-miniapps-deploy

NPM module for deploy VK Mini Apps on VK hosting
JavaScript
49
star
26

kphpstorm

A PhpStorm plugin that makes IDE understand KPHP specifics
Kotlin
41
star
27

vkui-tokens

TypeScript
39
star
28

fastXDM

fast library for cross-domain messaging
JavaScript
39
star
29

node-vk-call

Simple API wrapper for VK.com social network
JavaScript
35
star
30

elephize

Typescript to PHP translation tool
TypeScript
33
star
31

vk-streaming-api

Go
33
star
32

vk-mini-apps-api

The official package for quick and easy development of VK Mini Apps
TypeScript
28
star
33

vk-mini-apps-router

TypeScript
27
star
34

Appearance

JavaScript
26
star
35

vkid-android-sdk

Kotlin
25
star
36

vkjs

VK shared JS libs
TypeScript
23
star
37

vk-router

TypeScript
22
star
38

vk-windowsphone-sdk

VK SDK for Windows Phone
C#
22
star
39

admstorm

PhpStorm plugin aimed at simplifying tasks at the junction of the local repository and the repository on the dev server
Kotlin
20
star
40

vk-unity-sdk

C#
20
star
41

vk-tunnel-client

TypeScript
19
star
42

kive

Go
19
star
43

tl

C++
18
star
44

vkdata-sketchplugin

Sketch plugin for using data from your account at vk.com
JavaScript
17
star
45

vk-apps-launch-params

Пример работы с параметрами запуска
JavaScript
17
star
46

nginx-http-vkupload-module

C
16
star
47

kphp-polyfills

PHP implementations of functions supported by KPHP natively (a Composer package)
PHP
15
star
48

superappkit-android-demo

Kotlin
15
star
49

vkid-web-sdk

TypeScript
15
star
50

vk-mini-apps-examples

TypeScript
15
star
51

IOSDevice

A set of hacks and workarounds for iOS Safari & Co.
JavaScript
14
star
52

docker-emulator-android

Dockerfile
13
star
53

modulite-phpstan

Bring modules into PHP and PHPStan
PHP
13
star
54

vk-apps-tensorflow-example

VK apps + tensorflow-js demo app
JavaScript
12
star
55

api-schema-typescript-generator

TypeScript
11
star
56

vkid-ios-sdk

Swift
11
star
57

api-schema-typescript

TypeScript
10
star
58

VKSDK-iOS

Swift
10
star
59

Delegate

Python
10
star
60

engine-go

Common libraries for our go engines (microservices)
Go
10
star
61

vk-direct-games-example

JavaScript
10
star
62

vk-ios-urlprotocol-example

This is an example iOS app with custom URLProtocol
Swift
10
star
63

swc-plugin-css-modules

Rust
9
star
64

vk-bridge-mock

The VK Bridge mock library
TypeScript
9
star
65

ktest

Test and benchmark KPHP code
Go
9
star
66

vk-ads-retargeting-demo

Демонстрация JavaScript API ретаргетинга ВКонтакте
HTML
8
star
67

eslint-config

JavaScript
8
star
68

useWeb3

JavaScript
8
star
69

TL-Schema-idea-plugin

Plugin for JetBrains products for coloring TL Schema files
Java
8
star
70

vk-connect-promise

A package for integrating VK Mini Apps with official VK clients for iOS, Android and Web with events based on promises
JavaScript
8
star
71

torch_mobile

Torch7 for mobile devices
C
7
star
72

vkui-benchmarks

JavaScript
7
star
73

noverify-phpstorm

NoVerify plugin for PhpStorm
Kotlin
6
star
74

superappkit-ios

Ruby
6
star
75

swc-plugin-transform-remove-imports

Rust
6
star
76

VideoPlayer-iOS

Swift
6
star
77

statshouse-go

StatsHouse client library for Go
Go
6
star
78

create-vkui-app

JavaScript
6
star
79

m3u8

Parser and generator of M3U8-playlists for Apple HLS.
Go
5
star
80

nginx-statshouse-module

StatsHouse module for nginx
C
5
star
81

statshouse-cpp

StatsHouse client library for C++
C++
5
star
82

statshouse-php

StatsHouse client library for PHP and KPHP
PHP
5
star
83

stylelint-config

TypeScript
4
star
84

statshouse-java

Java
4
star
85

modulite-example-project

This example project contains some Modulite errors, detected by IDE, PHPStan, and KPHP
PHP
4
star
86

kphp-tools

A set of independent tools to work with KPHP compiled code
JavaScript
4
star
87

kphp-snippets

Libraries written in PHP aimed to be compiled with KPHP
PHP
4
star
88

vk-mini-apps-course-frontend

TypeScript
4
star
89

graph-cache

Easy way to build and maintain persistent dependency graph for any type of files/languges
JavaScript
4
star
90

gulp-portal

JavaScript
4
star
91

sprites

Module for generate SVG sprites and PNG fallback
JavaScript
4
star
92

swc-plugin-pre-paths

Rust
3
star
93

mask-assets

AngelScript
3
star
94

mini-apps-analytics

TypeScript
3
star
95

vk-apps-currency

JavaScript
3
star
96

eslint-plugin

JavaScript
3
star
97

vk-apps-qr

VK Apps + QR demo app
JavaScript
2
star
98

ktest-script

PHP
2
star
99

mvk-mini-apps-scroll-helper

JavaScript
2
star
100

prettier-config

JavaScript
2
star