• Stars
    star
    1,224
  • Rank 38,318 (Top 0.8 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Making text a first-class citizen in TensorFlow.



PyPI version PyPI nightly version PyPI Python version Documentation Contributions welcome License

TensorFlow Text - Text processing in Tensorflow

IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding minor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).

INDEX

Introduction

TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. You do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Documentation

Please visit http://tensorflow.org/text for all documentation. This site includes API docs, guides for working with TensorFlow Text, as well as tutorials for building specific models.

Unicode

Most ops expect that the strings are in UTF-8. If you're using a different encoding, you can use the core tensorflow transcode op to transcode into UTF-8. You can also use the same op to coerce your string to structurally valid UTF-8 if your input could be invalid.

docs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),
                    u'Sad☹'.encode('UTF-16-BE')])
utf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',
                                         output_encoding='UTF-8')

Normalization

When dealing with different sources of text, it's important that the same words are recognized to be identical. A common technique for case-insensitive matching in Unicode is case folding (similar to lower-casing). (Note that case folding internally applies NFKC normalization.)

We also provide Unicode normalization ops for transforming strings into a canonical representation of characters, with Normalization Form KC being the default (NFKC).

print(text.case_fold_utf8(['Everything not saved will be lost.']))
print(text.normalize_utf8(['Äffin']))
print(text.normalize_utf8(['Äffin'], 'nfkd'))
tf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)
tf.Tensor(['\xc3\x84ffin'], shape=(1,), dtype=string)
tf.Tensor(['A\xcc\x88ffin'], shape=(1,), dtype=string)

Tokenization

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

The main interfaces are Tokenizer and TokenizerWithOffsets which each have a single method tokenize and tokenizeWithOffsets respectively. There are multiple implementing tokenizers available now. Each of these implement TokenizerWithOffsets (which extends Tokenizer) which includes an option for getting byte offsets into the original string. This allows the caller to know the bytes in the original string the token was created from.

All of the tokenizers return RaggedTensors with the inner-most dimension of tokens mapping to the original individual strings. As a result, the resulting shape's rank is increased by one. Please review the ragged tensor guide if you are unfamiliar with them. https://www.tensorflow.org/guide/ragged_tensor

WhitespaceTokenizer

This is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\xe2\x98\xb9']]

UnicodeScriptTokenizer

This tokenizer splits UTF-8 strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

In practice, this is similar to the WhitespaceTokenizer with the most apparent difference being that it will split punctuation (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language texts from each other.

tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]

Unicode split

When tokenizing languages without whitespace to segment words, it is common to just split by character, which can be accomplished using the unicode_split op found in core.

tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print(tokens.to_list())
[['\xe4\xbb\x85', '\xe4\xbb\x8a', '\xe5\xb9\xb4', '\xe5\x89\x8d']]

Offsets

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements TokenizerWithOffsets has a tokenize_with_offsets method that will return the byte offsets along with the tokens. The start_offsets lists the bytes in the original string each token starts at (inclusive), and the end_offsets lists the bytes where each token ends at (exclusive, i.e., first byte after the token).

tokenizer = text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(
    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]
[[0, 11, 15, 21, 26, 29, 33], [0, 3]]
[[10, 14, 20, 25, 28, 33, 34], [3, 6]]

TF.Data Example

Tokenizers work as expected with the tf.data API. A simple example is provided below.

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())
print(iterator.get_next().to_list())
[['Never', 'tell', 'me', 'the', 'odds.']]
[["It's", 'a', 'trap!']]

Keras API

When you use different tokenizers and ops to preprocess your data, the resulting outputs are Ragged Tensors. The Keras API makes it easy now to train a model using Ragged Tensors without having to worry about padding or masking the data, by either using the ToDense layer which handles all of these for you or relying on Keras built-in layers support for natively working on ragged data.

model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)
  text.keras.layers.ToDense(pad_value=0, mask=True),
  tf.keras.layers.Embedding(100, 16),
  tf.keras.layers.LSTM(32),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

Other Text Ops

TF.Text packages other useful preprocessing ops. We will review a couple below.

Wordshape

A common feature used in some natural language understanding models is to see if the text string has a certain property. For example, a sentence breaking model might contain features which check for word capitalization or if a punctuation character is at the end of a string.

Wordshape defines a variety of useful regular expression based helper functions for matching various relevant patterns in your input text. Here are a few examples.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Is capitalized?
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
# Are all letters uppercased?
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
# Does the token contain punctuation?
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
# Is the token a number?
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)

print(f1.to_list())
print(f2.to_list())
print(f3.to_list())
print(f4.to_list())
[[True, False, False, False, False, False], [True]]
[[False, False, False, False, False, False], [False]]
[[False, False, False, False, False, True], [True]]
[[False, False, False, False, False, False], [False]]

N-grams & Sliding Window

N-grams are sequential words given a sliding window size of n. When combining the tokens, there are three reduction mechanisms supported. For text, you would want to use Reduction.STRING_JOIN which appends the strings to each other. The default separator character is a space, but this can be changed with the string_separater argument.

The other two reduction methods are most often used with numerical values, and these are Reduction.SUM and Reduction.MEAN.

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])

# Ngrams, in this case bi-gram (n = 2)
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)

print(bigrams.to_list())
[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]

Installation

Install using PIP

When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding version of TF Text. For example, if you're using TF 2.0, install the 2.0 version of TF Text, and if you're using TF 1.15, install the 1.15 version of TF Text.

pip install -U tensorflow-text==<version>

A note about different operating system packages

After version 2.10, we will only be providing pip packages for Linux x86_64 and Intel-based Macs. TensorFlow Text has always leveraged the release infrastructure of the core TensorFlow package to more easily maintain compatible releases with minimal maintenance, allowing the team to focus on TF Text itself and contributions to other parts of the TensorFlow ecosystem.

For other systems like Windows, Aarch64, and Apple Macs, TensorFlow relies on build collaborators, and so we will not be providing packages for them. However, we will continue to accept PRs to make building for these OSs easy for users, and will try to point to community efforts related to them.

Build from source steps:

Note that TF Text needs to be built in the same environment as TensorFlow. Thus, if you manually build TF Text, it is highly recommended that you also build TensorFlow.

If building on MacOS, you must have coreutils installed. It is probably easiest to do with Homebrew.

  1. build and install TensorFlow.
  2. Clone the TF Text repo:
    git clone https://github.com/tensorflow/text.git
    cd text
  3. Run the build script to create a pip package:
    ./oss_scripts/run_build.sh
    After this step, there should be a *.whl file in current directory. File name similar to tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl.
  4. Install the package to environment:
    pip install ./tensorflow_text-*-*-*-os_platform.whl

Build or test using TensorFlow's SIG docker image:

  1. Pull image from Tensorflow SIG docker builds.

  2. Run a container based with the pulled image and create a bash session. This can be done by running docker run -it {image_name} bash.
    {image_name} can be any name with {tf_verison}-python{python_version} format. An example for python 3.10 and TF version 2.10 :- 2.10-python3.10.

  3. Clone the TF-Text Github repository inside container: git clone https://github.com/tensorflow/text.git.
    Once cloned, change to the working directory using cd text/.

  4. Run the configuration script(s): ./oss_scripts/configure.sh and ./oss_scripts/prepare_tf_dep.sh.
    This will update bazel and TF dependencies to installed tensorflow in the container.

  5. To run the tests, use the bazel command: bazel test --test_output=errors tensorflow_text:all. This will run all the tests declared in the BUILD file.
    To run a specific test, modify the above command replacing :all with the test name (for example :fast_bert_normalizer).

  6. Build the pip package/wheel:
    bazel build --config=release_cpu_linux oss_scripts/pip_package:build_pip_package
    ./bazel-bin/oss_scripts/pip_package/build_pip_package /{wheel_dir}

    Once the build is complete, you should see the wheel available under {wheel_dir} directory.

More Repositories

1

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
186,123
star
2

models

Models and examples built with TensorFlow
Python
77,049
star
3

tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
TypeScript
18,430
star
4

tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Python
14,693
star
5

tfjs-models

Pretrained models for TensorFlow.js
TypeScript
14,058
star
6

playground

Play with neural networks!
TypeScript
11,585
star
7

tfjs-core

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
TypeScript
8,480
star
8

examples

TensorFlow examples
Jupyter Notebook
7,920
star
9

tensorboard

TensorFlow's Visualization Toolkit
TypeScript
6,686
star
10

tfjs-examples

Examples built with TensorFlow.js
JavaScript
6,553
star
11

nmt

TensorFlow Neural Machine Translation Tutorial
Python
6,315
star
12

docs

TensorFlow documentation
Jupyter Notebook
6,119
star
13

swift

Swift for TensorFlow
Jupyter Notebook
6,118
star
14

serving

A flexible, high-performance serving system for machine learning models
C++
6,068
star
15

tpu

Reference models and tools for Cloud TPUs.
Jupyter Notebook
5,214
star
16

rust

Rust language bindings for TensorFlow
Rust
4,939
star
17

lucid

A collection of infrastructure and tools for research in neural network interpretability.
Jupyter Notebook
4,611
star
18

datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Python
4,298
star
19

probability

Probabilistic reasoning and statistical analysis in TensorFlow
Jupyter Notebook
4,053
star
20

adanet

Fast and flexible AutoML with learning guarantees.
Jupyter Notebook
3,474
star
21

hub

A library for transfer learning by reusing parts of TensorFlow models.
Python
3,467
star
22

minigo

An open-source implementation of the AlphaGoZero algorithm
C++
3,428
star
23

skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Python
3,181
star
24

lingvo

Lingvo
Python
2,812
star
25

agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Python
2,775
star
26

graphics

TensorFlow Graphics: Differentiable Graphics Layers for TensorFlow
Python
2,744
star
27

ranking

Learning to Rank in TensorFlow
Python
2,735
star
28

federated

A framework for implementing federated learning
Python
2,281
star
29

tfx

TFX is an end-to-end platform for deploying production ML pipelines
Python
2,099
star
30

privacy

Library for training machine learning models with privacy for training data
Python
1,916
star
31

tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).
C++
1,887
star
32

fold

Deep learning with dynamic computation graphs in TensorFlow
Python
1,824
star
33

recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Python
1,816
star
34

quantum

Hybrid Quantum-Classical Machine Learning in TensorFlow
Python
1,798
star
35

mlir

"Multi-Level Intermediate Representation" Compiler Infrastructure
1,720
star
36

addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
Python
1,690
star
37

mesh

Mesh TensorFlow: Model Parallelism Made Easier
Python
1,589
star
38

haskell

Haskell bindings for TensorFlow
Haskell
1,558
star
39

model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Python
1,486
star
40

workshops

A few exercises for use at events.
Jupyter Notebook
1,457
star
41

ecosystem

Integration of TensorFlow with other open-source frameworks
Scala
1,370
star
42

gnn

TensorFlow GNN is a library to build Graph Neural Networks on the TensorFlow platform.
Python
1,320
star
43

model-analysis

Model analysis tools for TensorFlow
Python
1,250
star
44

community

Stores documents used by the TensorFlow developer community
C++
1,239
star
45

benchmarks

A benchmark framework for Tensorflow
Python
1,144
star
46

tfjs-node

TensorFlow powered JavaScript library for training and deploying ML models on Node.js.
TypeScript
1,048
star
47

similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.
Python
1,008
star
48

transform

Input pipeline framework
Python
984
star
49

neural-structured-learning

Training neural models with structured signals.
Python
982
star
50

gan

Tooling for GANs in TensorFlow
Jupyter Notebook
907
star
51

compression

Data compression in TensorFlow
Python
849
star
52

java

Java bindings for TensorFlow
Java
818
star
53

swift-apis

Swift for TensorFlow Deep Learning Library
Swift
794
star
54

deepmath

Experiments towards neural network theorem proving
C++
779
star
55

data-validation

Library for exploring and validating machine learning data
Python
756
star
56

runtime

A performant and modular runtime for TensorFlow
C++
754
star
57

tensorrt

TensorFlow/TensorRT integration
Jupyter Notebook
736
star
58

docs-l10n

Translations of TensorFlow documentation
Jupyter Notebook
716
star
59

io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
C++
698
star
60

tfjs-converter

Convert TensorFlow SavedModel and Keras models to TensorFlow.js
TypeScript
697
star
61

decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Python
656
star
62

swift-models

Models and examples built with Swift for TensorFlow
Jupyter Notebook
644
star
63

tcav

Code for the TCAV ML interpretability project
Jupyter Notebook
612
star
64

recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Cuda
590
star
65

tfjs-wechat

WeChat Mini-program plugin for TensorFlow.js
TypeScript
547
star
66

flutter-tflite

Dart
534
star
67

lattice

Lattice methods in TensorFlow
Python
519
star
68

model-card-toolkit

A toolkit that streamlines and automates the generation of model cards
Python
415
star
69

mlir-hlo

MLIR
388
star
70

tflite-support

TFLite Support is a toolkit that helps users to develop ML and deploy TFLite models onto mobile / ioT devices.
C++
374
star
71

cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Python
374
star
72

custom-op

Guide for building custom op for TensorFlow
Smarty
373
star
73

tfjs-vis

A set of utilities for in browser visualization with TensorFlow.js
TypeScript
360
star
74

profiler

A profiling and performance analysis tool for TensorFlow
TypeScript
359
star
75

fairness-indicators

Tensorflow's Fairness Evaluation and Visualization Toolkit
Jupyter Notebook
341
star
76

moonlight

Optical music recognition in TensorFlow
Python
325
star
77

tfjs-tsne

TypeScript
309
star
78

estimator

TensorFlow Estimator
Python
300
star
79

embedding-projector-standalone

HTML
293
star
80

tfjs-layers

TensorFlow.js high-level layers API
TypeScript
283
star
81

build

Build-related tools for TensorFlow
Shell
275
star
82

tflite-micro-arduino-examples

C++
207
star
83

kfac

An implementation of KFAC for TensorFlow
Python
197
star
84

ngraph-bridge

TensorFlow-nGraph bridge
C++
137
star
85

profiler-ui

[Deprecated] The TensorFlow Profiler (TFProf) UI provides a visual interface for profiling TensorFlow models.
HTML
134
star
86

tensorboard-plugin-example

Python
134
star
87

tfx-addons

Developers helping developers. TFX-Addons is a collection of community projects to build new components, examples, libraries, and tools for TFX. The projects are organized under the auspices of the special interest group, SIG TFX-Addons. Join the group at http://goo.gle/tfx-addons-group
Jupyter Notebook
125
star
88

metadata

Utilities for passing TensorFlow-related metadata between tools
Python
102
star
89

networking

Enhanced networking support for TensorFlow. Maintained by SIG-networking.
C++
97
star
90

tfhub.dev

Python
75
star
91

java-ndarray

Java
71
star
92

java-models

Models in Java
Java
71
star
93

tfjs-website

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
CSS
71
star
94

tfjs-data

Simple APIs to load and prepare data for use in machine learning models
TypeScript
66
star
95

tfx-bsl

Common code for TFX
Python
64
star
96

autograph

Python
50
star
97

model-remediation

Model Remediation is a library that provides solutions for machine learning practitioners working to create and train models in a way that reduces or eliminates user harm resulting from underlying performance biases.
Python
42
star
98

codelabs

Jupyter Notebook
36
star
99

tensorstore

C++
25
star
100

swift-bindings

Swift
25
star