TensorFlowASR
Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2
TensorFlowASR implements some automatic speech recognition architectures such as DeepSpeech2, Jasper, RNN Transducer, ContextNet, Conformer, etc. These models can be converted to TFLite to reduce memory and computation for deployment
What's New?
- (04/17/2021) Refactor repository with new version 1.x
- (02/16/2021) Supported for TPU training
- (12/27/2020) Supported naive token level timestamp, see demo with flag
--timestamp
- (12/17/2020) Supported ContextNet http://arxiv.org/abs/2005.03191
- (12/12/2020) Add support for using masking
- (11/14/2020) Supported Gradient Accumulation for Training in Larger Batch Size
Table of Contents
- What's New?
- Table of Contents
😋 Supported Models- Installation
- Setup training and testing
- TFLite Convertion
- Features Extraction
- Augmentations
- Training & Testing Tutorial
- Corpus Sources and Pretrained Models
- References & Credits
- Contact
😋 Supported Models
Baselines
- Transducer Models (End2end models using RNNT Loss for training, currently supported Conformer, ContextNet, Streaming Transducer)
- CTCModel (End2end models using CTC Loss for training, currently supported DeepSpeech2, Jasper)
Publications
- Conformer Transducer (Reference: https://arxiv.org/abs/2005.08100) See examples/conformer
- Streaming Transducer (Reference: https://arxiv.org/abs/1811.06621) See examples/streaming_transducer
- ContextNet (Reference: http://arxiv.org/abs/2005.03191) See examples/contextnet
- Deep Speech 2 (Reference: https://arxiv.org/abs/1512.02595) See examples/deepspeech2
- Jasper (Reference: https://arxiv.org/abs/1904.03288) See examples/jasper
Installation
For training and testing, you should use git clone
for installing necessary packages from other authors (ctc_decoders
, rnnt_loss
, etc.)
Installing from source (recommended)
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
# Tensorflow 2.x (with 2.x.x >= 2.5.1)
pip3 install -e ".[tf2.x]" # or ".[tf2.x-gpu]"
For anaconda3:
conda create -y -n tfasr tensorflow-gpu python=3.8 # tensorflow if using CPU, this makes sure conda install all dependencies for tensorflow
conda activate tfasr
pip install -U tensorflow-gpu # upgrade to latest version of tensorflow
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
# Tensorflow 2.x (with 2.x.x >= 2.5.1)
pip3 install -e ".[tf2.x]" # or ".[tf2.x-gpu]"
Installing via PyPi
# Tensorflow 2.x (with 2.x >= 2.3)
pip3 install -U "TensorFlowASR[tf2.x]" # or pip3 install -U "TensorFlowASR[tf2.x-gpu]"
Running in a container
docker-compose up -d
Setup training and testing
-
For datasets, see datasets
-
For training, testing and using CTC Models, run
./scripts/install_ctc_decoders.sh
-
For training Transducer Models with RNNT Loss in TF, make sure that warp-transducer is not installed (by simply run
pip3 uninstall warprnnt-tensorflow
) (Recommended) -
For training Transducer Models with RNNT Loss from warp-transducer, run
export CUDA_HOME=/usr/local/cuda && ./scripts/install_rnnt_loss.sh
(Note: onlyexport CUDA_HOME
when you have CUDA) -
For mixed precision training, use flag
--mxp
when running python scripts from examples -
For enabling XLA, run
TF_XLA_FLAGS=--tf_xla_auto_jit=2 python3 $path_to_py_script
) -
For hiding warnings, run
export TF_CPP_MIN_LOG_LEVEL=2
before running any examples
TFLite Convertion
After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.
- Install
tf-nightly
usingpip install tf-nightly
- Build a model with the same architecture as the trained model (if model has tflite argument, you must set it to True), then load the weights from trained model to the built model
- Load
TFSpeechFeaturizer
andTextFeaturizer
to model using functionadd_featurizers
- Convert model's function to tflite as follows:
func = model.make_tflite_function(**options) # options are the arguments of the function
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
- Save the converted tflite model as follows:
if not os.path.exists(os.path.dirname(tflite_path)):
os.makedirs(os.path.dirname(tflite_path))
with open(tflite_path, "wb") as tflite_out:
tflite_out.write(tflite_model)
- Then the
.tflite
model is ready to be deployed
Features Extraction
Augmentations
See augmentations
Training & Testing Tutorial
- Define config YAML file, see the
config.yml
files in the example folder for reference (you can copy and modify values such as parameters, paths, etc.. to match your local machine configuration) - Download your corpus (a.k.a datasets) and create a script to generate
transcripts.tsv
files from your corpus (this is general format used in this project because each dataset has different format). For more detail, see datasets. Note: Make sure your data contain only characters in your language, for example, english hasa
toz
and'
. Do not usecache
if your dataset size is not fit in the RAM. - [Optional] Generate TFRecords to use
tf.data.TFRecordDataset
for better performance by using the script create_tfrecords.py - Create vocabulary file (characters or subwords/wordpieces) by defining
language.characters
, using the scripts generate_vocab_subwords.py or generate_vocab_sentencepiece.py. There're predefined ones in vocabularies - [Optional] Generate metadata file for your dataset by using script generate_metadata.py. This metadata file contains maximum lengths calculated with your
config.yml
and total number of elements in each dataset, for static shape training and precalculated steps per epoch. - For training, see
train.py
files in the example folder to see the options - For testing, see
test.py
files in the example folder to see the options. Note: Testing is currently not supported for TPUs. It will print nothing other than the progress bar in the console, but it will store the predicted transcripts to the fileoutput.tsv
and then calculate the metrics from that file.
FYI: Keras builtin training uses infinite dataset, which avoids the potential last partial batch.
See examples for some predefined ASR models and results
Corpus Sources and Pretrained Models
For pretrained models, go to drive
English
Name | Source | Hours |
---|---|---|
LibriSpeech | LibriSpeech | 970h |
Common Voice | https://commonvoice.mozilla.org | 1932h |
Vietnamese
Name | Source | Hours |
---|---|---|
Vivos | https://ailab.hcmus.edu.vn/vivos | 15h |
InfoRe Technology 1 | InfoRe1 (passwd: BroughtToYouByInfoRe) | 25h |
InfoRe Technology 2 (used in VLSP2019) | InfoRe2 (passwd: BroughtToYouByInfoRe) | 415h |
German
Name | Source | Hours |
---|---|---|
Common Voice | https://commonvoice.mozilla.org/ | 750h |
References & Credits
- NVIDIA OpenSeq2Seq Toolkit
- https://github.com/noahchalifour/warp-transducer
- Sequence Transduction with Recurrent Neural Network
- End-to-End Speech Processing Toolkit in PyTorch
- https://github.com/iankur/ContextNet
Contact
Huy Le Nguyen
Email: [email protected]