asr-study: a study of all-neural speech recognition models
This repository contains my efforts on developing an end-to-end ASR system using Keras and Tensorflow.
Training a character-based all-neural Brazilian Portuguese speech recognition model
Our model was trained using four datasets: CSLU Spoltech (LDC2006S16), Sid, VoxForge, and LapsBM1.4. Only the CSLU dataset is paid.
Set up the (partial) Brazilian Portuguese Speech Dataset (BRSD)
You can download the freely available datasets with the provided script (it may take a while):
$ cd data; sh download_datasets.sh
Next, you can preprocess it into an hdf5 file. Click here for more information.
$ python -m extras.make_dataset --parser brsd --input_parser mfcc
Train the network
You can train the network with the train.py
script. For more usage information see this. To train with the default parameters:
$ python train.py --dataset .datasets/brsd/data.h5
Pre-trained model
You may download a pre-trained brsm v1.0 model over the full brsd dataset (including the CSLU dataset):
$ mkdir models; sh download_brsmv1.sh
Also, you can evaluate the model against the brsd test set
$ python eval.py --model models/brsmv1.h5 --dataset .datasets/brsd/data.h5
brsmv1.h5 training
Test set: LER 25.13% (using beam search decoder with beam width of 100)
Predicting the outputs
To predict the outputs of a trained model using some dataset:
$ python predict.py --model MODEL --dataset DATASET
Available dataset parsers
You can see in datasets/ all the datasets parsers available.
Creating a custom dataset parser
You may create your own dataset parser. Here an example:
class CustomParser(DatasetParser):
def __init__(self, dataset_dir, name='default name', **kwargs):
super(CustomParser, self).__init__(dataset_dir, name, **kwargs)
def _iter(self):
for line in dataset:
yield {'duration': line['duration'],
'input': line['input'],
'label': line['label'],
'non-optional-field': line['non-optional-field']}
def _report(self, dl):
args = extract_statistics(dl)
report = '''General information
Number of utterances: %d
Total size (in seconds) of utterances: %.f
Number of speakers: %d''' % (args)
Available models
You can see all the available models in core/models.py
Creating a custom model
You may create your custom model. Here an example of CTC-based model
def custom_model(num_features=26, num_hiddens=100, num_classes=28):
x = Input(name='inputs', shape=(None, num_features))
o = x
o = Bidirectional(LSTM(num_hiddens,
return_sequences=True,
consume_less='gpu'))(o)
o = TimeDistributed(Dense(num_classes))(o)
return ctc_model(x, o)
Contributing
There are a plenty of work to be done. All contributions are welcome :).
asr-related work
- Add new layers
- Reproduce topologies and results
- EESEN
- Deep Speech 2
- ConvNet-based architectures
- Add language model
- Encoder-decoder models with attention mechanism
- ASR from raw speech
- Real-time ASR
brsp-related work
- Investigate the brsdv1 model with
- Increase the number of datasets (ideally with free datasets)
- Improve the LER
- Train a language model
code-related work
- Test coverage
- Examples
- Better documentation
- Improve the API
- More features extractors, see audio and text
- More datasets parsers
- LibriSpeech
- Teldium)
- WSJ
- Switchboard
- TIMIT
- VCTK
- Implement a nice wrapper for Kaldi in order to enjoy their feature extractors
- Better way of store the entire preprocessed dataset
Known bugs
- High memory and CPU consumption
- Predicting with batch size greater than 1 (Keras' bug)
- warp-ctc does not seem to speed up training
- zoneout implementation
Requirements
basic requirements
- Python 2.7
- Numpy
- Scipy
- Pyyaml
- HDF5
- Unidecode
- Librosa
- Tensorflow
- Keras
recommended
- warp-ctc (for fast CTC loss calculation)
optional
- SpeechRecognition (to use the eval apis)
- openpyxl (to save the results in a excel file)
Acknowledgements
- python_speech_features for the audio preprocessing
- Google Magenta for the hparams
- @robertomest for helping me with everything
License
See LICENSE.md for more information