Update:
Update to pytorch1.2 and python3.
CTC-based Automatic Speech Recogniton
This is a CTC-based speech recognition system with pytorch.
At present, the system only supports phoneme recognition.
You can also do it at word-level and may get a high error rate.
Another way is to decode with a lexcion and word-level language model using WFST which is not included in this system.
Data
English Corpus: Timit
- Training set: 3696 sentences(exclude SA utterance)
- Dev set: 400 sentences
- Test set: 192 sentences
Chinese Corpus: 863 Corpus
- Training set:
Speaker | UtterId | Utterances |
---|---|---|
M50, F50 | A1-A521, AW1-AW129 | 650 sentences |
M54, F54 | B522-B1040,BW130-BW259 | 649 sentences |
M60, F60 | C1041-C1560 CW260-CW388 | 649 sentences |
M64, F64 | D1-D625 | 625 sentences |
All | 5146 sentences |
- Test set:
Speaker | UtterId | Utterances |
---|---|---|
M51, F51 | A1-A100 | 100 sentences |
M55, F55 | B522-B521 | 100 sentences |
M61, F61 | C1041-C1140 | 100 sentences |
M63, F63 | D1-D100 | 100 sentences |
All | 800 sentences |
Install
- Install Pytorch
Install warp-ctc and bind it to pytorch.
Notice: If use python2, reinstall the pytorch with source code instead of pip.Use pytorch1.2 built-in CTC function(nn.CTCLoss) Now.- Install Kaldi. We use kaldi to extract mfcc and fbank.
- Install pytorch torchaudio(This is needed when using waveform as input).
Install KenLM. Training n-gram Languange Model if needed. Use Irstlm in kaldi tools instead.- Install and start visdom
pip3 install visdom
python -m visdom.server
- Install other python packages
pip install -r requirements.txt
Usage
- Install all the packages according to the Install part.
- Revise the top script run.sh.
- Open the config file to revise the super-parameters about everything.
- Run the top script with four conditions
bash run.sh data_prepare + AM training + LM training + testing
bash run.sh 1 AM training + LM training + testing
bash run.sh 2 LM training + testing
bash run.sh 3 testing
RNN LM training is not implemented yet. They are added to the todo-list.
Data Prepare
- Extract 39dim mfcc and 40dim fbank feature from kaldi.
- Use compute-cmvn-stats and apply-cmvn with training data to get the global mean and variance and normalize the feature.
- Rewrite Dataset and dataLoader in torch.nn.dataset to prepare data for training. You can find them in the steps/dataloader.py.
Model
- RNN + DNN + CTC RNN here can be replaced by nn.LSTM and nn.GRU
- CNN + RNN + DNN + CTC
CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference. - How to choose
Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
Training:
- initial-lr = 0.001
- decay = 0.5
- wight-decay = 0.005
Adjust the learning rate if the dev loss is around a specific loss for ten times.
Times of adjusting learning rate is 8 which can be alter in steps/train_ctc.py(line367).
Optimizer is nn.optimizer.Adam with weigth decay 0.005
Decoder
Greedy decoder:
Take the max prob of outputs as the result and get the path.
Calculate the WER and CER by used the function of the class.
Beam decoder:
Implemented with python. Original Code
I fix it to support phoneme for batch decode.
Beamsearch can improve about 0.2% of phonome accuracy.
Phoneme-level language model is inserted to beam search decoder now.
ToDo
- Combine with RNN-LM
- Beam search with RNN-LM
- The code in 863_corpus is a mess. Need arranged.