• Stars
    star
    600
  • Rank 74,640 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.

An Apache 2.0 ASR research library, built on PyTorch, for developing end-to-end speech recognition models.


Introduction β€’ Roadmap β€’ Docs β€’ Codefactor β€’ License β€’ Gitter β€’ Paper


This repository archived. If the reason why you found this repo is below, I will recommend a different repository for each reason.

  • I want to train my own voice recognition model or study internal code! β†’ OpenSpeech
  • I want to test the trained Korean speech recognition model right away! β†’ Pororo ASR or Whisper

What's New

  • May 2021: Fix LayerNorm Error, Subword Error
  • Febuary 2021: Update Documentation
  • Febuary 2021: Add RNN-Transducer model
  • January 2021: Release v1.3
  • January 2021: Add Conformer model
  • January 2021: Add Jasper model
  • January 2021: Add Joint CTC-Attention Transformer model
  • January 2021: Add Speech Transformer model
  • January 2021: Apply Hydra: framework for elegantly configuring complex applications

Note

  • Not long ago, I modified a lot of the code, but I was personally busy, so I couldn't test all the cases. If there is an error, please feel free to give me a feedback.
  • Subword and Grapheme unit currently not tested.

KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [Paper]

KoSpeech, an open-source software, is modular and extensible end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch. Several automatic speech recognition open-source toolkits have been released, but all of them deal with non-Korean languages, such as English (e.g. ESPnet, Espresso). Although AI Hub opened 1,000 hours of Korean speech corpus known as KsponSpeech, there is no established preprocessing method and baseline model to compare model performances. Therefore, we propose preprocessing methods for KsponSpeech corpus and a several models (Deep Speech 2, LAS, Transformer, Jasper, Conformer). By KoSpeech, we hope this could be a guideline for those who research Korean speech recognition.

Supported Models

Acoustic Model Notes Citation
Deep Speech 2 2D-invariant convolution & RNN & CTC Dario Amodei et al., 2015
Listen Attend Spell (LAS) Attention based RNN sequence to sequence William Chan et al., 2016
Joint CTC-Attention LAS Joint CTC-Attention LAS Suyoun Kim et al., 2017
RNN-Transducer RNN Transducer Ales Graves. 2012
Speech Transformer Convolutional extractor & transformer Linhao Dong et al., 2018
Jasper Fully convolutional & dense residual connection & CTC Jason Li et al., 2019
Conformer Convolution-augmented-Transformer Anmol Gulati et al., 2020
  • Note
    It is based on the above papers, but there may be other parts of the model implementation.

Introduction

End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional β€œhybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.

For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.

Roadmap

So far, serveral models are implemented: Deep Speech 2, Listen Attend and Spell (LAS), RNN-Transducer, Speech Transformer, Jasper, Conformer.

  • Deep Speech 2

Deep Speech 2 showed faster and more accurate performance on ASR tasks with Connectionist Temporal Classification (CTC) loss. This model has been highlighted for significantly increasing performance compared to the previous end- to-end models.

  • Listen, Attend and Spell (LAS)

We follow the architecture previously proposed in the "Listen, Attend and Spell", but some modifications were added to improve performance. We provide four different attention mechanisms, scaled dot-product attention, additive attention, location aware attention, multi-head attention. Attention mechanisms much affect the performance of models.

  • RNN-Transducer

RNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.

  • Speech Transformer

Transformer is a powerful architecture in the Natural Language Processing (NLP) field. This architecture also showed good performance at ASR tasks. In addition, as the research of this model continues in the natural language processing field, this model has high potential for further development.

  • Joint CTC-Attention

With the proposed architecture to take advantage of both the CTC-based model and the attention-based model. It is a structure that makes it robust by adding CTC to the encoder. Joint CTC-Attention can be trained in combination with LAS and Speech Transformer.

  • Jasper

Jasper (Just Another SPEech Recognizer) is a end-to-end convolutional neural acoustic model. Jasper showed powerful performance with only CNN β†’ BatchNorm β†’ ReLU β†’ Dropout block and residential connection.

  • Conformer

Conformer combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Installation

This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

  • Numpy: pip install numpy (Refer here for problem installing Numpy).
  • Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
  • Pandas: pip install pandas (Refer here for problem installing Pandas)
  • Matplotlib: pip install matplotlib (Refer here for problem installing Matplotlib)
  • librosa: conda install -c conda-forge librosa (Refer here for problem installing librosa)
  • torchaudio: pip install torchaudio==0.6.0 (Refer here for problem installing torchaudio)
  • tqdm: pip install tqdm (Refer here for problem installing tqdm)
  • sentencepiece: pip install sentencepiece (Refer here for problem installing sentencepiece)
  • warp-rnnt: pip install warp_rnnt (Refer here) for problem installing warp-rnnt)
  • hydra: pip install hydra-core --upgrade (Refer here for problem installing hydra)

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:

pip install -e .

Get Started

We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically.

Preparing KsponSpeech Dataset (LibriSpeech also supports)

Download from here or refer to the following to preprocess.

Training KsponSpeech Dataset

You can choose from several models and training options. There are many other training options, so look carefully and execute the following command:

  • Deep Speech 2 Training
python ./bin/main.py model=ds2 train=ds2_train train.dataset_path=$DATASET_PATH
  • Listen, Attend and Spell Training
python ./bin/main.py model=las train=las_train train.dataset_path=$DATASET_PATH
  • Joint CTC-Attention Listen, Attend and Spell Training
python ./bin/main.py model=joint-ctc-attention-las train=las_train train.dataset_path=$DATASET_PATH
  • RNN Transducer Training
python ./bin/main.py model=rnnt train=rnnt_train train.dataset_path=$DATASET_PATH
  • Speech Transformer Training
python ./bin/main.py model=transformer train=transformer_train train.dataset_path=$DATASET_PATH
  • Joint CTC-Attention Speech Transformer Training
python ./bin/main.py model=joint-ctc-attention-transformer train=transformer_train train.dataset_path=$DATASET_PATH
  • Jasper Training
python ./bin/main.py model=jasper train=jasper_train train.dataset_path=$DATASET_PATH
  • Conformer Training
python ./bin/main.py model=conformer-large train=conformer_large_train train.dataset_path=$DATASET_PATH

You can train with conformer-medium, conformer-small model.

Evaluate for KsponSpeech

python ./bin/eval.py eval.dataset_path=$DATASET_PATH eval.transcripts_path=$TRANSCRIPTS_PATH eval.model_path=$MODEL_PATH

Now you have a model which you can use to predict on new data. We do this by running greedy search or beam search.

Inference One Audio with Pre-train Models

  • Command
$ python3 ./bin/inference.py --model_path $MODEL_PATH --audio_path $AUDIO_PATH --device $DEVICE
  • Output
μŒμ„±μΈμ‹ κ²°κ³Ό λ¬Έμž₯이 λ‚˜μ˜΅λ‹ˆλ‹€

You can get a quick look of pre-trained model's inference, with a audio.

Checkpoints

Checkpoints are organized by experiments and timestamps as shown in the following file structure.

outputs
+-- YYYY_mm_dd
|  +-- HH_MM_SS
   |  +-- trainer_states.pt
   |  +-- model.pt

You can resume and load from checkpoints.

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

Paper References

Ilya Sutskever et al. Sequence to Sequence Learning with Neural Networks arXiv: 1409.3215

Dzmitry Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate arXiv: 1409.0473

Jan Chorowski et al. Attention Based Models for Speech Recognition arXiv: 1506.07503

Wiliam Chan et al. Listen, Attend and Spell arXiv: 1508.01211

Dario Amodei et al. Deep Speech2: End-to-End Speech Recognition in English and Mandarin arXiv: 1512.02595

Takaaki Hori et al. Advances in Joint CTC-Attention based E2E Automatic Speech Recognition with a Deep CNN Encoder and RNN-LM arXiv: 1706.02737

Ashish Vaswani et al. Attention Is All You Need arXiv: 1706.03762

Chung-Cheng Chiu et al. State-of-the-art Speech Recognition with Sequence-to-Sequence Models arXiv: 1712.01769

Anjuli Kannan et al. An Analysis Of Incorporating An External LM Into A Sequence-to-Sequence Model arXiv: 1712.01996

Daniel S. Park et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition arXiv: 1904.08779

Rafael Muller et al. When Does Label Smoothing Help? arXiv: 1906.02629

Daniel S. Park et al. SpecAugment on large scale datasets arXiv: 1912.05533

Jung-Woo Ha et al. ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers arXiv: 2004.09367

Jason Li et al. Jasper: An End-to-End Convolutional Neural Acoustic Model arXiv: 1902.03288

Anmol Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recognition arXiv: 2005.08100

Github References

IBM/Pytorch-seq2seq

SeanNaren/deepspeech.pytorch

kaituoxu/Speech-Transformer

OpenNMT/OpenNMT-py

clovaai/ClovaCall

LiyuanLucasLiu/RAdam

NVIDIA/DeepLearningExample

espnet/espnet

License

This project is licensed under the Apache-2.0 LICENSE - see the LICENSE.md file for details

Citation

A paper on KoSpeech is available. If you use the system for academic work, please cite:

@ARTICLE{2021-kospeech,
  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},
  title     = {KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition},
  url       = {https://www.sciencedirect.com/science/article/pii/S2665963821000026},
  month     = {February},
  year      = {2021},
  publisher = {ELSEVIER},
  journal   = {SIMPAC},
  pages     = {Volume 7, 100054}
}

A technical report on KoSpeech in available.

@TECHREPORT{2020-kospeech,
  author    = {Kim, Soohwan and Bae, Seyoung and Won, Cheolhwang},
  title     = {KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition},
  month     = {September},
  year      = {2020},
  url       = {https://arxiv.org/abs/2009.03092},
  journal   = {ArXiv e-prints},
  eprint    = {2009.03092}
}

More Repositories

1

conformer

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)
Python
701
star
2

attentions

PyTorch implementation of some attentions for Deep Learning Researchers.
Python
396
star
3

k-startups

List of tech startups in South Korea. (Republic of Korea)
206
star
4

Korean-PLM

List of Korean pre-trained language models.
157
star
5

ksponspeech

Pre-processing KsponSpeech corpus (Korean Speech dataset) provided by AI Hub.
Python
76
star
6

pytorch-lr-scheduler

PyTorch implementation of some learning rate schedulers for deep learning researcher.
Python
67
star
7

Speech-Recognition-Tutorial

ν•œκ΅­μ–΄ μŒμ„±μΈμ‹ νŠœν† λ¦¬μ–Ό
60
star
8

nlp-tasks

Natural Language Processing Tasks and Examples.
Python
59
star
9

speech-transformer

Transformer implementation speciaized in speech recognition tasks using Pytorch.
Python
56
star
10

RNN-Transducer

PyTorch implementation of RNN-Transducer(RNN-T).
Python
51
star
11

lightning-asr

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.
Python
42
star
12

End-to-End-Speech-Recognition-Models

PyTorch implementation of automatic speech recognition models.
Python
41
star
13

transformer

A PyTorch Implementation of "Attention Is All You Need"
Python
37
star
14

luna-transformer

A PyTorch Implementation of the Luna: Linear Unified Nested Attention
Python
35
star
15

jasper

PyTorch implementation of "Jasper: An End-to-End Convolutional Neural Acoustic Model" (INTERSPEECH 2019)
Python
29
star
16

Naver-AI-Hackathon-Speech

2019 Clova AI Hackathon : Speech - Rank 12 / Team Kai.Lib
Python
23
star
17

deepspeech2

PyTorch implementation of "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin" (ICML, 2016)
Python
19
star
18

seq2seq

PyTorch implementation of the RNN-based sequence-to-sequence architecture.
Python
19
star
19

tacotron2

Pytorch implementation of "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", ICASSP, 2018.
Python
17
star
20

speech-paper-review

Review of papers I read
15
star
21

speech-recognition-papers

Awesome Automatic Speech Recognition (ASR) paper collection
15
star
22

Fairseq-Listen-Attend-Spell

A Fairseq implementation of Listen, Attend and Spell (LAS), an End-to-End ASR framework.
Python
14
star
23

char-rnnlm

Character-level Recurrent Neural Network Language Model (rnnlm) implement in Pytorch.
Python
12
star
24

accelerate-asr

Modular and extensible speech recognition library leveraging accelerate and hydra.
Python
10
star
25

sooftware

10
star
26

Speech-Note

🎧 Speech study records repository
C
7
star
27

Audio-Signal-Processing

Audio Signal Preocessing: pcm2wav, wav2pcm, feature extraction, augment, delete silence etc
Python
7
star
28

generate-sec-dataset

Generate space error correction dataset
Python
6
star
29

TIL

Today I Learned
Python
6
star
30

sooftware.github.io

SCSS
6
star
31

KoSpeech-Flask

KoSpeech Flask Web Application
Python
3
star