• Stars
    star
    338
  • Rank 124,931 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 4 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

PIKA is a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi. The first release focuses on end-to-end speech recognition. We use Pytorch as deep learning engine, Kaldi for data formatting and feature extraction.

Key Features

  • On-the-fly data augmentation and feature extraction loader

  • TDNN Transformer encoder and convolution and transformer based decoder model structure

  • RNNT training and batch decoding

  • RNNT decoding with external Ngram FSTs (on-the-fly rescoring, aka, shallow fusion)

  • RNNT Minimum Bayes Risk (MBR) training

  • LAS forward and backward rescorer for RNNT

  • Efficient BMUF (Block model update filtering) based distributed training

Installation and Dependencies

In general, we recommend Anaconda since it comes with most dependencies. Other major dependencies include,

Pytorch

Please go to https://pytorch.org/ for pytorch installation, codes and scripts should be able to run against pytorch 0.4.0 and above. But we recommend 1.0.0 above for compatibility with RNNT loss module (see below)

Pykaldi and Kaldi

We use Kaldi (https://github.com/kaldi-asr/kaldi)) and PyKaldi (a python wrapper for Kaldi) for data processing, feature extraction and FST manipulations. Please go to Pykaldi website https://github.com/pykaldi/pykaldi for installation and make sure to build Pykaldi with ninja for efficiency. After following the installation process of pykaldi, you should have both Kaldi and Pykaldi dependencies ready.

CUDA-Warp RNN-Transducer

For RNNT loss module, we adopt the pytorch binding at https://github.com/1ytic/warp-rnnt

Others

Check requirements.txt for other dependencies.

Get Started

To get started, check all the training and decoding scripts located in egs directory.

I. Data preparation and RNNT training

egs/train_transducer_bmuf_otfaug.sh contains data preparation and RNNT training. One need to prepare training data and specify the training data directory,

#training data dir must contain wav.scp and label.txt files
#wav.scp: standard kaldi wav.scp file, see https://kaldi-asr.org/doc/data_prep.html 
#label.txt: label text file, the format is, uttid sequence-of-integer, where integer
#           is one-based indexing mapped label, note that zero is reserved for blank,  
#           ,eg., utt_id_1 3 5 7 10 23 
train_data_dir=

II. Continue with MBR training

With RNNT trained model, one can continued MBR training with egs/train_transducer_mbr_bmuf_otfaug.sh (assuming using the same training data, therefore data preparation is omitted). Make sure to specify the initial model,

--verbose \
--optim sgd \
--init_model $exp_dir/init.model \
--rnnt_scale 1.0 \
--sm_scale 0.8 \

III. Training LAS forward and backward rescorer

One can train a forward and backward LAS rescorer for your RNN-T model using egs/train_las_rescorer_bmuf_otfaug.sh. The LAS rescorer will share the encoder part with RNNT model, and has extra two-layer LSTM as additional encoder, make sure to specify the encoder sharing as,

--num_batches_per_epoch 526264 \
--shared_encoder_model $exp_dir/final.model \
--num_epochs 5 \

We support bi-directional LAS rescoring, i.e., forward and backward rescoring. Backward (right-to-left) rescoring is achieved by reversing sequential labels when conducting LAS model training. One can easily perform a backward LAS rescorer training by specifying,

--reverse_labels

IV. Decoding

egs/eval_transducer.sh is the main evluation script, which contains the decoding pipeline. Forward and backward LAS rescoring can be enabled by specifying these two models,

##########configs#############
#rnn transducer model
rnnt_model=
#forward and backward las rescorer model
lasrescorer_fw=
lasrescorer_bw=

Caveats

All the training and decoding hyper-parameters are adopted based on large-scale (e.g., 60khrs) training and internal evaluation data. One might need to re-tune hyper-parameters to acheive optimal performances. Also the WER (CER) scoring script is based on a Mandarin task, we recommend those who work on different languages rewrite scoring scripts.

References

[1] Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition, Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su, Dong Yu, InterSpeech 2018

[2] Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition, Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu, InterSpeech 2020

Citations

@inproceedings{Weng2020,
  author={Chao Weng and Chengzhu Yu and Jia Cui and Chunlei Zhang and Dong Yu},
  title={{Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={966--970},
  doi={10.21437/Interspeech.2020-1221},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1221}
}

@inproceedings{Weng2018,
  author={Chao Weng and Jia Cui and Guangsen Wang and Jun Wang and Chengzhu Yu and Dan Su and Dong Yu},
  title={Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={761--765},
  doi={10.21437/Interspeech.2018-1030},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1030}
}

Disclaimer

This is not an officially supported Tencent product

More Repositories

1

IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Jupyter Notebook
5,177
star
2

V-Express

V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.
Python
2,182
star
3

persona-hub

Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
Python
768
star
4

hifi3dface

Code and data for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
Python
758
star
5

hok_env

Honor of Kings AI Open Environment of Tencent
Python
616
star
6

grover

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data
Python
325
star
7

bddm

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
Python
217
star
8

FRA-RIR

Python
169
star
9

PCDMs

Implementation code:Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models
Jupyter Notebook
150
star
10

DrugOOD

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery
Python
149
star
11

Frequency_Aug_VAE_MoESR

Latent-based SR using MoE and frequency augmented VAE decoder
Python
145
star
12

tleague_projpage

Jinja
135
star
13

3m-asr

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition
Python
119
star
14

TLeague

Python
79
star
15

RLogist

RLogist = RL (reinforcement learning) + Pathologist
Python
65
star
16

CogKernel

Python
44
star
17

MDM

MDM
Python
43
star
18

UltraDualPathCompression

A Pytorch-based implementation of the compression and decompression module in "Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression".
Jupyter Notebook
36
star
19

Lodoss

Python
34
star
20

mini-hok

Mini HoK: a novel MARL benchmark based on the popular mobile game, Honor of Kings, to address limitations in existing environments such as complexity and accessibility.
Python
29
star
21

TriNet

TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR.
Python
26
star
22

ICML21_OAXE

Python
25
star
23

season

[EMNLP 2022] Salience Allocation as Guidance for Abstractive Summarization
Python
22
star
24

hokoff

Python
21
star
25

Leopard

The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"
18
star
26

hifi3dface_projpage

Project page for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
HTML
16
star
27

GrndPodcastSum

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"
Python
15
star
28

OASum

13
star
29

EMNLP21_SemEq

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".
Python
12
star
30

learning_singing_from_speech

Project page for our paper "DurIAN : DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System".
10
star
31

valuationgame

Jupyter Notebook
9
star
32

Arena

Python
8
star
33

MetaLogic

Python
8
star
34

ZED

This is the repository for EMNLP 2022 paper "Efficient Zero-shot Event Extraction with Context-Definition Alignment"
Python
8
star
35

machine-translation

Open source on machine translation
7
star
36

TPolicies

Python
6
star
37

zebra-inference

Python
5
star
38

Interformer

Jupyter Notebook
5
star
39

FOLNet

This repository includes the code for First-Order Logic Network (FOLNet).
Python
4
star
40

TLeagueAutoBuild

Python
4
star
41

TImitate

Python
2
star
42

siam

2
star