SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis
By Bohan Zhai *, Tianren Gao *, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, and Kurt Keutzer (UC Berkeley)
Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.
Link to the paper: paper. If you find this work useful, please consider citing
@inproceedings{squeezewave,
Author = {Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, Kurt Keutzer},
Title = {SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis},
Journal = {arXiv:2001.05685},
Year = {2020}
}
Audio samples generated by SqueezeWave
Audio samples of SqueezeWave are here: https://tianrengao.github.io/SqueezeWaveDemo/
Results
We introduce 4 variants of SqueezeWave in our paper. See the table below.
Model | length | n_channels | MACs | Reduction | MOS |
---|---|---|---|---|---|
WaveGlow | 2048 | 8 | 228.9 | 1x | 4.57Β±0.04 |
SqueezeWave-128L | 128 | 256 | 3.78 | 60x | 4.07Β±0.06 |
SqueezeWave-64L | 64 | 256 | 2.16 | 106x | 3.77Β±0.05 |
SqueezeWave-128S | 128 | 128 | 1.06 | 214x | 3.79Β±0.05 |
SqueezeWave-64S | 64 | 128 | 0.68 | 332x | 2.74Β±0.04 |
Model Complexity
A detailed MAC calculation can be found from here
Setup
-
(Optional) Create a virtual environment
virtualenv env source env/bin/activate
-
Clone our repo and initialize submodule
git clone https://github.com/tianrengao/SqueezeWave.git cd SqueezeWave git submodule init git submodule update
-
Install requirements
pip3 install -r requirements.txt
-
Install Apex
cd ../ git clone https://www.github.com/nvidia/apex cd apex python setup.py install
Generate audio with our pretrained model
-
Download our pretrained models. We provide 4 pretrained models as described in the paper.
-
Download mel-spectrograms
-
Generate audio. Please replace
SqueezeWave.pt
to the specific pretrained model's name.python3 inference.py -f <(ls mel_spectrograms/*.pt) -w SqueezeWave.pt -o . --is_fp16 -s 0.6
Train your own model
-
Download LJ Speech Data. We assume all the waves are stored in the directory
^/data/
-
Make a list of the file names to use for training/testing
ls data/*.wav | tail -n+10 > train_files.txt ls data/*.wav | head -n10 > test_files.txt
-
We provide 4 model configurations with audio channel and channel numbers specified in the table below. The configuration files are under
/configs
directory. To choose the model you want to train, select the corresponding configuration file. -
Train your SqueezeWave model
mkdir checkpoints python train.py -c configs/config_a256_c128.json
For multi-GPU training replace
train.py
withdistributed.py
. Only tested with single node and NCCL.For mixed precision training set
"fp16_run": true
onconfig.json
. -
Make test set mel-spectrograms
mkdir -p eval/mels python3 mel2samp.py -f test_files.txt -o eval/mels -c configs/config_a128_c256.json
-
Run inference on the test data.
ls eval/mels > eval/mel_files.txt sed -i -e 's_.*_eval/mels/&_' eval/mel_files.txt mkdir -p eval/output python3 inference.py -f eval/mel_files.txt -w checkpoints/SqueezeWave_10000 -o eval/output --is_fp16 -s 0.6
Replace
SqueezeWave_10000
with the checkpoint you want to test.
Credits
The implementation of this work is based on WaveGlow: https://github.com/NVIDIA/waveglow