ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren
PyTorch Implementation of ProDiff (ACM Multimedia'22): a conditional diffusion probabilistic model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.
Visit our demo page for audio samples.
News
- April, 2022: Our previous work FastDiff (IJCAI 2022) released in Github.
- September, 2022: ProDiff (ACM Multimedia 2022) released in Github.
Key Features
- Extremely-Fast diffusion text-to-speech synthesis pipeline for potential industrial deployment.
- Tutorial and code base for speech diffusion models.
- More supported diffusion mechanism (e.g., guided diffusion) will be available.
Quick Started
We provide an example of how you can generate high-fidelity samples using ProDiff.
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
Support Datasets and Pretrained Models
Simply run following command to download the weights
from huggingface_hub import snapshot_download
downloaded_path = snapshot_download(repo_id="Rongjiehuang/ProDiff")
and move the downloaded checkpoints to checkpoints/$Model/model_ckpt_steps_*.ckpt
mv ${downloaded_path}/checkpoints/ checkpoints/
Details of each folder are as in follows:
Model | Dataset | Config |
---|---|---|
ProDiff Teacher | LJSpeech | modules/ProDiff/config/prodiff_teacher.yaml |
ProDiff | LJSpeech | modules/ProDiff/config/prodiff.yaml |
More supported datasets are coming soon.
Dependencies
See requirements in requirement.txt
:
Multi-GPU
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Extremely-Fast Text-to-Speech with diffusion probabilistic models
Here we provide a speech synthesis pipeline using diffusion probabilistic models: ProDiff (acoustic model) + FastDiff (neural vocoder).
-
Prepare acoustic model (ProDiff or ProDiff Teacher): Download LJSpeech checkpoint and put it in
checkpoints/ProDiff
orcheckpoints/ProDiff_Teacher
-
Prepare neural vocoder (FastDiff): Download LJSpeech checkpoint and put it in
checkpoints/FastDiff
-
Specify the input
$text
, and setN
for reverse sampling in neural vocoder, which is a trade off between quality and speed. -
Run the following command for extreme fast speed
(2-iter ProDiff + 4-iter FastDiff)
:
CUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff.py --config modules/ProDiff/config/prodiff.yaml --exp_name ProDiff --hparams="N=4,text='$txt'" --reset
Generated wav files are saved in infer_out
by default.
Note: For better quality, it's recommended to finetune the FastDiff neural vocoder here.
- Enjoy speed-quality trade-off:
(4-iter ProDiff Teacher + 6-iter FastDiff)
:
CUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff_teacher.py --config modules/ProDiff/config/prodiff_teacher.yaml --exp_name ProDiff_Teacher --hparams="N=6,text='$txt'" --reset
Train your own model
Data Preparation and Configuraion
- Set
raw_data_dir
,processed_data_dir
,binary_data_dir
in the config file - Download dataset to
raw_data_dir
. Note: the dataset structure needs to followegs/datasets/audio/*/pre_align.py
, or you could rewritepre_align.py
according to your dataset. - Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/pre_align.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/runs/train_mfa_align.py --config $CONFIG_NAME
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config
You could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed LJSpeech dataset here.
Training Teacher of ProDiff
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml --exp_name ProDiff_Teacher --reset
Training ProDiff
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml --exp_name ProDiff --reset
Inference using ProDiff Teacher
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml --exp_name ProDiff_Teacher --infer
Inference using ProDiff
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml --exp_name ProDiff --infer
Acknowledgements
This implementation uses parts of the code from the following Github repos: FastDiff, DiffSinger, NATSpeech, as described in our code.
Citations
If you find this code useful in your research, please cite our work:
@inproceedings{huang2022prodiff,
title={ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech},
author={Huang, Rongjie and Zhao, Zhou and Liu, Huadai and Liu, Jinglin and Cui, Chenye and Ren, Yi},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
year={2022}
}
@article{huang2022fastdiff,
title={FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis},
author={Huang, Rongjie and Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong and Ren, Yi and Zhao, Zhou},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
year={2022}
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.