GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao | Zhejiang University, Sea AI Lab
PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
We provide our implementation and pretrained models in this repository.
Visit our demo page for audio samples.
News
- December, 2022: GenerSpeech (NeurIPS 2022) released at Github.
Key Features
- Multi-level Style Transfer for expressive text-to-speech.
- Enhanced model generalization to out-of-distribution (OOD) style reference.
Quick Started
We provide an example of how you can generate high-fidelity samples using GenerSpeech.
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
Support Datasets and Pretrained Models
You can use pretrained models we provide here. Details of each folder are as in follows:
Model | Dataset (16 kHz) | Discription |
---|---|---|
GenerSpeech | LibriTTS,ESD | Acousitic model (config) |
HIFI-GAN | LibriTTS,ESD | Neural Vocoder |
Encoder | / | Emotion Encoder |
More supported datasets are coming soon.
Dependencies
A suitable conda environment named generspeech
can be created
and activated with:
conda env create -f environment.yaml
conda activate generspeech
Multi-GPU
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Inference towards style transfer of custom voice
Here we provide a speech synthesis pipeline using GenerSpeech.
- Prepare GenerSpeech (acoustic model): Download and put checkpoint at
checkpoints/GenerSpeech
- Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at
checkpoints/trainset_hifigan
- Prepare Emotion Encoder: Download and put checkpoint at
checkpoints/Emotion_encoder.pt
- Prepare dataset: Download and put statistical files at
data/binary/training_set
- Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.
CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"
Generated wav files are saved in infer_out
by default.
Train your own model
Data Preparation and Configuration
- Set
raw_data_dir
,processed_data_dir
,binary_data_dir
in the config file, and download dataset toraw_data_dir
. - Check
preprocess_cls
in the config file. The dataset structure needs to follow the processorpreprocess_cls
, or you could rewrite it according to your dataset. We provide a Libritts processor as an example inmodules/GenerSpeech/config/generspeech.yaml
- Download global emotion encoder to
emotion_encoder_path
. For more details, please refer to this branch. - Preprocess Dataset
# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config
You could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).
Training GenerSpeech
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --reset
Inference using GenerSpeech
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --infer
Acknowledgements
This implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.
Citations
If you find this code useful in your research, please cite our work:
@inproceedings{huanggenerspeech,
title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
booktitle={Advances in Neural Information Processing Systems}
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.