StyleSpeech - PyTorch Implementation
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation.
Branch
- StyleSpeech (
naive
branch) - Meta-StyleSpeech (
main
branch)
Quickstart
Dependencies
You can install the Python dependencies with
pip3 install -r requirements.txt
Inference
You have to download pretrained models and put them in output/ckpt/LibriTTS_meta_learner/
.
For English multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --ref_audio path/to/reference_audio.wav --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
The generated utterances will be put in output/result/
. Your synthesized speech will have ref_audio
's style.
Batch Inference
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LibriTTS/val.txt --restore_step 200000 --mode batch -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
to synthesize all utterances in preprocessed_data/LibriTTS/val.txt
. This can be viewed as a reconstruction of validation datasets referring to themselves for the reference style.
Controllability
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml --duration_control 0.8 --energy_control 0.8
Note that the controllability is originated from FastSpeech2 and not a vital interest of StyleSpeech. Please refer to STYLER [demo, code] for the controllability of each style factor.
Training
Datasets
The supported datasets are
- LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.
- (will be added more)
Preprocessing
Run
python3 prepare_align.py config/LibriTTS/preprocess.yaml
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/LibriTTS/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py config/LibriTTS/preprocess.yaml
Training
Train your model with
python3 train.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
As described in the paper, the script will start from pre-training the naive model until meta_learning_warmup
steps and then meta-train the model for additional steps via episodic training.
TensorBoard
Use
tensorboard --logdir output/log/LibriTTS
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
Implementation Issues
- Use
22050Hz
sampling rate instead of16kHz
. - Add one fully connected layer at the beginning of Mel-Style Encoder to upsample input mel-spectrogram from
80
to128
. - The model size including meta-learner is
28.197M
. - Use a maximum
16
batch size on training instead of48
or20
mainly due to the lack of memory capacity with a single 24GiB TITAN-RTX. This can be achieved by the following script to filter out data longer thanmax_seq_len
:This will generatepython3 filelist_filtering.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml
train_filtered.txt
in the same location oftrain.txt
. - Since the total batch size is decreased, the number of training steps is doubled compared to the original paper.
- Use HiFi-GAN instead of MelGAN for vocoding.
Citation
@misc{lee2021stylespeech,
author = {Lee, Keon},
title = {StyleSpeech},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/StyleSpeech}}
}
References
- Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
- A Style-Based Generator Architecture for Generative Adversarial Networks
- Matching Networks for One Shot Learning
- Prototypical Networks for Few-shot Learning
- TADAM: Task dependent adaptive metric for improved few-shot learning
- ming024's FastSpeech2