PortaSpeech - PyTorch Implementation
PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech.
Audio Samples
Audio samples are available at /demo.
Model Size
Module | Normal | Small | Normal (paper) | Small (paper) |
---|---|---|---|---|
Total | 24M | 7.6M | 21.8M | 6.7M |
LinguisticEncoder | 3.7M | 1.4M | - | - |
VariationalGenerator | 11M | 2.8M | - | - |
FlowPostNet | 9.3M | 3.4M | - | - |
Quickstart
DATASET refers to the names of datasets such as LJSpeech
in the following documents.
Dependencies
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, Dockerfile
is provided for Docker
users.
Inference
You have to download the pretrained models and put them in output/ckpt/DATASET/
.
For a single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
The generated utterances will be put in output/result/
.
Batch Inference
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
.
Controllability
The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8
Please note that the controllability is originated from FastSpeech2 and not a vital interest of PortaSpeech.
Training
Datasets
The supported datasets are
- LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
Preprocessing
Run
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/DATASET/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Training
Train your model with
python3 train.py --dataset DATASET
Useful options:
- To use Automatic Mixed Precision, append
--use_amp
argument to the above command. - The trainer assumes single-node multi-GPU training. To use specific GPUs, specify
CUDA_VISIBLE_DEVICES=<GPU_IDs>
at the beginning of the above command.
TensorBoard
Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
Normal Model
Small Model Loss
Notes
- For vocoder, HiFi-GAN and MelGAN are supported.
- No ReLU activation and LayerNorm in VariationalGenerator to avoid mashed output.
- Speed โโup the convergence of word-to-phoneme alignment in LinguisticEncoder by dividing long words into subwords and sorting the dataset by mel-spectrogram frame length.
- There are two kinds of helper loss to improve word-to-phoneme alignment: "ctc" and "dga". You can toggle them as follows:
# In the train.yaml aligner: helper_type: "dga" # ["dga", "ctc", "none"]
- "dga": Diagonal Guided Attention (DGA) Loss
- "ctc": Connectionist Temporal Classification (CTC) Loss with forward-sum algorithm
- If you set "none", no helper loss will be applied during training.
- The alignments comparision of three methods ("dga", "ctc", and "none" from top to bottom):
- The default setting is "dga". Although "ctc" makes the strongest alignment, the output quality and the accuracy are worse than "dga".
- But still, there is a room for the improvement of output quality. The audio quality and the alingment (accuracy) seem to be a trade-off.
- Will be extended to a multi-speaker TTS.
Citation
Please cite this repository by the "Cite this repository" of About section (top right of the main page).