• Stars
    star
    220
  • Rank 173,885 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 3 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech)

DiffSinger - PyTorch Implementation

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech).

Repository Status

  • Naive Version of DiffSpeech (not DiffSinger)
  • Auxiliary Decoder (from FastSpeech2)
  • An Easier Trick for Boundary Prediction of K
  • Shallow Version of DiffSpeech (Shallow Diffusion Mechanism): Leveraging pre-trained auxiliary decoder + Training denoiser using K as a maximum time step
  • Multi-Speaker Training

Quickstart

DATASET refers to the names of datasets such as LJSpeech in the following documents.

MODEL refers to the types of model (choose from 'naive', 'aux', 'shallow').

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in

  • output/ckpt/LJSpeech_naive/ for 'naive' model.
  • output/ckpt/LJSpeech_shallow/ for 'shallow' model. Please note that the checkpoint of the 'shallow' model contains both 'shallow' and 'aux' models, and these two models will share all directories except results throughout the whole process.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of DiffSpeech.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

Preprocessing

First, run

python3 prepare_align.py --dataset DATASET

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

You can train three types of model: 'naive', 'aux', and 'shallow'.

  • Training Naive Version ('naive'):

    Train the naive version with

    python3 train.py --model naive --dataset DATASET
    
  • Training Auxiliary Decoder for Shallow Version ('aux'):

    To train the shallow version, we need a pre-trained FastSpeech2. The below command will let you train the FastSpeech2 modules, including Auxiliary Decoder.

    python3 train.py --model aux --dataset DATASET
    
  • An Easier Trick for Boundary Prediction:

    To get the boundary K from our validation dataset, you can run the boundary predictor using pre-trained auxiliary FastSpeech2 by the following command.

    python3 boundary_predictor.py --restore_step RESTORE_STEP --dataset DATASET
    

    It will print out the predicted value (say, K_STEP) in the command log.

    Then, set the config with the predicted value as follows

    # In the model.yaml
    denoiser:
        K_step: K_STEP

    Please note that this is based on the trick introduced in Appendix B.

  • Training Shallow Version ('shallow'):

    To leverage pre-trained FastSpeech2, including Auxiliary Decoder, you must set restore_step with the final step of auxiliary FastSpeech2 training as the following command.

    python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
    

    For example, if the last checkpoint is saved at 160000 steps during the auxiliary training, you have to set restore_step with 160000. Then it will load the aux model and then continue the training under a shallow training mechanism.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Naive Diffusion

Shallow Diffusion

Loss Comparison

Notes

  1. (Naive version of DiffSpeech) The number of learnable parameters is 27.767M, which is similar to the original paper (27.722M).
  2. Unfortunately, the predicted boundary (of LJSpeech) for the shallow diffusion in the current implementation is 100, which is the full timesteps of the naive diffusion so that there is no advantage on diffusion steps.
  3. Use HiFi-GAN instead of Parallel WaveGAN (PWG) for vocoding.

Citation

@misc{lee2021diffsinger,
  author = {Lee, Keon},
  title = {DiffSinger},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/DiffSinger}}
}

References

More Repositories

1

PortaSpeech

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Python
330
star
2

Comprehensive-Transformer-TTS

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS
Python
308
star
3

DiffGAN-TTS

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
Python
293
star
4

Expressive-FastSpeech2

PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.
Python
256
star
5

Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Python
186
star
6

StyleSpeech

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Python
177
star
7

DailyTalk

Official repository of DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech, ICASSP 2023 (Oral)
Python
175
star
8

Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Python
169
star
9

STYLER

Official repository of STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech, INTERSPEECH 2021
Python
150
star
10

Comprehensive-E2E-TTS

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS
Python
140
star
11

Soft-DTW-Loss

PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA
Python
113
star
12

FastPitchFormant

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
Python
70
star
13

VAENAR-TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
Python
69
star
14

WaveGrad2

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Python
66
star
15

Daft-Exprt

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
Python
54
star
16

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.
Python
42
star
17

Robust_Fine_Grained_Prosody_Control

PyTorch Implementation of Robust and fine-grained prosody control of end-to-end speech synthesis
Python
39
star
18

Stepwise_Monotonic_Multihead_Attention

PyTorch Implementation of Stepwise Monotonic Multihead Attention similar to Enhancing Monotonicity for Robust Autoregressive Transformer TTS
Python
27
star
19

Deep-Learning-TTS-Template

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).
Python
14
star
20

tacotron2_MMI

Another PyTorch implementation of Tacotron2 MMI (with waveglow) which supports n_frames_per_step>1 mode(reduction windows) and diagonal guided attention for robust alignments.
Jupyter Notebook
5
star
21

Fully_Hierarchical_Fine_Grained_TTS

Pytorch Implementation of Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (Unofficial)
2
star
22

cs231n

cs231n 2020 Spring assignments implementation
Jupyter Notebook
2
star
23

pintos

KAIST CS330 OS pintos Project
HTML
1
star