• Stars
    star
    112
  • Rank 312,240 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    BSD 3-Clause "New...
  • Created over 3 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unofficial Pytorch Implementation of WaveGrad2

WaveGrad 2 β€” Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.

arXiv githubio Colab

Update: Enjoy our pre-trained model with Google Colab notebook!

TODO

  • More training for WaveGrad-Base setup
  • Checkpoint release for Base
  • WaveGrad-Large Decoder
  • Checkpoint release for Large
  • Inference by reduced sampling steps

Requirements

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
  • etc.

We take LJSpeech as an example hereafter.

Preprocessing

  • Adjust preprocess.yaml, especially path section.
path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'
  • run prepare_align.py for some preparations.
python prepare_align.py -c preprocess.yaml
  • Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

  • After that, run preprocess.py.

python preprocess.py -c preprocess.yaml
  • Alternately, you can align the corpus by yourself.
  • Download the official MFA package and run it to align the corpus.
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

or

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
  • And then run preprocess.py.
python preprocess.py -c preprocess.yaml

Training

  • Adjust hparameter.yaml, especially train section.
train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0
  • If you want to train with other dataset, adjust data section in hparameter.yaml
data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  • run trainer.py
python trainer.py
  • If you want to resume training from checkpoint, check parser.
parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()
  • During training, tensorboard logger is logging loss, spectrogram and audio.
tensorboard --logdir=./tensorboard --bind_all

Inference

  • run inference.py
python inference.py -c <checkpoint_path> --text <'text'>

We provide a Jupyter Notebook script to provide the code for inference and show some visualizations with resulting audio.

  • Colab notebook This notebook provides pre-trained weights for WaveGrad 2 and you can download it via url inside(Both Checkpoint for WaveGrad-Base and WaveGrad-Large decoder).

Large Decoder

We implemented WaveGrad-Large decoder for high MOS output.
Note: it could be different with google's implementation since number of parameters are different with paper's value.

  • To train with Large model you need to modify hparameter.yaml.
wavegrad:
  is_large: True #if False, Base
  ...
  dilations: [[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,4,8]] #dilations for Large
  #dilations: [[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,1,2],[1,2,1,2]] dilations for Base

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

  • Normal LSTM without ZoneOut is applied for encoder.
  • g2p_en is applied instead of Google's unknown G2P.
  • Trained with LJSpeech datasdet instead of Google's proprietary dataset.
    • Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
  • MT + SpecAug are not implemented.
  • WaveGrad decoder shares same issues from ivanvovk's WaveGrad implementation.
  • WaveGrad-Large decoder's architecture could be different with Google's implementation.
  • hyperparameters
    • train.batch_size: 12 for Base and train.batch_size: 6 for Large, Trained with 2 V100 (32GB) GPUs
    • train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
    • train.decay learning rate decay is applied during training
    • train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
    • ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
    • encoder.channel is reduced to 512 from 1024 or 2048
  • TODO things.

Tree

.
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ README.md
β”œβ”€β”€ dataloader.py
β”œβ”€β”€ docs
β”‚Β Β  β”œβ”€β”€ spec.png
β”‚Β Β  β”œβ”€β”€ tb.png
β”‚Β Β  └── tblogger.png
β”œβ”€β”€ hparameter.yaml
β”œβ”€β”€ inference.py
β”œβ”€β”€ lexicon
β”‚Β Β  β”œβ”€β”€ librispeech-lexicon.txt
β”‚Β Β  └── pinyin-lexicon-r.txt
β”œβ”€β”€ lightning_model.py
β”œβ”€β”€ model
β”‚Β Β  β”œβ”€β”€ base.py
β”‚Β Β  β”œβ”€β”€ downsampling.py
β”‚Β Β  β”œβ”€β”€ encoder.py
β”‚Β Β  β”œβ”€β”€ gaussian_upsampling.py
β”‚Β Β  β”œβ”€β”€ interpolation.py
β”‚Β Β  β”œβ”€β”€ layers.py
β”‚Β Β  β”œβ”€β”€ linear_modulation.py
β”‚Β Β  β”œβ”€β”€ nn.py
β”‚Β Β  β”œβ”€β”€ resampling.py
β”‚Β Β  β”œβ”€β”€ upsampling.py
β”‚Β Β  └── window.py
β”œβ”€β”€ prepare_align.py
β”œβ”€β”€ preprocess.py
β”œβ”€β”€ preprocess.yaml
β”œβ”€β”€ preprocessor
β”‚Β Β  β”œβ”€β”€ ljspeech.py
β”‚Β Β  └── preprocessor.py
β”œβ”€β”€ text
β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”œβ”€β”€ cleaners.py
β”‚Β Β  β”œβ”€β”€ cmudict.py
β”‚Β Β  β”œβ”€β”€ numbers.py
β”‚Β Β  └── symbols.py
β”œβ”€β”€ trainer.py
β”œβ”€β”€ utils
β”‚Β Β  β”œβ”€β”€ mel.py
β”‚Β Β  β”œβ”€β”€ stft.py
β”‚Β Β  β”œβ”€β”€ tblogger.py
β”‚Β Β  └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Special thanks to

References

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

The audio samples on our webpage are partially derived from:

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • WaveGrad2 Official Github.io