GAN TTS
PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).
- Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).
- Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.
Generated audio samples
Audio samples are available in the Jupyter notebooks at the link below:
- Voice conversion (en, MLP)
- Voice conversion (en, RNN)
- Text-to-speech synthesis (en, MLP)
- Text-to-speech synthesis (ja, MLP)
Notes on hyper parameters
adversarial_streams
, which represents streams (mgc, lf0, vuv, bap) to be used to compute adversarial loss, is a very speech quality sensitive parameter. Computing adversarial loss on mgc features (except for first few dimensions) seems to be working good.- If
mask_nth_mgc_for_adv_loss
> 0, firstmask_nth_mgc_for_adv_loss
dimension for mgc will be ignored for computing adversarial loss. As described in saito2017asja, I confirmed that using 0-th (and 1-th) mgc for computing adversarial loss affects speech quality. From my experience,mask_nth_mgc_for_adv_loss
= 1 for mgc order 25,mask_nth_mgc_for_adv_loss
= 2 for mgc order 59 are working to me. - F0 extracted by WORLD will be spline interpolated. Set
f0_interpolation_kind
to "slinear" if you want frist-order spline interpolation, which is same as Merlin's default. - Set
use_harvest
to True if you want to use Harvest F0 estimation algorithm. If False, Dio and StoneMask are used to estimate/refine F0. - If you see
cuda runtime error (2) : out of memory
, try smaller batch size. #3
Notes on [2]
Though I haven't got improvements over Saito's approach [1] yet, but the GAN-based models described in [2] should be achieved by the following configurations:
- Set
generator_add_noise
to True. This will enable generator to use Gaussian noise as input. Linguistic features are concatenated with the noise vector. - Set
discriminator_linguistic_condition
to True. The discriminator uses linguistic features as condition.
Requirements
- PyTorch >= v0.2.0
- TensorFlow (just for
tf.contrib.training.HParams
) - nnmnkwii
- PyWorld
- https://github.com/taolei87/sru (if you want to try SRU-based models)
- Python
Installation
Please install PyTorch, TensorFlow and SRU (if needed) first. Once you have those, then
git clone --recursive https://github.com/r9y9/gantts && cd gantts
pip install -e ".[train]"
should install all other dependencies.
Repository structure
- gantts/: Network definitions, utilities for working on sequence-loss optimization.
- prepare_features_vc.py: Acoustic feature extraction script for voice conversion.
- prepare_features_tts.py: Linguistic/duration/acoustic feature extraction script for TTS.
- train.py: GAN-based training script. This is written to be generic so that can be used for training voice conversion models as well as text-to-speech models (duration/acoustic).
- train_gan.sh: Adversarial training wrapper script for
train.py
. - hparams.py: Hyper parameters for VC and TTS experiments.
- evaluation_vc.py: Evaluation script for VC.
- evaluation_tts.py: Evaluation script for TTS.
Feature extraction scripts are written for CMU ARCTIC dataset, but can be easily adapted for other datasets.
Run demos
Voice conversion (en)
vc_demo.sh
is a clb
to clt
voice conversion demo script. Before running the script, please download wav files for clb
and slt
from CMU ARCTIC and check that you have all data in a directory as follows:
> tree ~/data/cmu_arctic/ -d -L 1
/home/ryuichi/data/cmu_arctic/
βββ cmu_us_awb_arctic
βββ cmu_us_bdl_arctic
βββ cmu_us_clb_arctic
βββ cmu_us_jmk_arctic
βββ cmu_us_ksp_arctic
βββ cmu_us_rms_arctic
βββ cmu_us_slt_arctic
Once you have downloaded datasets, then:
./vc_demo.sh ${experimental_id} ${your_cmu_arctic_data_root}
e.g.,
./vc_demo.sh vc_gan_test ~/data/cmu_arctic/
Model checkpoints will be saved at ./checkpoints/${experimental_id}
and audio samples
are saved at ./generated/${experimental_id}
.
Text-to-speech synthesis (en)
tts_demo.sh
is a self-contained TTS demo script. The usage is:
./tts_demo.sh ${experimental_id}
This will download slt_arctic_full_data
used in Merlin's demo, perform feature extraction, train models and synthesize audio samples for eval/test set. ${experimenta_id}
can be arbitrary string, for example,
./tts_demo.sh tts_test
Model checkpoints will be saved at ./checkpoints/${experimental_id}
and audio samples
are saved at ./generated/${experimental_id}
.
Hyper paramters
See hparams.py
.
Monitoring training progress
tensorboard --logdir=log
References
- Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari, "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks", arXiv:1709.08041 [cs.SD], Sep. 2017
- Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Training algorithm to deceive anti-spoofing verification for DNN-based text-to-speech synthesis," IPSJ SIG Technical Report, 2017-SLP-115, no. 1, pp. 1-6, Feb., 2017. (in Japanese)
- Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Voice conversion using input-to-output highway networks," IEICE Transactions on Information and Systems, Vol.E100-D, No.8, pp.1925--1928, Aug. 2017
- https://www.slideshare.net/ShinnosukeTakamichi/dnnantispoofing
- https://www.slideshare.net/YukiSaito8/Saito2017icassp
Notice
The repository doesn't try to reproduce same results reported in their papers because 1) data is not publically available and 2). hyper parameters are highly depends on data. Instead, I tried same ideas on different data with different hyper parameters.