ZeroSpeech 2019: TTS without T - Pytorch
- This is the original source code for the paper "Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion", which is accepted by Interspeech 2019.
- Furthurmore, we used this implementation to compete in the ZeroSpeech 2019 challenge. On the Surprise dataset leaderboard, the proposed method is 2nd place in terms of low bitrate, while achieving higher Mean Opinion Score (MOS) and lower CER than the 1st place team.
- Feel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact [email protected]. If you find this project helpful for your research, please do consider to cite this paper, thanks!
Quick Start
Setup
- Clone this repo:
git clone [email protected]:andi611/ZeroSpeech-TTS-without-T.git
- CD into this repo:
cd ZeroSpeech-TTS-without-T
Installing dependencies
-
Install Python 3.
-
Install the latest version of Pytorch according to your platform. For better performance, install with GPU support (CUDA) if viable. This code works with Pytorch 0.4 and later.
Prepare data
-
Download the ZeroSpeech dataset.
- The English dataset:
wget https://download.zerospeech.com/2019/english.tgz tar xvfz english.tgz -C data rm -f english.tgz
- The Surprise dataset:
wget https://download.zerospeech.com/2019/surprise.zip # Go to https://download.zerospeech.com and accept the licence agreement # to get the password protecting the archive unzip surprise.zip -d data rm -f surprise.zip
-
After unpacking the dataset into
~/ZeroSpeech-TTS-without-T/data
, data tree should look like this:|- ZeroSpeech-TTS-without-T |- data |- english |- train |- unit |- voice |- test |- surprise |- train |- unit |- voice |- test
-
Preprocess the dataset and sample model-ready index files:
python3 main.py --preprocess —-remake
Usage
Training
-
Train ASR-TTS autoencoder model for discrete linguistic units discovery:
python3 main.py --train_ae
Tunable hyperparameters can be found in hps/zerospeech.json. You can adjust these parameters and setting by editing the file, the default hyperparameters are recommended for this project.
-
Train TTS patcher for voice conversion performance boosting:
python3 main.py --train_p --load_model --load_train_model_name=model.pth-ae-400000
-
Train TTS patcher with target guided adversarial training:
python3 main.py --train_tgat --load_model --load_train_model_name=model.pth-ae-400000
-
Monitor with Tensorboard (OPTIONAL)
tensorboard --logdir='path to log dir' or python3 -m tensorboard.main --logdir='path to log dir'
Testing
-
Test on a single speech::
python3 main.py --test_single --load_test_model_name=model.pth-ae-200000
-
Test on 'synthesis.txt' and generate resynthesized audio files::
python3 main.py --test --load_test_model_name=model.pth-ae-200000
-
Test on all the testing speech under
test/
and generate encoding files::python3 main.py --test_encode --load_test_model_name=model.pth-ae-200000
-
Add
--enc_only
if testing with ASR-TTS autoencoder only:python3 main.py --test_single --load_test_model_name=model.pth-ae-200000 --enc_only python3 main.py --test --load_test_model_name=model.pth-ae-200000 --enc_only python3 main.py --test_encode --load_test_model_name=model.pth-ae-200000 --enc_only
Switching between datasets
- Simply use
--dataset=surprise
to switch to the default alternative set, all paths are handled automatically if the data tree structure is placed as suggested. For example:python3 main.py --train_ae --dataset=surprise
Trained-Models
- We provide trained models as ckpt files, Donwload Link: bit.ly/ZeroSpeech2019-Liu
- Reload model for training:
(
--load_train_model_name=model.pth-ae-400000-128-multi-1024-english
--ckpt_dir=./ckpt_english
or--ckpt_dir=./ckpt_surprise
by default). - 2 ways to load model for testing:
--load_test_model_name=model.pth-ae-400000-128-multi-1024-english (by name) --ckpt_pth=ckpt/model.pth-ae-400000-128-multi-1024-english (direct path)
- Care that hps/zerospeech.json needs to be set accordingly to the model you are loading. If a
128-multi-1024
model is being loaded,seg_len
andenc_size
should be set to 128 and 1024, respectively. If aae
model is being loaded, the argument--enc_only
must be used when runningmain.py
(See 4. in the Testing section).
Notes
- This code includes all the settings and methods we've tested for this challenge, some of which did not suceess but we did not remove them from our code. However, the previous instructions and default settings are for the method we proposed. By running them one can easily reproduce our results.
- TODO: upload pre-trained models
Citation
@article{Liu_2019,
title={Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion},
url={http://dx.doi.org/10.21437/interspeech.2019-2048},
DOI={10.21437/interspeech.2019-2048},
journal={Interspeech 2019},
publisher={ISCA},
author={Liu, Andy T. and Hsu, Po-chun and Lee, Hung-Yi},
year={2019},
month={Sep}
}