MQTTS
- Official implementation for the paper: A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech.
- Audio samples (40 each system) can be accessed at here.
- Quick demo can be accessed here (Some are still TODO).
- Paper appendix is here.
Setup the environment
- Setup conda environment:
conda create --name mqtts python=3.9
conda activate mqtts
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
(Update) You may need to create an access token to use the speaker embedding of pyannote as they updated their policy.
If that's the case follow the pyannote repo and change every Inference("pyannote/embedding", window="whole")
accordingly.
- Download the pretrained phonemizer checkpoint
wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt
Preprocess the dataset
- Get the GigaSpeech dataset from the official repo
- Install FFmpeg, then
conda install ffmpeg=4.3=hf484d3e_0
conda update ffmpeg
- Run python script
python preprocess.py --giga_speech_dir GIGASPEECH --outputdir datasets
Train the quantizer and inference
- Train
cd quantizer/
python train.py --input_wavs_dir ../datasets/audios \
--input_training_file ../datasets/training.txt \
--input_validation_file ../datasets/validation.txt \
--checkpoint_path ./checkpoints \
--config config.json
- Inference to get codes for training the second stage
python get_labels.py --input_json ../datasets/train.json \
--input_wav_dir ../datasets/audios \
--output_json ../datasets/train_q.json \
--checkpoint_file ./checkpoints/g_{training_steps}
python get_labels.py --input_json ../datasets/dev.json \
--input_wav_dir ../datasets/audios \
--output_json ../datasets/dev_q.json \
--checkpoint_file ./checkpoints/g_{training_steps}
Train the transformer (below an example for the 100M version)
cd ..
mkdir ckpt
python train.py \
--distributed \
--saving_path ckpt/ \
--sampledir logs/ \
--vocoder_config_path quantizer/checkpoints/config.json \
--vocoder_ckpt_path quantizer/checkpoints/g_{training_steps} \
--datadir datasets/audios \
--metapath datasets/train_q.json \
--val_metapath datasets/dev_q.json \
--use_repetition_token \
--ar_layer 4 \
--ar_ffd_size 1024 \
--ar_hidden_size 256 \
--ar_nheads 4 \
--speaker_embed_dropout 0.05 \
--enc_nlayers 6 \
--dec_nlayers 6 \
--ffd_size 3072 \
--hidden_size 768 \
--nheads 12 \
--batch_size 200 \
--precision bf16 \
--training_step 800000 \
--layer_norm_eps 1e-05
You can view the progress using:
tensorboard --logdir logs/
Run batched inference
You'll have to change speaker_to_text.json
, it's just an example.
mkdir infer_samples
CUDA_VISIBLE_DEVICES=0 python infer.py \
--phonemizer_dict_path en_us_cmudict_forward.pt \
--model_path ckpt/last.ckpt \
--config_path ckpt/config.json \
--input_path speaker_to_text.json \
--outputdir infer_samples \
--batch_size {batch_size} \
--top_p 0.8 \
--min_top_k 2 \
--max_output_length {Maximum Output Frames to prevent infinite loop} \
--phone_context_window 3 \
--clean_speech_prior