All three branches provided here are for prosody transfer.
You can generate speech of desired style,sentence and voice.
The speaker of reference audio can be anyone and that person is not necessary to be included in the training data.
The target speaker (the voice of synthesized audio) must be included in the training data.
Using hard and soft pitchtorn, you can synthesize in 'Kyongsang' dialect, 'Cheolla' dialect and emotional style even if the model is only trained with plain, neutral speech.
On the other hand, for global style token, you need the DB of desired style during training time.
I proposed this pitchtron in order to speak in Korean Kyongsang anc Cheolla dialect.
The DB of these dialects are very limited and 'pitch contour' is key to referencing them naturally. This is also true of many other pitch-accented language(Japanese), tonal langauge(Chinese) and emotional speaking style.
Temporal resolution
Linear control
Vocal range adjustment
Non-parallel referencing
Unseen style support
Dimension analysis requirement
GST
X
X
X
O
X
O
Soft pitchtron
O
*
O
O
O
X
Hard pitchtron
O
O
O
**
O
X
*: Soft pitchtron will let you control the pitch as long as it can sound natural. If it is out of vocal range of target speaker, it will be clipped to make natural sound.
**: Hard pitchtron allows limited non-parallel referencing.
Limited non-parallel: the text can differ, but the structure of the sentence must match.
Temporal resolution: Can we control the style differently by timestep?
Linear control: Can I control exactly to what amount the pitch(note) is going to be scaled? I don't have to explore on the embedding space to figure out the scale change in embedding dimension as the input changes?
Vocal range adjustment: If the vocal range of reference speaker and target speaker are drastically different, can I reference naturally in target speaker's vocal range?
Non-parallel referencing: If the reference sentence and target sentence are different, can I synthesize it naturally?
Unseen style support: If the desired reference audio is of the style that has never been seen during training, can it be transferred naturally?
Dimension analysis requirement: Do I have to analyze which token/dimension controls which attribute to have control over this model?
1. Soft pitchtron
This branch provides unsupervised prosody transfer of parallel, limited non-parallel and non-parallel sentences.
Parallel: Reference audio sentence and target synthesis sentence matches.
Limited non-parallel: mentioned above.
Non-parallel: Reference audio sentence and target synthesis sentence need not match.
Similar to Global style token, but there are several advantages.
It is much more robust to styles that are unseen during training.
It is much easier to control.
You don't have to analyze tokens or dimensions to see what each token does.
You can scale the pitch range of reference audio to fit that of target speaker so that inter-gender transfer is more natural.
You can also control pitch for every phoneme input
Pitch range of reference audio is scaled to fit that of target speaker so that inter-gender transfer is more natural.
Your control over pitch is not so strict that it will only scale to the amount it sounds natural.
2. Hard pitchtron
This branch provides unsupervised parallel and 'limited non-parallel' unsupervised prosody transfer.
Instead, the rhythm and pitch are exactly the same as reference audio.
Pitch range of reference audio is scaled to fit that of target speaker so that inter-gender transfer is more natural.
You have strict control over pitch range, to the amount where it will scale even if it results in unnatural sound.
regularize sampling rate to 22050 Hz (This DB has irregular sr)
Trim with top 25 dB
source:
wav_16000/{speaker}/*.wav
pron/{speaker}/t**.txt
Excluded from script:
the script for unzipping and moving the wavs to wav_16000 is not included. You need to make it in this form yourself
Text file for all speakers are equal in this DB, so I divided this shared script by literature manually.(It includes missing newline errors so I had to do it manually)
This can be generalized to multi-lingual TTS where there are multiple DBs of different languages.
Thus, language code correspoding to each DB is appended to the integrated meta text file created in this step.
How to
Modify source file lists('train_file_lists', 'eval_file_lists', 'test_file_lists') and target file lists(target_train_file_list, target_eval_file_list, target_test_file_list)
from preprocess.preprocess.integrate_dataset(args)
You might want to modify _integrate() method to designate language code for each DB. Sorry it is hard-codded for now.
Run preprocess.py
python preprocess.py --dataset=integrate_dataset
4. check_file_integrity
This step generates meta file with wav paths that has been unable to read.
You might wanna remove them from your final filelists or go through some investigation. It's on you. This step does not remove these detected files from the filelists.
out: problematic_merge_korean_pron_{}.txt
5. generate_mel_f0 (optional)
This step is optional. This step extracts features for training and save as files.
src: wav_22050/*.wav
dst: mel/*.pt and f0/*.pt
6. initialize first few epochs with single speaker DB
Prepare separate train, valid filelists for single speaker
Files for single speaker training and validation are also included in multispeaker filelists.
I experimented training initial 30 epochs with single speaker DB and it helped learning encoder-decoder alignment a lot.
How to train?
1. Commands
python train.py {program arguments}
2. Program arguments
Option
Mandatory
Purpose
-o
O
Directory path to save checkpoints.
-c
X
Path of pretrained checkpoint to load.
-l
O
Log directory to drop logs for tensorboard.
3. Pretrained models
*Pretrained models are trained on phoneme. They expect phoneme as input when you give texts to synthesize.
To prevent cracking sound, the reference audio vocal range needs to be scaled to the target speaker vocal range.
That part is implemented in our code, but the target speaker vocal range is calculated coarsely by sampling just 10 audios and taking the max-min as variance.
You will get much better sound if you use more accurate statistics for target speaker vocal range.
Acknowledgements
This material is based upon work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10080667, Development of conversational speech synthesis technology to express emotion and personality of robots through sound source diversification).
I got help regarding grapheme to phoneme coversion from this awesome guy => Jeongpil_Lee