Making a TTS model with 1 minute of speech samples within 10 minutes
Seeing my implementaions of Tacotron and DCTTS, many people have asked me "How large speech dataset is needed for neural TTS?" or "Can you make a TTS model with X hour(s)/minute(s) of training data?" I'm fully aware of the importance of those questions. When you plan a service using TTS, it is not always likely to get lots of speech samples. I would like to give an answer. I really do. But unfortunately I have no answer. The only thing I know is that I could train a model successfully with five hours of speech samples I extracted from Kate Winslet's audiobook. I haven't tried less data than that. I could try it, but I actually I have a better idea. Since I have a decent model trained with the LJ Speech Dataset for several days, why don't I use it? After all, we all have different voices, but the way we speak English is not totally different.
In the above two repos, I trained TTS models using all the speech samples of my two favorite celebrities, Nick Offerman and Kate Winslet, from scratch. This time, I use only one minute of the speech samples. The following are the synthesized samples after 10 minutes of fine-tuning training. Do you think they sound like them?
- Check Nick Samples
- Check Kate Samples
Additionally, I collected 10 speech samples of Modern Family celebrities from YouTube, and generated their voice, training on those sample.
- Ed O'Neill
- SofΓa Vergara
- Julie Bowen
- Ty Burrell
- Jesse Tyler Ferguson
- Eric Stonestreet
- Sarah Hyland
- Ariel Winter
- Rico Rodriguez
Check here to see the model details, source code and the pretrained model which served as a seed.