Multi-Tacotron Voice Cloning
This repository is a phonemic multilingual (Russian-English) implementation based on Real-Time-Voice-Cloning. it is a four-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model. If you only need the English version, please use the original implementation.
ΠΡΠΎΡ ΡΠ΅ΠΏΠΎΠ·ΠΈΡΠΎΡΠΈΠΉ ΡΠ²Π»ΡΠ΅ΡΡΡ ΠΌΠ½ΠΎΠ³ΠΎΡΠ·ΡΡΠ½ΠΎΠΉ(ΡΡΡΡΠΊΠΎ-Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΎΠΉ) ΡΠΎΠ½Π΅ΠΌΠ½ΠΎΠΉ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠ΅ΠΉ, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΠΎΠΉ Π½Π° Real-Time-Voice-Cloning. ΠΠ½Π° ΡΠΎΡΡΠΎΠΈΡ ΠΈΠ· ΡΠ΅ΡΡΡΡΡ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΡ ΡΠ΅ΡΠ΅ΠΉ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΡΠΎΠ·Π΄Π°Π²Π°ΡΡ ΡΠΈΡΠ»ΠΎΠ²ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ Π³ΠΎΠ»ΠΎΡΠ° ΠΈΠ· Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΡ ΡΠ΅ΠΊΡΠ½Π΄ Π·Π²ΡΠΊΠ° ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π΅Π³ΠΎ Π΄Π»Ρ ΡΠΎΠ·Π΄Π°Π½ΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΡΠ΅ΠΊΡΡΠ° Π² ΡΠ΅ΡΡ
Example
Quick start
Use the colab online demo
Requirements
You will need the following whether you plan to use the toolbox only or to retrain the models.
β₯Python 3.6.
PyTorch (>=1.0.1).
Run pip install -r requirements.txt
to install the necessary packages.
A GPU is mandatory, but you don't necessarily need a high tier GPU if you only want to use the toolbox.
Pretrained models
Download the latest here.
Datasets
Name | Language | Link | Comments | My link | Comments |
---|---|---|---|---|---|
Phoneme dictionary | En, Ru | En,Ru | Phoneme dictionary | link | Π‘ΠΎΠ²ΠΌΠ΅ΡΡΠΈΠ» ΡΡΡΡΠΊΠΈΠΉ ΠΈ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΈΠΉ ΡΠΎΠ½Π΅ΠΌΠ½ΡΠΉ ΡΠ»ΠΎΠ²Π°ΡΡ |
LibriSpeech | En | link | 300 speakers, 360h clean speech | ||
VoxCeleb | En | link | 7000 speakers, many hours bad speech | ||
M-AILABS | Ru | link | 3 speakers, 46h clean speech | ||
open_tts, open_stt | Ru | open_tts, open_stt | many speakers, many hours bad speech | link | ΠΠΎΡΠΈΡΡΠΈΠ» 4 ΡΠ°ΡΠ° ΡΠ΅ΡΠΈ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΏΠΈΠΊΠ΅ΡΠ°. ΠΠΎΠΏΡΠ°Π²ΠΈΠ» Π°Π½ΠΎΡΠ°ΡΠΈΡ, ΡΠ°Π·Π±ΠΈΠ» Π½Π° ΠΎΡΡΠ΅Π·ΠΊΠΈ Π΄ΠΎ 7 ΡΠ΅ΠΊΡΠ½Π΄ |
Voxforge+audiobook | Ru | link | Many speaker, 25h various quality | link | ΠΡΠ±ΡΠ°Π» Ρ ΠΎΡΠΎΡΠΈΠ΅ ΡΠ°ΠΉΠ»Ρ. Π Π°Π·Π±ΠΈΠ» Π½Π° ΠΎΡΡΠ΅Π·ΠΊΠΈ. ΠΠΎΠ±Π°Π²ΠΈΠ» Π°ΡΠ΄ΠΈΠΎΠΊΠ½ΠΈΠ³ ΠΈΠ· ΠΈΠ½ΡΠ΅ΡΠ½Π΅ΡΠ°. ΠΠΎΠ»ΡΡΠΈΠ»ΠΎΡΡ 200 ΡΠΏΠΈΠΊΠ΅ΡΠΎΠ² ΠΏΠΎ ΠΏΠ°ΡΠ΅ ΠΌΠΈΠ½ΡΡ Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ |
RUSLAN | Ru | link | One speaker, 40h good speech | link | ΠΠ΅ΡΠ΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π» Π² 16ΠΊΠΡ |
Mozilla | Ru | link | 50 speaker, 30h good speech | link | ΠΠ΅ΡΠ΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π» Π² 16ΠΊΠΡ, Π Π°ΡΠΊΠΈΠ΄Π°Π» ΡΠ°Π·Π½ΡΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ ΠΏΠΎ ΠΏΠ°ΠΏΠΊΠ°ΠΌ |
Russian Single | Ru | link | One speaker, 9h good speech | link | ΠΠ΅ΡΠ΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π» Π² 16ΠΊΠΡ |
Toolbox
You can then try the toolbox:
python demo_toolbox.py -d <datasets_root>
or
python demo_toolbox.py
Wiki
Π’ΡΠ΅Π½ΠΈΡΠΎΠ²ΠΊΠ° (ΠΈ Π΄Π»Ρ Π΄ΡΡΠ³ΠΈΡ ΡΠ·ΡΠΊΠΎΠ²)
Training (and for other languages)
Contribution
for any questions, please email me
Papers implemented
URL | Designation | Title | Implementation source |
---|---|---|---|
1806.04558 | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | CorentinJ |
1802.08435 | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | fatchord/WaveRNN |
1712.05884 | Tacotron 2 (synthesizer) | Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions | Rayhane-mamah/Tacotron-2 |
1710.10467 | GE2E (encoder) | Generalized End-To-End Loss for Speaker Verification | CorentinJ |