Arabic Tacotron TTS
An implementation of Tacotron speech synthesis in TensorFlow for Arabic.
Audio Samples
Check Audio Samples from models trained using this repo on Nawar Halabi's speech corpus
Background
In April 2017, Google published a paper, Tacotron: Towards End-to-End Speech Synthesis, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs. However, they didn't release their source code or training data. This is an attempt to provide an open-source implementation of the model described in their paper.
This implementation is pretty much the same as Keithito's implemetation. Here are changes I made.
Check this article to know more about this project Arabic-Tacotron-TTS
Quick Start
Installing dependencies
-
Install Python 3. Use version 3.5 instead of newer python versions for tensorflow support. You could use anaconda to create a new environment by
conda create -n myenv python=3.5
activate myenv
-
Install requirements:
pip install -r requirements.txt
-
Install tensorflow
pip install tensorflow
orpip install tensorflow-gpu
Using a pre-trained model
-
Download and unpack the pretrained model
-
Extract the model files into a folder in a destination of your choice
-
Run the demo server:
python demo_server.py --checkpoint .\{folder_in_a_destination_of_your_choice}\model.ckpt-200000
-
Point your browser at localhost:9200
- Type what you want to synthesize. Use only diacritised Arabic text.
Training
Note: you need 40GB (more or less) of free disk space to train a model.
-
Download a speech dataset. The following are supported out of the box:
- Nawar Halabi You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.
-
Preprocess
- Unpack the dataset`
- Add a folder called nawar_without_hag9 in
~/tacotron
- Download temp_filtered.csv and add it there.
- Add a folder called wavs there in which all wav files are there Your tree should look like this
tacotron |- nawar_without_hag9 |- temp_filtered.csv |- wavs
- Run
python .\preprocess.py --dataset nawar
- Update max_iters to 400 if not already set to that number
-
Train
python .\train.py
-
Monitor with Tensorboard (optional) The trainer dumps audio and alignments every 1000 steps. You can find these in
~/tacotron/logs-tacotron
. You could use tensorboard to make sense out of these data using the following command.tensorboard --logdir ~/tacotron/logs-tacotron
Changes from the original repo
- Added Arabic speech corpus preprocessing code and created temp_filtered.csv
- Hosted Arabic trained model based on Nawar Halabi's Speech Corpus
- Updated hparams to work with the Arabic speech corpus
- Added an instructional explanation on how to reproduce
- Added Arabic specific tests
- Removed some of the unused code
- Updated symbols to match the Arabic phonetic language
- Adjusted data-feeder so that all input text are phonetised by arabic_pronounce
Areas of Improvement
- Add cleaners
- Add embedded diacritiser
Summary of important commands
- python .\preprocess.py --dataset nawar
- python .\train.py --restore_step 201000
- python .\demo_server.py --checkpoint C:\Users\User\tacotron\logs-tacotron\model.ckpt-70000
- python -m pytest
Notes
Remember to delete the files in the training folder then preprocess again if you changed the config
Thanks to
Suhail Kwailat, Dr. Taha Zerrouki, Dr. Motaz Saad, Dr. Nawar Halabi, Keith Ito, Dr. Basem Ahmed, and Leo Ma for their detailed feedback and recommendations.