Open Text to Speech Server
Unifies access to multiple open source text to speech systems and voices for many languages.
Supports a subset of SSML that can use multiple voices, text to speech systems, and languages!
<speak>
The 1st thing to remember is that 27 languages are supported in Open TTS as of 10/13/2021 at 3pm.
<voice name="glow-speak:en-us_mary_ann">
<s>
The current voice can be changed, even to a different text to speech system!
</s>
</voice>
<voice name="coqui-tts:en_vctk#p228">
<s>Breaks are possible</s>
<break time="0.5s" />
<s>between sentences.</s>
</voice>
<s lang="en">
One language is never enough
</s>
<s lang="de">
Eine Sprache ist niemals genug
</s>
<s lang="ja">
θ¨θͺγδΈγ€γ―ζ±Ίγγ¦θΆ³γγͺγ
</s>
<s lang="sw">
Lugha moja haitoshi
</s>
</speak>
See the full SSML example (use synesthesiam/opentts:all
Docker image with all voices included)
Voices
- Larynx
- Glow-Speak
- Coqui-TTS
- English (110), Japanese (1), Chinese (1)
- Patched embedded version of Coqui-TTS 0.3.1
- nanoTTS
- English (2), German (1), French (1), Italian (1), Spanish (1)
- MaryTTS
- English (7), German (3), French (4), Italian (1), Russian (1), Swedish (1), Telugu (1), Turkish (1)
- Includes embedded MaryTTS
- flite
- English (19), Hindi (1), Bengali (1), Gujarati (3), Kannada (1), Marathi (2), Punjabi (1), Tamil (1), Telugu (3)
- Festival
- English (9), Spanish (1), Catalan (1), Czech (4), Russian (1), Finnish (2), Marathi (1), Telugu (1), Hindi (1), Italian (2), Arabic (2)
- Spanish/Catalan/Finnish use ISO-8859-15 encoding
- Czech uses ISO-8859-2 encoding
- Russian is transliterated from Cyrillic to Latin script automatically
- Arabic uses UTF-8 and is diacritized with mishkal
- eSpeak
- Supports huge number of languages/locales, but sounds robotic
Running
Basic OpenTTS server:
$ docker run -it -p 5500:5500 synesthesiam/opentts:<LANGUAGE>
where <LANGUAGE>
is one of:
- all (All languages)
- ar (Arabic)
- bn (Bengali)
- ca (Catalan)
- cs (Czech)
- de (German)
- el (Greek)
- en (English)
- es (Spanish)
- fi (Finnish)
- fr (French)
- gu (Gujarati)
- hi (Hindi)
- hu (Hungarian)
- it (Italian)
- ja (Japanese)
- kn (Kannada)
- ko (Korean)
- mr (Marathi)
- nl (Dutch)
- pa (Punjabi)
- ru (Russian)
- sv (Swedish)
- sw (Swahili)
- ta (Tamil)
- te (Telugu)
- tr (Turkish)
- zh (Chinese)
Visit http://localhost:5500
For HTTP API test page, visit http://localhost:5500/openapi/
Exclude eSpeak (robotic voices):
$ docker run -it -p 5500:5500 synesthesiam/opentts:<LANGUAGE> --no-espeak
WAV Cache
You can have the OpenTTS server cache WAV files with --cache
:
$ docker run -it -p 5500:5500 synesthesiam/opentts:<LANGUAGE> --cache
This will store WAV files in a temporary directory (inside the Docker container). A specific directory can also be used:
$ docker run -it -v /path/to/cache:/cache -p 5500:5500 synesthesiam/opentts:<LANGUAGE> --cache /cache
HTTP API Endpoints
See swagger.yaml
GET /api/tts
?voice
- voice in the formtts:voice
(e.g.,espeak:en
)?text
- text to speak?cache
- disable WAV cache withfalse
- Returns
audio/wav
bytes
GET /api/voices
- Returns JSON object
- Keys are voice ids in the form
tts:voice
- Values are objects with:
id
- voice identifier for TTS system (string)name
- friendly name of voice (string)gender
- M or F (string)language
- 2-character language code (e.g., "en")locale
- lower-case locale code (e.g., "en-gb")tts_name
- name of text to speech system
- Filter voices using query parameters:
?tts_name
- only text to speech system(s)?language
- only language(s)?locale
- only locale(s)?gender
- only gender(s)
GET /api/languages
- Returns JSON list of supported languages
- Filter languages using query parameters:
?tts_name
- only text to speech system(s)
SSML
A subset of SSML is supported:
<speak>
- wrap around SSML textlang
- set language for document
<s>
- sentence (disables automatic sentence breaking)lang
- set language for sentence
<w>
/<token>
- word (disables automatic tokenization)<voice name="...">
- set voice of inner textvoice
- name or language of voice- Name format is
tts:voice
(e.g., "glow-speak:en-us_mary_ann") ortts:voice#speaker_id
(e.g., "coqui-tts:en_vctk#p228") - If one of the supported languages, a preferred voice is used (override with
--preferred-voice <lang> <voice>
)
- Name format is
<say-as interpret-as="">
- force interpretation of inner textinterpret-as
one of "spell-out", "date", "number", "time", or "currency"format
- way to format text depending oninterpret-as
- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time="">
- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
<sub alias="">
- substitutealias
for inner text
MaryTTS Compatible Endpoint
Use OpenTTS as a drop-in replacement for MaryTTS.
The voice format is <TTS_SYSTEM>:<VOICE_NAME>
. Visit the OpenTTS web UI and copy/paste the "voice id" of your favorite voice here.
You may need to change the port in your docker run
command to -p 59125:5500
for compatibility with existing software.
Larynx Voice Quality
On the Raspberry Pi, you may need to lower the quality of Larynx voices to get reasonable response times.
This is done by appending the quality level to the end of your voice:
tts:
- platform: marytts
voice:larynx:harvard;low
Available quality levels are high
(the default), medium
, and low
.
Note that this only applies to Larynx and Glow-Speak voices.
Speaker ID
For multi-speaker models (currently just coqui-tts:en_vctk
), you can append a speaker name or id to your voice:
tts:
- platform: marytts
voice:coqui-tts:en_vctk#p228
You can get the available speaker names from /api/voices
or provide a 0-based index instead:
tts:
- platform: marytts
voice:coqui-tts:en_vctk#42
Default Larynx Settings
Default settings for Larynx can be provided on the command-line:
--larynx-quality
- vocoder quality ("high", "medium", or "low", default: "high")--larynx-noise-scale
- voice volatility (0-1, default: 0.667)--larynx-length-scale
- voice speed (< 1 is faster, default: 1.0)
Building From Source
OpenTTS uses Docker buildx to build multi-platform images based on Debian bullseye.
Before building, make sure to download the voices you want to the voices
directory. Each TTS system that uses external voices has a sub-directory with instructions on how to download voices.
If you only plan to build an image for your current platform, you should be able to run:
make <lang>
from the root of the cloned repository, where <lang>
is one of the supported languages. If it builds successfully, you can run it with:
make <lang>-run
For example, the English image can be built and run with:
make en
make en-run
Under the hood, this does two things:
- Runs the
configure
script with--languages <lang>
- Runs
docker buildx build
with the appropriate arguments
You can manually run the configure
script -- see ./configure --help
for more options. This script generates the following files (used by the build process):
- build_packages - Debian packages installed with
apt-get
during the build only - packages - Debian packages installed with
apt-get
for runtime - python_packages - Python packages installed with
pip
- .dockerignore - Files that docker will ignore during building ("!" inverts)
- .dockerargs - Command-line arguments passed to
docker buildx build
Multi-Platform images
To build an image for a different platform, you need to initialize a docker buildx builder:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
docker buildx create --config /etc/docker/buildx.conf --use --name mybuilder
docker buildx use mybuilder
docker buildx inspect --bootstrap
NOTE: For some reason, you have to do these steps each time you reboot. If you see errors like "Error while loading /usr/sbin/dpkg-split: No such file or directory", run docker buildx rm mybuilder
and re-run the steps above.
When you run make
, specify the platform(s) you want to build for:
DOCKER_PLATFORMS='--platform linux/amd64,linux/arm64,linux/arm/v7' make <lang>
You may place pre-compiled Python wheels in the download
directory. They will be used during the installation of Python packages.