RealtimeTTS
Easy to use, low-latency text-to-speech library for realtime applications
About the Project
RealtimeTTS is a state-of-the-art text-to-speech (TTS) library designed for real-time applications. It stands out in its ability to convert text streams fast into high-quality auditory output with minimal latency.
Short_RealtimeTTS_Demo.mov
Key Features
- Low Latency
- almost instantaneous text-to-speech conversion
- compatible with LLM outputs
- High-Quality Audio
- generates clear and natural-sounding speech
- Multiple TTS Engine Support
- supports OpenAI TTS, Elevenlabs, Azure Speech Services, Coqui TTS and System TTS
- Multilingual
- Robust and Reliable:
- ensures continuous operation with a fallback mechanism
- switches to alternative engines in case of disruptions guaranteeing consistent performance and reliability, which is vital for critical and professional use cases
Hint: check out RealtimeSTT, the input counterpart of this library, for speech-to-text capabilities. Together, they form a powerful realtime audio wrapper around large language models.
Updates
Latest Version: v0.3.41
New Features:
- π₯NEW: predefined voices β¨Coqui Engineβ¨
- OpenAI TTS support
- more languages (chinese etc)
- fallback engines (define alternate engines if one fails)
For more details, see the release history.
Tech Stack
This library uses:
-
Text-to-Speech Engines
- OpenAIEngine: OpenAI's TTS system offers 6 natural sounding voices.
- CoquiEngine: High quality local neural TTS.
- AzureEngine: Microsoft's leading TTS technology. 250000 chars free per month.
- ElevenlabsEngine: Offer the best sounding voices available.
- SystemEngine: Native engine for quick setup.
-
Sentence Boundary Detection
- NLTK Sentence Tokenizer: Uses the Natural Language Toolkit's sentence tokenizer for precise and efficient sentence segmentation.
By using "industry standard" components RealtimeTTS offers a reliable, high-end technological foundation for developing advanced voice solutions.
Installation
Simple installation:
pip install RealtimeTTS
This will install all the necessary dependencies, including a CPU support only version of PyTorch (needed for Coqui engine)
Installation into virtual environment with GPU support:
python -m venv env_realtimetts
env_realtimetts\Scripts\activate.bat
python.exe -m pip install --upgrade pip
pip install RealtimeTTS
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
More information about CUDA installation.
Engine Requirements
Different engines supported by RealtimeTTS have unique requirements. Ensure you fulfill these requirements based on the engine you choose.
SystemEngine
The SystemEngine
works out of the box using your system's built-in TTS capabilities. No additional setup is needed.
OpenAIEingine
To use the OpenAIEngine
:
- set environment variable OPENAI_API_KEY
- install ffmpeg (see CUDA installation point 3)
AzureEngine
To use the AzureEngine
, you will need:
- Microsoft Azure Text-to-Speech API key (provided via AzureEngine constructor parameter "speech_key" or in the environment variable AZURE_SPEECH_KEY)
- Microsoft Azure service region.
Make sure you have these credentials available and correctly configured when initializing the AzureEngine
.
ElevenlabsEngine
For the ElevenlabsEngine
, you need:
-
Elevenlabs API key (provided via ElevenlabsEngine constructor parameter "api_key" or in the environment variable ELEVENLABS_API_KEY)
-
mpv
installed on your system (essential for streaming mpeg audio, Elevenlabs only delivers mpeg).πΉ Installing
mpv
:-
macOS:
brew install mpv
-
Linux and Windows: Visit mpv.io for installation instructions.
-
CoquiEngine
Delivers high quality, local, neural TTS with voice-cloning.
Downloads a neural TTS model first. In most cases it be fast enought for Realtime using GPU synthesis. Needs around 4-5 GB VRAM.
- to clone a voice submit the filename of a wave file containing the source voice as cloning_reference_wav to the CoquiEngine constructor
- in my experience voice cloning works best with a 22050 Hz mono 16bit WAV file containing a short (~10-30 sec) sample
On most systems GPU support will be needed to run fast enough for realtime, otherwise you will experience stuttering.
Quick Start
Here's a basic usage example:
from RealtimeTTS import TextToAudioStream, SystemEngine, AzureEngine, ElevenlabsEngine
engine = SystemEngine() # replace with your TTS engine
stream = TextToAudioStream(engine)
stream.feed("Hello world! How are you today?")
stream.play_async()
Feed Text
You can feed individual strings:
stream.feed("Hello, this is a sentence.")
Or you can feed generators and character iterators for real-time streaming:
def write(prompt: str):
for chunk in openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content" : prompt}],
stream=True
):
if (text_chunk := chunk["choices"][0]["delta"].get("content")) is not None:
yield text_chunk
text_stream = write("A three-sentence relaxing speech.")
stream.feed(text_stream)
char_iterator = iter("Streaming this character by character.")
stream.feed(char_iterator)
Playback
Asynchronously:
stream.play_async()
while stream.is_playing():
time.sleep(0.1)
Synchronously:
stream.play()
Testing the Library
The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.
Note that most of the tests still rely on the "old" OpenAI API (<1.0.0). Usage of the new OpenAI API is demonstrated in openai_1.0_test.py.
-
simple_test.py
- Description: A "hello world" styled demonstration of the library's simplest usage.
-
complex_test.py
- Description: A comprehensive demonstration showcasing most of the features provided by the library.
-
coqui_test.py
- Description: Test of local coqui TTS engine.
-
translator.py
- Dependencies: Run
pip install openai realtimestt
. - Description: Real-time translations into six different languages.
- Dependencies: Run
-
openai_voice_interface.py
- Dependencies: Run
pip install openai realtimestt
. - Description: Wake word activated and voice based user interface to the OpenAI API.
- Dependencies: Run
-
advanced_talk.py
- Dependencies: Run
pip install openai keyboard realtimestt
. - Description: Choose TTS engine and voice before starting AI conversation.
- Dependencies: Run
-
minimalistic_talkbot.py
- Dependencies: Run
pip install openai realtimestt
. - Description: A basic talkbot in 20 lines of code.
- Dependencies: Run
-
simple_llm_test.py
- Dependencies: Run
pip install openai
. - Description: Simple demonstration how to integrate the library with large language models (LLMs).
- Dependencies: Run
-
test_callbacks.py
- Dependencies: Run
pip install openai
. - Description: Showcases the callbacks and lets you check the latency times in a real-world application environment.
- Dependencies: Run
Pause, Resume & Stop
Pause the audio stream:
stream.pause()
Resume a paused stream:
stream.resume()
Stop the stream immediately:
stream.stop()
Requirements Explained
-
Python 3.6+
-
requests (>=2.31.0): to send HTTP requests for API calls and voice list retrieval
-
PyAudio (>=0.2.13): to create an output audio stream
-
stream2sentence (>=0.1.1): to split the incoming text stream into sentences
-
pyttsx3 (>=2.90): System text-to-speech conversion engine
-
azure-cognitiveservices-speech (>=1.31.0): Azure text-to-speech conversion engine
-
elevenlabs (>=0.2.24): Elevenlabs text-to-speech conversion engine
Configuration
TextToAudioStream
Initialization Parameters for When you initialize the TextToAudioStream
class, you have various options to customize its behavior. Here are the available parameters:
engine
(BaseEngine)
- Type: BaseEngine
- Required: Yes
- Description: The underlying engine responsible for text-to-audio synthesis. You must provide an instance of
BaseEngine
or its subclass to enable audio synthesis.
on_text_stream_start
(callable)
- Type: Callable function
- Required: No
- Description: This optional callback function is triggered when the text stream begins. Use it for any setup or logging you may need.
on_text_stream_stop
(callable)
- Type: Callable function
- Required: No
- Description: This optional callback function is activated when the text stream ends. You can use this for cleanup tasks or logging.
on_audio_stream_start
(callable)
- Type: Callable function
- Required: No
- Description: This optional callback function is invoked when the audio stream starts. Useful for UI updates or event logging.
on_audio_stream_stop
(callable)
- Type: Callable function
- Required: No
- Description: This optional callback function is called when the audio stream stops. Ideal for resource cleanup or post-processing tasks.
on_character
(callable)
- Type: Callable function
- Required: No
- Description: This optional callback function is called when a single character is processed.
level
(int)
- Type: Integer
- Required: No
- Default:
logging.WARNING
- Description: Sets the logging level for the internal logger. This can be any integer constant from Python's built-in
logging
module.
Example Usage:
engine = YourEngine() # Substitute with your engine
stream = TextToAudioStream(
engine=engine,
on_text_stream_start=my_text_start_func,
on_text_stream_stop=my_text_stop_func,
on_audio_stream_start=my_audio_start_func,
on_audio_stream_stop=my_audio_stop_func,
level=logging.INFO
)
Methods
play
and play_async
These methods are responsible for executing the text-to-audio synthesis and playing the audio stream. The difference is that play
is a blocking function, while play_async
runs in a separate thread, allowing other operations to proceed.
fast_sentence_fragment
(bool)
- Default:
False
- Description: When set to
True
, the method will prioritize speed, generating and playing sentence fragments faster. This is useful for applications where latency matters.
buffer_threshold_seconds
(float)
-
Default:
2.0
-
Description: Specifies the time in seconds for the buffering threshold, which impacts the smoothness and continuity of audio playback.
- How it Works: Before synthesizing a new sentence, the system checks if there is more audio material left in the buffer than the time specified by
buffer_threshold_seconds
. If so, it retrieves another sentence from the text generator, assuming that it can fetch and synthesize this new sentence within the time window provided by the remaining audio in the buffer. This process allows the text-to-speech engine to have more context for better synthesis, enhancing the user experience.
A higher value ensures that there's more pre-buffered audio, reducing the likelihood of silence or gaps during playback. If you experience breaks or pauses, consider increasing this value.
- How it Works: Before synthesizing a new sentence, the system checks if there is more audio material left in the buffer than the time specified by
-
Hint: If you experience silence or breaks between sentences, consider raising this value to ensure smoother playback.
minimum_sentence_length
(int)
- Default:
3
- Description: Sets the minimum character length to consider a string as a sentence to be synthesized. This affects how text chunks are processed and played.
log_characters
(bool)
- Default:
False
- Description: Enable this to log the individual characters that are being processed for synthesis.
log_synthesized_text
(bool)
- Default:
False
- Description: When enabled, logs the text chunks as they are synthesized into audio. Helpful for auditing and debugging.
By understanding and setting these parameters and methods appropriately, you can tailor the TextToAudioStream
to meet the specific needs of your application.
CUDA installation
These steps are recommended for those who require better performance and have a compatible NVIDIA GPU.
Note: to check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.
To use torch with support via CUDA please follow these steps:
Note: newer pytorch installations may (unverified) not need Toolkit (and possibly cuDNN) installation anymore.
-
Install NVIDIA CUDA Toolkit: For example, to install Toolkit 11.8 please
- Visit NVIDIA CUDA Toolkit Archive.
- Select version 11.
- Download and install the software.
-
Install NVIDIA cuDNN: For example, to install cuDNN 8.7.0 for CUDA 11.x please
- Visit NVIDIA cuDNN Archive.
- Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
- Download and install the software.
-
Install ffmpeg:
You can download an installer for your OS from the ffmpeg Website.
Or use a package manager:
-
On Ubuntu or Debian:
sudo apt update && sudo apt install ffmpeg
-
On Arch Linux:
sudo pacman -S ffmpeg
-
On MacOS using Homebrew (https://brew.sh/):
brew install ffmpeg
-
On Windows using Chocolatey (https://chocolatey.org/):
choco install ffmpeg
-
On Windows using Scoop (https://scoop.sh/):
scoop install ffmpeg
-
-
Install PyTorch with CUDA support:
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
-
Fix for to resolve compatility issues: If you run into library compatility issues, try setting these libraries to fixed versions:
pip install networkx==2.8.8 pip install typing_extensions==4.8.0 pip install fsspec==2023.6.0 pip install imageio==2.31.6 pip install networkx==2.8.8 pip install numpy==1.24.3 pip install requests==2.31.0
π Acknowledgements
Huge shoutout to the team behind Coqui AI being the first giving us local high quality synthesis with realtime speed and even a clonable voice!
Contribution
Contributions are always welcome (e.g. PR to add a new engine).
License Information
β Important Note:
While the source of this library is open-source, the usage of many of the engines it depends on are not: External engine providers often restrict commercial use in their free plans. This means the engines can be used for noncommercial projects, but commercial usage requires a paid plan.
Engine Licenses Summary:
CoquiEngine
- License: Open-source only for noncommercial projects.
- Commercial Use: Requires a paid plan.
- Details: CoquiEngine License
ElevenlabsEngine
- License: Open-source only for noncommercial projects.
- Commercial Use: Available with every paid plan.
- Details: ElevenlabsEngine License
AzureEngine
- License: Open-source only for noncommercial projects.
- Commercial Use: Available from the standard tier upwards.
- Details: AzureEngine License
SystemEngine
- License: Mozilla Public License 2.0 and GNU Lesser General Public License (LGPL) version 3.0.
- Commercial Use: Allowed under this license.
- Details: SystemEngine License
OpenAIEngine
- License: please read OpenAI Terms of Use
Disclaimer: This is a summarization of the licenses as understood at the time of writing. It is not legal advice. Please read and respect the licenses of the different engine providers yourself if you plan to use them in a project.
Author
Kolja Beigel
Email: [email protected]
GitHub