XTTS
Text-to-speech service implementation using Coqui’s XTTS streaming server
Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.
Overview
XTTSService
provides text-to-speech capabilities using Coqui’s XTTS (Cross-lingual Text-to-Speech) model through a streaming server. It supports multiple languages and custom voice cloning.
Installation
The service requires a running XTTS streaming server. You can start one using Docker:
For more information, visit the official repository.
Configuration
Constructor Parameters
Voice identifier from studio speakers
Language for speech synthesis
XTTS streaming server URL
HTTP client session for API requests
Output audio sample rate in Hz
Modifies text provided to the TTS. Learn more about the available filters.
Output Frames
Control Frames
Signals start of speech synthesis
Signals completion of speech synthesis
Audio Frames
Contains generated audio data with: - PCM audio format - Specified sample rate (resampled from 24kHz) - Single channel (mono)
Error Frames
Contains XTTS server error information
Methods
See the TTS base class methods for additional functionality.
Language Support
Supports multiple languages:
Language Code | Description | Service Code |
---|---|---|
Language.CS | Czech | cs |
Language.DE | German | de |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RU | Russian | ru |
Language.TR | Turkish | tr |
Language.ZH | Chinese (Simplified) | zh-cn |
Usage Example
Streaming Process
The service handles audio streaming in chunks:
- Receives audio chunks from XTTS server
- Buffers chunks for processing
- Resamples audio to desired sample rate
- Delivers audio frames in real-time
Frame Flow
Metrics Support
The service collects processing metrics:
- Time to First Byte (TTFB)
- Processing duration
- Character usage
- Streaming performance
Notes
- Requires GPU for optimal performance
- Supports real-time streaming
- Automatic audio resampling
- Buffer management for smooth playback
- Thread-safe processing
- Automatic error handling
- Manages server connection lifecycle
- Text preprocessing (removes periods and asterisks)