Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTSService provides text-to-speech capabilities using Coqui’s XTTS (Cross-lingual Text-to-Speech) model through a streaming server. It supports multiple languages and custom voice cloning.

Installation

The service requires a running XTTS streaming server. You can start one using Docker:

docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

For more information, visit the official repository.

Configuration

Constructor Parameters

voice_id
str
required

Voice identifier from studio speakers

language
Language
required

Language for speech synthesis

base_url
str
required

XTTS streaming server URL

aiohttp_session
aiohttp.ClientSession
required

HTTP client session for API requests

sample_rate
int
default: "24000"

Output audio sample rate in Hz

text_filter
BaseTextFilter
default: "None"

Modifies text provided to the TTS. Learn more about the available filters.

Output Frames

Control Frames

TTSStartedFrame
Frame

Signals start of speech synthesis

TTSStoppedFrame
Frame

Signals completion of speech synthesis

Audio Frames

TTSAudioRawFrame
Frame

Contains generated audio data with: - PCM audio format - Specified sample rate (resampled from 24kHz) - Single channel (mono)

Error Frames

ErrorFrame
Frame

Contains XTTS server error information

Methods

See the TTS base class methods for additional functionality.

Language Support

Supports multiple languages:

Language CodeDescriptionService Code
Language.CSCzechcs
Language.DEGermande
Language.ENEnglishen
Language.ESSpanishes
Language.FRFrenchfr
Language.HIHindihi
Language.HUHungarianhu
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.NLDutchnl
Language.PLPolishpl
Language.PTPortuguesept
Language.RURussianru
Language.TRTurkishtr
Language.ZHChinese (Simplified)zh-cn

Usage Example

from pipecat.services.xtts import XTTSService
from pipecat.transcriptions.language import Language
import aiohttp

# Configure service
async with aiohttp.ClientSession() as session:
    tts_service = XTTSService(
        voice_id="speaker_1",
        language=Language.EN,
        base_url="http://localhost:8000",
        aiohttp_session=session
    )

    # Use in pipeline
    pipeline = Pipeline([
        text_input,         # Produces text
        tts_service,        # Converts text to speech
        audio_output        # Plays audio
    ])

Streaming Process

The service handles audio streaming in chunks:

  1. Receives audio chunks from XTTS server
  2. Buffers chunks for processing
  3. Resamples audio to desired sample rate
  4. Delivers audio frames in real-time
# Streaming configuration
payload = {
    "text": text,
    "language": language_code,
    "speaker_embedding": embeddings,
    "gpt_cond_latent": latent_data,
    "add_wav_header": False,
    "stream_chunk_size": 20
}

Frame Flow

Metrics Support

The service collects processing metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • Character usage
  • Streaming performance

Notes

  • Requires GPU for optimal performance
  • Supports real-time streaming
  • Automatic audio resampling
  • Buffer management for smooth playback
  • Thread-safe processing
  • Automatic error handling
  • Manages server connection lifecycle
  • Text preprocessing (removes periods and asterisks)