Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTS (Cross-lingual Text-to-Speech) provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model.

Installation

XTTS requires a running streaming server. Start the server using Docker:
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121
GPU acceleration is recommended for optimal performance. The server requires CUDA support.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data (streaming, resampled from 24kHz)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - Server connection or processing errors

Language Support

XTTS supports multiple languages with cross-lingual capabilities:
Language CodeDescriptionService Code
Language.CSCzechcs
Language.DEGermande
Language.ENEnglishen
Language.ESSpanishes
Language.FRFrenchfr
Language.HIHindihi
Language.HUHungarianhu
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.NLDutchnl
Language.PLPolishpl
Language.PTPortuguesept
Language.RURussianru
Language.TRTurkishtr
Language.ZHChinese (Simplified)zh-cn

Usage Example

Basic Configuration

Initialize the XTTSService and use it in a pipeline:
from pipecat.services.xtts.tts import XTTSService
from pipecat.transcriptions.language import Language
import aiohttp

async def setup_tts():
    # Create HTTP session for server communication
    session = aiohttp.ClientSession()

    tts = XTTSService(
        aiohttp_session=session,
        voice_id="Claribel Dervla",
        base_url="http://localhost:8000",
        language=Language.EN
    )

    # Use in pipeline
    pipeline = Pipeline([
        transport.input(),
        stt,
        context_aggregator.user(),
        llm,
        tts,
        transport.output(),
        context_aggregator.assistant()
    ])

    return pipeline, session

Dynamic Configuration

Make settings updates by pushing an TTSUpdateSettingsFrame for the XTTSService:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(
    TTSUpdateSettingsFrame(settings={"voice": "your-new-voice-id"})
)

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Streaming Performance - Buffer utilization and chunk processing
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Local Deployment: Runs entirely on local infrastructure for privacy
  • Voice Cloning: Supports custom voice training with audio samples
  • Cross-lingual: Can synthesize multiple languages with same voice