Skip to main content

Overview

XTTSVLLMTTSService streams audio from a self-hosted XTTSv2-vLLM streaming server — Coqui XTTSv2 served with vLLM for real-time, low-latency synthesis (~0.45s time-to-first-byte on the maintainer’s test hardware). It is a thin HTTP client: the heavy model server runs separately (as a Docker image, typically on a GPU host) and the service talks to it over an OpenAI-compatible streaming endpoint, outputting TTSAudioRawFrame audio into your Pipecat pipeline. Voice cloning conditioning is computed once from a short reference sample and cached for the lifetime of the service, so per-utterance requests stay fast.

Source Repository

Source code, examples, and issues for the XTTS-vLLM integration

PyPI Package

The pipecat-xtts-vllm package on PyPI

Model Server

The XTTSv2-vLLM streaming server this client connects to

Installation

This is a community-maintained package distributed separately from pipecat-ai:
uv add pipecat-xtts-vllm

Prerequisites

This service is a client for a self-hosted model server; there is no third-party account or API key.
  1. Run the model server. Deploy the XTTSv2-vLLM streaming server (Docker image, GPU recommended) and note its URL for base_url. See the server repository for deployment instructions.
  2. Provide a reference voice. A ~6-second reference audio clip (as bytes) is used for voice cloning. Alternatively, supply precomputed conditioning.
The integration code is MIT-licensed, but the underlying XTTSv2 model weights are distributed under the Coqui Public Model License (non-commercial use only). Review the server repository for licensing details before production use.

Configuration

base_url
str
required
Base URL of the running XTTSv2-vLLM streaming server, e.g. http://localhost:8000.
reference_audio
bytes
default:"None"
Reference voice sample (~6 seconds) used to compute voice-cloning conditioning. Required unless conditioning is provided.
conditioning
XTTSVLLMConditioning
default:"None"
Optional precomputed conditioning (gpt_cond_latent_b64 + speaker_embeddings_b64). If set, it takes precedence over reference_audio and skips the conditioning request.
language
str
default:"en"
Language code for synthesis. XTTSv2 supports 17 languages: en (English), es (Spanish), fr (French), de (German), it (Italian), pt (Portuguese), pl (Polish), tr (Turkish), ru (Russian), nl (Dutch), cs (Czech), ar (Arabic), zh-cn (Chinese, Simplified), hu (Hungarian), ko (Korean), ja (Japanese), and hi (Hindi). Pass auto to let the server auto-detect the language.
chunk_size
int
default:"20"
Token-delta streaming chunk size sent to the server.
speed
float
default:"1.0"
Speech speed multiplier.
sample_rate
int
default:"24000"
Output audio sample rate in Hz (XTTSv2 native is 24 kHz, 16-bit mono PCM).
aiohttp_session
aiohttp.ClientSession
default:"None"
Optional shared aiohttp session used for requests. If not provided, the service creates and manages its own session.

Usage

from pathlib import Path

from pipecat.pipeline.pipeline import Pipeline
from pipecat_xtts_vllm import XTTSVLLMTTSService

tts = XTTSVLLMTTSService(
    base_url="http://localhost:8000",
    reference_audio=Path("reference.wav").read_bytes(),
    language="en",
)

pipeline = Pipeline(
    [
        transport.input(),               # audio/user input
        stt,                             # speech to text
        context_aggregator.user(),       # add user text to context
        llm,                             # LLM generates response
        tts,                             # XTTS-vLLM synthesis
        transport.output(),              # stream audio back to user
        context_aggregator.assistant(),  # store assistant response
    ]
)
To reuse precomputed conditioning instead of a reference clip, import XTTSVLLMConditioning alongside the service (from pipecat_xtts_vllm import XTTSVLLMConditioning, XTTSVLLMTTSService) and pass it via the conditioning= argument. See the foundational example in the source repository for a complete, runnable script.

Compatibility

Tested with pipecat-ai v1.4.0. Check the source repository for the latest tested version and changelog.