Overview
XTTSVLLMTTSService streams audio from a self-hosted
XTTSv2-vLLM streaming server —
Coqui XTTSv2 served with vLLM for real-time, low-latency synthesis (~0.45s
time-to-first-byte on the maintainer’s test hardware). It is a thin HTTP client:
the heavy model server runs
separately (as a Docker image, typically on a GPU host) and the service talks to
it over an OpenAI-compatible streaming endpoint, outputting TTSAudioRawFrame
audio into your Pipecat pipeline.
Voice cloning conditioning is computed once from a short reference sample and
cached for the lifetime of the service, so per-utterance requests stay fast.
Source Repository
Source code, examples, and issues for the XTTS-vLLM integration
PyPI Package
The
pipecat-xtts-vllm package on PyPIModel Server
The XTTSv2-vLLM streaming server this client connects to
Installation
This is a community-maintained package distributed separately frompipecat-ai:
Prerequisites
This service is a client for a self-hosted model server; there is no third-party account or API key.- Run the model server. Deploy the
XTTSv2-vLLM streaming server
(Docker image, GPU recommended) and note its URL for
base_url. See the server repository for deployment instructions. - Provide a reference voice. A ~6-second reference audio clip (as bytes) is
used for voice cloning. Alternatively, supply precomputed
conditioning.
The integration code is MIT-licensed, but the underlying XTTSv2 model
weights are distributed under the Coqui Public Model License (non-commercial
use only). Review the server repository for licensing details before
production use.
Configuration
Base URL of the running XTTSv2-vLLM streaming server, e.g.
http://localhost:8000.Reference voice sample (~6 seconds) used to compute voice-cloning
conditioning. Required unless
conditioning is provided.Optional precomputed conditioning (
gpt_cond_latent_b64 +
speaker_embeddings_b64). If set, it takes precedence over reference_audio
and skips the conditioning request.Language code for synthesis. XTTSv2 supports 17 languages:
en (English),
es (Spanish), fr (French), de (German), it (Italian), pt
(Portuguese), pl (Polish), tr (Turkish), ru (Russian), nl (Dutch),
cs (Czech), ar (Arabic), zh-cn (Chinese, Simplified), hu (Hungarian),
ko (Korean), ja (Japanese), and hi (Hindi). Pass auto to let the
server auto-detect the language.Token-delta streaming chunk size sent to the server.
Speech speed multiplier.
Output audio sample rate in Hz (XTTSv2 native is 24 kHz, 16-bit mono
PCM).
Optional shared aiohttp session used for requests. If not provided, the
service creates and manages its own session.
Usage
XTTSVLLMConditioning alongside the service
(from pipecat_xtts_vllm import XTTSVLLMConditioning, XTTSVLLMTTSService) and
pass it via the conditioning= argument.
See the foundational example
in the source repository for a complete, runnable script.
Compatibility
Tested withpipecat-ai v1.4.0. Check the source
repository for the latest
tested version and changelog.