Text-to-Speech
Cartesia
Text-to-speech services using Cartesia’s WebSocket and HTTP APIs
Overview
Cartesia provides two TTS service implementations: CartesiaTTSService
(WebSocket-based with streaming and word timestamps) and CartesiaHttpTTSService
(HTTP-based for simpler synthesis). The WebSocket service is recommended for real-time applications.
API Reference
Complete API documentation and method details
Cartesia Docs
Official Cartesia documentation and features
Example Code
Working example with interruption handling
Installation
To use Cartesia services, install the required dependencies:
You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY
.
Get your API key by signing up at Cartesia.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that the TTS service should speakTTSUpdateSettingsFrame
- Runtime configuration updates (e.g., voice)LLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data chunksTTSStoppedFrame
- Signals completion of synthesisErrorFrame
- Connection or processing errors
Service Comparison
Feature | CartesiaTTSService (WebSocket) | CartesiaHttpTTSService (HTTP) |
---|---|---|
Streaming | ✅ Real-time chunks | ❌ Single audio block |
Word Timestamps | ✅ Precise timing | ❌ Not available |
Interruption | ✅ Advanced handling | ⚠️ Basic support |
Latency | 🚀 Low | 📈 Higher |
Best For | Interactive apps | Batch processing |
Language Support
Supports multiple languages through the Language
enum:
Language Code | Description | Service Code |
---|---|---|
Language.DE | German | de |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.HI | Hindi | hi |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RU | Russian | ru |
Language.SV | Swedish | sv |
Language.TR | Turkish | tr |
Language.ZH | Chinese (Mandarin) | zh |
Usage Example
WebSocket Service (Recommended)
HTTP Service
Metrics
Both services provide:
- Time to First Byte (TTFB) - Latency from text input to first audio
- Processing Duration - Total synthesis time
- Usage Metrics - Character count and synthesis statistics
Additional Notes
- WebSocket Recommended: Use
CartesiaTTSService
for low-latency streaming and accurate context updates with word timestamps - Connection Management: WebSocket lifecycle is handled automatically with reconnection support
- Sample Rate: Set globally in
PipelineParams
rather than per-service for consistency