Overview

Cartesia provides two TTS service implementations: CartesiaTTSService (WebSocket-based with streaming and word timestamps) and CartesiaHttpTTSService (HTTP-based for simpler synthesis). The WebSocket service is recommended for real-time applications.

Installation

To use Cartesia services, install the required dependencies:

pip install "pipecat-ai[cartesia]"

You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY.

Get your API key by signing up at Cartesia.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that the TTS service should speak
  • TTSUpdateSettingsFrame - Runtime configuration updates (e.g., voice)
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data chunks
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - Connection or processing errors

Service Comparison

FeatureCartesiaTTSService (WebSocket)CartesiaHttpTTSService (HTTP)
Streaming✅ Real-time chunks❌ Single audio block
Word Timestamps✅ Precise timing❌ Not available
Interruption✅ Advanced handling⚠️ Basic support
Latency🚀 Low📈 Higher
Best ForInteractive appsBatch processing

Language Support

Supports multiple languages through the Language enum:

Language CodeDescriptionService Code
Language.DEGermande
Language.ENEnglishen
Language.ESSpanishes
Language.FRFrenchfr
Language.HIHindihi
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.NLDutchnl
Language.PLPolishpl
Language.PTPortuguesept
Language.RURussianru
Language.SVSwedishsv
Language.TRTurkishtr
Language.ZHChinese (Mandarin)zh

Usage Example

from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transcriptions.language import Language
import os

# Configure WebSocket service
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-2",
    params=CartesiaTTSService.InputParams(
        language=Language.EN,
        speed="normal"
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,  # Word timestamps enable precise context updates
    transport.output()
])

HTTP Service

# For simpler, non-streaming use cases
http_tts = CartesiaHttpTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-2",
    params=CartesiaHttpTTSService.InputParams(
        language=Language.EN
    )
)

Metrics

Both services provide:

  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Usage Metrics - Character count and synthesis statistics

Additional Notes

  • WebSocket Recommended: Use CartesiaTTSService for low-latency streaming and accurate context updates with word timestamps
  • Connection Management: WebSocket lifecycle is handled automatically with reconnection support
  • Sample Rate: Set globally in PipelineParams rather than per-service for consistency