Overview

Cartesia provides two TTS service implementations:
  • CartesiaTTSService: WebSocket-based with streaming and word timestamps
  • CartesiaHttpTTSService: HTTP-based for simpler synthesis
CartesiaTTSService is recommended for real-time applications.

Installation

To use Cartesia services, install the required dependencies:
pip install "pipecat-ai[cartesia]"
You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY.
Get your API key by signing up at Cartesia.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that the TTS service should speak
  • TTSUpdateSettingsFrame - Runtime configuration updates (e.g., voice)
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data chunks
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - Connection or processing errors

Service Comparison

FeatureCartesiaTTSService (WebSocket)CartesiaHttpTTSService (HTTP)
Streaming✅ Real-time chunks❌ Single audio block
Word Timestamps✅ Precise timing❌ Not available
Interruption✅ Advanced handling⚠️ Basic support
Latency🚀 Low📈 Higher
Best ForInteractive appsBatch processing

Language Support

Supports multiple languages through the Language enum:
Language CodeDescriptionService Code
Language.DEGermande
Language.ENEnglishen
Language.ESSpanishes
Language.FRFrenchfr
Language.HIHindihi
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.NLDutchnl
Language.PLPolishpl
Language.PTPortuguesept
Language.RURussianru
Language.SVSwedishsv
Language.TRTurkishtr
Language.ZHChinese (Mandarin)zh

Usage Example

Initialize the WebSocket service with your API key and desired voice:
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transcriptions.language import Language
import os

# Configure WebSocket service
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-2",
    params=CartesiaTTSService.InputParams(
        language=Language.EN,
        speed="normal"
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    tts,  # Word timestamps enable precise context updates
    transport.output()
])

HTTP Service

Initialize the HTTP service and use it in a pipeline:
# For simpler, non-streaming use cases
http_tts = CartesiaHttpTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-2",
    params=CartesiaHttpTTSService.InputParams(
        language=Language.EN
    )
)

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame for the CartesiaTTSService:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(
    TTSUpdateSettingsFrame(settings={"voice": "your-new-voice-id"})
)

Metrics

Both services provide:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Usage Metrics - Character count and synthesis statistics
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • WebSocket Recommended: Use CartesiaTTSService for low-latency streaming and accurate context updates with word timestamps
  • Connection Management: WebSocket lifecycle is handled automatically with reconnection support
  • Sample Rate: Set globally in PipelineParams rather than per-service for consistency