Overview

Cartesia provides two TTS service implementations:

  • CartesiaTTSService: WebSocket-based service with word-level timestamps and streaming
  • CartesiaHttpTTSService: HTTP-based service for simpler, non-streaming synthesis

Installation

To use Cartesia services, install the required dependencies:

pip install pipecat-ai[cartesia]

You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY.

You can obtain a Cartesia API key by signing up at Cartesia.

Choosing a Cartesia service

Cartesia has two supported services:

  • CartesiaTTSService which is a websocket-based implementation
  • CartesiaHttpTTSService, which is an HTTP-based implementation

CartesiaTTSService

The CartesiaTTSService is recommended for real-time streaming and interactive applications. It offers:

  • Streaming audio in chunks
  • Word-level timestamps
  • Text frame generation aligned with audio playback
  • Sophisticated interruption handling
  • Continuous session management through websocket connection
  • Non-blocking operation that allows other frames to be processed while audio is being generated

CartesiaHttpTTSService

The CartesiaHttpTTSService is simpler and more straightforward, suitable for non-interactive use cases. It:

  • Processes the entire text in one request
  • Returns audio in a single frame
  • Has simpler implementation and fewer moving parts
  • May be more suitable for batch processing
  • Blocks during the HTTP request, preventing other frames from being processed until the audio is fully generated

Both services support usage metrics and start/stop frame events, but they differ in how they handle the audio streaming process and interaction capabilities. Choose the websocket-based service if you need real-time responsiveness, or the HTTP service if you prefer simplicity and don’t mind the blocking behavior.

Input Parameters

Both services use the same input parameters structure:

language
Language
default: "Language.EN"

The language to use for synthesis. See Language Support section for available options.

speed
Union[str, float]
default: ""

Controls the speech rate.

Can be specified as either:

  • String options: "slowest", "slow", "normal", "fast", "fastest"
  • Float value: Between -1.0 (slowest) and 1.0 (fastest), where 0.0 is normal speed
emotion
List[str]
default: "[]"

List of emotion controls to apply.

Each emotion can be specified as:

  • Simple emotion: "anger", "positivity", "surprise", "sadness", "curiosity"
  • Emotion with level: “emotion:level” where level can be "lowest", "low", "high", "highest"

Example: ["positivity:high", "curiosity"]

Note: Emotion controls are additive and their effects may vary by voice and content.

CartesiaTTSService

WebSocket-based implementation supporting real-time streaming and word timestamps.

Constructor Parameters

api_key
str
required

Cartesia API key

voice_id
str
required

Voice identifier

cartesia_version
str
default: "2024-06-10"

API version

url
str
default: "wss://api.cartesia.ai/tts/websocket"

WebSocket endpoint URL

model
str
default: "sonic-english"

Model identifier

sample_rate
int
default: "24000"

Output audio sample rate in Hz

encoding
str
default: "pcm_s16le"

Audio encoding format

container
str
default: "raw"

Audio container format

text_filter
BaseTextFilter
default: "None"

Modifies text provided to the TTS. Learn more about the available filters.

CartesiaHttpTTSService

HTTP-based implementation for simpler synthesis requirements.

Constructor Parameters

api_key
str
required

Cartesia API key

voice_id
str
required

Voice identifier

model
str
default: "sonic-english"

Model identifier

base_url
str
default: "https://api.cartesia.ai"

API base URL

sample_rate
int
default: "24000"

Output audio sample rate in Hz

encoding
str
default: "pcm_s16le"

Audio encoding format

container
str
default: "raw"

Audio container format

text_filter
class

Modifies text provided to the TTS. Learn more about the available filters.

Output Frames

Control Frames

TTSStartedFrame
Frame

Signals start of synthesis

TTSStoppedFrame
Frame

Signals completion of synthesis

Audio Frames

TTSAudioRawFrame
Frame

Contains generated audio data

Error Frames

ErrorFrame
Frame

Contains error information

Methods

See the TTS base class methods for additional functionality.

Language Support

Supports multiple languages through the Language enum:

Language CodeDescriptionService Code
Language.DEGermande
Language.ENEnglishen
Language.ESSpanishes
Language.FRFrenchfr
Language.HIHindihi
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.NLDutchnl
Language.PLPolishpl
Language.PTPortuguesept
Language.RURussianru
Language.SVSwedishsv
Language.TRTurkishtr
Language.ZHChinese (Mandarin)zh

Usage Examples

WebSocket Service

# Configure WebSocket service
tts = CartesiaTTSService(
    api_key="your-api-key",
    voice_id="voice-id",
    model="sonic-english",
    params=CartesiaTTSService.InputParams(
        language=Language.EN,
        speed="normal",
        emotion=[
          "positivity:high",
          "curiosity"
        ]
    )
)

# Use in pipeline
pipeline = Pipeline([
    text_input,
    tts,
    transport.output()
])

HTTP Service

# Configure HTTP service
http_service = CartesiaHttpTTSService(
    api_key="your-api-key",
    voice_id="voice-id",
    model="sonic-english",
    params=CartesiaHttpTTSService.InputParams(
        language=Language.EN,
        speed=1.0
    )
)

Frame Flow

WebSocket Service

HTTP Service