Skip to main content

Overview

Cartesia provides high-quality text-to-speech synthesis with two service implementations: CartesiaTTSService (WebSocket-based) for real-time streaming with word timestamps, and CartesiaHttpTTSService (HTTP-based) for simpler batch synthesis. CartesiaTTSService is recommended for interactive applications requiring low latency and interruption handling.

Installation

To use Cartesia services, install the required dependencies:
pip install "pipecat-ai[cartesia]"

Prerequisites

Cartesia Account Setup

Before using Cartesia TTS services, you need:
  1. Cartesia Account: Sign up at Cartesia
  2. API Key: Generate an API key from your account dashboard
  3. Voice Selection: Choose voice IDs from the voice library

Required Environment Variables

  • CARTESIA_API_KEY: Your Cartesia API key for authentication

Customizing Speech

CartesiaTTSService provides a set of helper methods for implementing Cartesia-specific customizations, meant to be used as part of text transformers. These include methods for spelling out text, adjusting speech rate, and modifying pitch. See the Text Transformers for TTS section in the Text-to-Speech guide for usage examples.

SPELL(text: str) -> str:

A convenience method to wrap text in Cartesia’s spell tag for spelling out text character by character.
# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    return CartesiaTTSService.SPELL(text)

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[{
        "phone_number": spell_out_text,
    }],
)

EMOTION_TAG(emotion: CartesiaEmotion) -> str:

A convenience method to create an emotion tag for expressing emotions in speech.
# Text transformers for TTS
# This will insert Cartesia's sarcasm tag in front of any sentence that is just "whatever".
async def maybe_insert_sarcasm(text: str, type: str) -> str:
    if text.strip(".!").lower() == "whatever":
        return CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.SARCASM) + text + CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.NEUTRAL)
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[{
        "sentence": maybe_insert_sarcasm,
    }],
)

PAUSE_TAG(seconds: float) -> str:

A convenience method to create Cartesia’s SSML tag for inserting pauses in speech.
# Text transformers for TTS
# This will insert a one second pause after questions.
async def pause_after_questions(text: str, type: str) -> str:
    if text.endswith("?"):
        return f"{text}{CartesiaTTSService.PAUSE_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[{
        "sentence": pause_after_questions, # Only apply to sentence aggregations
    }],
)

VOLUME_TAG(volume: float) -> str:

A convenience method to create Cartesia’s SSML volume tag for dynamically adjusting speech volume in situ.
# Text transformers for TTS
# This will increase the volume for any full text aggregation that is in all caps.
async def maybe_say_it_loud(text: str, type: str) -> str:
    if text.upper() == text:
        return f"{CartesiaTTSService.VOLUME_TAG(2.0)}{text}{CartesiaTTSService.VOLUME_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[{
        "*": maybe_say_it_loud, # Apply to all text
    }],
)

SPEED_TAG(speed: float) -> str:

A convenience method to create Cartesia’s SSML speed tag for dynamically adjusting the speech rate in situ.
# Text transformers for TTS
# This will make the word slow always be spoken more slowly.
async def slow_down_slow_words(text: str, type: str) -> str:
    return text.replace(
        "slow",
        f"{CartesiaTTSService.SPEED_TAG(0.6)}slow{CartesiaTTSService.SPEED_TAG(1.0)}"
    )

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[{
        "*": slow_down_slow_words, # Apply to all text
    }],
)