Cartesia

Overview

CartesiaSTTService provides real-time speech recognition using Cartesia’s WebSocket API with the ink-whisper model, supporting streaming transcription with both interim and final results.

API Reference

Complete API documentation and method details

Cartesia Docs

Official Cartesia STT documentation and features

Example Code

Working example with transcription logging

Installation

To use Cartesia services, install the required dependency:

pip install "pipecat-ai[cartesia]"

You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY.

Get your API key from Cartesia.

Frames

Input

InputAudioRawFrame - Raw PCM audio data (16-bit, 16kHz, mono)
UserStartedSpeakingFrame - Triggers metrics collection
UserStoppedSpeakingFrame - Sends finalize command to flush session
STTUpdateSettingsFrame - Runtime transcription configuration updates
STTMuteFrame - Mute audio input for transcription

Output

InterimTranscriptionFrame - Real-time transcription updates
TranscriptionFrame - Final transcription results
ErrorFrame - Connection or processing errors

Models

Cartesia currently offers one primary STT model:

Model	Description	Best For
`ink-whisper`	Cartesia’s optimized Whisper implementation	General-purpose real-time transcription

Language Support

Cartesia STT supports multiple languages through standard language codes:

Language Code	Description	Service Codes
`Language.EN`	English (US)	`en`
`Language.ES`	Spanish	`es`
`Language.FR`	French	`fr`
`Language.DE`	German	`de`
`Language.IT`	Italian	`it`
`Language.PT`	Portuguese	`pt`
`Language.NL`	Dutch	`nl`
`Language.PL`	Polish	`pl`
`Language.RU`	Russian	`ru`
`Language.JA`	Japanese	`ja`
`Language.KO`	Korean	`ko`
`Language.ZH`	Chinese	`zh`

Language support may vary. Check Cartesia’s documentation for the most current language list.

Usage Example

Basic Configuration

Initialize the CartesiaSTTService and use it in a pipeline:

from pipecat.services.cartesia.stt import CartesiaSTTService

# Simple setup with defaults
stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY")
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Dynamic Configuration

Make settings updates by pushing an STTUpdateSettingsFrame for the CartesiaSTTService:

from pipecat.frames.frames import STTUpdateSettingsFrame

await task.queue_frame(STTUpdateSettingsFrame(
    language=Language.FR,
))

Live Options Configuration

from pipecat.services.cartesia.stt import CartesiaSTTService, CartesiaLiveOptions
from pipecat.transcriptions.language import Language

# Custom configuration with live options
live_options = CartesiaLiveOptions(
    model="ink-whisper",
    language=Language.ES,  # Spanish transcription
)

stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    base_url="api.cartesia.ai",  # Custom endpoint if needed
    live_options=live_options
)

Metrics

The service provides comprehensive metrics:

Time to First Byte (TTFB) - Latency from audio input to first transcription
Processing Duration - Total time spent processing audio

Learn how to enable Metrics in your Pipeline.

Additional Notes

Audio Format: Expects PCM S16LE format at 16kHz sample rate by default
Session Management: Each connection represents a transcription session that can be finalized
Interim Results: Provides real-time interim transcriptions before final results
Language Detection: Automatic language detection available in transcription responses

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

Cartesia Docs

Example Code

Installation

Frames

Input

Output

Models

Language Support

Usage Example

Basic Configuration

Dynamic Configuration

Live Options Configuration

Metrics

Additional Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

Cartesia Docs

Example Code

​Installation

​Frames

​Input

​Output

​Models

​Language Support

​Usage Example

​Basic Configuration

​Dynamic Configuration

​Live Options Configuration

​Metrics

​Additional Notes

Overview

Installation

Frames

Input

Output

Models

Language Support

Usage Example

Basic Configuration

Dynamic Configuration

Live Options Configuration

Metrics

Additional Notes