Overview

CartesiaSTTService provides real-time speech recognition using Cartesia’s WebSocket API with the ink-whisper model, supporting streaming transcription with both interim and final results.

Installation

To use Cartesia services, install the required dependency:
pip install "pipecat-ai[cartesia]"
You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY.
Get your API key from Cartesia.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, 16kHz, mono)
  • UserStartedSpeakingFrame - Triggers metrics collection
  • UserStoppedSpeakingFrame - Sends finalize command to flush session
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • InterimTranscriptionFrame - Real-time transcription updates
  • TranscriptionFrame - Final transcription results
  • ErrorFrame - Connection or processing errors

Models

Cartesia currently offers one primary STT model:
ModelDescriptionBest For
ink-whisperCartesia’s optimized Whisper implementationGeneral-purpose real-time transcription

Language Support

Cartesia STT supports multiple languages through standard language codes:
Language CodeDescriptionService Codes
Language.ENEnglish (US)en
Language.ESSpanishes
Language.FRFrenchfr
Language.DEGermande
Language.ITItalianit
Language.PTPortuguesept
Language.NLDutchnl
Language.PLPolishpl
Language.RURussianru
Language.JAJapaneseja
Language.KOKoreanko
Language.ZHChinesezh
Language support may vary. Check Cartesia’s documentation for the most current language list.

Usage Example

Basic Configuration

Initialize the CartesiaSTTService and use it in a pipeline:
from pipecat.services.cartesia.stt import CartesiaSTTService

# Simple setup with defaults
stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY")
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Dynamic Configuration

Make settings updates by pushing an STTUpdateSettingsFrame for the CartesiaSTTService:
from pipecat.frames.frames import STTUpdateSettingsFrame

await task.queue_frame(STTUpdateSettingsFrame(
    language=Language.FR,
))

Live Options Configuration

from pipecat.services.cartesia.stt import CartesiaSTTService, CartesiaLiveOptions
from pipecat.transcriptions.language import Language

# Custom configuration with live options
live_options = CartesiaLiveOptions(
    model="ink-whisper",
    language=Language.ES,  # Spanish transcription
)

stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    base_url="api.cartesia.ai",  # Custom endpoint if needed
    live_options=live_options
)

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from audio input to first transcription
  • Processing Duration - Total time spent processing audio
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Audio Format: Expects PCM S16LE format at 16kHz sample rate by default
  • Session Management: Each connection represents a transcription session that can be finalized
  • Interim Results: Provides real-time interim transcriptions before final results
  • Language Detection: Automatic language detection available in transcription responses