Cartesia
Speech-to-text service implementation using Cartesia’s real-time transcription API
Overview
CartesiaSTTService
provides real-time speech-to-text capabilities using Cartesia’s WebSocket API. It supports streaming transcription with both interim and final results using the ink-whisper
model.
Installation
To use CartesiaSTTService
, install the required dependencies:
You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY
.
You can obtain a Cartesia API key by signing up at Cartesia.
Configuration
Constructor Parameters
Your Cartesia API key
Custom Cartesia API endpoint URL
Audio sample rate in Hz
Custom transcription options
CartesiaLiveOptions
The Cartesia transcription model to use
Language code for transcription
Audio encoding format
Audio sample rate in Hz
Default Options
Input
The service processes raw audio data with the following requirements:
- PCM audio format (
pcm_s16le
) - 16-bit depth
- 16kHz sample rate (default)
- Single channel (mono)
Output Frames
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Final transcribed text
User identifier
ISO 8601 formatted timestamp
Detected or configured language
InterimTranscriptionFrame
Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.
Methods
See the STT base class methods for additional functionality.
Language Setting
The service supports language configuration through the CartesiaLiveOptions
:
Model Selection
Usage Example
Frame Flow
Connection Management
The service automatically manages WebSocket connections:
- Auto-reconnect: Reconnects automatically when the connection is closed due to timeout
- Finalization: Sends a “finalize” command when user stops speaking to flush the transcription session
- Error handling: Gracefully handles connection errors and WebSocket exceptions
Metrics Support
The service supports comprehensive metrics collection:
- Time to First Byte (TTFB)
- Processing duration
- Speech detection events
- Connection status
Notes
- Requires valid Cartesia API key
- Supports real-time streaming transcription
- Handles automatic WebSocket connection management
- Includes comprehensive error handling
- Manages connection lifecycle automatically