Speech-to-text service implementation using Cartesia’s real-time transcription API
CartesiaSTTService
provides real-time speech recognition using Cartesia’s WebSocket API with the ink-whisper
model, supporting streaming transcription with both interim and final results.
CARTESIA_API_KEY
.
InputAudioRawFrame
- Raw PCM audio data (16-bit, 16kHz, mono)UserStartedSpeakingFrame
- Triggers metrics collectionUserStoppedSpeakingFrame
- Sends finalize command to flush sessionSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcriptionInterimTranscriptionFrame
- Real-time transcription updatesTranscriptionFrame
- Final transcription resultsErrorFrame
- Connection or processing errorsModel | Description | Best For |
---|---|---|
ink-whisper | Cartesia’s optimized Whisper implementation | General-purpose real-time transcription |
Language Code | Description | Service Codes |
---|---|---|
Language.EN | English (US) | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.DE | German | de |
Language.IT | Italian | it |
Language.PT | Portuguese | pt |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.RU | Russian | ru |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.ZH | Chinese | zh |
CartesiaSTTService
and use it in a pipeline:
STTUpdateSettingsFrame
for the CartesiaSTTService
: