Gladia
Speech-to-text service implementation using Gladia’s API
Overview
GladiaSTTService
is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time.
Installation
To use GladiaSTTService
, you need to install the Gladia dependencies:
You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY
Configuration
Service Parameters
Your Gladia API key for authentication
Gladia API endpoint URL
Minimum confidence threshold for transcriptions. Values range from 0 to 1.
Audio Processing Parameters
Audio sample rate in Hz
Primary language for transcription
Silence duration (in seconds) to mark end of speech
Maximum duration in seconds without detecting speech end
Enable audio enhancement preprocessing
Enable accurate word timestamps in transcription
Input Requirements
The service processes InputAudioRawFrame
instances with:
- Raw PCM audio data
- 16-bit depth
- Sample rate matching configuration (default 16kHz)
- Single channel (mono)
See Audio Frames for detailed frame structure.
Output
The service produces two types of frames during transcription:
InterimTranscriptionFrame
Generated during ongoing speech when confidence threshold is met. Contains:
Preliminary transcribed text
ID of the speaking user
ISO 8601 formatted timestamp
Detected language (if enabled)
TranscriptionFrame
Generated for final transcriptions when confidence threshold is met. Contains identical fields to InterimTranscriptionFrame but represents confirmed text.
See Text Frames for detailed frame structures.
Example Usage
Methods
See the STT base class methods for additional functionality.
Language Setting
Language Support
Gladia STT supports the following languages:
Language Code | Description | Service Code |
---|---|---|
Language.BG | Bulgarian | bg |
Language.CA | Catalan | ca |
Language.ZH | Chinese | zh |
Language.CS | Czech | cs |
Language.DA | Danish | da |
Language.NL | Dutch | nl |
Language.EN | English | en |
Language.EN_US | English (US) | en |
Language.EN_AU | English (Australia) | en |
Language.EN_GB | English (UK) | en |
Language.EN_NZ | English (New Zealand) | en |
Language.EN_IN | English (India) | en |
Language.ET | Estonian | et |
Language.FI | Finnish | fi |
Language.FR | French | fr |
Language.FR_CA | French (Canada) | fr |
Language.DE | German | de |
Language.DE_CH | German (Switzerland) | de |
Language.EL | Greek | el |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.ID | Indonesian | id |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.LV | Latvian | lv |
Language.LT | Lithuanian | lt |
Language.MS | Malay | ms |
Language.NO | Norwegian | no |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.PT_BR | Portuguese (Brazil) | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SK | Slovak | sk |
Language.ES | Spanish | es |
Language.SV | Swedish | sv |
Language.TH | Thai | th |
Language.TR | Turkish | tr |
Language.UK | Ukrainian | uk |
Language.VI | Vietnamese | vi |
Usage Example
Note: Gladia uses simplified language codes without regional variants.
Frame Flow
Service Control
The service accepts STTUpdateSettingsFrame
for dynamic configuration updates. See Service Control Frames for details.
Notes
- Audio input must be in PCM format
- Transcription frames are only generated when confidence threshold is met
- Language detection is optional
- Service automatically handles websocket connections and cleanup
- Real-time processing occurs in parallel for natural conversation flow