Gladia
Speech-to-text service implementation using Gladia’s API
Overview
GladiaSTTService
is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time.
Installation
To use GladiaSTTService
, you need to install the Gladia dependencies:
You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY
Configuration
Service Parameters
Your Gladia API key for authentication
Gladia API endpoint URL
Minimum confidence threshold for transcriptions. Values range from 0 to 1.
Audio Processing Parameters
Audio sample rate in Hz
Primary language for transcription
Silence duration (in seconds) to mark end of speech
Maximum duration in seconds without detecting speech end
Enable audio enhancement preprocessing
Enable accurate word timestamps in transcription
Input
The service processes raw audio data with the following requirements:
- PCM audio format
- 16-bit depth
- 16kHz sample rate (default)
- Single channel (mono)
Output
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Transcription language
InterimTranscriptionFrame
Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.
ErrorFrame
Generated when transcription errors occur, containing error details.
Methods
See the STT base class methods for additional functionality.
Language Setting
Language Support
Gladia STT supports the following languages:
Language Code | Description | Service Code |
---|---|---|
Language.BG | Bulgarian | bg |
Language.CA | Catalan | ca |
Language.ZH | Chinese | zh |
Language.CS | Czech | cs |
Language.DA | Danish | da |
Language.NL | Dutch | nl |
Language.EN | English | en |
Language.EN_US | English (US) | en |
Language.EN_AU | English (Australia) | en |
Language.EN_GB | English (UK) | en |
Language.EN_NZ | English (New Zealand) | en |
Language.EN_IN | English (India) | en |
Language.ET | Estonian | et |
Language.FI | Finnish | fi |
Language.FR | French | fr |
Language.FR_CA | French (Canada) | fr |
Language.DE | German | de |
Language.DE_CH | German (Switzerland) | de |
Language.EL | Greek | el |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.ID | Indonesian | id |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.LV | Latvian | lv |
Language.LT | Lithuanian | lt |
Language.MS | Malay | ms |
Language.NO | Norwegian | no |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.PT_BR | Portuguese (Brazil) | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SK | Slovak | sk |
Language.ES | Spanish | es |
Language.SV | Swedish | sv |
Language.TH | Thai | th |
Language.TR | Turkish | tr |
Language.UK | Ukrainian | uk |
Language.VI | Vietnamese | vi |
Usage Example
Note: Gladia uses simplified language codes without regional variants.
Frame Flow
Metrics Support
The service collects processing metrics:
- Time to First Byte (TTFB)
- Processing duration
- Connection status
Notes
- Audio input must be in PCM format
- Transcription frames are only generated when confidence threshold is met
- Language detection is optional
- Service automatically handles websocket connections and cleanup
- Real-time processing occurs in parallel for natural conversation flow