Gladia
Speech-to-text service implementation using Gladia’s API
Overview
GladiaSTTService
is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time with support for multiple languages, custom vocabulary, and various processing options.
Installation
To use GladiaSTTService
, you need to install the Gladia dependencies:
You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY
.
Configuration
Service Parameters
Your Gladia API key for authentication
Gladia API endpoint URL
Minimum confidence threshold to create interim and final transcriptions. Values range from 0 to 1.
Audio sample rate in Hz
Model to use for transcription. Options include
solaria-1
solaria-mini-1
fast
accurate
See Gladia’s docs for the latest supported models.
Additional configuration parameters for the service
GladiaInputParams
Audio encoding format
Audio bit depth
Number of audio channels
Additional metadata to include with requests
Silence duration in seconds to mark end of speech
Maximum utterance duration without silence
Primary language for transcription. Deprecated: use language_config
instead.
Detailed language configuration
Audio pre-processing options
Real-time processing features
WebSocket message filtering options
LanguageConfig
Specify language(s) for transcription. If one language is set, it will be used for all transcription. If multiple languages are provided or none, language will be auto-detected by the model.
If true, language will be auto-detected on each utterance. Otherwise, language will be auto-detected on first utterance and then used for the rest of the transcription. If one language is set, this option will be ignored.
PreProcessingConfig
Sensitivity configuration for Speech Threshold. A value close to 1 will apply stricter thresholds, making it less likely to detect background sounds as speech. Must be between 0 and 1.
CustomVocabularyConfig
Specific vocabulary list to feed the transcription model with. Can be a list of strings or CustomVocabularyItem objects.
Default intensity for the custom vocabulary. Must be between 0 and 1.
CustomSpellingConfig
The list of spelling rules applied on the audio transcription. Keys are the correct spellings and values are lists of phonetic variations.
TranslationConfig
The target language(s) in ISO639-1 format (e.g., “en”, “fr”, “es”)
Translation model to use. Options: “base” or “enhanced”
Align translated utterances with the original ones
RealtimeProcessingConfig
Whether to provide per-word timestamps
Whether to enable custom vocabulary
Custom vocabulary configuration
Whether to enable custom spelling
Custom spelling configuration
Whether to enable translation
Translation configuration
Whether to enable named entity recognition
Whether to enable sentiment analysis
MessagesConfig
If true, partial utterances will be sent via WebSocket
If true, final utterances will be sent via WebSocket
If true, begin and end speech events will be sent via WebSocket
If true, pre-processing events will be sent via WebSocket
If true, realtime processing events will be sent via WebSocket
If true, post-processing events will be sent via WebSocket
If true, acknowledgments will be sent via WebSocket
If true, errors will be sent via WebSocket
If true, lifecycle events will be sent via WebSocket
Input
The service processes raw audio data with the following requirements:
- PCM audio format
- 16-bit depth
- 16kHz sample rate (default)
- Single channel (mono)
Output
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Transcription language
InterimTranscriptionFrame
Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.
ErrorFrame
Generated when transcription errors occur, containing error details.
Methods
See the STT base class methods for additional functionality.
Language Setting
Language Support
Gladia STT supports a wide range of languages. Here’s a partial list:
Language Code | Description | Service Code |
---|---|---|
Language.AF | Afrikaans | af |
Language.AM | Amharic | am |
Language.AR | Arabic | ar |
Language.AS | Assamese | as |
Language.AZ | Azerbaijani | az |
Language.BA | Bashkir | ba |
Language.BE | Belarusian | be |
Language.BG | Bulgarian | bg |
Language.BN | Bengali | bn |
Language.BO | Tibetan | bo |
Language.BR | Breton | br |
Language.BS | Bosnian | bs |
Language.CA | Catalan | ca |
Language.CS | Czech | cs |
Language.CY | Welsh | cy |
Language.DA | Danish | da |
Language.DE | German | de |
Language.EL | Greek | el |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.ET | Estonian | et |
Language.EU | Basque | eu |
Language.FA | Persian | fa |
Language.FI | Finnish | fi |
Language.FO | Faroese | fo |
Language.FR | French | fr |
Language.GL | Galician | gl |
Language.GU | Gujarati | gu |
Language.HA | Hausa | ha |
Language.HAW | Hawaiian | haw |
Language.HE | Hebrew | he |
Language.HI | Hindi | hi |
Language.HR | Croatian | hr |
Language.HT | Haitian Creole | ht |
Language.HU | Hungarian | hu |
Language.HY | Armenian | hy |
Language.ID | Indonesian | id |
Language.IS | Icelandic | is |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.JV | Javanese | jv |
Language.KA | Georgian | ka |
Language.KK | Kazakh | kk |
Language.KM | Khmer | km |
Language.KN | Kannada | kn |
Language.KO | Korean | ko |
Language.LA | Latin | la |
Language.LB | Luxembourgish | lb |
Language.LN | Lingala | ln |
Language.LO | Lao | lo |
Language.LT | Lithuanian | lt |
Language.LV | Latvian | lv |
Language.MG | Malagasy | mg |
Language.MI | Maori | mi |
Language.MK | Macedonian | mk |
Language.ML | Malayalam | ml |
Language.MN | Mongolian | mn |
Language.MR | Marathi | mr |
Language.MS | Malay | ms |
Language.MT | Maltese | mt |
Language.MY_MR | Burmese | mymr |
Language.NE | Nepali | ne |
Language.NL | Dutch | nl |
Language.NN | Norwegian (Nynorsk) | nn |
Language.NO | Norwegian | no |
Language.OC | Occitan | oc |
Language.PA | Punjabi | pa |
Language.PL | Polish | pl |
Language.PS | Pashto | ps |
Language.PT | Portuguese | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SA | Sanskrit | sa |
Language.SD | Sindhi | sd |
Language.SI | Sinhala | si |
Language.SK | Slovak | sk |
Language.SL | Slovenian | sl |
Language.SN | Shona | sn |
Language.SO | Somali | so |
Language.SQ | Albanian | sq |
Language.SR | Serbian | sr |
Language.SU | Sundanese | su |
Language.SV | Swedish | sv |
Language.SW | Swahili | sw |
Language.TA | Tamil | ta |
Language.TE | Telugu | te |
Language.TG | Tajik | tg |
Language.TH | Thai | th |
Language.TK | Turkmen | tk |
Language.TL | Tagalog | tl |
Language.TR | Turkish | tr |
Language.TT | Tatar | tt |
Language.UK | Ukrainian | uk |
Language.UR | Urdu | ur |
Language.UZ | Uzbek | uz |
Language.VI | Vietnamese | vi |
Language.YI | Yiddish | yi |
Language.YO | Yoruba | yo |
Language.ZH | Chinese | zh |
For a complete list of supported languages, refer to Gladia’s documentation.
Advanced Features
Custom Vocabulary
You can provide custom vocabulary items with bias intensity:
Translation
Enable real-time translation:
Multi-language Support
Configure multiple languages with automatic language switching:
Usage Example
Frame Flow
Metrics Support
The service collects processing metrics:
- Time to First Byte (TTFB)
- Processing duration
- Connection status
Notes
- Audio input must be in PCM format
- Transcription frames are only generated when confidence threshold is met
- Service automatically handles websocket connections and cleanup
- Real-time processing occurs in parallel for natural conversation flow