Overview

GladiaSTTService is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time with support for multiple languages, custom vocabulary, and various processing options.

Installation

To use GladiaSTTService, you need to install the Gladia dependencies:

pip install "pipecat-ai[gladia]"

You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY.

Configuration

Service Parameters

api_key
string
required

Your Gladia API key for authentication

url
string
default:"https://api.gladia.io/v2/live"

Gladia API endpoint URL

confidence
float
default:"0.5"

Minimum confidence threshold to create interim and final transcriptions. Values range from 0 to 1.

sample_rate
integer
default:"None"

Audio sample rate in Hz

model
string
default:"solaria-1"

Model to use for transcription. Options include

  • solaria-1
  • solaria-mini-1
  • fast
  • accurate

See Gladia’s docs for the latest supported models.

params
GladiaInputParams
default:"GladiaInputParams()"

Additional configuration parameters for the service

GladiaInputParams

encoding
string
default:"wav/pcm"

Audio encoding format

bit_depth
integer
default:"16"

Audio bit depth

channels
integer
default:"1"

Number of audio channels

custom_metadata
Dict[str, Any]

Additional metadata to include with requests

endpointing
float

Silence duration in seconds to mark end of speech

maximum_duration_without_endpointing
integer
default:"10"

Maximum utterance duration without silence

language
Language
deprecated

Primary language for transcription. Deprecated: use language_config instead.

language_config
LanguageConfig

Detailed language configuration

pre_processing
PreProcessingConfig

Audio pre-processing options

realtime_processing
RealtimeProcessingConfig

Real-time processing features

messages_config
MessagesConfig

WebSocket message filtering options

LanguageConfig

languages
List[str]

Specify language(s) for transcription. If one language is set, it will be used for all transcription. If multiple languages are provided or none, language will be auto-detected by the model.

code_switching
boolean
default:"false"

If true, language will be auto-detected on each utterance. Otherwise, language will be auto-detected on first utterance and then used for the rest of the transcription. If one language is set, this option will be ignored.

PreProcessingConfig

speech_threshold
float
default:"0.8"

Sensitivity configuration for Speech Threshold. A value close to 1 will apply stricter thresholds, making it less likely to detect background sounds as speech. Must be between 0 and 1.

CustomVocabularyConfig

vocabulary
List[Union[str, CustomVocabularyItem]]
required

Specific vocabulary list to feed the transcription model with. Can be a list of strings or CustomVocabularyItem objects.

default_intensity
float

Default intensity for the custom vocabulary. Must be between 0 and 1.

CustomSpellingConfig

spelling_dictionary
Dict[str, List[str]]
required

The list of spelling rules applied on the audio transcription. Keys are the correct spellings and values are lists of phonetic variations.

TranslationConfig

target_languages
List[str]
required

The target language(s) in ISO639-1 format (e.g., “en”, “fr”, “es”)

model
string
default:"base"

Translation model to use. Options: “base” or “enhanced”

match_original_utterances
boolean
default:"true"

Align translated utterances with the original ones

RealtimeProcessingConfig

words_accurate_timestamps
boolean

Whether to provide per-word timestamps

custom_vocabulary
boolean

Whether to enable custom vocabulary

custom_vocabulary_config
CustomVocabularyConfig

Custom vocabulary configuration

custom_spelling
boolean

Whether to enable custom spelling

custom_spelling_config
CustomSpellingConfig

Custom spelling configuration

translation
boolean

Whether to enable translation

translation_config
TranslationConfig

Translation configuration

named_entity_recognition
boolean

Whether to enable named entity recognition

sentiment_analysis
boolean

Whether to enable sentiment analysis

MessagesConfig

receive_partial_transcripts
boolean
default:"true"

If true, partial utterances will be sent via WebSocket

receive_final_transcripts
boolean
default:"true"

If true, final utterances will be sent via WebSocket

receive_speech_events
boolean
default:"true"

If true, begin and end speech events will be sent via WebSocket

receive_pre_processing_events
boolean
default:"true"

If true, pre-processing events will be sent via WebSocket

receive_realtime_processing_events
boolean
default:"true"

If true, realtime processing events will be sent via WebSocket

receive_post_processing_events
boolean
default:"true"

If true, post-processing events will be sent via WebSocket

receive_acknowledgments
boolean
default:"true"

If true, acknowledgments will be sent via WebSocket

receive_errors
boolean
default:"true"

If true, errors will be sent via WebSocket

receive_lifecycle_events
boolean
default:"false"

If true, lifecycle events will be sent via WebSocket

Input

The service processes raw audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • 16kHz sample rate (default)
  • Single channel (mono)

Output

The service produces two types of frames during transcription:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Transcription language

InterimTranscriptionFrame

Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.

ErrorFrame

Generated when transcription errors occur, containing error details.

Methods

See the STT base class methods for additional functionality.

Language Setting

await service.set_language(Language.FR)

Language Support

Gladia STT supports a wide range of languages. Here’s a partial list:

Language CodeDescriptionService Code
Language.AFAfrikaansaf
Language.AMAmharicam
Language.ARArabicar
Language.ASAssameseas
Language.AZAzerbaijaniaz
Language.BABashkirba
Language.BEBelarusianbe
Language.BGBulgarianbg
Language.BNBengalibn
Language.BOTibetanbo
Language.BRBretonbr
Language.BSBosnianbs
Language.CACatalanca
Language.CSCzechcs
Language.CYWelshcy
Language.DADanishda
Language.DEGermande
Language.ELGreekel
Language.ENEnglishen
Language.ESSpanishes
Language.ETEstonianet
Language.EUBasqueeu
Language.FAPersianfa
Language.FIFinnishfi
Language.FOFaroesefo
Language.FRFrenchfr
Language.GLGaliciangl
Language.GUGujaratigu
Language.HAHausaha
Language.HAWHawaiianhaw
Language.HEHebrewhe
Language.HIHindihi
Language.HRCroatianhr
Language.HTHaitian Creoleht
Language.HUHungarianhu
Language.HYArmenianhy
Language.IDIndonesianid
Language.ISIcelandicis
Language.ITItalianit
Language.JAJapaneseja
Language.JVJavanesejv
Language.KAGeorgianka
Language.KKKazakhkk
Language.KMKhmerkm
Language.KNKannadakn
Language.KOKoreanko
Language.LALatinla
Language.LBLuxembourgishlb
Language.LNLingalaln
Language.LOLaolo
Language.LTLithuanianlt
Language.LVLatvianlv
Language.MGMalagasymg
Language.MIMaorimi
Language.MKMacedonianmk
Language.MLMalayalamml
Language.MNMongolianmn
Language.MRMarathimr
Language.MSMalayms
Language.MTMaltesemt
Language.MY_MRBurmesemymr
Language.NENepaline
Language.NLDutchnl
Language.NNNorwegian (Nynorsk)nn
Language.NONorwegianno
Language.OCOccitanoc
Language.PAPunjabipa
Language.PLPolishpl
Language.PSPashtops
Language.PTPortuguesept
Language.RORomanianro
Language.RURussianru
Language.SASanskritsa
Language.SDSindhisd
Language.SISinhalasi
Language.SKSlovaksk
Language.SLSloveniansl
Language.SNShonasn
Language.SOSomaliso
Language.SQAlbaniansq
Language.SRSerbiansr
Language.SUSundanesesu
Language.SVSwedishsv
Language.SWSwahilisw
Language.TATamilta
Language.TETelugute
Language.TGTajiktg
Language.THThaith
Language.TKTurkmentk
Language.TLTagalogtl
Language.TRTurkishtr
Language.TTTatartt
Language.UKUkrainianuk
Language.URUrduur
Language.UZUzbekuz
Language.VIVietnamesevi
Language.YIYiddishyi
Language.YOYorubayo
Language.ZHChinesezh

For a complete list of supported languages, refer to Gladia’s documentation.

Advanced Features

Custom Vocabulary

You can provide custom vocabulary items with bias intensity:

from pipecat.services.gladia.config import CustomVocabularyItem, CustomVocabularyConfig, RealtimeProcessingConfig

custom_vocab = CustomVocabularyConfig(
    vocabulary=[
        CustomVocabularyItem(value="Pipecat", intensity=0.8),
        CustomVocabularyItem(value="Daily", intensity=0.7),
    ],
    default_intensity=0.5
)

realtime_config = RealtimeProcessingConfig(
    custom_vocabulary=True,
    custom_vocabulary_config=custom_vocab
)

Translation

Enable real-time translation:

from pipecat.services.gladia.config import TranslationConfig, RealtimeProcessingConfig

translation_config = TranslationConfig(
    target_languages=["fr", "es", "de"],
    model="enhanced",
    match_original_utterances=True
)

realtime_config = RealtimeProcessingConfig(
    translation=True,
    translation_config=translation_config
)

Multi-language Support

Configure multiple languages with automatic language switching:

from pipecat.services.gladia.config import LanguageConfig, GladiaInputParams

language_config = LanguageConfig(
    languages=["en", "fr", "es"],
    code_switching=True
)

params = GladiaInputParams(
    language_config=language_config
)

Usage Example

from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.gladia.stt import GladiaSTTService
from pipecat.services.gladia.config import (
    GladiaInputParams,
    LanguageConfig,
    RealtimeProcessingConfig
)
from pipecat.transcriptions.language import Language

# Configure the service
stt = GladiaSTTService(
    api_key="your-api-key",
    model="solaria-1",
    params=GladiaInputParams(
        language_config=LanguageConfig(
            languages=[Language.EN, Language.FR],
            code_switching=True
        ),
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

Frame Flow

Metrics Support

The service collects processing metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • Connection status

Notes

  • Audio input must be in PCM format
  • Transcription frames are only generated when confidence threshold is met
  • Service automatically handles websocket connections and cleanup
  • Real-time processing occurs in parallel for natural conversation flow