Overview

GladiaSTTService is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time.

Installation

To use GladiaSTTService, you need to install the Gladia dependencies:

pip install pipecat-ai[gladia]

You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY

Configuration

Service Parameters

api_key
string
required

Your Gladia API key for authentication

url
string
default: "https://api.gladia.io/v2/live"

Gladia API endpoint URL

confidence
float
default: "0.5"

Minimum confidence threshold for transcriptions. Values range from 0 to 1.

Audio Processing Parameters

sample_rate
integer
default: "16000"

Audio sample rate in Hz

language
Language
default: "Language.EN"

Primary language for transcription

endpointing
float
default: "0.2"

Silence duration (in seconds) to mark end of speech

maximum_duration_without_endpointing
integer
default: "10"

Maximum duration in seconds without detecting speech end

audio_enhancer
boolean

Enable audio enhancement preprocessing

words_accurate_timestamps
boolean

Enable accurate word timestamps in transcription

Input

The service processes raw audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • 16kHz sample rate (default)
  • Single channel (mono)

Output

The service produces two types of frames during transcription:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Transcription language

InterimTranscriptionFrame

Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.

ErrorFrame

Generated when transcription errors occur, containing error details.

Methods

See the STT base class methods for additional functionality.

Language Setting

await service.set_language(Language.FR)

Language Support

Gladia STT supports the following languages:

Language CodeDescriptionService Code
Language.BGBulgarianbg
Language.CACatalanca
Language.ZHChinesezh
Language.CSCzechcs
Language.DADanishda
Language.NLDutchnl
Language.ENEnglishen
Language.EN_USEnglish (US)en
Language.EN_AUEnglish (Australia)en
Language.EN_GBEnglish (UK)en
Language.EN_NZEnglish (New Zealand)en
Language.EN_INEnglish (India)en
Language.ETEstonianet
Language.FIFinnishfi
Language.FRFrenchfr
Language.FR_CAFrench (Canada)fr
Language.DEGermande
Language.DE_CHGerman (Switzerland)de
Language.ELGreekel
Language.HIHindihi
Language.HUHungarianhu
Language.IDIndonesianid
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.LVLatvianlv
Language.LTLithuanianlt
Language.MSMalayms
Language.NONorwegianno
Language.PLPolishpl
Language.PTPortuguesept
Language.PT_BRPortuguese (Brazil)pt
Language.RORomanianro
Language.RURussianru
Language.SKSlovaksk
Language.ESSpanishes
Language.SVSwedishsv
Language.THThaith
Language.TRTurkishtr
Language.UKUkrainianuk
Language.VIVietnamesevi

Usage Example

from pipecat.services.gladia import GladiaSTTService
from pipecat.transcriptions.language import Language

# Configure the service
stt_service = GladiaSTTService(
    api_key="your-api-key",
    confidence=0.7,
    params=GladiaSTTService.InputParams(
        language=Language.EN,
        audio_enhancer=True,
        sample_rate=16000
    )
)

# Use in a pipeline
pipeline = Pipeline([
    transport.input(),    # Produces InputAudioRawFrame
    stt_service,          # Processes audio → produces transcription frames
    llm_processor,        # Consumes TranscriptionFrame
])

Note: Gladia uses simplified language codes without regional variants.

Frame Flow

Metrics Support

The service collects processing metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • Connection status

Notes

  • Audio input must be in PCM format
  • Transcription frames are only generated when confidence threshold is met
  • Language detection is optional
  • Service automatically handles websocket connections and cleanup
  • Real-time processing occurs in parallel for natural conversation flow