Overview

GladiaSTTService is a speech-to-text (STT) service that integrates with Gladia’s API to provide real-time transcription capabilities. It processes audio input and produces transcription frames in real-time.

Installation

To use GladiaSTTService, you need to install the Gladia dependencies:

pip install pipecat-ai[gladia]

You’ll also need to set up your Gladia API key as an environment variable: GLADIA_API_KEY

Configuration

Service Parameters

api_key
string
required

Your Gladia API key for authentication

url
string
default: "https://api.gladia.io/v2/live"

Gladia API endpoint URL

confidence
float
default: "0.5"

Minimum confidence threshold for transcriptions. Values range from 0 to 1.

Audio Processing Parameters

sample_rate
integer
default: "16000"

Audio sample rate in Hz

language
Language
default: "Language.EN"

Primary language for transcription

endpointing
float
default: "0.2"

Silence duration (in seconds) to mark end of speech

maximum_duration_without_endpointing
integer
default: "10"

Maximum duration in seconds without detecting speech end

audio_enhancer
boolean

Enable audio enhancement preprocessing

words_accurate_timestamps
boolean

Enable accurate word timestamps in transcription

Input Requirements

The service processes InputAudioRawFrame instances with:

  • Raw PCM audio data
  • 16-bit depth
  • Sample rate matching configuration (default 16kHz)
  • Single channel (mono)

See Audio Frames for detailed frame structure.

Output

The service produces two types of frames during transcription:

InterimTranscriptionFrame

Generated during ongoing speech when confidence threshold is met. Contains:

text
string

Preliminary transcribed text

user_id
string

ID of the speaking user

timestamp
string

ISO 8601 formatted timestamp

language
Language

Detected language (if enabled)

TranscriptionFrame

Generated for final transcriptions when confidence threshold is met. Contains identical fields to InterimTranscriptionFrame but represents confirmed text.

See Text Frames for detailed frame structures.

Example Usage

from pipecat.services.gladia import GladiaSTTService
from pipecat.transcriptions.language import Language

# Configure the service
stt_service = GladiaSTTService(
    api_key="your-api-key",
    confidence=0.7,
    params=GladiaSTTService.InputParams(
        language=Language.EN,
        audio_enhancer=True,
        sample_rate=16000
    )
)

# Use in a pipeline
pipeline = Pipeline([
    transport.input(),    # Produces InputAudioRawFrame
    stt_service,          # Processes audio → produces transcription frames
    llm_processor,        # Consumes TranscriptionFrame
])

Methods

See the STT base class methods for additional functionality.

Language Setting

await service.set_language(Language.FR)

Language Support

Gladia STT supports the following languages:

Language CodeDescriptionService Code
Language.BGBulgarianbg
Language.CACatalanca
Language.ZHChinesezh
Language.CSCzechcs
Language.DADanishda
Language.NLDutchnl
Language.ENEnglishen
Language.EN_USEnglish (US)en
Language.EN_AUEnglish (Australia)en
Language.EN_GBEnglish (UK)en
Language.EN_NZEnglish (New Zealand)en
Language.EN_INEnglish (India)en
Language.ETEstonianet
Language.FIFinnishfi
Language.FRFrenchfr
Language.FR_CAFrench (Canada)fr
Language.DEGermande
Language.DE_CHGerman (Switzerland)de
Language.ELGreekel
Language.HIHindihi
Language.HUHungarianhu
Language.IDIndonesianid
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.LVLatvianlv
Language.LTLithuanianlt
Language.MSMalayms
Language.NONorwegianno
Language.PLPolishpl
Language.PTPortuguesept
Language.PT_BRPortuguese (Brazil)pt
Language.RORomanianro
Language.RURussianru
Language.SKSlovaksk
Language.ESSpanishes
Language.SVSwedishsv
Language.THThaith
Language.TRTurkishtr
Language.UKUkrainianuk
Language.VIVietnamesevi

Usage Example

# Configure service with specific language
stt_service = GladiaSTTService(
    api_key="your-api-key",
    params=GladiaSTTService.InputParams(
        language=Language.FR  # French
    )
)

Note: Gladia uses simplified language codes without regional variants.

Frame Flow

Service Control

The service accepts STTUpdateSettingsFrame for dynamic configuration updates. See Service Control Frames for details.

Notes

  • Audio input must be in PCM format
  • Transcription frames are only generated when confidence threshold is met
  • Language detection is optional
  • Service automatically handles websocket connections and cleanup
  • Real-time processing occurs in parallel for natural conversation flow