Overview

GoogleSTTService provides real-time speech recognition using Google Cloud’s Speech-to-Text V2 API with support for 125+ languages, multiple models, voice activity detection, and advanced features like automatic punctuation and word-level confidence scores.

Installation

To use Google Cloud Speech services, install the required dependency:
pip install "pipecat-ai[google]"
You’ll need Google Cloud credentials either as environment variables, a JSON string, or a service account file.
Get your credentials by creating a service account in the Google Cloud Console with Speech-to-Text API access.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, configurable sample rate, mono)
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • InterimTranscriptionFrame - Real-time transcription updates
  • TranscriptionFrame - Final transcription results with confidence scores
  • ErrorFrame - Connection or processing errors

Models

Google Cloud offers specialized models for different use cases:
ModelDescriptionBest For
latest_longOptimized for long-form speechConversations, meetings, podcasts
chirp_2LLM powered ASR modelStreaming and multilingual
telephonyOptimized for phone call audioCall centers, phone interviews
medical_dictationMedical terminology optimizedHealthcare dictation
medical_conversationDoctor-patient conversation optimizedMedical consultations
See Google’s model documentation for detailed performance comparisons.

Regional Support

Google Cloud Speech-to-Text V2 supports different regional endpoints:
RegionDescriptionBest For
globalDefault global endpointGeneral use, auto-routing
us-central1US Central regionNorth American users
europe-west1Europe West regionEuropean users
asia-northeast1Asia Northeast regionAsian users
Configure region for improved latency and data residency:
stt = GoogleSTTService(
    location="us-central1",  # Regional endpoint
    credentials_path="credentials.json"
)

Language Support

Google Cloud STT supports 125+ languages with regional variants:
Common languages:
  • Language.EN_US - English (US) - en-US
  • Language.ES - Spanish - es-ES
  • Language.FR - French - fr-FR
  • Language.DE - German - de-DE
  • Language.ZH - Chinese (Simplified) - cmn-Hans-CN
  • Language.JA - Japanese - ja-JP

Usage Example

Basic Configuration

Initialize the GoogleSTTService and use it in a pipeline:
from pipecat.services.google.stt import GoogleSTTService
from pipecat.transcriptions.language import Language

# Using environment credentials
stt = GoogleSTTService(
    params=GoogleSTTService.InputParams(
        languages=Language.EN_US,
        model="latest_long",
        enable_automatic_punctuation=True,
        enable_interim_results=True
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Credentials Configuration

# Using service account file
stt = GoogleSTTService(
    credentials_path="path/to/service-account.json",
    location="us-central1",
    params=GoogleSTTService.InputParams(languages=Language.EN_US)
)

# Using credentials JSON string
stt = GoogleSTTService(
    credentials=os.getenv("GOOGLE_SERVICE_ACCOUNT_JSON"),
    params=GoogleSTTService.InputParams(languages=Language.EN_US)
)

Multi-language Configuration

# Multiple languages (first is primary)
params = GoogleSTTService.InputParams(
    languages=[Language.EN_US, Language.ES_MX, Language.FR],
    model="latest_long",
    enable_automatic_punctuation=True
)

stt = GoogleSTTService(
    credentials_path="credentials.json",
    params=params
)

Dynamic Configuration Updates

Make settings updates by pushing an STTUpdateSettingsFrame for the GoogleSTTService:
from pipecat.frames.frames import STTUpdateSettingsFrame

await stt.update_options(
    languages=[Language.FR, Language.EN_US],
)

Advanced Features

Multi-language Support

  • Support for multiple languages simultaneously
  • First language in list is considered primary
  • Automatic language detection within configured set

Voice Activity Detection

  • Built-in VAD events from Google’s service
  • Integrates with Pipecat’s VAD framework
  • Configurable sensitivity and detection

Content Processing

  • Automatic Punctuation: Smart punctuation insertion
  • Profanity Filtering: Optional content filtering
  • Format Control: Handle spoken vs written formats

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from audio input to first transcription
  • Processing Duration - Total time spent processing audio
Learn how to enable Metrics in your Pipeline.