Overview

GoogleSTTService provides real-time speech-to-text capabilities using Google Cloud’s Speech-to-Text V2 API. It supports interim results, multiple languages, and voice activity detection (VAD).

Installation

To use GoogleSTTService, install the required dependencies:

pip install pipecat-ai[google]

You’ll need Google Cloud credentials either as a JSON string or file.

You can obtain Google Cloud credentials by creating a service account in the Google Cloud Console.

Configuration

Constructor Parameters

credentials
str

Google Cloud service account credentials as JSON string

credentials_path
str

Path to service account credentials JSON file

location
str
default:
"global"

Google Cloud location for the service

sample_rate
int

Audio sample rate in Hertz

params
InputParams

Configuration parameters for the service

InputParams

The InputParams class provides configuration options for the Google STT service.

languages
Language | List[Language]
default:
"Language.EN_US"

Single language or list of recognition languages. First language is primary. Examples:

  • Language.EN_US
  • [Language.EN_US, Language.ES_US]

The first language in the list is considered primary. Recognition accuracy may vary with multiple languages.

When using multiple languages, list them in order of expected usage frequency for optimal recognition results.

model
str
default:
"latest_long"

Speech recognition model to use.

use_separate_recognition_per_channel
bool
default:
"False"

Process each audio channel separately for multi-channel audio.

enable_automatic_punctuation
bool
default:
"True"

Automatically add punctuation marks to transcriptions.

enable_spoken_punctuation
bool
default:
"False"

Include spoken punctuation (e.g., “period”, “comma”) in transcript.

enable_spoken_emojis
bool
default:
"False"

Include spoken emojis (e.g., “smiley face”) in transcript.

profanity_filter
bool
default:
"False"

Filter profanity from transcriptions.

enable_word_time_offsets
bool
default:
"False"

Include timing information for each word.

enable_word_confidence
bool
default:
"False"

Include confidence scores for each word.

enable_interim_results
bool
default:
"True"

Stream partial recognition results as they become available.

enable_voice_activity_events
bool
default:
"False"

Enable voice activity detection events.

  • Not all features are available for all models or languages
  • Some combinations of options may affect latency or accuracy
  • Model selection should match your use case for best results

Input

The service processes raw audio data with:

  • Linear16 PCM encoding
  • 16-bit depth
  • Configurable sample rate
  • Single channel (mono)

Output Frames

The service produces two types of frames:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Recognition language

InterimTranscriptionFrame

Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results.

Methods

set_languages
method

Updates the service’s recognition language.

async def set_languages(language: List[Language]) -> None

Example:

await service.set_languages([Language.FR_FR])
set_model
method

Updates the service’s recognition model.

async def set_model(model: str) -> None

Example:

await service.set_model("medical_dictation")
update_options
method

Updates multiple service options dynamically.

async def update_options(
    *,
    languages: Optional[List[Language]] = None,
    model: Optional[str] = None,
    enable_automatic_punctuation: Optional[bool] = None,
    enable_spoken_punctuation: Optional[bool] = None,
    enable_spoken_emojis: Optional[bool] = None,
    profanity_filter: Optional[bool] = None,
    enable_word_time_offsets: Optional[bool] = None,
    enable_word_confidence: Optional[bool] = None,
    enable_interim_results: Optional[bool] = None,
    enable_voice_activity_events: Optional[bool] = None,
    location: Optional[str] = None,
) -> None

Example:

await service.update_options(
    languages=[Language.ES_ES, Language.EN_US],
    enable_interim_results=True,
    profanity_filter=True
)

See the STT base class methods for additional functionality.

Usage Example

from pipecat.services.google import GoogleSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt_service = GoogleSTTService(
    credentials_path="path/to/credentials.json",
    params=GoogleSTTService.InputParams(
        languages=Language.EN_US,
        model="latest_long",
        enable_automatic_punctuation=True,
        enable_interim_results=True
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    ...
])

Regional Support

Google Cloud Speech-to-Text V2 supports different regional endpoints for improved latency and data residency requirements.

Available Regions

See supported languages, models, and features for each region in Google’s Speech-to-Text documentation.

Configuration

Specify the region during service initialization:

stt_service = GoogleSTTService(
    credentials=credentials,
    location="us-central1",  # Use us-central1 endpoint
    params=GoogleSTTService.InputParams(
        model="chirp_2"
    )
)

Dynamic Region Updates

The region can be updated during runtime:

await stt_service.update_options(
    location="asia"
)

Notes

  • The global endpoint is used by default
  • Regional endpoints may provide lower latency for users in those regions
  • Some features or models might only be available in specific regions
  • Regional selection may affect pricing
  • Data residency requirements may dictate region selection

Models

Model NameDescriptionBest For
chirp_2Google’s latest ASR modelGeneral use cases
latest_longLatest model optimized for long-form speechConversations, meetings
latest_shortLatest model optimized for short-form speechShort messages, notes
telephonyOptimized for phone callsCall centers
medical_dictationOptimized for medical terminologyHealthcare dictation
medical_conversationOptimized for doctor-patient interactionsMedical consultations

See Google Cloud’s Speech-to-Text documentation for more details.

Language Support

Language CodeDescriptionService Codes
Language.AFAfrikaansaf-ZA
Language.SQAlbaniansq-AL
Language.AMAmharicam-ET
Language.ARArabic (Default: Egypt)ar-EG
Language.AR_AEArabic (UAE)ar-AE
Language.AR_BHArabic (Bahrain)ar-BH
Language.AR_DZArabic (Algeria)ar-DZ
Language.AR_EGArabic (Egypt)ar-EG
Language.AR_IQArabic (Iraq)ar-IQ
Language.AR_JOArabic (Jordan)ar-JO
Language.AR_KWArabic (Kuwait)ar-KW
Language.AR_LBArabic (Lebanon)ar-LB
Language.AR_MAArabic (Morocco)ar-MA
Language.AR_OMArabic (Oman)ar-OM
Language.AR_QAArabic (Qatar)ar-QA
Language.AR_SAArabic (Saudi Arabia)ar-SA
Language.AR_SYArabic (Syria)ar-SY
Language.AR_TNArabic (Tunisia)ar-TN
Language.AR_YEArabic (Yemen)ar-YE
Language.HYArmenianhy-AM
Language.AZAzerbaijaniaz-AZ
Language.EUBasqueeu-ES
Language.BNBengali (Default: India)bn-IN
Language.BN_BDBengali (Bangladesh)bn-BD
Language.BN_INBengali (India)bn-IN
Language.BSBosnianbs-BA
Language.BGBulgarianbg-BG
Language.MYBurmesemy-MM
Language.CACatalanca-ES
Language.ZHChinese (Default: Simplified)cmn-Hans-CN
Language.ZH_CNChinese (Simplified)cmn-Hans-CN
Language.ZH_HKChinese (Hong Kong)cmn-Hans-HK
Language.ZH_TWChinese (Traditional)cmn-Hant-TW
Language.YUEChinese (Cantonese)yue-Hant-HK
Language.HRCroatianhr-HR
Language.CSCzechcs-CZ
Language.DADanishda-DK
Language.NLDutch (Default: Netherlands)nl-NL
Language.NL_BEDutch (Belgium)nl-BE
Language.NL_NLDutch (Netherlands)nl-NL
Language.ENEnglish (Default: US)en-US
Language.EN_AUEnglish (Australia)en-AU
Language.EN_CAEnglish (Canada)en-CA
Language.EN_GBEnglish (UK)en-GB
Language.EN_GHEnglish (Ghana)en-GH
Language.EN_HKEnglish (Hong Kong)en-HK
Language.EN_INEnglish (India)en-IN
Language.EN_IEEnglish (Ireland)en-IE
Language.EN_KEEnglish (Kenya)en-KE
Language.EN_NGEnglish (Nigeria)en-NG
Language.EN_NZEnglish (New Zealand)en-NZ
Language.EN_PHEnglish (Philippines)en-PH
Language.EN_SGEnglish (Singapore)en-SG
Language.EN_TZEnglish (Tanzania)en-TZ
Language.EN_USEnglish (US)en-US
Language.EN_ZAEnglish (South Africa)en-ZA
Language.ETEstonianet-EE
Language.FILFilipinofil-PH
Language.FIFinnishfi-FI
Language.FRFrench (Default: France)fr-FR
Language.FR_BEFrench (Belgium)fr-BE
Language.FR_CAFrench (Canada)fr-CA
Language.FR_CHFrench (Switzerland)fr-CH
Language.GLGaliciangl-ES
Language.KAGeorgianka-GE
Language.DEGerman (Default: Germany)de-DE
Language.DE_ATGerman (Austria)de-AT
Language.DE_CHGerman (Switzerland)de-CH
Language.ELGreekel-GR
Language.GUGujaratigu-IN
Language.HEHebrewiw-IL
Language.HIHindihi-IN
Language.HUHungarianhu-HU
Language.ISIcelandicis-IS
Language.IDIndonesianid-ID
Language.ITItalianit-IT
Language.IT_CHItalian (Switzerland)it-CH
Language.JAJapaneseja-JP
Language.JVJavanesejv-ID
Language.KNKannadakn-IN
Language.KKKazakhkk-KZ
Language.KMKhmerkm-KH
Language.KOKoreanko-KR
Language.LOLaolo-LA
Language.LVLatvianlv-LV
Language.LTLithuanianlt-LT
Language.MKMacedonianmk-MK
Language.MSMalayms-MY
Language.MLMalayalamml-IN
Language.MRMarathimr-IN
Language.MNMongolianmn-MN
Language.NENepaline-NP
Language.NONorwegianno-NO
Language.FAPersianfa-IR
Language.PLPolishpl-PL
Language.PTPortuguese (Default: Portugal)pt-PT
Language.PT_BRPortuguese (Brazil)pt-BR
Language.PT_PTPortuguese (Portugal)pt-PT
Language.PAPunjabipa-Guru-IN
Language.RORomanianro-RO
Language.RURussianru-RU
Language.SRSerbiansr-RS
Language.SISinhalasi-LK
Language.SKSlovaksk-SK
Language.SLSloveniansl-SI
Language.ESSpanish (Default: Spain)es-ES
Language.ES_ARSpanish (Argentina)es-AR
Language.ES_BOSpanish (Bolivia)es-BO
Language.ES_CLSpanish (Chile)es-CL
Language.ES_COSpanish (Colombia)es-CO
Language.ES_CRSpanish (Costa Rica)es-CR
Language.ES_DOSpanish (Dominican Republic)es-DO
Language.ES_ECSpanish (Ecuador)es-EC
Language.ES_GTSpanish (Guatemala)es-GT
Language.ES_HNSpanish (Honduras)es-HN
Language.ES_MXSpanish (Mexico)es-MX
Language.ES_NISpanish (Nicaragua)es-NI
Language.ES_PASpanish (Panama)es-PA
Language.ES_PESpanish (Peru)es-PE
Language.ES_PRSpanish (Puerto Rico)es-PR
Language.ES_PYSpanish (Paraguay)es-PY
Language.ES_SVSpanish (El Salvador)es-SV
Language.ES_USSpanish (US)es-US
Language.ES_UYSpanish (Uruguay)es-UY
Language.ES_VESpanish (Venezuela)es-VE
Language.SUSundanesesu-ID
Language.SWSwahili (Default: Tanzania)sw-TZ
Language.SW_KESwahili (Kenya)sw-KE
Language.SW_TZSwahili (Tanzania)sw-TZ
Language.SVSwedishsv-SE
Language.TATamil (Default: India)ta-IN
Language.TA_INTamil (India)ta-IN
Language.TA_MYTamil (Malaysia)ta-MY
Language.TA_SGTamil (Singapore)ta-SG
Language.TA_LKTamil (Sri Lanka)ta-LK
Language.TETelugute-IN
Language.THThaith-TH
Language.TRTurkishtr-TR
Language.UKUkrainianuk-UA
Language.URUrdu (Default: India)ur-IN
Language.UR_INUrdu (India)ur-IN
Language.UR_PKUrdu (Pakistan)ur-PK
Language.UZUzbekuz-UZ
Language.VIVietnamesevi-VN
Language.XHXhosaxh-ZA
Language.ZUZuluzu-ZA

Special Features

  • Supports multiple languages simultaneously
  • Provides regional variants for many languages
  • Handles different Chinese scripts (simplified/traditional)
  • Supports medical-specific models

Frame Flow

Notes

  • Requires Google Cloud credentials
  • Supports real-time transcription
  • Handles streaming connection management
  • Provides dynamic configuration updates
  • Supports model switching
  • Includes VAD capabilities
  • Manages connection lifecycle