The GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:

Real-time Interaction

Stream audio and video in real-time with low latency response times

Speech Processing

Built-in speech-to-text and text-to-speech capabilities with multiple voice options

Voice Activity Detection

Automatic detection of speech start/stop for natural conversations

Context Management

Intelligent handling of conversation history and system instructions

Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use GeminiMultimodalLiveLLMService, install the required dependencies:

pip install "pipecat-ai[google]"

You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY.

Basic Usage

Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:

from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams,
    GeminiMultimodalModalities
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Aoede",                    # Voices: Aoede, Charon, Fenrir, Kore, Puck
    transcribe_user_audio=True,          # Enable speech-to-text for user input
    params=InputParams(
        temperature=0.7,                 # Set model input params
        language=Language.EN_US,         # Set language (30+ languages supported)
        modalities=GeminiMultimodalModalities.AUDIO  # Response modality
    )
)

Configuration

Constructor Parameters

api_key
str
required

Your Google API key

base_url
str

API endpoint URL

model
str
default:"models/gemini-2.0-flash-live-001"

Gemini model to use (upgraded to new v1beta model)

voice_id
str
default:"Charon"

Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Puck",  # Choose your preferred voice
)
transcribe_user_audio
bool
default:"False"

Enable transcription of user audio

system_instruction
str

High-level instructions that guide the model’s behavior

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="Talk like a pirate.",
)

Input Parameters

frequency_penalty
float
default:"None"

Penalizes repeated tokens. Range: 0.0 to 2.0

max_tokens
int
default:"4096"

Maximum number of tokens to generate

modalities
GeminiMultimodalModalities
default:"AUDIO"

Response modalities to include (options: AUDIO, TEXT).

presence_penalty
float
default:"None"

Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0

temperature
float
default:"None"

Controls randomness in responses. Range: 0.0 to 2.0

language
Language
default:"Language.EN_US"

Language for generation. Over 30 languages are supported.

media_resolution
GeminiMediaResolution
default:"UNSPECIFIED"

Controls image processing quality and token usage:

  • LOW: Uses 64 tokens
  • MEDIUM: Uses 256 tokens
  • HIGH: Zoomed reframing with 256 tokens
vad
GeminiVADParams

Voice Activity Detection configuration:

  • disabled: Toggle VAD on/off
  • start_sensitivity: How quickly speech is detected (HIGH/LOW)
  • end_sensitivity: How quickly turns end after pauses (HIGH/LOW)
  • prefix_padding_ms: Milliseconds of audio to keep before speech
  • silence_duration_ms: Milliseconds of silence to end a turn
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiVADParams,
    GeminiMediaResolution,
    StartSensitivity,
    EndSensitivity
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        temperature=0.7,
        language=Language.ES,  # Spanish language
        media_resolution=GeminiMediaResolution.HIGH,  # Higher quality image processing
        vad=GeminiVADParams(
            start_sensitivity=StartSensitivity.HIGH,  # Detect speech quickly
            end_sensitivity=EndSensitivity.LOW,      # Allow longer pauses
            prefix_padding_ms=300,                   # Keep 300ms before speech
            silence_duration_ms=1000,                # End turn after 1s silence
        )
    )
)
top_k
int
default:"None"

Limits vocabulary to k most likely tokens. Minimum: 0

top_p
float
default:"None"

Cumulative probability cutoff for token selection. Range: 0.0 to 1.0

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        top_p=0.9,     # More focused token selection
        top_k=40       # Limit vocabulary options
    )
)

Frame Types

Input Frames

InputAudioRawFrame
Frame

Raw audio data for speech input

StartInterruptionFrame
Frame

Signals start of user interruption

UserStartedSpeakingFrame
Frame

Signals user started speaking

UserStoppedSpeakingFrame
Frame

Signals user stopped speaking

OpenAILLMContextFrame
Frame

Contains conversation context

Output Frames

TTSAudioRawFrame
Frame

Generated speech audio

TTSStartedFrame
Frame

Signals start of speech synthesis

TTSStoppedFrame
Frame

Signals end of speech synthesis

TextFrame
Frame

Generated text responses

TranscriptionFrame
Frame

Speech transcriptions

Function Calling

This service supports function calling (also known as tool calling) which allows the LLM to request information from external services and APIs. For example, you can enable your bot to:

  • Check current weather conditions
  • Query databases
  • Access external APIs
  • Perform custom actions

See the Function Calling guide for:

  • Detailed implementation instructions
  • Provider-specific function definitions
  • Handler registration examples
  • Control over function call behavior
  • Complete usage examples

Language Support

Gemini Multimodal Live supports the following languages:

Language CodeDescriptionGemini Code
Language.ARArabicar-XA
Language.BN_INBengali (India)bn-IN
Language.CMN_CNChinese (Mandarin)cmn-CN
Language.DE_DEGerman (Germany)de-DE
Language.EN_USEnglish (US)en-US
Language.EN_AUEnglish (Australia)en-AU
Language.EN_GBEnglish (UK)en-GB
Language.EN_INEnglish (India)en-IN
Language.ES_ESSpanish (Spain)es-ES
Language.ES_USSpanish (US)es-US
Language.FR_FRFrench (France)fr-FR
Language.FR_CAFrench (Canada)fr-CA
Language.GU_INGujarati (India)gu-IN
Language.HI_INHindi (India)hi-IN
Language.ID_IDIndonesianid-ID
Language.IT_ITItalian (Italy)it-IT
Language.JA_JPJapanese (Japan)ja-JP
Language.KN_INKannada (India)kn-IN
Language.KO_KRKorean (Korea)ko-KR
Language.ML_INMalayalam (India)ml-IN
Language.MR_INMarathi (India)mr-IN
Language.NL_NLDutch (Netherlands)nl-NL
Language.PL_PLPolish (Poland)pl-PL
Language.PT_BRPortuguese (Brazil)pt-BR
Language.RU_RURussian (Russia)ru-RU
Language.TA_INTamil (India)ta-IN
Language.TE_INTelugu (India)te-IN
Language.TH_THThai (Thailand)th-TH
Language.TR_TRTurkish (Turkey)tr-TR
Language.VI_VNVietnamese (Vietnam)vi-VN

You can set the language using the language parameter:

from pipecat.transcriptions.language import Language
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams
)

# Set language during initialization
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(language=Language.ES_ES)  # Spanish (Spain)
)

Next Steps

Examples

  • Foundational Example Basic implementation showing core features and transcription

  • Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.

Learn More

Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.