The GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:

Real-time Interaction

Stream audio and video in real-time with low latency response times

Speech Processing

Built-in speech-to-text and text-to-speech capabilities with multiple voice options

Voice Activity Detection

Automatic detection of speech start/stop for natural conversations

Context Management

Intelligent handling of conversation history and system instructions
Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use GeminiMultimodalLiveLLMService, install the required dependencies:
pip install "pipecat-ai[google]"
You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY.

Basic Usage

Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams,
    GeminiMultimodalModalities
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Aoede",                                # Voices: Aoede, Charon, Fenrir, Kore, Puck
    params=InputParams(
        temperature=0.7,                             # Set model input params
        language=Language.EN_US,                     # Set language (30+ languages supported)
        modalities=GeminiMultimodalModalities.AUDIO  # Response modality
    )
)

Configuration

Constructor Parameters

api_key
str
required
Your Google API key
base_url
str
API endpoint URL
model
str
Gemini model to use (upgraded to new v1beta model)
voice_id
str
default:"Charon"
Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Puck",  # Choose your preferred voice
)
system_instruction
str
High-level instructions that guide the model’s behavior
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="Talk like a pirate.",
)
start_audio_paused
bool
default:"False"
Whether to start with audio input paused
start_video_paused
bool
default:"False"
Whether to start with video input paused
tools
Union[List[dict], ToolsSchema]
Tools/functions available to the model
inference_on_context_initialization
bool
default:"True"
Whether to generate a response when context is first set

Input Parameters

frequency_penalty
float
default:"None"
Penalizes repeated tokens. Range: 0.0 to 2.0
max_tokens
int
default:"4096"
Maximum number of tokens to generate
modalities
GeminiMultimodalModalities
default:"AUDIO"
Response modalities to include (options: AUDIO, TEXT).
presence_penalty
float
default:"None"
Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0
temperature
float
default:"None"
Controls randomness in responses. Range: 0.0 to 2.0
language
Language
default:"Language.EN_US"
Language for generation. Over 30 languages are supported.
media_resolution
GeminiMediaResolution
default:"UNSPECIFIED"
Controls image processing quality and token usage:
  • LOW: Uses 64 tokens
  • MEDIUM: Uses 256 tokens
  • HIGH: Zoomed reframing with 256 tokens
vad
GeminiVADParams
Voice Activity Detection configuration:
  • disabled: Toggle VAD on/off
  • start_sensitivity: How quickly speech is detected (HIGH/LOW)
  • end_sensitivity: How quickly turns end after pauses (HIGH/LOW)
  • prefix_padding_ms: Milliseconds of audio to keep before speech
  • silence_duration_ms: Milliseconds of silence to end a turn
from pipecat.services.gemini_multimodal_live.events import (
    StartSensitivity,
    EndSensitivity
)
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiVADParams,
    GeminiMediaResolution,
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        temperature=0.7,
        language=Language.ES,                         # Spanish language
        media_resolution=GeminiMediaResolution.HIGH,  # Higher quality image processing
        vad=GeminiVADParams(
            start_sensitivity=StartSensitivity.HIGH,  # Detect speech quickly
            end_sensitivity=EndSensitivity.LOW,       # Allow longer pauses
            prefix_padding_ms=300,                    # Keep 300ms before speech
            silence_duration_ms=1000,                 # End turn after 1s silence
        )
    )
)
top_k
int
default:"None"
Limits vocabulary to k most likely tokens. Minimum: 0
top_p
float
default:"None"
Cumulative probability cutoff for token selection. Range: 0.0 to 1.0
context_window_compression
ContextWindowCompressionParams
Parameters for managing the context window: - enabled: Enable/disable compression (default: False) - trigger_tokens: Number of tokens that trigger compression (default: None, uses 80% of context window)
from pipecat.services.gemini_multimodal_live.gemini import (
    ContextWindowCompressionParams
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        top_p=0.9,               # More focused token selection
        top_k=40,                # Limit vocabulary options
        context_window_compression=ContextWindowCompressionParams(
            enabled=True,
            trigger_tokens=8000  # Compress when reaching 8000 tokens
        )
    )
)

Methods

set_audio_input_paused(paused: bool)
method
Pause or unpause audio input processing
set_video_input_paused(paused: bool)
method
Pause or unpause video input processing
set_model_modalities(modalities: GeminiMultimodalModalities)
method
Change the response modality (TEXT or AUDIO)
set_language(language: Language)
method
Change the language for generation
set_context(context: OpenAILLMContext)
method
Set the conversation context explicitly
create_context_aggregator(context: OpenAILLMContext, user_params: LLMUserAggregatorParams, assistant_params: LLMAssistantAggregatorParams)
method
Create context aggregators for managing conversation state

Frame Types

Input Frames

InputAudioRawFrame
Frame
Raw audio data for speech input
InputImageRawFrame
Frame
Raw image data for visual input
StartInterruptionFrame
Frame
Signals start of user interruption
UserStartedSpeakingFrame
Frame
Signals user started speaking
UserStoppedSpeakingFrame
Frame
Signals user stopped speaking
OpenAILLMContextFrame
Frame
Contains conversation context
LLMMessagesAppendFrame
Frame
Adds messages to the conversation
LLMUpdateSettingsFrame
Frame
Updates LLM settings
LLMSetToolsFrame
Frame
Sets available tools for the LLM

Output Frames

TTSAudioRawFrame
Frame
Generated speech audio
TTSStartedFrame
Frame
Signals start of speech synthesis
TTSStoppedFrame
Frame
Signals end of speech synthesis
LLMTextFrame
Frame
Generated text responses from the LLM
TTSTextFrame
Frame
Text used for speech synthesis
TranscriptionFrame
Frame
Speech transcriptions from user audio
LLMFullResponseStartFrame
Frame
Signals the start of a complete LLM response
LLMFullResponseEndFrame
Frame
Signals the end of a complete LLM response

Function Calling

This service supports function calling (also known as tool calling) which allows the LLM to request information from external services and APIs. For example, you can enable your bot to:
  • Check current weather conditions
  • Query databases
  • Access external APIs
  • Perform custom actions
See the Function Calling guide for:
  • Detailed implementation instructions
  • Provider-specific function definitions
  • Handler registration examples
  • Control over function call behavior
  • Complete usage examples

Token Usage Tracking

Gemini Multimodal Live automatically tracks token usage metrics, providing:
  • Prompt token counts
  • Completion token counts
  • Total token counts
  • Detailed token breakdowns by modality (text, audio)
These metrics can be used for monitoring usage, optimizing costs, and understanding model performance.

Language Support

Gemini Multimodal Live supports the following languages:
Language CodeDescriptionGemini Code
Language.ARArabicar-XA
Language.BN_INBengali (India)bn-IN
Language.CMN_CNChinese (Mandarin)cmn-CN
Language.DE_DEGerman (Germany)de-DE
Language.EN_USEnglish (US)en-US
Language.EN_AUEnglish (Australia)en-AU
Language.EN_GBEnglish (UK)en-GB
Language.EN_INEnglish (India)en-IN
Language.ES_ESSpanish (Spain)es-ES
Language.ES_USSpanish (US)es-US
Language.FR_FRFrench (France)fr-FR
Language.FR_CAFrench (Canada)fr-CA
Language.GU_INGujarati (India)gu-IN
Language.HI_INHindi (India)hi-IN
Language.ID_IDIndonesianid-ID
Language.IT_ITItalian (Italy)it-IT
Language.JA_JPJapanese (Japan)ja-JP
Language.KN_INKannada (India)kn-IN
Language.KO_KRKorean (Korea)ko-KR
Language.ML_INMalayalam (India)ml-IN
Language.MR_INMarathi (India)mr-IN
Language.NL_NLDutch (Netherlands)nl-NL
Language.PL_PLPolish (Poland)pl-PL
Language.PT_BRPortuguese (Brazil)pt-BR
Language.RU_RURussian (Russia)ru-RU
Language.TA_INTamil (India)ta-IN
Language.TE_INTelugu (India)te-IN
Language.TH_THThai (Thailand)th-TH
Language.TR_TRTurkish (Turkey)tr-TR
Language.VI_VNVietnamese (Vietnam)vi-VN
You can set the language using the language parameter:
from pipecat.transcriptions.language import Language
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams
)

# Set language during initialization
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(language=Language.ES_ES)  # Spanish (Spain)
)

Next Steps

Examples

  • Foundational Example Basic implementation showing core features and transcription
  • Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.

Learn More

Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.