Gemini Multimodal Live

The GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:

Real-time Interaction

Stream audio and video in real-time with low latency response times

Speech Processing

Built-in speech-to-text and text-to-speech capabilities with multiple voice options

Voice Activity Detection

Automatic detection of speech start/stop for natural conversations

Context Management

Intelligent handling of conversation history and system instructions

Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use GeminiMultimodalLiveLLMService, install the required dependencies:

pip install "pipecat-ai[google]"

You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY.

Basic Usage

Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:

from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams,
    GeminiMultimodalModalities
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Aoede",                                # Voices: Aoede, Charon, Fenrir, Kore, Puck
    params=InputParams(
        temperature=0.7,                             # Set model input params
        language=Language.EN_US,                     # Set language (30+ languages supported)
        modalities=GeminiMultimodalModalities.AUDIO  # Response modality
    )
)

Configuration

Constructor Parameters

api_key

str

required

Your Google API key

base_url

str

API endpoint URL

model

str

Gemini model to use (upgraded to new v1beta model)

voice_id

str

default:"Charon"

Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Puck",  # Choose your preferred voice
)

system_instruction

str

High-level instructions that guide the model’s behavior

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="Talk like a pirate.",
)

start_audio_paused

bool

default:"False"

Whether to start with audio input paused

start_video_paused

bool

default:"False"

Whether to start with video input paused

tools

Union[List[dict], ToolsSchema]

Tools/functions available to the model

inference_on_context_initialization

bool

default:"True"

Whether to generate a response when context is first set

Input Parameters

frequency_penalty

float

default:"None"

Penalizes repeated tokens. Range: 0.0 to 2.0

max_tokens

int

default:"4096"

Maximum number of tokens to generate

modalities

GeminiMultimodalModalities

default:"AUDIO"

Response modalities to include (options: AUDIO, TEXT).

presence_penalty

float

default:"None"

Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0

temperature

float

default:"None"

Controls randomness in responses. Range: 0.0 to 2.0

language

Language

default:"Language.EN_US"

Language for generation. Over 30 languages are supported.

media_resolution

GeminiMediaResolution

default:"UNSPECIFIED"

Controls image processing quality and token usage:

LOW: Uses 64 tokens
MEDIUM: Uses 256 tokens
HIGH: Zoomed reframing with 256 tokens

vad

GeminiVADParams

Voice Activity Detection configuration:

disabled: Toggle VAD on/off
start_sensitivity: How quickly speech is detected (HIGH/LOW)
end_sensitivity: How quickly turns end after pauses (HIGH/LOW)
prefix_padding_ms: Milliseconds of audio to keep before speech
silence_duration_ms: Milliseconds of silence to end a turn

from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiVADParams,
    GeminiMediaResolution,
    StartSensitivity,
    EndSensitivity
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        temperature=0.7,
        language=Language.ES,                         # Spanish language
        media_resolution=GeminiMediaResolution.HIGH,  # Higher quality image processing
        vad=GeminiVADParams(
            start_sensitivity=StartSensitivity.HIGH,  # Detect speech quickly
            end_sensitivity=EndSensitivity.LOW,       # Allow longer pauses
            prefix_padding_ms=300,                    # Keep 300ms before speech
            silence_duration_ms=1000,                 # End turn after 1s silence
        )
    )
)

top_k

int

default:"None"

Limits vocabulary to k most likely tokens. Minimum: 0

top_p

float

default:"None"

Cumulative probability cutoff for token selection. Range: 0.0 to 1.0

context_window_compression

ContextWindowCompressionParams

Parameters for managing the context window: - enabled: Enable/disable compression (default: False) - trigger_tokens: Number of tokens that trigger compression (default: None, uses 80% of context window)

from pipecat.services.gemini_multimodal_live.gemini import (
    ContextWindowCompressionParams
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        top_p=0.9,               # More focused token selection
        top_k=40,                # Limit vocabulary options
        context_window_compression=ContextWindowCompressionParams(
            enabled=True,
            trigger_tokens=8000  # Compress when reaching 8000 tokens
        )
    )
)

Methods

set_audio_input_paused(paused: bool)

method

Pause or unpause audio input processing

set_video_input_paused(paused: bool)

method

Pause or unpause video input processing

set_model_modalities(modalities: GeminiMultimodalModalities)

method

Change the response modality (TEXT or AUDIO)

set_language(language: Language)

method

Change the language for generation

set_context(context: OpenAILLMContext)

method

Set the conversation context explicitly

create_context_aggregator(context: OpenAILLMContext, user_params: LLMUserAggregatorParams, assistant_params: LLMAssistantAggregatorParams)

method

Create context aggregators for managing conversation state

Frame Types

Input Frames

InputAudioRawFrame

Frame

Raw audio data for speech input

InputImageRawFrame

Frame

Raw image data for visual input

StartInterruptionFrame

Frame

Signals start of user interruption

UserStartedSpeakingFrame

Frame

Signals user started speaking

UserStoppedSpeakingFrame

Frame

Signals user stopped speaking

OpenAILLMContextFrame

Frame

Contains conversation context

LLMMessagesAppendFrame

Frame

Adds messages to the conversation

LLMUpdateSettingsFrame

Frame

Updates LLM settings

LLMSetToolsFrame

Frame

Sets available tools for the LLM

Output Frames

TTSAudioRawFrame

Frame

Generated speech audio

TTSStartedFrame

Frame

Signals start of speech synthesis

TTSStoppedFrame

Frame

Signals end of speech synthesis

LLMTextFrame

Frame

Generated text responses from the LLM

TTSTextFrame

Frame

Text used for speech synthesis

TranscriptionFrame

Frame

Speech transcriptions from user audio

LLMFullResponseStartFrame

Frame

Signals the start of a complete LLM response

LLMFullResponseEndFrame

Frame

Signals the end of a complete LLM response

Function Calling

This service supports function calling (also known as tool calling) which allows the LLM to request information from external services and APIs. For example, you can enable your bot to:

Check current weather conditions
Query databases
Access external APIs
Perform custom actions

See the Function Calling guide for:

Detailed implementation instructions
Provider-specific function definitions
Handler registration examples
Control over function call behavior
Complete usage examples

Token Usage Tracking

Gemini Multimodal Live automatically tracks token usage metrics, providing:

Prompt token counts
Completion token counts
Total token counts
Detailed token breakdowns by modality (text, audio)

These metrics can be used for monitoring usage, optimizing costs, and understanding model performance.

Language Support

Gemini Multimodal Live supports the following languages:

Language Code	Description	Gemini Code
`Language.AR`	Arabic	`ar-XA`
`Language.BN_IN`	Bengali (India)	`bn-IN`
`Language.CMN_CN`	Chinese (Mandarin)	`cmn-CN`
`Language.DE_DE`	German (Germany)	`de-DE`
`Language.EN_US`	English (US)	`en-US`
`Language.EN_AU`	English (Australia)	`en-AU`
`Language.EN_GB`	English (UK)	`en-GB`
`Language.EN_IN`	English (India)	`en-IN`
`Language.ES_ES`	Spanish (Spain)	`es-ES`
`Language.ES_US`	Spanish (US)	`es-US`
`Language.FR_FR`	French (France)	`fr-FR`
`Language.FR_CA`	French (Canada)	`fr-CA`
`Language.GU_IN`	Gujarati (India)	`gu-IN`
`Language.HI_IN`	Hindi (India)	`hi-IN`
`Language.ID_ID`	Indonesian	`id-ID`
`Language.IT_IT`	Italian (Italy)	`it-IT`
`Language.JA_JP`	Japanese (Japan)	`ja-JP`
`Language.KN_IN`	Kannada (India)	`kn-IN`
`Language.KO_KR`	Korean (Korea)	`ko-KR`
`Language.ML_IN`	Malayalam (India)	`ml-IN`
`Language.MR_IN`	Marathi (India)	`mr-IN`
`Language.NL_NL`	Dutch (Netherlands)	`nl-NL`
`Language.PL_PL`	Polish (Poland)	`pl-PL`
`Language.PT_BR`	Portuguese (Brazil)	`pt-BR`
`Language.RU_RU`	Russian (Russia)	`ru-RU`
`Language.TA_IN`	Tamil (India)	`ta-IN`
`Language.TE_IN`	Telugu (India)	`te-IN`
`Language.TH_TH`	Thai (Thailand)	`th-TH`
`Language.TR_TR`	Turkish (Turkey)	`tr-TR`
`Language.VI_VN`	Vietnamese (Vietnam)	`vi-VN`

You can set the language using the language parameter:

from pipecat.transcriptions.language import Language
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams
)

# Set language during initialization
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(language=Language.ES_ES)  # Spanish (Spain)
)

Next Steps

Examples

Foundational Example Basic implementation showing core features and transcription
Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.

Learn More

Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.

API Reference

Services

Utilities

Frameworks

Pipeline

Base Service Classes

Gemini Multimodal Live

Real-time Interaction

Speech Processing

Voice Activity Detection

Context Management

Installation

Basic Usage

Configuration

Constructor Parameters

Input Parameters

Methods

Frame Types

Input Frames

Output Frames

Function Calling

Token Usage Tracking

Language Support

Next Steps

Examples

Learn More

API Reference

Services

Utilities

Frameworks

Pipeline

Base Service Classes

Real-time Interaction

Speech Processing

Voice Activity Detection

Context Management

​Installation

​Basic Usage

​Configuration

​Constructor Parameters

​Input Parameters

​Methods

​Frame Types

​Input Frames

​Output Frames

​Function Calling

​Token Usage Tracking

​Language Support

​Next Steps

​Examples

​Learn More

Installation

Basic Usage

Configuration

Constructor Parameters

Input Parameters

Methods

Frame Types

Input Frames

Output Frames

Function Calling

Token Usage Tracking

Language Support

Next Steps

Examples

Learn More