The GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:

Real-time Interaction

Stream audio and video in real-time with low latency response times

Speech Processing

Built-in speech-to-text and text-to-speech capabilities with multiple voice options

Voice Activity Detection

Automatic detection of speech start/stop for natural conversations

Context Management

Intelligent handling of conversation history and system instructions

Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use GeminiMultimodalLiveLLMService, install the required dependencies:

pip install "pipecat-ai[google]"

You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY.

Basic Usage

Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:

from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams,
    GeminiMultimodalModalities
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Aoede",                                # Voices: Aoede, Charon, Fenrir, Kore, Puck
    params=InputParams(
        temperature=0.7,                             # Set model input params
        language=Language.EN_US,                     # Set language (30+ languages supported)
        modalities=GeminiMultimodalModalities.AUDIO  # Response modality
    )
)

Configuration

Constructor Parameters

api_key
str
required

Your Google API key

base_url
str

API endpoint URL

model
str

Gemini model to use (upgraded to new v1beta model)

voice_id
str
default:"Charon"

Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Puck",  # Choose your preferred voice
)
system_instruction
str

High-level instructions that guide the model’s behavior

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="Talk like a pirate.",
)
start_audio_paused
bool
default:"False"

Whether to start with audio input paused

start_video_paused
bool
default:"False"

Whether to start with video input paused

tools
Union[List[dict], ToolsSchema]

Tools/functions available to the model

inference_on_context_initialization
bool
default:"True"

Whether to generate a response when context is first set

Input Parameters

frequency_penalty
float
default:"None"

Penalizes repeated tokens. Range: 0.0 to 2.0

max_tokens
int
default:"4096"

Maximum number of tokens to generate

modalities
GeminiMultimodalModalities
default:"AUDIO"

Response modalities to include (options: AUDIO, TEXT).

presence_penalty
float
default:"None"

Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0

temperature
float
default:"None"

Controls randomness in responses. Range: 0.0 to 2.0

language
Language
default:"Language.EN_US"

Language for generation. Over 30 languages are supported.

media_resolution
GeminiMediaResolution
default:"UNSPECIFIED"

Controls image processing quality and token usage:

  • LOW: Uses 64 tokens
  • MEDIUM: Uses 256 tokens
  • HIGH: Zoomed reframing with 256 tokens
vad
GeminiVADParams

Voice Activity Detection configuration:

  • disabled: Toggle VAD on/off
  • start_sensitivity: How quickly speech is detected (HIGH/LOW)
  • end_sensitivity: How quickly turns end after pauses (HIGH/LOW)
  • prefix_padding_ms: Milliseconds of audio to keep before speech
  • silence_duration_ms: Milliseconds of silence to end a turn
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiVADParams,
    GeminiMediaResolution,
    StartSensitivity,
    EndSensitivity
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        temperature=0.7,
        language=Language.ES,                         # Spanish language
        media_resolution=GeminiMediaResolution.HIGH,  # Higher quality image processing
        vad=GeminiVADParams(
            start_sensitivity=StartSensitivity.HIGH,  # Detect speech quickly
            end_sensitivity=EndSensitivity.LOW,       # Allow longer pauses
            prefix_padding_ms=300,                    # Keep 300ms before speech
            silence_duration_ms=1000,                 # End turn after 1s silence
        )
    )
)
top_k
int
default:"None"

Limits vocabulary to k most likely tokens. Minimum: 0

top_p
float
default:"None"

Cumulative probability cutoff for token selection. Range: 0.0 to 1.0

context_window_compression
ContextWindowCompressionParams

Parameters for managing the context window: - enabled: Enable/disable compression (default: False) - trigger_tokens: Number of tokens that trigger compression (default: None, uses 80% of context window)

from pipecat.services.gemini_multimodal_live.gemini import (
    ContextWindowCompressionParams
)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        top_p=0.9,               # More focused token selection
        top_k=40,                # Limit vocabulary options
        context_window_compression=ContextWindowCompressionParams(
            enabled=True,
            trigger_tokens=8000  # Compress when reaching 8000 tokens
        )
    )
)

Methods

set_audio_input_paused(paused: bool)
method

Pause or unpause audio input processing

set_video_input_paused(paused: bool)
method

Pause or unpause video input processing

set_model_modalities(modalities: GeminiMultimodalModalities)
method

Change the response modality (TEXT or AUDIO)

set_language(language: Language)
method

Change the language for generation

set_context(context: OpenAILLMContext)
method

Set the conversation context explicitly

create_context_aggregator(context: OpenAILLMContext, user_params: LLMUserAggregatorParams, assistant_params: LLMAssistantAggregatorParams)
method

Create context aggregators for managing conversation state

Frame Types

Input Frames

InputAudioRawFrame
Frame

Raw audio data for speech input

InputImageRawFrame
Frame

Raw image data for visual input

StartInterruptionFrame
Frame

Signals start of user interruption

UserStartedSpeakingFrame
Frame

Signals user started speaking

UserStoppedSpeakingFrame
Frame

Signals user stopped speaking

OpenAILLMContextFrame
Frame

Contains conversation context

LLMMessagesAppendFrame
Frame

Adds messages to the conversation

LLMUpdateSettingsFrame
Frame

Updates LLM settings

LLMSetToolsFrame
Frame

Sets available tools for the LLM

Output Frames

TTSAudioRawFrame
Frame

Generated speech audio

TTSStartedFrame
Frame

Signals start of speech synthesis

TTSStoppedFrame
Frame

Signals end of speech synthesis

LLMTextFrame
Frame

Generated text responses from the LLM

TTSTextFrame
Frame

Text used for speech synthesis

TranscriptionFrame
Frame

Speech transcriptions from user audio

LLMFullResponseStartFrame
Frame

Signals the start of a complete LLM response

LLMFullResponseEndFrame
Frame

Signals the end of a complete LLM response

Function Calling

This service supports function calling (also known as tool calling) which allows the LLM to request information from external services and APIs. For example, you can enable your bot to:

  • Check current weather conditions
  • Query databases
  • Access external APIs
  • Perform custom actions

See the Function Calling guide for:

  • Detailed implementation instructions
  • Provider-specific function definitions
  • Handler registration examples
  • Control over function call behavior
  • Complete usage examples

Token Usage Tracking

Gemini Multimodal Live automatically tracks token usage metrics, providing:

  • Prompt token counts
  • Completion token counts
  • Total token counts
  • Detailed token breakdowns by modality (text, audio)

These metrics can be used for monitoring usage, optimizing costs, and understanding model performance.

Language Support

Gemini Multimodal Live supports the following languages:

Language CodeDescriptionGemini Code
Language.ARArabicar-XA
Language.BN_INBengali (India)bn-IN
Language.CMN_CNChinese (Mandarin)cmn-CN
Language.DE_DEGerman (Germany)de-DE
Language.EN_USEnglish (US)en-US
Language.EN_AUEnglish (Australia)en-AU
Language.EN_GBEnglish (UK)en-GB
Language.EN_INEnglish (India)en-IN
Language.ES_ESSpanish (Spain)es-ES
Language.ES_USSpanish (US)es-US
Language.FR_FRFrench (France)fr-FR
Language.FR_CAFrench (Canada)fr-CA
Language.GU_INGujarati (India)gu-IN
Language.HI_INHindi (India)hi-IN
Language.ID_IDIndonesianid-ID
Language.IT_ITItalian (Italy)it-IT
Language.JA_JPJapanese (Japan)ja-JP
Language.KN_INKannada (India)kn-IN
Language.KO_KRKorean (Korea)ko-KR
Language.ML_INMalayalam (India)ml-IN
Language.MR_INMarathi (India)mr-IN
Language.NL_NLDutch (Netherlands)nl-NL
Language.PL_PLPolish (Poland)pl-PL
Language.PT_BRPortuguese (Brazil)pt-BR
Language.RU_RURussian (Russia)ru-RU
Language.TA_INTamil (India)ta-IN
Language.TE_INTelugu (India)te-IN
Language.TH_THThai (Thailand)th-TH
Language.TR_TRTurkish (Turkey)tr-TR
Language.VI_VNVietnamese (Vietnam)vi-VN

You can set the language using the language parameter:

from pipecat.transcriptions.language import Language
from pipecat.services.gemini_multimodal_live.gemini import (
    GeminiMultimodalLiveLLMService,
    InputParams
)

# Set language during initialization
llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(language=Language.ES_ES)  # Spanish (Spain)
)

Next Steps

Examples

  • Foundational Example Basic implementation showing core features and transcription

  • Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.

Learn More

Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.