Gemini Multimodal Live
A real-time, multimodal conversational AI service powered by Google’s Gemini
The GeminiMultimodalLiveLLMService
enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:
Real-time Interaction
Stream audio and video in real-time with low latency response times
Speech Processing
Built-in speech-to-text and text-to-speech capabilities with multiple voice options
Voice Activity Detection
Automatic detection of speech start/stop for natural conversations
Context Management
Intelligent handling of conversation history and system instructions
Want to start building? Check out our Gemini Multimodal Live Guide.
Installation
To use GeminiMultimodalLiveLLMService
, install the required dependencies:
You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY
.
Basic Usage
Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:
Configuration
Constructor Parameters
Your Google API key
API endpoint URL
Gemini model to use (upgraded to new v1beta model)
Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)
High-level instructions that guide the model’s behavior
Whether to start with audio input paused
Whether to start with video input paused
Tools/functions available to the model
Whether to generate a response when context is first set
Input Parameters
Penalizes repeated tokens. Range: 0.0 to 2.0
Maximum number of tokens to generate
Response modalities to include (options: AUDIO
, TEXT
).
Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0
Controls randomness in responses. Range: 0.0 to 2.0
Language for generation. Over 30 languages are supported.
Controls image processing quality and token usage:
LOW
: Uses 64 tokensMEDIUM
: Uses 256 tokensHIGH
: Zoomed reframing with 256 tokens
Voice Activity Detection configuration:
disabled
: Toggle VAD on/offstart_sensitivity
: How quickly speech is detected (HIGH/LOW)end_sensitivity
: How quickly turns end after pauses (HIGH/LOW)prefix_padding_ms
: Milliseconds of audio to keep before speechsilence_duration_ms
: Milliseconds of silence to end a turn
Limits vocabulary to k most likely tokens. Minimum: 0
Cumulative probability cutoff for token selection. Range: 0.0 to 1.0
Parameters for managing the context window: - enabled
: Enable/disable
compression (default: False) - trigger_tokens
: Number of tokens that trigger
compression (default: None, uses 80% of context window)
Methods
Pause or unpause audio input processing
Pause or unpause video input processing
Change the response modality (TEXT or AUDIO)
Change the language for generation
Set the conversation context explicitly
Create context aggregators for managing conversation state
Frame Types
Input Frames
Raw audio data for speech input
Raw image data for visual input
Signals start of user interruption
Signals user started speaking
Signals user stopped speaking
Contains conversation context
Adds messages to the conversation
Updates LLM settings
Sets available tools for the LLM
Output Frames
Generated speech audio
Signals start of speech synthesis
Signals end of speech synthesis
Generated text responses from the LLM
Text used for speech synthesis
Speech transcriptions from user audio
Signals the start of a complete LLM response
Signals the end of a complete LLM response
Function Calling
This service supports function calling (also known as tool calling) which allows the LLM to request information from external services and APIs. For example, you can enable your bot to:
- Check current weather conditions
- Query databases
- Access external APIs
- Perform custom actions
See the Function Calling guide for:
- Detailed implementation instructions
- Provider-specific function definitions
- Handler registration examples
- Control over function call behavior
- Complete usage examples
Token Usage Tracking
Gemini Multimodal Live automatically tracks token usage metrics, providing:
- Prompt token counts
- Completion token counts
- Total token counts
- Detailed token breakdowns by modality (text, audio)
These metrics can be used for monitoring usage, optimizing costs, and understanding model performance.
Language Support
Gemini Multimodal Live supports the following languages:
Language Code | Description | Gemini Code |
---|---|---|
Language.AR | Arabic | ar-XA |
Language.BN_IN | Bengali (India) | bn-IN |
Language.CMN_CN | Chinese (Mandarin) | cmn-CN |
Language.DE_DE | German (Germany) | de-DE |
Language.EN_US | English (US) | en-US |
Language.EN_AU | English (Australia) | en-AU |
Language.EN_GB | English (UK) | en-GB |
Language.EN_IN | English (India) | en-IN |
Language.ES_ES | Spanish (Spain) | es-ES |
Language.ES_US | Spanish (US) | es-US |
Language.FR_FR | French (France) | fr-FR |
Language.FR_CA | French (Canada) | fr-CA |
Language.GU_IN | Gujarati (India) | gu-IN |
Language.HI_IN | Hindi (India) | hi-IN |
Language.ID_ID | Indonesian | id-ID |
Language.IT_IT | Italian (Italy) | it-IT |
Language.JA_JP | Japanese (Japan) | ja-JP |
Language.KN_IN | Kannada (India) | kn-IN |
Language.KO_KR | Korean (Korea) | ko-KR |
Language.ML_IN | Malayalam (India) | ml-IN |
Language.MR_IN | Marathi (India) | mr-IN |
Language.NL_NL | Dutch (Netherlands) | nl-NL |
Language.PL_PL | Polish (Poland) | pl-PL |
Language.PT_BR | Portuguese (Brazil) | pt-BR |
Language.RU_RU | Russian (Russia) | ru-RU |
Language.TA_IN | Tamil (India) | ta-IN |
Language.TE_IN | Telugu (India) | te-IN |
Language.TH_TH | Thai (Thailand) | th-TH |
Language.TR_TR | Turkish (Turkey) | tr-TR |
Language.VI_VN | Vietnamese (Vietnam) | vi-VN |
You can set the language using the language
parameter:
Next Steps
Examples
-
Foundational Example Basic implementation showing core features and transcription
-
Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.
Learn More
Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.