Gemini Multimodal Live
A real-time, multimodal conversational AI service powered by Google’s Gemini
The GeminiMultimodalLiveLLMService
enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:
Real-time Interaction
Stream audio and video in real-time with low latency response times
Speech Processing
Built-in speech-to-text and text-to-speech capabilities with multiple voice options
Voice Activity Detection
Automatic detection of speech start/stop for natural conversations
Context Management
Intelligent handling of conversation history and system instructions
Want to start building? Check out our Gemini Multimodal Live Guide.
Installation
To use GeminiMultimodalLiveLLMService
, install the required dependencies:
You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY
.
Basic Usage
Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:
Configuration
Constructor Parameters
Your Google API key
API endpoint URL
Gemini model to use
Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)
Enable transcription of user audio
Enable transcription of model responses
High-level instructions that guide the model’s behavior
Input Parameters
Penalizes repeated tokens. Range: 0.0 to 2.0
Maximum number of tokens to generate
Response modalities to include (options: AUDIO
, TEXT
).
Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0
Controls randomness in responses. Range: 0.0 to 2.0
Limits vocabulary to k most likely tokens. Minimum: 0
Cumulative probability cutoff for token selection. Range: 0.0 to 1.0
Frame Types
Input Frames
Raw audio data for speech input
Signals start of user interruption
Signals user started speaking
Signals user stopped speaking
Contains conversation context
Output Frames
Generated speech audio
Signals start of speech synthesis
Signals end of speech synthesis
Generated text responses
Speech transcriptions
Next Steps
Examples
-
Foundational Example Basic implementation showing core features and transcription
-
Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.
Learn More
Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.