Building with Gemini Multimodal Live
Create a real-time AI chatbot using Gemini Multimodal Live and Pipecat
This guide will walk you through building a real-time AI chatbot using Gemini Multimodal Live and Pipecat. We’ll create a complete application with a Pipecat server and a Pipecat React client that enables natural conversations with an AI assistant.
API Reference
Gemini Multimodal Live API documentation
Example Code
Find the complete client and server code in Github
Client SDK
Pipecat React SDK documentation
What We’ll Build
In this guide, you’ll create:
- A FastAPI server that manages bot instances
- A Gemini-powered conversational AI bot
- A React client with real-time audio/video
- A complete pipeline for speech-to-speech interaction
Key Concepts
Before we dive into implementation, let’s cover some important concepts that will help you understand how Pipecat and Gemini work together.
Understanding Pipelines
At the heart of Pipecat is the pipeline system. A pipeline is a sequence of processors that handle different aspects of the conversation flow. Think of it like an assembly line where each station (processor) performs a specific task.
For our chatbot, the pipeline looks like this:
Processors
Each processor in the pipeline handles a specific task:
Transport
transport.input()
and transport.output()
handle media streaming with
Daily
Context
context_aggregator
maintains conversation history for natural dialogue
Speech Processing
rtvi_user_transcription
and rtvi_bot_transcription
handle speech-to-text
Animation
talking_animation
controls the bot’s visual state based on speaking
activity
The order of processors matters! Data flows through the pipeline in sequence, so each processor should receive the data it needs from previous processors.
Learn more about the Core Concepts to Pipecat server.
Gemini Integration
The GeminiMultimodalLiveLLMService
is a speech-to-speech LLM service that interfaces with the Gemini Multimodal Live API.
It provides:
- Real-time speech-to-speech conversation
- Context management
- Voice activity detection
- Tool use
Pipecat manages two types of connections:
- A WebRTC connection between the Pipecat client and server for reliable audio/video streaming
- A WebSocket connection between the Pipecat server and Gemini for real-time AI processing
This architecture ensures stable media streaming while maintaining responsive AI interactions.
Prerequisites
Before we begin, you’ll need:
- Python 3.10 or higher
- Node.js 16 or higher
- A Daily API key
- A Google API key with Gemini Multimodal Live access
- Clone the Pipecat repo:
Server Implementation
Let’s start by setting up the server components. Our server will handle bot management, room creation, and client connections.
Environment Setup
- Navigate to the simple-chatbot’s server directory:
- Set up a python virtual environment:
- Install requirements:
- Copy env.example to .env and add configure:
Server Setup (server.py)
server.py
is a FastAPI server that creates the meeting room where clients and bots interact, manages bot instances, and handles client connections. It’s the orchestrator that brings everything on the server-side together.
Creating Meeting Room
The server uses Daily’s API via a REST API helper to create rooms where clients and bots can meet. Each room is a secure space for audio/video communication:
Managing Bot Instances
When a client connects, the server starts a new bot instance configured specifically for that room. It keeps track of running bots and ensures there’s only one bot per room:
Connection Endpoints
The server provides two ways to connect:
Browser Access (/)
Creates a room, starts a bot, and redirects the browser to the Daily meeting URL. Perfect for quick testing and development.
RTVI Client (/connect)
Creates a room, starts a bot, and returns connection credentials. Used by RTVI clients for custom implementations.
Bot Implementation (bot-gemini.py)
The bot implementation connects all the pieces: Daily transport, Gemini service, conversation context, and processors.
Let’s break down each component:
Transport Setup
First, we configure the Daily transport, which handles WebRTC communication between the client and server.
Gemini Multimodal Live audio requirements:
- Input: 16 kHz sample rate
- Output: 24 kHz sample rate
Gemini Service Configuration
Next, we initialize the Gemini service which will provide speech-to-speech inference and communication:
Conversation Context
We give our bot its personality and initial instructions:
OpenAILLMContext
is used as a common LLM base service for context management. In the future, we may add a specific context manager for Gemini.
The context aggregator automatically maintains conversation history, helping the bot remember previous interactions.
Processor Setup
We initialize several processors to handle different aspects of the interaction:
RTVI Processors
RTVISpeakingProcessor
: Manages speaking states
RTVITranscriptionProcessor
: Handles transcription events
RTVIMetricsProcessor
: Tracks performance metrics
Animation
TalkingAnimation
: Controls the bot’s visual state, switching between
static and animated frames based on speaking status
Learn more about the RTVI framework and available processors.
Pipeline Assembly
Finally, we bring everything together in a pipeline:
The order of processors is crucial! For example, transcription processors should come after the LLM to capture the processed speech.
Traditional STT, LLM, TTS pipelines have a different ordering, so be sure to tailor your processor ordering based on the elements in the pipeline.
Client Implementation
Our React client uses the Pipecat React SDK to communicate with the bot. Let’s explore how the client connects and interacts with our Pipecat server.
Connection Setup
The client needs to connect to our bot server using the same transport type (Daily WebRTC) that we configured on the server:
The connection configuration must match your server:
DailyTransport
: Matches the WebRTC transport used inbot-gemini.py
connect
endpoint: Matches the/connect
route inserver.py
- Media settings: Controls which devices are enabled on join
Media Handling
Pipecat’s React components handle all the complex media stream management for you:
The RTVIClientProvider
is the root component for providing RTVI client context to your application. By wrapping your RTVIClientAudio
and RTVIClientVideo
components in this provider, they can access the client instance and receive and process the streams received from the Pipecat server.
Real-time Events
The RTVI processors we configured in the pipeline emit events that we can handle in our client:
Available Events
- Speaking state changes
- Transcription updates
- Bot responses
- Connection status
- Performance metrics
Event Usage
Use these events to:
- Show speaking indicators
- Display transcripts
- Update UI state
- Monitor performance
Optionally, uses callbacks to handle events in your application. Learn more in the Pipecat client docs.
Complete Example
Here’s a basic client implementation with connection status and transcription display:
Check out the example repository for a complete client implementation with styling and error handling.
Running the Application
From the simple-chatbot
directory, start the server and client to test the chatbot:
1. Start the Server
In one terminal:
2. Start the Client
In another terminal:
3. Testing the Connection
- Open
http://localhost:5173
in your browser - Click “Connect” to join a room
- Allow microphone access when prompted
- Start talking with your AI assistant
Troubleshooting:
- Check that all API keys are properly configured in .env
- Grant your browser access to your microphone, so it can receive your audio input
- Verify WebRTC ports aren’t blocked by firewalls
Next Steps
Now that you have a working chatbot, consider these enhancements:
- Add custom avatar animations
- Implement function calling for external integrations
- Add support for multiple languages
- Enhance error recovery and reconnection logic
Examples
Foundational Example
A basic implementation demonstrating core Gemini Multimodal Live features and transcription capabilities
Simple Chatbot
A complete client/server implementation showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot