Building with Gemini Multimodal Live

This guide will walk you through building a real-time AI chatbot using Gemini Multimodal Live and Pipecat. We’ll create a complete application with a Pipecat server and a Pipecat React client that enables natural conversations with an AI assistant.

API Reference

Gemini Multimodal Live API documentation

Example Code

Find the complete client and server code in Github

Client SDK

Pipecat React SDK documentation

What We’ll Build

In this guide, you’ll create:

A FastAPI server that manages bot instances
A Gemini-powered conversational AI bot
A React client with real-time audio/video
A complete pipeline for speech-to-speech interaction

Key Concepts

Before we dive into implementation, let’s cover some important concepts that will help you understand how Pipecat and Gemini work together.

Understanding Pipelines

At the heart of Pipecat is the pipeline system. A pipeline is a sequence of processors that handle different aspects of the conversation flow. Think of it like an assembly line where each station (processor) performs a specific task. For our chatbot, the pipeline looks like this:

pipeline = Pipeline([
    transport.input(),              # Receives audio/video from the user via WebRTC
    rtvi,                           # Handles client/server messaging and events
    context_aggregator.user(),      # Manages user message history
    llm,                            # Processes speech through Gemini
    talking_animation,              # Controls bot's avatar
    transport.output(),             # Sends audio/video back to the user via WebRTC
    context_aggregator.assistant(), # Manages bot message history
])

Processors

Each processor in the pipeline handles a specific task:

Transport

transport.input() and transport.output() handle media streaming with Daily

Context

context_aggregator maintains conversation history for natural dialogue

Speech Processing

rtvi_user_transcription and rtvi_bot_transcription handle speech-to-text

Animation

talking_animation controls the bot’s visual state based on speaking activity

The order of processors matters! Data flows through the pipeline in sequence, so each processor should receive the data it needs from previous processors.Learn more about the Core Concepts to Pipecat server.

Gemini Integration

The GeminiMultimodalLiveLLMService is a speech-to-speech LLM service that interfaces with the Gemini Multimodal Live API. It provides:

Real-time speech-to-speech conversation
Context management
Voice activity detection
Tool use

Pipecat manages two types of connections:

A WebRTC connection between the Pipecat client and server for reliable audio/video streaming
A WebSocket connection between the Pipecat server and Gemini for real-time AI processing

This architecture ensures stable media streaming while maintaining responsive AI interactions.

Prerequisites

Before we begin, you’ll need:

Python 3.10 or higher
Node.js 16 or higher
A Daily API key
A Google API key with Gemini Multimodal Live access
Clone the Pipecat repo:

git clone git@github.com:pipecat-ai/pipecat.git

Server Implementation

Let’s start by setting up the server components. Our server will handle bot management, room creation, and client connections.

Environment Setup

Navigate to the simple-chatbot’s server directory:

cd examples/simple-chatbot/server

Set up a python virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install requirements:

pip install -r requirements.txt

Copy env.example to .env and make a few changes:

# Remove the hard-coded example room URL
DAILY_SAMPLE_ROOM_URL=

# Add your Daily and Gemini API keys
DAILY_API_KEY=[your key here]
GEMINI_API_KEY=[your key here]

# Use Gemini implementation
BOT_IMPLEMENTATION=gemini

Server Setup (server.py)

server.py is a FastAPI server that creates the meeting room where clients and bots interact, manages bot instances, and handles client connections. It’s the orchestrator that brings everything on the server-side together.

Creating Meeting Room

The server uses Daily’s API via a REST API helper to create rooms where clients and bots can meet. Each room is a secure space for audio/video communication:

server/server.py

async def create_room_and_token():
    """Create a Daily room and generate access credentials."""
    room = await daily_helpers["rest"].create_room(DailyRoomParams())
    token = await daily_helpers["rest"].get_token(room.url)
    return room.url, token

Managing Bot Instances

When a client connects, the server starts a new bot instance configured specifically for that room. It keeps track of running bots and ensures there’s only one bot per room:

server/server.py

# Start the bot process for a specific room
bot_file = "bot-gemini.py"
proc = subprocess.Popen([f"python3 -m {bot_file} -u {room_url} -t {token}"])
bot_procs[proc.pid] = (proc, room_url)

Connection Endpoints

The server provides two ways to connect:

Browser Access (/)

Creates a room, starts a bot, and redirects the browser to the Daily meeting URL. Perfect for quick testing and development.

RTVI Client (/connect)

Creates a room, starts a bot, and returns connection credentials. Used by RTVI clients for custom implementations.

Bot Implementation (bot-gemini.py)

The bot implementation connects all the pieces: Daily transport, Gemini service, conversation context, and processors. Let’s break down each component:

Transport Setup

First, we configure the Daily transport, which handles WebRTC communication between the client and server.

server/bot-gemini.py

transport = DailyTransport(
    room_url,
    token,
    "Chatbot",
    DailyParams(
        audio_in_enabled=True,        # Enable audio input
        audio_out_enabled=True,       # Enable audio output
        video_out_enabled=True,      # Enable video output
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
    ),
)

Gemini Multimodal Live audio requirements:

Input: 16 kHz sample rate
Output: 24 kHz sample rate

Gemini Service Configuration

Next, we initialize the Gemini service which will provide speech-to-speech inference and communication:

server/bot-gemini.py

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GEMINI_API_KEY"),
    voice_id="Puck",                     # Choose your bot's voice
    params=InputParams(temperature=0.7)  # Set model input params
)

Conversation Context

We give our bot its personality and initial instructions:

server/bot-gemini.py

messages = [{
    "role": "user",
    "content": """You are Chatbot, a friendly, helpful robot.
                 Keep responses brief and avoid special characters
                 since output will be converted to audio."""
}]

context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)

OpenAILLMContext is used as a common LLM base service for context management. In the future, we may add a specific context manager for Gemini.

The context aggregator automatically maintains conversation history, helping the bot remember previous interactions.

Processor Setup

We initialize two additional processors in our pipeline to handle different aspects of the interaction:

RTVI Processors

RTVIProcessor: Handles all client communication events including transcriptions, speaking states, and performance metrics

Animation

TalkingAnimation: Controls the bot’s visual state, switching between static and animated frames based on speaking status

Learn more about the RTVI framework and available processors.

Pipeline Assembly

Finally, we bring everything together in a pipeline:

server/bot-gemini.py

pipeline = Pipeline([
    transport.input(),             # Receive media
    rtvi,                          # Client UI events
    context_aggregator.user(),     # Process user context
    llm,                           # Gemini processing
    ta,                            # Animation (talking/quiet states)
    transport.output(),            # Send media
    context_aggregator.assistant() # Process bot context
])

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        allow_interruptions=True,
        enable_metrics=True,
        enable_usage_metrics=True,
    ),
    observers=[RTVIObserver(rtvi)],
)

The order of processors is crucial! For example, the RTVI processor should be early in the pipeline to capture all relevant events.The RTVIObserver monitors the entire pipeline and automatically collects relevant events to send to the client.

Client Implementation

Our React client uses the Pipecat React SDK to communicate with the bot. Let’s explore how the client connects and interacts with our Pipecat server.

Connection Setup

The client needs to connect to our bot server using the same transport type (Daily WebRTC) that we configured on the server:

examples/react/src/providers/PipecatProvider.tsx

const client = new PipecatClient({
  transport: new DailyTransport(),
  enableMic: true, // Enable audio input
  enableCam: false, // Disable video input
  enableScreenShare: false, // Disable screen sharing
});
client.connect({
  endpoint: "http://localhost:7860/connect", // Your bot connection endpoint
});

The connection configuration must match your server:

DailyTransport: Matches the WebRTC transport used in bot-gemini.py
connect endpoint: Matches the /connect route in server.py
Media settings: Controls which devices are enabled on join

Media Handling

Pipecat’s React components handle all the complex media stream management for you:

function App() {
  return (
    <PipecatClientProvider client={client}>
      <div className="app">
        <PipecatClientVideo participant="bot" /> {/* Bot's video feed */}
        <PipecatClientAudio /> {/* Audio input/output */}
      </div>
    </PipecatClientProvider>
  );
}

The PipecatClientProvider is the root component for providing Pipecat client context to your application. By wrapping your PipecatClientAudio and PipecatClientVideo components in this provider, they can access the client instance and receive and process the streams received from the Pipecat server.

Real-time Events

The RTVI processors we configured in the pipeline emit events that we can handle in our client:

// Listen for transcription events
useRTVIClientEvent(RTVIEvent.UserTranscript, (data: TranscriptData) => {
  if (data.final) {
    console.log(`User said: ${data.text}`);
  }
});

// Listen for bot responses
useRTVIClientEvent(RTVIEvent.BotTranscript, (data: BotLLMTextData) => {
  console.log(`Bot responded: ${data.text}`);
});

Available Events

Speaking state changes
Transcription updates
Bot responses
Connection status
Performance metrics

Event Usage

Use these events to:

Show speaking indicators
Display transcripts
Update UI state
Monitor performance

Optionally, uses callbacks to handle events in your application. Learn more in the Pipecat client docs.

Complete Example

Here’s a basic client implementation with connection status and transcription display:

function ChatApp() {
  return (
    <PipecatClientProvider client={client}>
      <div className="app">
        {/* Connection UI */}
        <StatusDisplay />
        <ConnectButton />

        {/* Media Components */}
        <BotVideo />
        <PipecatClientAudio />

        {/* Debug/Transcript Display */}
        <DebugDisplay />
      </div>
    </PipecatClientProvider>
  );
}

Check out the example repository for a complete client implementation with styling and error handling.

Running the Application

From the simple-chatbot directory, start the server and client to test the chatbot:

1. Start the Server

In one terminal:

python server/server.py

2. Start the Client

In another terminal:

cd examples/react
npm install
npm run dev

3. Testing the Connection

Open http://localhost:5173 in your browser
Click “Connect” to join a room
Allow microphone access when prompted
Start talking with your AI assistant

Troubleshooting:

Check that all API keys are properly configured in .env
Grant your browser access to your microphone, so it can receive your audio input
Verify WebRTC ports aren’t blocked by firewalls

Next Steps

Now that you have a working chatbot, consider these enhancements:

Add custom avatar animations
Implement function calling for external integrations
Add support for multiple languages
Enhance error recovery and reconnection logic

Examples

Foundational Example

A basic implementation demonstrating core Gemini Multimodal Live features and transcription capabilities

Simple Chatbot

A complete client/server implementation showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot

Fundamentals

Features

Telephony

Deploying your bot

API Reference

Example Code

Client SDK

​What We’ll Build

​Key Concepts

​Understanding Pipelines

​Processors

Transport

Context

Speech Processing

Animation

​Gemini Integration

​Prerequisites

​Server Implementation

​Environment Setup

​Server Setup (server.py)

​Creating Meeting Room

​Managing Bot Instances

​Connection Endpoints

Browser Access (/)

RTVI Client (/connect)

​Bot Implementation (bot-gemini.py)

​Transport Setup

​Gemini Service Configuration

​Conversation Context

​Processor Setup

RTVI Processors

Animation

​Pipeline Assembly

​Client Implementation

​Connection Setup

​Media Handling

​Real-time Events

Available Events

Event Usage

​Complete Example

​Running the Application

​1. Start the Server

​2. Start the Client

​3. Testing the Connection

​Next Steps

​Examples

Foundational Example

Simple Chatbot

​Learn More

What We’ll Build

Key Concepts

Understanding Pipelines

Processors

Gemini Integration

Prerequisites

Server Implementation

Environment Setup

Server Setup (server.py)

Creating Meeting Room

Managing Bot Instances

Connection Endpoints

Bot Implementation (bot-gemini.py)

Transport Setup

Gemini Service Configuration

Conversation Context

Processor Setup

Pipeline Assembly

Client Implementation

Connection Setup

Media Handling

Real-time Events

Complete Example

Running the Application

1. Start the Server

2. Start the Client

3. Testing the Connection

Next Steps

Examples

Learn More