This guide will walk you through building a real-time AI chatbot using Gemini Multimodal Live and Pipecat. We’ll create a complete application with a Pipecat server and a Pipecat React client that enables natural conversations with an AI assistant.

What We’ll Build

In this guide, you’ll create:

  • A FastAPI server that manages bot instances
  • A Gemini-powered conversational AI bot
  • A React client with real-time audio/video
  • A complete pipeline for speech-to-speech interaction

Key Concepts

Before we dive into implementation, let’s cover some important concepts that will help you understand how Pipecat and Gemini work together.

Understanding Pipelines

At the heart of Pipecat is the pipeline system. A pipeline is a sequence of processors that handle different aspects of the conversation flow. Think of it like an assembly line where each station (processor) performs a specific task.

For our chatbot, the pipeline looks like this:

pipeline = Pipeline([
    transport.input(),             # Receives audio/video from the user via WebRTC
    context_aggregator.user(),     # Manages user message history
    llm,                           # Processes speech through Gemini
    rtvi_speaking,                 # Tracks speaking states
    rtvi_user_transcription,       # Handles user speech transcription
    rtvi_bot_transcription,        # Handles bot speech transcription
    talking_animation,             # Controls bot's avatar
    transport.output(),            # Sends audio/video back to the user via WebRTC
])

Processors

Each processor in the pipeline handles a specific task:

Transport

transport.input() and transport.output() handle media streaming with Daily

Context

context_aggregator maintains conversation history for natural dialogue

Speech Processing

rtvi_user_transcription and rtvi_bot_transcription handle speech-to-text

Animation

talking_animation controls the bot’s visual state based on speaking activity

The order of processors matters! Data flows through the pipeline in sequence, so each processor should receive the data it needs from previous processors.

Learn more about the Core Concepts to Pipecat server.

Gemini Integration

The GeminiMultimodalLiveLLMService is a speech-to-speech LLM service that interfaces with the Gemini Multimodal Live API.

It provides:

  • Real-time speech-to-speech conversation
  • Context management
  • Voice activity detection
  • Tool use

Pipecat manages two types of connections:

  1. A WebRTC connection between the Pipecat client and server for reliable audio/video streaming
  2. A WebSocket connection between the Pipecat server and Gemini for real-time AI processing

This architecture ensures stable media streaming while maintaining responsive AI interactions.

Prerequisites

Before we begin, you’ll need:

  • Python 3.10 or higher
  • Node.js 16 or higher
  • A Daily API key
  • A Google API key with Gemini Multimodal Live access
  • Clone the Pipecat repo:
git clone git@github.com:pipecat-ai/pipecat.git

Server Implementation

Let’s start by setting up the server components. Our server will handle bot management, room creation, and client connections.

Environment Setup

  1. Navigate to the simple-chatbot’s server directory:
cd examples/simple-chatbot/server
  1. Set up a python virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install requirements:
pip install -r requirements.txt
  1. Copy env.example to .env and add configure:
# Add your Daily and Gemini API keys
DAILY_API_KEY=
GEMINI_API_KEY=

# Use Gemini implementation
BOT_IMPLEMENTATION=gemini

Server Setup (server.py)

server.py is a FastAPI server that creates the meeting room where clients and bots interact, manages bot instances, and handles client connections. It’s the orchestrator that brings everything on the server-side together.

Creating Meeting Room

The server uses Daily’s API via a REST API helper to create rooms where clients and bots can meet. Each room is a secure space for audio/video communication:

async def create_room_and_token():
    """Create a Daily room and generate access credentials."""
    room = await daily_helpers["rest"].create_room(DailyRoomParams())
    token = await daily_helpers["rest"].get_token(room.url)
    return room.url, token

Managing Bot Instances

When a client connects, the server starts a new bot instance configured specifically for that room. It keeps track of running bots and ensures there’s only one bot per room:

# Start the bot process for a specific room
bot_file = "bot-gemini.py"
proc = subprocess.Popen([f"python3 -m {bot_file} -u {room_url} -t {token}"])
bot_procs[proc.pid] = (proc, room_url)

Connection Endpoints

The server provides two ways to connect:

Browser Access (/)

Creates a room, starts a bot, and redirects the browser to the Daily meeting URL. Perfect for quick testing and development.

RTVI Client (/connect)

Creates a room, starts a bot, and returns connection credentials. Used by RTVI clients for custom implementations.

Bot Implementation (bot-gemini.py)

The bot implementation connects all the pieces: Daily transport, Gemini service, conversation context, and processors.

Let’s break down each component:

Transport Setup

First, we configure the Daily transport, which handles WebRTC communication between the client and server.

transport = DailyTransport(
    room_url,
    token,
    "Chatbot",
    DailyParams(
        audio_in_sample_rate=16000,   # 16khz input sample rate
        audio_out_sample_rate=24000,  # 24khz output sample rate
        audio_out_enabled=True,       # Enable audio output
        camera_out_enabled=True,      # Enable video output
        vad_enabled=True,             # Enable voice activity detection, 0.5 sec stop time
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
    ),
)

Gemini Multimodal Live audio requirements:

  • Input: 16 kHz sample rate
  • Output: 24 kHz sample rate

Gemini Service Configuration

Next, we initialize the Gemini service which will provide speech-to-speech inference and communication:

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GEMINI_API_KEY"),
    voice_id="Puck",                     # Choose your bot's voice
    transcribe_user_audio=True,          # Enable speech-to-text
    transcribe_model_audio=True,         # Log bot responses
    params=InputParams(temperature=0.7)  # Set model input params
)

Conversation Context

We give our bot its personality and initial instructions:

messages = [{
    "role": "user",
    "content": """You are Chatbot, a friendly, helpful robot.
                 Keep responses brief and avoid special characters
                 since output will be converted to audio."""
}]

context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)

OpenAILLMContext is used as a common LLM base service for context management. In the future, we may add a specific context manager for Gemini.

The context aggregator automatically maintains conversation history, helping the bot remember previous interactions.

Processor Setup

We initialize several processors to handle different aspects of the interaction:

RTVI Processors

RTVISpeakingProcessor: Manages speaking states RTVITranscriptionProcessor: Handles transcription events RTVIMetricsProcessor: Tracks performance metrics

Animation

TalkingAnimation: Controls the bot’s visual state, switching between static and animated frames based on speaking status

Learn more about the RTVI framework and available processors.

Pipeline Assembly

Finally, we bring everything together in a pipeline:

pipeline = Pipeline([
    transport.input(),             # Receive media
    context_aggregator.user(),     # Process user context
    llm,                           # Gemini processing
    rtvi_speaking,                 # Speaking states
    rtvi_user_transcription,       # User transcripts
    rtvi_bot_transcription,        # Bot transcripts
    ta,                            # Animation
    rtvi_metrics,                  # Metrics
    transport.output(),            # Send media
    context_aggregator.assistant() # Process bot context
])

The order of processors is crucial! For example, transcription processors should come after the LLM to capture the processed speech.

Traditional STT, LLM, TTS pipelines have a different ordering, so be sure to tailor your processor ordering based on the elements in the pipeline.

Client Implementation

Our React client uses the Pipecat React SDK to communicate with the bot. Let’s explore how the client connects and interacts with our Pipecat server.

Connection Setup

The client needs to connect to our bot server using the same transport type (Daily WebRTC) that we configured on the server:

const transport = new DailyTransport();

const client = new RTVIClient({
  transport,
  params: {
    baseUrl: "http://localhost:7860", // Your bot server address
    endpoints: {
      connect: "/connect", // Matches server.py endpoint
    },
  },
  enableMic: true, // Enable audio input
  enableCam: false, // Disable video input
});

The connection configuration must match your server:

  • DailyTransport: Matches the WebRTC transport used in bot-gemini.py
  • connect endpoint: Matches the /connect route in server.py
  • Media settings: Controls which devices are enabled on join

Media Handling

Pipecat’s React components handle all the complex media stream management for you:

function App() {
  return (
    <RTVIClientProvider client={client}>
      <div className="app">
        <RTVIClientVideo participant="bot" /> {/* Bot's video feed */}
        <RTVIClientAudio /> {/* Audio input/output */}
      </div>
    </RTVIClientProvider>
  );
}

The RTVIClientProvider is the root component for providing RTVI client context to your application. By wrapping your RTVIClientAudio and RTVIClientVideo components in this provider, they can access the client instance and receive and process the streams received from the Pipecat server.

Real-time Events

The RTVI processors we configured in the pipeline emit events that we can handle in our client:

// Listen for transcription events
useRTVIClientEvent(RTVIEvent.UserTranscript, (data: TranscriptData) => {
  if (data.final) {
    console.log(`User said: ${data.text}`);
  }
});

// Listen for bot responses
useRTVIClientEvent(RTVIEvent.BotTranscript, (data: BotLLMTextData) => {
  console.log(`Bot responded: ${data.text}`);
});

Available Events

  • Speaking state changes
  • Transcription updates
  • Bot responses
  • Connection status
  • Performance metrics

Event Usage

Use these events to:

  • Show speaking indicators
  • Display transcripts
  • Update UI state
  • Monitor performance

Optionally, uses callbacks to handle events in your application. Learn more in the Pipecat client docs.

Complete Example

Here’s a basic client implementation with connection status and transcription display:

function ChatApp() {
  return (
    <RTVIClientProvider client={client}>
      <div className="app">
        {/* Connection UI */}
        <StatusDisplay />
        <ConnectButton />

        {/* Media Components */}
        <BotVideo />
        <RTVIClientAudio />

        {/* Debug/Transcript Display */}
        <DebugDisplay />
      </div>
    </RTVIClientProvider>
  );
}

Check out the example repository for a complete client implementation with styling and error handling.

Running the Application

From the simple-chatbot directory, start the server and client to test the chatbot:

1. Start the Server

In one terminal:

python server/server.py

2. Start the Client

In another terminal:

cd examples/react
npm install
npm run dev

3. Testing the Connection

  1. Open http://localhost:5173 in your browser
  2. Click “Connect” to join a room
  3. Allow microphone access when prompted
  4. Start talking with your AI assistant

Troubleshooting:

  • Check that all API keys are properly configured in .env
  • Grant your browser access to your microphone, so it can receive your audio input
  • Verify WebRTC ports aren’t blocked by firewalls

Next Steps

Now that you have a working chatbot, consider these enhancements:

  • Add custom avatar animations
  • Implement function calling for external integrations
  • Add support for multiple languages
  • Enhance error recovery and reconnection logic

Examples

Learn More