The GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences. It provides:

Real-time Interaction

Stream audio and video in real-time with low latency response times

Speech Processing

Built-in speech-to-text and text-to-speech capabilities with multiple voice options

Voice Activity Detection

Automatic detection of speech start/stop for natural conversations

Context Management

Intelligent handling of conversation history and system instructions

Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use GeminiMultimodalLiveLLMService, install the required dependencies:

pip install pipecat-ai[google]

You’ll need to set up your Google API key as an environment variable: GOOGLE_API_KEY.

Basic Usage

Here’s a simple example of setting up a conversational AI bot with Gemini Multimodal Live:

from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService, InputParams

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Aoede",                    # Voices: Aoede, Charon, Fenrir, Kore, Puck
    transcribe_user_audio=True,          # Enable speech-to-text for user input
    transcribe_model_audio=True,         # Enable speech-to-text for model responses
    params=InputParams(temperature=0.7)  # Set model input params
)

Configuration

Constructor Parameters

api_key
str
required

Your Google API key

base_url
str
default:
"preprod-generativelanguage.googleapis.com"

API endpoint URL

model
str
default:
"models/gemini-2.0-flash-exp"

Gemini model to use

voice_id
str
default:
"Charon"

Voice for text-to-speech (options: Aoede, Charon, Fenrir, Kore, Puck)

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Puck",  # Choose your preferred voice
)
transcribe_user_audio
bool
default:
"False"

Enable transcription of user audio

transcribe_model_audio
bool
default:
"False"

Enable transcription of model responses

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    transcribe_user_audio=True,   # Log user speech as text
    transcribe_model_audio=True,  # Log model responses as text
)
system_instruction
str

High-level instructions that guide the model’s behavior

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="Talk like a pirate.",
)

Input Parameters

frequency_penalty
float
default:
"None"

Penalizes repeated tokens. Range: 0.0 to 2.0

max_tokens
int
default:
"4096"

Maximum number of tokens to generate

modalities
enum
default:
"AUDIO"

Response modalities to include (options: AUDIO, TEXT).

presence_penalty
float
default:
"None"

Penalizes tokens based on their presence in the text. Range: 0.0 to 2.0

temperature
float
default:
"None"

Controls randomness in responses. Range: 0.0 to 2.0

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        temperature=0.7,  # More creative responses
    )
)
top_k
int
default:
"None"

Limits vocabulary to k most likely tokens. Minimum: 0

top_p
float
default:
"None"

Cumulative probability cutoff for token selection. Range: 0.0 to 1.0

llm = GeminiMultimodalLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    params=InputParams(
        top_p=0.9,     # More focused token selection
        top_k=40       # Limit vocabulary options
    )
)

Frame Types

Input Frames

InputAudioRawFrame
Frame

Raw audio data for speech input

StartInterruptionFrame
Frame

Signals start of user interruption

UserStartedSpeakingFrame
Frame

Signals user started speaking

UserStoppedSpeakingFrame
Frame

Signals user stopped speaking

OpenAILLMContextFrame
Frame

Contains conversation context

Output Frames

TTSAudioRawFrame
Frame

Generated speech audio

TTSStartedFrame
Frame

Signals start of speech synthesis

TTSStoppedFrame
Frame

Signals end of speech synthesis

TextFrame
Frame

Generated text responses

TranscriptionFrame
Frame

Speech transcriptions

Next Steps

Examples

  • Foundational Example Basic implementation showing core features and transcription

  • Simple Chatbot A client/server example showing how to build a Pipecat JS or React client that connects to a Gemini Live Pipecat bot.

Learn More

Check out our Gemini Multimodal Live Guide for detailed explanations and best practices.