Skip to main content

Overview

OpenAIRealtimeLLMService provides real-time, multimodal conversation capabilities using OpenAI’s Realtime API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with minimal latency response times.

OpenAI Realtime API Reference

Pipecat’s API methods for OpenAI Realtime integration

Example Implementation

Complete OpenAI Realtime conversation example

OpenAI Documentation

Official OpenAI Realtime API documentation

OpenAI Platform

Access Realtime models and manage API keys

Installation

To use OpenAI Realtime services, install the required dependencies:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI Realtime services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an OpenAI API key from your account dashboard
  3. Model Access: Ensure access to GPT-4o Realtime models
  4. Usage Limits: Configure appropriate usage limits and billing

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

Key Features

  • Real-time Speech-to-Speech: Direct audio processing with minimal latency
  • Advanced Turn Detection: Multiple voice activity detection options including semantic detection
  • Function Calling: Seamless support for external functions and APIs
  • Voice Options: Multiple voice personalities and speaking styles
  • Conversation Management: Intelligent context handling and conversation flow control

Configuration

OpenAIRealtimeLLMService

api_key
str
required
OpenAI API key for authentication.
model
str
default:"gpt-realtime-1.5"
deprecated
OpenAI Realtime model name. This is a connection-level parameter set via the WebSocket URL and cannot be changed during the session.Deprecated in v0.0.105. Use settings=OpenAIRealtimeLLMService.Settings(model=...) instead.
base_url
str
default:"wss://api.openai.com/v1/realtime"
WebSocket base URL for the Realtime API. Override for custom or proxied deployments.
session_properties
SessionProperties
default:"None"
deprecated
Configuration properties for the realtime session. These are session-level settings that can be updated during the session (except for voice and model). See SessionProperties below.Deprecated in v0.0.105. Use settings=OpenAIRealtimeLLMService.Settings(session_properties=...) instead.
settings
OpenAIRealtimeLLMService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused. Useful when you want to control when audio processing begins.
start_video_paused
bool
default:"False"
Whether to start with video input paused.
video_frame_detail
str
default:"auto"
Detail level for video processing. Can be "auto", "low", or "high". "auto" lets the model decide, "low" is faster and uses fewer tokens, "high" provides more detail.

Settings

Runtime-configurable settings passed via the settings constructor argument using OpenAIRealtimeLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNOT_GIVENModel identifier. (Inherited from base settings.)
system_instructionstrNOT_GIVENSystem instruction/prompt. (Inherited from base settings.)
session_propertiesSessionPropertiesNOT_GIVENSession-level configuration (modalities, audio, tools, etc.).
NOT_GIVEN values are omitted, letting the service use its own defaults ("gpt-realtime-1.5" for model). Only parameters that are explicitly set are included.

SessionProperties

ParameterTypeDefaultDescription
output_modalitiesList[Literal["text", "audio"]]NoneModalities the model can respond with. The API supports single modality responses: either ["text"] or ["audio"].
instructionsstrNoneSystem instructions for the assistant.
audioAudioConfigurationNoneConfiguration for input and output audio (format, transcription, turn detection, voice, speed).
toolsList[Dict]NoneAvailable function tools for the assistant.
tool_choiceLiteral["auto", "none", "required"]NoneTool usage strategy.
max_output_tokensint | Literal["inf"]NoneMaximum tokens in response, or "inf" for unlimited.
tracingLiteral["auto"] | DictNoneConfiguration options for tracing.

AudioConfiguration

The audio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations: AudioInput (audio.input):
ParameterTypeDefaultDescription
formatAudioFormatNoneInput audio format (PCMAudioFormat, PCMUAudioFormat, or PCMAAudioFormat).
transcriptionInputAudioTranscriptionNoneTranscription settings: model (e.g. "gpt-4o-transcribe"), language, and prompt.
noise_reductionInputAudioNoiseReductionNoneNoise reduction type: "near_field" or "far_field".
turn_detectionTurnDetection | SemanticTurnDetection | boolNoneTurn detection config, or False to disable server-side turn detection.
AudioOutput (audio.output):
ParameterTypeDefaultDescription
formatAudioFormatNoneOutput audio format.
voicestrNoneVoice the model uses to respond (e.g. "alloy", "echo", "shimmer").
speedfloatNoneSpeed of the model’s spoken response.

TurnDetection

Server-side VAD configuration via TurnDetection:
ParameterTypeDefaultDescription
typeLiteral["server_vad"]"server_vad"Detection type.
thresholdfloat0.5Voice activity detection threshold (0.0-1.0).
prefix_padding_msint300Padding before speech starts in milliseconds.
silence_duration_msint500Silence duration to detect speech end in milliseconds.
Alternatively, use SemanticTurnDetection for semantic-based detection:
ParameterTypeDefaultDescription
typeLiteral["semantic_vad"]"semantic_vad"Detection type.
eagernessLiteral["low", "medium", "high", "auto"]NoneTurn detection eagerness level.
create_responseboolNoneWhether to automatically create responses on turn detection.
interrupt_responseboolNoneWhether to interrupt ongoing responses on turn detection.

Usage

Basic Setup

import os
from pipecat.services.openai.realtime import OpenAIRealtimeLLMService

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-realtime-1.5",
)

With Session Configuration

from pipecat.services.openai.realtime import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import (
    SessionProperties,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    InputAudioTranscription,
    SemanticTurnDetection,
)

session_properties = SessionProperties(
    audio=AudioConfiguration(
        input=AudioInput(
            transcription=InputAudioTranscription(model="gpt-4o-transcribe"),
            turn_detection=SemanticTurnDetection(eagerness="medium"),
        ),
        output=AudioOutput(
            voice="alloy",
            speed=1.0,
        ),
    ),
    max_output_tokens=4096,
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIRealtimeLLMService.Settings(
        model="gpt-realtime-1.5",
        session_properties=session_properties,
        system_instruction="You are a helpful assistant.",
    ),
)

With Disabled Turn Detection (Manual Control)

session_properties = SessionProperties(
    audio=AudioConfiguration(
        input=AudioInput(
            turn_detection=False,
        ),
    ),
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIRealtimeLLMService.Settings(
        model="gpt-realtime-1.5",
        session_properties=session_properties,
        system_instruction="You are a helpful assistant.",
    ),
)

Updating Settings at Runtime

from pipecat.frames.frames import LLMUpdateSettingsFrame
from pipecat.services.openai.realtime.llm import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import SessionProperties

await task.queue_frame(
    LLMUpdateSettingsFrame(
        delta=OpenAIRealtimeLLMService.Settings(
            system_instruction="Now speak in Spanish.",
            session_properties=SessionProperties(
                max_output_tokens=2048,
            ),
        )
    )
)
The deprecated model and session_properties constructor parameters are replaced by Settings as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • Model is connection-level: The model parameter is set via the WebSocket URL at connection time and cannot be changed during a session.
  • Output modalities are single-mode: The API supports either ["text"] or ["audio"] output, not both simultaneously.
  • Turn detection options: Use TurnDetection for traditional VAD, SemanticTurnDetection for AI-based turn detection, or False to disable server-side detection and manage turns manually.
  • Audio output format: The service outputs 24kHz PCM audio by default.
  • Video support: Video frames can be sent to the model for multimodal input. Control the detail level with video_frame_detail and pause/resume with set_video_input_paused().
  • Transcription frames: User speech transcription frames are always emitted upstream when input audio transcription is configured.

Event Handlers

EventDescription
on_conversation_item_createdCalled when a new conversation item is created in the session
on_conversation_item_updatedCalled when a conversation item is updated or completed
@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")