OpenAI Realtime Beta
Real-time speech-to-speech service implementation using OpenAI’s Realtime Beta API
OpenAIRealtimeBetaLLMService
provides real-time, multimodal conversation capabilities using OpenAI’s Realtime Beta API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management.
Real-time Interaction
Stream audio in real-time with minimal latency response times
Speech Processing
Built-in speech-to-text and text-to-speech capabilities with voice options
Advanced Turn Detection
Multiple voice activity detection options including semantic turn detection
Powerful Function Calling
Seamless support for calling external functions and APIs
Installation
To use OpenAIRealtimeBetaLLMService
, install the required dependencies:
You’ll also need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY
.
Configuration
Constructor Parameters
Your OpenAI API key
The speech-to-speech model used for processing
WebSocket endpoint URL
Configuration for the realtime session
Whether to start with audio input paused
Whether to emit transcription frames
Session Properties
The SessionProperties
object configures the behavior of the realtime session:
The modalities to enable (default includes both text and audio)
System instructions that guide the model’s behavior
Voice ID for text-to-speech (options: alloy, echo, fable, onyx, nova, shimmer)
Format of the input audio
Format of the output audio
Configuration for audio transcription
Configuration for audio noise reduction
Configuration for turn detection (set to False to disable)
List of function definitions for tool/function calling
Controls when the model calls functions
Controls randomness in responses (0.0 to 2.0)
Maximum number of tokens to generate
Input Frames
Audio Input
Raw audio data for speech input
Control Input
Signals start of user interruption
Signals user started speaking
Signals user stopped speaking
Context Input
Contains conversation context
Appends messages to conversation
Output Frames
Audio Output
Generated speech audio
Control Output
Signals start of speech synthesis
Signals end of speech synthesis
Text Output
Generated text responses
Speech transcriptions
Events
Emitted when a conversation item on the server is created. Handler receives:
item_id: str
item: ConversationItem
Emitted when a conversation item on the server is updated. Handler receives:
item_id: str
item: Optional[ConversationItem]
(may not exist for some updates)
Methods
Retrieves a conversation item’s details from the server.
Usage Example
Function Calling
The service supports function calling with automatic response handling:
See the Function Calling guide for:
- Detailed implementation instructions
- Provider-specific function definitions
- Handler registration examples
- Control over function call behavior
- Complete usage examples
Frame Flow
Metrics Support
The service collects comprehensive metrics:
- Token usage (prompt and completion)
- Processing duration
- Time to First Byte (TTFB)
- Audio processing metrics
- Function call metrics
Advanced Features
Turn Detection
Context Management
Foundational Examples
OpenAI Realtime Beta Example
Basic implementation showing core realtime features including audio streaming, turn detection, and function calling.
Notes
- Supports real-time speech-to-speech conversation
- Handles interruptions and turn-taking
- Manages WebSocket connection lifecycle
- Provides function calling capabilities
- Supports conversation context management
- Includes comprehensive error handling
- Manages audio streaming and processing
- Handles both text and audio modalities