Real-time speech-to-speech service implementation using OpenAI’s Realtime Beta API
OpenAIRealtimeBetaLLMService
provides real-time, multimodal conversation capabilities using OpenAI’s Realtime Beta API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management.
Stream audio in real-time with minimal latency response times
Built-in speech-to-text and text-to-speech capabilities with voice options
Multiple voice activity detection options including semantic turn detection
Seamless support for calling external functions and APIs
To use OpenAIRealtimeBetaLLMService
, install the required dependencies:
You’ll also need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY
.
Your OpenAI API key
The speech-to-speech model used for processing
WebSocket endpoint URL
Configuration for the realtime session
Whether to start with audio input paused
Whether to emit transcription frames
The SessionProperties
object configures the behavior of the realtime session:
The modalities to enable (default includes both text and audio)
System instructions that guide the model’s behavior
Voice ID for text-to-speech (options: alloy, echo, fable, onyx, nova, shimmer)
Format of the input audio
Format of the output audio
Configuration for audio transcription
Configuration for audio noise reduction
Configuration for turn detection (set to False to disable)
List of function definitions for tool/function calling
Controls when the model calls functions
Controls randomness in responses (0.0 to 2.0)
Maximum number of tokens to generate
Raw audio data for speech input
Signals start of user interruption
Signals user started speaking
Signals user stopped speaking
Contains conversation context
Appends messages to conversation
Generated speech audio
Signals start of speech synthesis
Signals end of speech synthesis
Generated text responses
Speech transcriptions
Emitted when a conversation item on the server is created. Handler receives:
item_id: str
item: ConversationItem
Emitted when a conversation item on the server is updated. Handler receives:
item_id: str
item: Optional[ConversationItem]
(may not exist for some updates)Retrieves a conversation item’s details from the server.
The service supports function calling with automatic response handling:
See the Function Calling guide for:
The service collects comprehensive metrics:
Basic implementation showing core realtime features including audio streaming, turn detection, and function calling.
Real-time speech-to-speech service implementation using OpenAI’s Realtime Beta API
OpenAIRealtimeBetaLLMService
provides real-time, multimodal conversation capabilities using OpenAI’s Realtime Beta API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management.
Stream audio in real-time with minimal latency response times
Built-in speech-to-text and text-to-speech capabilities with voice options
Multiple voice activity detection options including semantic turn detection
Seamless support for calling external functions and APIs
To use OpenAIRealtimeBetaLLMService
, install the required dependencies:
You’ll also need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY
.
Your OpenAI API key
The speech-to-speech model used for processing
WebSocket endpoint URL
Configuration for the realtime session
Whether to start with audio input paused
Whether to emit transcription frames
The SessionProperties
object configures the behavior of the realtime session:
The modalities to enable (default includes both text and audio)
System instructions that guide the model’s behavior
Voice ID for text-to-speech (options: alloy, echo, fable, onyx, nova, shimmer)
Format of the input audio
Format of the output audio
Configuration for audio transcription
Configuration for audio noise reduction
Configuration for turn detection (set to False to disable)
List of function definitions for tool/function calling
Controls when the model calls functions
Controls randomness in responses (0.0 to 2.0)
Maximum number of tokens to generate
Raw audio data for speech input
Signals start of user interruption
Signals user started speaking
Signals user stopped speaking
Contains conversation context
Appends messages to conversation
Generated speech audio
Signals start of speech synthesis
Signals end of speech synthesis
Generated text responses
Speech transcriptions
Emitted when a conversation item on the server is created. Handler receives:
item_id: str
item: ConversationItem
Emitted when a conversation item on the server is updated. Handler receives:
item_id: str
item: Optional[ConversationItem]
(may not exist for some updates)Retrieves a conversation item’s details from the server.
The service supports function calling with automatic response handling:
See the Function Calling guide for:
The service collects comprehensive metrics:
Basic implementation showing core realtime features including audio streaming, turn detection, and function calling.