> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenAI Realtime

> Real-time speech-to-speech service implementation using OpenAI's Realtime API

## Overview

`OpenAIRealtimeLLMService` provides real-time, multimodal conversation capabilities using OpenAI's Realtime API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with minimal latency response times.

<CardGroup cols={2}>
  <Card title="OpenAI Realtime API Reference" icon="code" href="https://reference-server.pipecat.ai/en/latest/api/pipecat.services.openai.realtime.llm.html">
    Pipecat's API methods for OpenAI Realtime integration
  </Card>

  <Card title="Example Implementation" icon="play" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-openai.py">
    Complete OpenAI Realtime conversation example
  </Card>

  <Card title="OpenAI Documentation" icon="book" href="https://platform.openai.com/docs/guides/realtime">
    Official OpenAI Realtime API documentation
  </Card>

  <Card title="OpenAI Platform" icon="external-link" href="https://platform.openai.com/">
    Access Realtime models and manage API keys
  </Card>
</CardGroup>

## Installation

To use OpenAI Realtime services, install the required dependencies:

```bash theme={null}
uv add "pipecat-ai[openai]"
```

## Prerequisites

### OpenAI Account Setup

Before using OpenAI Realtime services, you need:

1. **OpenAI Account**: Sign up at [OpenAI Platform](https://platform.openai.com/)
2. **API Key**: Generate an OpenAI API key from your account dashboard
3. **Model Access**: Ensure access to GPT-4o Realtime models
4. **Usage Limits**: Configure appropriate usage limits and billing

### Required Environment Variables

* `OPENAI_API_KEY`: Your OpenAI API key for authentication

### Key Features

* **Real-time Speech-to-Speech**: Direct audio processing with minimal latency
* **Advanced Turn Detection**: Multiple voice activity detection options including semantic detection
* **Function Calling**: Seamless support for external functions and APIs
* **Voice Options**: Multiple voice personalities and speaking styles
* **Conversation Management**: Intelligent context handling and conversation flow control

## Configuration

### OpenAIRealtimeLLMService

<ParamField path="api_key" type="str" required>
  OpenAI API key for authentication.
</ParamField>

<ParamField path="model" type="str" default="gpt-realtime-2" deprecated>
  OpenAI Realtime model name. This is a connection-level parameter set via the
  WebSocket URL and cannot be changed during the session.

  *Deprecated in v0.0.105. Use `settings=OpenAIRealtimeLLMService.Settings(model=...)` instead.*
</ParamField>

<ParamField path="base_url" type="str" default="wss://api.openai.com/v1/realtime">
  WebSocket base URL for the Realtime API. Override for custom or proxied
  deployments.
</ParamField>

<ParamField path="session_properties" type="SessionProperties" default="None" deprecated>
  Configuration properties for the realtime session. These are session-level
  settings that can be updated during the session (except for voice and model).
  See [SessionProperties](#sessionproperties) below.

  *Deprecated in v0.0.105. Use `settings=OpenAIRealtimeLLMService.Settings(session_properties=...)` instead.*
</ParamField>

<ParamField path="settings" type="OpenAIRealtimeLLMService.Settings" default="None">
  Runtime-configurable settings. See [Settings](#settings) below.
</ParamField>

<ParamField path="start_audio_paused" type="bool" default="False">
  Whether to start with audio input paused. Useful when you want to control when
  audio processing begins.
</ParamField>

<ParamField path="start_video_paused" type="bool" default="False">
  Whether to start with video input paused.
</ParamField>

<ParamField path="video_frame_detail" type="str" default="auto">
  Detail level for video processing. Can be `"auto"`, `"low"`, or `"high"`.
  `"auto"` lets the model decide, `"low"` is faster and uses fewer tokens,
  `"high"` provides more detail.
</ParamField>

<ParamField path="user_audio_preroll_secs" type="float | None" default="None">
  In manual turn-detection mode (`turn_detection=False`, locally-driven turns),
  how much recent audio to replay after an interruption clears the input audio
  buffer, so the speech onset isn't lost. Defaults to `None`: auto-sized to the
  upstream VAD's `start_secs` plus a small margin, falling back to `0.5` seconds
  when no VAD is present. Auto-sizing assumes VAD drives turn starts (the
  default `VADUserTurnStartStrategy`); set this explicitly if you use a non-VAD
  turn-start strategy. No effect when server-side turn detection is enabled.
</ParamField>

<ParamField path="**kwargs" type="Any">
  Additional arguments passed to parent LLMService.
</ParamField>

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `OpenAIRealtimeLLMService.Settings(...)`. These can be updated mid-conversation with `LLMUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details.

| Parameter            | Type                | Default     | Description                                                   |
| -------------------- | ------------------- | ----------- | ------------------------------------------------------------- |
| `model`              | `str`               | `NOT_GIVEN` | Model identifier. *(Inherited from base settings.)*           |
| `system_instruction` | `str`               | `NOT_GIVEN` | System instruction/prompt. *(Inherited from base settings.)*  |
| `session_properties` | `SessionProperties` | `NOT_GIVEN` | Session-level configuration (modalities, audio, tools, etc.). |

<Note>
  `NOT_GIVEN` values are omitted, letting the service use its own defaults
  (`"gpt-realtime-2"` for model). Only parameters that are explicitly set are
  included.
</Note>

### SessionProperties

| Parameter           | Type                                  | Default | Description                                                                                                          |
| ------------------- | ------------------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------- |
| `output_modalities` | `List[Literal["text", "audio"]]`      | `None`  | Modalities the model can respond with. The API supports single modality responses: either `["text"]` or `["audio"]`. |
| `instructions`      | `str`                                 | `None`  | System instructions for the assistant.                                                                               |
| `audio`             | `AudioConfiguration`                  | `None`  | Configuration for input and output audio (format, transcription, turn detection, voice, speed).                      |
| `tools`             | `List[Dict]`                          | `None`  | Available function tools for the assistant.                                                                          |
| `tool_choice`       | `Literal["auto", "none", "required"]` | `None`  | Tool usage strategy.                                                                                                 |
| `max_output_tokens` | `int \| Literal["inf"]`               | `None`  | Maximum tokens in response, or `"inf"` for unlimited.                                                                |
| `tracing`           | `Literal["auto"] \| Dict`             | `None`  | Configuration options for tracing.                                                                                   |
| `reasoning`         | `Reasoning`                           | `None`  | Reasoning configuration. Only supported by reasoning-capable models such as `gpt-realtime-2`.                        |

### AudioConfiguration

The `audio` field in `SessionProperties` accepts an `AudioConfiguration` with `input` and `output` sub-configurations:

**AudioInput** (`audio.input`):

| Parameter         | Type                                             | Default | Description                                                                                                                                                   |
| ----------------- | ------------------------------------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `format`          | `AudioFormat`                                    | `None`  | Input audio format (`PCMAudioFormat`, `PCMUAudioFormat`, or `PCMAAudioFormat`).                                                                               |
| `transcription`   | `InputAudioTranscription`                        | `None`  | Transcription settings: `model` (e.g. `"gpt-realtime-whisper"`, `"gpt-4o-transcribe"`), `language`, and `prompt` (not supported by `"gpt-realtime-whisper"`). |
| `noise_reduction` | `InputAudioNoiseReduction`                       | `None`  | Noise reduction type: `"near_field"` or `"far_field"`.                                                                                                        |
| `turn_detection`  | `TurnDetection \| SemanticTurnDetection \| bool` | `None`  | Turn detection config, or `False` to disable server-side turn detection.                                                                                      |

**AudioOutput** (`audio.output`):

| Parameter | Type          | Default | Description                                                              |
| --------- | ------------- | ------- | ------------------------------------------------------------------------ |
| `format`  | `AudioFormat` | `None`  | Output audio format.                                                     |
| `voice`   | `str`         | `None`  | Voice the model uses to respond (e.g. `"alloy"`, `"echo"`, `"shimmer"`). |
| `speed`   | `float`       | `None`  | Speed of the model's spoken response.                                    |

### TurnDetection

Server-side VAD configuration via `TurnDetection`:

| Parameter             | Type                    | Default        | Description                                            |
| --------------------- | ----------------------- | -------------- | ------------------------------------------------------ |
| `type`                | `Literal["server_vad"]` | `"server_vad"` | Detection type.                                        |
| `threshold`           | `float`                 | `0.5`          | Voice activity detection threshold (0.0-1.0).          |
| `prefix_padding_ms`   | `int`                   | `300`          | Padding before speech starts in milliseconds.          |
| `silence_duration_ms` | `int`                   | `500`          | Silence duration to detect speech end in milliseconds. |

Alternatively, use `SemanticTurnDetection` for semantic-based detection:

| Parameter            | Type                                       | Default          | Description                                                  |
| -------------------- | ------------------------------------------ | ---------------- | ------------------------------------------------------------ |
| `type`               | `Literal["semantic_vad"]`                  | `"semantic_vad"` | Detection type.                                              |
| `eagerness`          | `Literal["low", "medium", "high", "auto"]` | `None`           | Turn detection eagerness level.                              |
| `create_response`    | `bool`                                     | `None`           | Whether to automatically create responses on turn detection. |
| `interrupt_response` | `bool`                                     | `None`           | Whether to interrupt ongoing responses on turn detection.    |

### Reasoning

Reasoning configuration for reasoning-capable Realtime models (e.g. `gpt-realtime-2`):

| Parameter | Type                                                   | Default | Description                                                                                              |
| --------- | ------------------------------------------------------ | ------- | -------------------------------------------------------------------------------------------------------- |
| `effort`  | `Literal["minimal", "low", "medium", "high", "xhigh"]` | `None`  | How much reasoning effort the model should apply. Omit to let the server pick the default for the model. |

<Note>
  Reasoning configuration is automatically filtered out when used with
  non-reasoning models (e.g. `gpt-realtime-1.5`) to prevent session errors. A
  warning is logged when this filtering occurs.
</Note>

## Usage

<Tip>
  Pair this service with
  `LLMContextAggregatorPair(context, realtime_service_mode=True)`. Realtime mode
  keeps context-writing correct for speech-to-speech services and adapts turn
  handling to the service. See [Realtime (Speech-to-Speech)
  Services](/api-reference/server/utilities/turn-management/external-turn-management#realtime-speech-to-speech-services).
</Tip>

### Basic Setup

```python theme={null}
import os
from pipecat.services.openai.realtime import OpenAIRealtimeLLMService

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-realtime-2",
)
```

### With Session Configuration

```python theme={null}
from pipecat.services.openai.realtime import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import (
    SessionProperties,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    InputAudioTranscription,
    SemanticTurnDetection,
    Reasoning,
)

session_properties = SessionProperties(
    audio=AudioConfiguration(
        input=AudioInput(
            transcription=InputAudioTranscription(model="gpt-realtime-whisper"),
            turn_detection=SemanticTurnDetection(eagerness="medium"),
        ),
        output=AudioOutput(
            voice="alloy",
            speed=1.0,
        ),
    ),
    max_output_tokens=4096,
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIRealtimeLLMService.Settings(
        model="gpt-realtime-2",
        session_properties=session_properties,
        system_instruction="You are a helpful assistant.",
    ),
)
```

### With Disabled Turn Detection (Manual Control)

```python theme={null}
session_properties = SessionProperties(
    audio=AudioConfiguration(
        input=AudioInput(
            turn_detection=False,
        ),
    ),
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIRealtimeLLMService.Settings(
        model="gpt-realtime-2",
        session_properties=session_properties,
        system_instruction="You are a helpful assistant.",
    ),
)
```

### With Reasoning Configuration (gpt-realtime-2)

```python theme={null}
from pipecat.services.openai.realtime import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import SessionProperties, Reasoning

session_properties = SessionProperties(
    reasoning=Reasoning(effort="high"),
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIRealtimeLLMService.Settings(
        model="gpt-realtime-2",
        session_properties=session_properties,
        system_instruction="You are a helpful assistant.",
    ),
)
```

### Updating Settings at Runtime

```python theme={null}
from pipecat.frames.frames import LLMUpdateSettingsFrame
from pipecat.services.openai.realtime.llm import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import SessionProperties, Reasoning

# Update system instruction and max tokens
await worker.queue_frame(
    LLMUpdateSettingsFrame(
        delta=OpenAIRealtimeLLMService.Settings(
            system_instruction="Now speak in Spanish.",
            session_properties=SessionProperties(
                max_output_tokens=2048,
            ),
        )
    )
)

# Update reasoning effort (gpt-realtime-2 only)
await worker.queue_frame(
    LLMUpdateSettingsFrame(
        delta=OpenAIRealtimeLLMService.Settings(
            session_properties=SessionProperties(
                reasoning=Reasoning(effort="xhigh"),
            ),
        )
    )
)
```

<Tip>
  The deprecated `model` and `session_properties` constructor parameters are
  replaced by `Settings` as of v0.0.105. Use `Settings` / `settings=` instead.
  See the [Service Settings guide](/pipecat/fundamentals/service-settings) for
  migration details.
</Tip>

## Notes

* **Model is connection-level**: The `model` parameter is set via the WebSocket URL at connection time and cannot be changed during a session.
* **Output modalities are single-mode**: The API supports either `["text"]` or `["audio"]` output, not both simultaneously.
* **Turn detection options**: Use `TurnDetection` for traditional VAD, `SemanticTurnDetection` for AI-based turn detection, or `False` to disable server-side detection and manage turns manually.
* **Manual turn detection pre-roll**: When server-side turn detection is disabled (`turn_detection=False`), the service maintains a rolling audio buffer that is replayed after interruptions to preserve speech onsets. Configure the buffer duration with `user_audio_preroll_secs` or let it auto-size from the upstream VAD's `start_secs`.
* **Audio output format**: The service outputs 24kHz PCM audio by default.
* **Video support**: Video frames can be sent to the model for multimodal input. Control the detail level with `video_frame_detail` and pause/resume with `set_video_input_paused()`.
* **Transcription frames**: User speech transcription frames are always emitted upstream when input audio transcription is configured.
* **System instruction precedence**: The `system_instruction` from service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set.
* **Reasoning support**: The `reasoning` configuration is only supported by reasoning-capable models such as `gpt-realtime-2`. When used with non-supporting models, the reasoning configuration is automatically filtered out to prevent session errors, and a warning is logged.

## Event Handlers

| Event                          | Description                                                   |
| ------------------------------ | ------------------------------------------------------------- |
| `on_conversation_item_created` | Called when a new conversation item is created in the session |
| `on_conversation_item_updated` | Called when a conversation item is updated or completed       |

```python theme={null}
@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")
```