Inworld Realtime

Overview

InworldRealtimeLLMService provides real-time, multimodal conversation capabilities using Inworld’s Realtime API. It operates as a cascade STT/LLM/TTS pipeline under the hood with built-in semantic voice activity detection (VAD) for turn management, offering low-latency speech-to-speech interactions with integrated LLM processing and function calling. Speech synthesis defaults to Realtime TTS-2 (inworld-tts-2). Realtime TTS-1.5-Max (inworld-tts-1.5-max) and Realtime TTS-1.5-Mini (inworld-tts-1.5-mini) remain available via the tts_model parameter.

Inworld Realtime API Reference

Pipecat’s API methods for Inworld Realtime integration

Example Implementation

Complete Inworld Realtime conversation example

Inworld Realtime Documentation

Official Inworld Realtime API documentation

Inworld Console

Access Inworld models and manage API keys

Installation

To use Inworld Realtime services, install the required dependencies:

uv add "pipecat-ai[inworld]"

Prerequisites

Inworld Account Setup

Before using Inworld Realtime services, you need:

Inworld Account: Sign up at Inworld Studio
API Key: Generate an Inworld API key from your account dashboard
Model Access: Ensure access to Inworld Realtime models
Usage Limits: Configure appropriate usage limits and billing

Required Environment Variables

INWORLD_API_KEY: Your Inworld API key for authentication

Key Features

Real-time Speech-to-Speech: Direct audio processing with low latency
Cascade Pipeline: Integrated STT → LLM → TTS processing
Semantic VAD: Advanced semantic voice activity detection for natural turn-taking
Multilingual Support: Support for multiple languages via STT model selection
Function Calling: Seamless support for external functions and tool integration
Multiple Voice Options: Various voice personalities available
WebSocket Support: Real-time bidirectional audio streaming
Streaming Transcription: Real-time user speech transcription

Configuration

InworldRealtimeLLMService

api_key

str

required

Inworld API key for authentication.

llm_model

str

default:"openai/gpt-4.1-mini"

LLM model to use (e.g. “openai/gpt-4.1-nano”). Can be any supported model or an Inworld Router. Shorthand for session_properties.model.

voice

str

default:"Clive"

Voice ID for TTS output (e.g. “Sarah”, “Clive”). Shorthand for session_properties.audio.output.voice.

tts_model

str

default:"inworld-tts-2"

TTS model to use. Defaults to Realtime TTS-2 (inworld-tts-2). Other options: Realtime TTS-1.5-Max (inworld-tts-1.5-max), Realtime TTS-1.5-Mini (inworld-tts-1.5-mini). Shorthand for session_properties.audio.output.model.

stt_model

str

default:"inworld/inworld-stt-1"

STT model for input transcription (e.g. “inworld/inworld-stt-1”). Defaults to "inworld/inworld-stt-1"; pass stt_model= to override. Shorthand for session_properties.audio.input.transcription.model.

base_url

str

default:"wss://api.inworld.ai/api/v1/realtime/session"

WebSocket base URL for the Inworld Realtime API. Override for custom deployments.

auth_type

Literal['basic', 'bearer']

default:"basic"

Authentication type. "basic" for server-side API key auth, "bearer" for client-side JWT auth.

settings

InworldRealtimeLLMService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

start_audio_paused

bool

default:"False"

Whether to start with audio input paused.

Settings

Runtime-configurable settings passed via the settings constructor argument using InworldRealtimeLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`NOT_GIVEN`	Model identifier. (Inherited from base settings.)
`system_instruction`	`str`	`NOT_GIVEN`	System instruction/prompt. (Inherited from base settings.)
`temperature`	`float`	`NOT_GIVEN`	Temperature for response generation. (Inherited from base settings.)
`session_properties`	`SessionProperties`	`NOT_GIVEN`	Session-level configuration (voice, audio config, tools, etc.).

NOT_GIVEN values are omitted, letting the service use its own defaults. Only parameters that are explicitly set are included.

SessionProperties

Parameter	Type	Default	Description
`model`	`str`	`None`	LLM model to use (e.g. “openai/gpt-4.1-nano”).
`instructions`	`str`	`None`	System instructions for the assistant.
`temperature`	`float`	`None`	Temperature for response generation.
`output_modalities`	`List[str]`	`["audio", "text"]`	Output modalities for the assistant.
`audio`	`AudioConfiguration`	`None`	Configuration for input and output audio formats.
`tools`	`List[FunctionTool]`	`None`	Available custom function tools.

AudioConfiguration

The audio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations: AudioInput (audio.input):

Parameter	Type	Default	Description
`format`	`AudioFormat`	`None`	Input audio format. Supports `PCMAudioFormat` (configurable rate), `PCMUAudioFormat` (8kHz), or `PCMAAudioFormat` (8kHz).
`transcription`	`InputTranscription`	`None`	Configuration for input audio transcription. Includes `model` field for STT model selection.
`turn_detection`	`TurnDetection`	`None`	Turn detection configuration. Supports `"semantic_vad"` and `"server_vad"` types.

AudioOutput (audio.output):

Parameter	Type	Default	Description
`format`	`AudioFormat`	`None`	Output audio format. Same format options as input.
`model`	`str`	`None`	TTS model to use (e.g. “inworld-tts-2”).
`voice`	`str`	`None`	Voice ID (e.g. “Sarah”, “Clive”).

Inworld PCM audio supports sample rates: 8000, 16000, 24000, 32000, 44100, and 48000 Hz.

TurnDetection

Parameter	Type	Default	Description
`type`	`Literal["server_vad", "semantic_vad"]`	`"semantic_vad"`	Detection type. “semantic_vad” for semantic-based, “server_vad” for standard VAD.
`eagerness`	`str`	`None`	How eagerly to detect end of turn. Options: “low”, “medium”, “high”.
`create_response`	`bool`	`None`	Whether to automatically create a response on turn end.
`interrupt_response`	`bool`	`None`	Whether user speech interrupts the current response.

Usage

Pair this service with LLMContextAggregatorPair(context, realtime_service_mode=True). Realtime mode keeps context-writing correct for speech-to-speech services and adapts turn handling to the service. See Realtime (Speech-to-Speech) Services.

Basic Setup

import os
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
    llm_model="xai/grok-4-1-fast-non-reasoning",
    voice="Sarah",
    settings=InworldRealtimeLLMService.Settings(
        system_instruction=(
            "You are a helpful and friendly AI assistant powered by Inworld. "
            "Keep your responses concise and conversational since this is a "
            "voice interaction."
        ),
    ),
)

With Model and Voice Configuration

from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
    llm_model="openai/gpt-4.1-nano",
    voice="Sarah",
    tts_model="inworld-tts-2",
    stt_model="inworld/inworld-stt-1",
)

With Full Session Configuration

from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService
from pipecat.services.inworld.realtime.events import (
    SessionProperties,
    TurnDetection,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    PCMAudioFormat,
    InputTranscription,
)

session_properties = SessionProperties(
    model="openai/gpt-4.1-nano",
    instructions="You are a helpful assistant.",
    temperature=0.7,
    audio=AudioConfiguration(
        input=AudioInput(
            format=PCMAudioFormat(rate=24000),
            transcription=InputTranscription(
                model="inworld/inworld-stt-1"
            ),
            turn_detection=TurnDetection(
                type="semantic_vad",
                eagerness="low",
                create_response=True,
                interrupt_response=True,
            ),
        ),
        output=AudioOutput(
            format=PCMAudioFormat(rate=24000),
            model="inworld-tts-2",
            voice="Sarah",
        ),
    ),
)

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
    settings=InworldRealtimeLLMService.Settings(
        session_properties=session_properties,
    ),
)

Updating Settings at Runtime

For partial updates, prefer the top-level fields (model, system_instruction, temperature). They are synced into session_properties automatically, so you don’t need to resend a full SessionProperties config:

from pipecat.frames.frames import LLMUpdateSettingsFrame
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMSettings

await worker.queue_frame(
    LLMUpdateSettingsFrame(
        delta=InworldRealtimeLLMSettings(
            system_instruction="Now speak in Spanish.",
        )
    )
)

To change nested fields like voice, send a complete SessionProperties (it replaces the stored config wholesale):

from pipecat.frames.frames import LLMUpdateSettingsFrame
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMSettings
from pipecat.services.inworld.realtime.events import (
    SessionProperties,
    AudioConfiguration,
    AudioOutput,
    PCMAudioFormat,
)

await worker.queue_frame(
    LLMUpdateSettingsFrame(
        delta=InworldRealtimeLLMSettings(
            session_properties=SessionProperties(
                model="openai/gpt-4.1-nano",
                instructions="Now speak in Spanish.",
                audio=AudioConfiguration(
                    output=AudioOutput(
                        format=PCMAudioFormat(rate=24000),
                        voice="Sarah",
                    ),
                ),
            ),
        )
    )
)

Notes

Audio format auto-configuration: If audio format is not specified in session_properties, the service automatically configures PCM input/output using the pipeline’s sample rates (defaults to 24000 Hz).
Semantic VAD by default: The service uses semantic VAD ("semantic_vad") by default for more natural turn detection. When VAD is enabled, the server handles speech detection and turn management automatically.
Cascade architecture: The service operates as an integrated STT → LLM → TTS pipeline on the server side, simplifying client-side implementation.
Audio before setup: Audio is not sent to Inworld until the conversation setup is complete, preventing sample rate mismatches.
G.711 support: PCMU and PCMA formats are supported at a fixed 8000 Hz rate, useful for telephony integrations.
System instruction precedence: The system_instruction from service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set.
Settings replacement: When providing session_properties in settings, it replaces all defaults wholesale — provide a complete SessionProperties configuration in that case. Use the constructor shortcuts (llm_model, voice, tts_model, stt_model) for simpler configuration.
Function calling limitations: Functions registered with cancel_on_interruption=False are not reliably supported by Inworld Realtime as of this writing. The service will emit a warning if async tool messages are detected. Use cancel_on_interruption=True (the default) or consider another LLM service if your tool needs async semantics. Streamed intermediate tool results (FunctionCallResultProperties(is_final=False)) are also not supported.

Event Handlers

Event	Description
`on_conversation_item_created`	Called when a new conversation item is created in the session
`on_conversation_item_updated`	Called when a conversation item is updated or completed

@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")

​Overview

Inworld Realtime API Reference

Example Implementation

Inworld Realtime Documentation

Inworld Console

​Installation

​Prerequisites

​Inworld Account Setup

​Required Environment Variables

​Key Features

​Configuration

​InworldRealtimeLLMService

​Settings

​SessionProperties

​AudioConfiguration

​TurnDetection

​Usage

​Basic Setup

​With Model and Voice Configuration

​With Full Session Configuration

​Updating Settings at Runtime

​Notes

​Event Handlers

Overview

Installation

Prerequisites

Inworld Account Setup

Required Environment Variables

Key Features

Configuration

InworldRealtimeLLMService

Settings

SessionProperties

AudioConfiguration

TurnDetection

Usage

Basic Setup

With Model and Voice Configuration

With Full Session Configuration

Updating Settings at Runtime

Notes

Event Handlers