Skip to main content

Overview

InworldRealtimeLLMService provides real-time, multimodal conversation capabilities using Inworld’s Realtime API. It operates as a cascade STT/LLM/TTS pipeline under the hood with built-in semantic voice activity detection (VAD) for turn management, offering low-latency speech-to-speech interactions with integrated LLM processing and function calling.

Inworld Realtime API Reference

Pipecat’s API methods for Inworld Realtime integration

Example Implementation

Complete Inworld Realtime conversation example

Inworld Realtime Documentation

Official Inworld Realtime API documentation

Inworld Console

Access Inworld models and manage API keys

Installation

To use Inworld Realtime services, install the required dependencies:
pip install "pipecat-ai[inworld]"

Prerequisites

Inworld Account Setup

Before using Inworld Realtime services, you need:
  1. Inworld Account: Sign up at Inworld Studio
  2. API Key: Generate an Inworld API key from your account dashboard
  3. Model Access: Ensure access to Inworld Realtime models
  4. Usage Limits: Configure appropriate usage limits and billing

Required Environment Variables

  • INWORLD_API_KEY: Your Inworld API key for authentication

Key Features

  • Real-time Speech-to-Speech: Direct audio processing with low latency
  • Cascade Pipeline: Integrated STT → LLM → TTS processing
  • Semantic VAD: Advanced semantic voice activity detection for natural turn-taking
  • Multilingual Support: Support for multiple languages via STT model selection
  • Function Calling: Seamless support for external functions and tool integration
  • Multiple Voice Options: Various voice personalities available
  • WebSocket Support: Real-time bidirectional audio streaming
  • Streaming Transcription: Real-time user speech transcription

Configuration

InworldRealtimeLLMService

api_key
str
required
Inworld API key for authentication.
llm_model
str
default:"openai/gpt-4.1-mini"
LLM model to use (e.g. “openai/gpt-4.1-nano”). Shorthand for session_properties.model.
voice
str
default:"Clive"
Voice ID for TTS output (e.g. “Sarah”, “Clive”). Shorthand for session_properties.audio.output.voice.
tts_model
str
default:"inworld-tts-1.5-max"
TTS model to use (e.g. “inworld-tts-1.5-max”). Shorthand for session_properties.audio.output.model.
stt_model
str
default:"assemblyai/u3-rt-pro"
STT model for input transcription (e.g. “assemblyai/universal-streaming-multilingual”). Shorthand for session_properties.audio.input.transcription.model.
base_url
str
default:"wss://api.inworld.ai/api/v1/realtime/session"
WebSocket base URL for the Inworld Realtime API. Override for custom deployments.
auth_type
Literal['basic', 'bearer']
default:"basic"
Authentication type. "basic" for server-side API key auth, "bearer" for client-side JWT auth.
settings
InworldRealtimeLLMService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused.

Settings

Runtime-configurable settings passed via the settings constructor argument using InworldRealtimeLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNOT_GIVENModel identifier. (Inherited from base settings.)
system_instructionstrNOT_GIVENSystem instruction/prompt. (Inherited from base settings.)
temperaturefloatNOT_GIVENTemperature for response generation. (Inherited from base settings.)
session_propertiesSessionPropertiesNOT_GIVENSession-level configuration (voice, audio config, tools, etc.).
NOT_GIVEN values are omitted, letting the service use its own defaults. Only parameters that are explicitly set are included.

SessionProperties

ParameterTypeDefaultDescription
modelstrNoneLLM model to use (e.g. “openai/gpt-4.1-nano”).
instructionsstrNoneSystem instructions for the assistant.
temperaturefloatNoneTemperature for response generation.
output_modalitiesList[str]["audio", "text"]Output modalities for the assistant.
audioAudioConfigurationNoneConfiguration for input and output audio formats.
toolsList[FunctionTool]NoneAvailable custom function tools.

AudioConfiguration

The audio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations: AudioInput (audio.input):
ParameterTypeDefaultDescription
formatAudioFormatNoneInput audio format. Supports PCMAudioFormat (configurable rate), PCMUAudioFormat (8kHz), or PCMAAudioFormat (8kHz).
transcriptionInputTranscriptionNoneConfiguration for input audio transcription. Includes model field for STT model selection.
turn_detectionTurnDetectionNoneTurn detection configuration. Supports "semantic_vad" and "server_vad" types.
AudioOutput (audio.output):
ParameterTypeDefaultDescription
formatAudioFormatNoneOutput audio format. Same format options as input.
modelstrNoneTTS model to use (e.g. “inworld-tts-1.5-max”).
voicestrNoneVoice ID (e.g. “Sarah”, “Clive”).
Inworld PCM audio supports sample rates: 8000, 16000, 24000, 32000, 44100, and 48000 Hz.

TurnDetection

ParameterTypeDefaultDescription
typeLiteral["server_vad", "semantic_vad"]"semantic_vad"Detection type. “semantic_vad” for semantic-based, “server_vad” for standard VAD.
eagernessstrNoneHow eagerly to detect end of turn. Options: “low”, “medium”, “high”.
create_responseboolNoneWhether to automatically create a response on turn end.
interrupt_responseboolNoneWhether user speech interrupts the current response.

Usage

Basic Setup

import os
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
)

With Model and Voice Configuration

from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
    llm_model="openai/gpt-4.1-nano",
    voice="Sarah",
    tts_model="inworld-tts-1.5-max",
    stt_model="assemblyai/universal-streaming-multilingual",
)

With Full Session Configuration

from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService
from pipecat.services.inworld.realtime.events import (
    SessionProperties,
    TurnDetection,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    PCMAudioFormat,
    InputTranscription,
)

session_properties = SessionProperties(
    model="openai/gpt-4.1-nano",
    instructions="You are a helpful assistant.",
    temperature=0.7,
    audio=AudioConfiguration(
        input=AudioInput(
            format=PCMAudioFormat(rate=24000),
            transcription=InputTranscription(
                model="assemblyai/u3-rt-pro"
            ),
            turn_detection=TurnDetection(
                type="semantic_vad",
                eagerness="low",
                create_response=True,
                interrupt_response=True,
            ),
        ),
        output=AudioOutput(
            format=PCMAudioFormat(rate=24000),
            model="inworld-tts-1.5-max",
            voice="Sarah",
        ),
    ),
)

llm = InworldRealtimeLLMService(
    api_key=os.getenv("INWORLD_API_KEY"),
    settings=InworldRealtimeLLMService.Settings(
        session_properties=session_properties,
    ),
)

Updating Settings at Runtime

from pipecat.frames.frames import LLMUpdateSettingsFrame
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMSettings
from pipecat.services.inworld.realtime.events import SessionProperties

await task.queue_frame(
    LLMUpdateSettingsFrame(
        delta=InworldRealtimeLLMSettings(
            session_properties=SessionProperties(
                instructions="Now speak in Spanish.",
                voice="Sarah",
            ),
        )
    )
)

Notes

  • Audio format auto-configuration: If audio format is not specified in session_properties, the service automatically configures PCM input/output using the pipeline’s sample rates (defaults to 24000 Hz).
  • Semantic VAD by default: The service uses semantic VAD ("semantic_vad") by default for more natural turn detection. When VAD is enabled, the server handles speech detection and turn management automatically.
  • Cascade architecture: The service operates as an integrated STT → LLM → TTS pipeline on the server side, simplifying client-side implementation.
  • Audio before setup: Audio is not sent to Inworld until the conversation setup is complete, preventing sample rate mismatches.
  • G.711 support: PCMU and PCMA formats are supported at a fixed 8000 Hz rate, useful for telephony integrations.
  • System instruction precedence: The system_instruction from service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set.
  • Settings replacement: When providing session_properties in settings, it replaces all defaults wholesale — provide a complete SessionProperties configuration in that case. Use the constructor shortcuts (llm_model, voice, tts_model, stt_model) for simpler configuration.

Event Handlers

EventDescription
on_conversation_item_createdCalled when a new conversation item is created in the session
on_conversation_item_updatedCalled when a conversation item is updated or completed
@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")