Overview
InworldRealtimeLLMService provides real-time, multimodal conversation capabilities using Inworld’s Realtime API. It operates as a cascade STT/LLM/TTS pipeline under the hood with built-in semantic voice activity detection (VAD) for turn management, offering low-latency speech-to-speech interactions with integrated LLM processing and function calling.
Inworld Realtime API Reference
Pipecat’s API methods for Inworld Realtime integration
Example Implementation
Complete Inworld Realtime conversation example
Inworld Realtime Documentation
Official Inworld Realtime API documentation
Inworld Console
Access Inworld models and manage API keys
Installation
To use Inworld Realtime services, install the required dependencies:Prerequisites
Inworld Account Setup
Before using Inworld Realtime services, you need:- Inworld Account: Sign up at Inworld Studio
- API Key: Generate an Inworld API key from your account dashboard
- Model Access: Ensure access to Inworld Realtime models
- Usage Limits: Configure appropriate usage limits and billing
Required Environment Variables
INWORLD_API_KEY: Your Inworld API key for authentication
Key Features
- Real-time Speech-to-Speech: Direct audio processing with low latency
- Cascade Pipeline: Integrated STT → LLM → TTS processing
- Semantic VAD: Advanced semantic voice activity detection for natural turn-taking
- Multilingual Support: Support for multiple languages via STT model selection
- Function Calling: Seamless support for external functions and tool integration
- Multiple Voice Options: Various voice personalities available
- WebSocket Support: Real-time bidirectional audio streaming
- Streaming Transcription: Real-time user speech transcription
Configuration
InworldRealtimeLLMService
Inworld API key for authentication.
LLM model to use (e.g. “openai/gpt-4.1-nano”). Shorthand for
session_properties.model.Voice ID for TTS output (e.g. “Sarah”, “Clive”). Shorthand for
session_properties.audio.output.voice.TTS model to use (e.g. “inworld-tts-1.5-max”). Shorthand for
session_properties.audio.output.model.STT model for input transcription (e.g. “assemblyai/universal-streaming-multilingual”). Shorthand for
session_properties.audio.input.transcription.model.WebSocket base URL for the Inworld Realtime API. Override for custom deployments.
Authentication type.
"basic" for server-side API key auth, "bearer" for client-side JWT auth.Runtime-configurable settings. See Settings below.
Whether to start with audio input paused.
Settings
Runtime-configurable settings passed via thesettings constructor argument using InworldRealtimeLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | NOT_GIVEN | Model identifier. (Inherited from base settings.) |
system_instruction | str | NOT_GIVEN | System instruction/prompt. (Inherited from base settings.) |
temperature | float | NOT_GIVEN | Temperature for response generation. (Inherited from base settings.) |
session_properties | SessionProperties | NOT_GIVEN | Session-level configuration (voice, audio config, tools, etc.). |
NOT_GIVEN values are omitted, letting the service use its own defaults. Only
parameters that are explicitly set are included.SessionProperties
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | LLM model to use (e.g. “openai/gpt-4.1-nano”). |
instructions | str | None | System instructions for the assistant. |
temperature | float | None | Temperature for response generation. |
output_modalities | List[str] | ["audio", "text"] | Output modalities for the assistant. |
audio | AudioConfiguration | None | Configuration for input and output audio formats. |
tools | List[FunctionTool] | None | Available custom function tools. |
AudioConfiguration
Theaudio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations:
AudioInput (audio.input):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Input audio format. Supports PCMAudioFormat (configurable rate), PCMUAudioFormat (8kHz), or PCMAAudioFormat (8kHz). |
transcription | InputTranscription | None | Configuration for input audio transcription. Includes model field for STT model selection. |
turn_detection | TurnDetection | None | Turn detection configuration. Supports "semantic_vad" and "server_vad" types. |
audio.output):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Output audio format. Same format options as input. |
model | str | None | TTS model to use (e.g. “inworld-tts-1.5-max”). |
voice | str | None | Voice ID (e.g. “Sarah”, “Clive”). |
TurnDetection
| Parameter | Type | Default | Description |
|---|---|---|---|
type | Literal["server_vad", "semantic_vad"] | "semantic_vad" | Detection type. “semantic_vad” for semantic-based, “server_vad” for standard VAD. |
eagerness | str | None | How eagerly to detect end of turn. Options: “low”, “medium”, “high”. |
create_response | bool | None | Whether to automatically create a response on turn end. |
interrupt_response | bool | None | Whether user speech interrupts the current response. |
Usage
Basic Setup
With Model and Voice Configuration
With Full Session Configuration
Updating Settings at Runtime
Notes
- Audio format auto-configuration: If audio format is not specified in
session_properties, the service automatically configures PCM input/output using the pipeline’s sample rates (defaults to 24000 Hz). - Semantic VAD by default: The service uses semantic VAD (
"semantic_vad") by default for more natural turn detection. When VAD is enabled, the server handles speech detection and turn management automatically. - Cascade architecture: The service operates as an integrated STT → LLM → TTS pipeline on the server side, simplifying client-side implementation.
- Audio before setup: Audio is not sent to Inworld until the conversation setup is complete, preventing sample rate mismatches.
- G.711 support: PCMU and PCMA formats are supported at a fixed 8000 Hz rate, useful for telephony integrations.
- System instruction precedence: The
system_instructionfrom service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set. - Settings replacement: When providing
session_propertiesinsettings, it replaces all defaults wholesale — provide a completeSessionPropertiesconfiguration in that case. Use the constructor shortcuts (llm_model,voice,tts_model,stt_model) for simpler configuration.
Event Handlers
| Event | Description |
|---|---|
on_conversation_item_created | Called when a new conversation item is created in the session |
on_conversation_item_updated | Called when a conversation item is updated or completed |