Skip to main content

Overview

Inworld provides high-quality, low-latency speech synthesis via two implementation types: InworldTTSService for real-time, minimal-latency use-cases through websockets and InworldHttpTTSService for streaming and non-streaming use-cases over HTTP. Featuring support for 12+ languages, timestamps, custom pronunciation and instant voice cloning.

Installation

To use Inworld services, no additional dependencies are required beyond the base installation:
pip install "pipecat-ai"

Prerequisites

Inworld Account Setup

Before using Inworld TTS services, you need:
  1. Inworld Account: Sign up at Inworld Studio
  2. API Key: Generate an API key from your account dashboard
  3. Voice Selection: Choose from available voice models

Required Environment Variables

  • INWORLD_API_KEY: Your Inworld API key for authentication

Configuration

InworldTTSService

WebSocket-based service for lowest latency streaming.
api_key
str
required
Inworld API key.
voice_id
str
default:"Ashley"
ID of the voice to use for synthesis.
model
str
default:"inworld-tts-1.5-max"
ID of the model to use for synthesis.
url
str
URL of the Inworld WebSocket API.
sample_rate
int
default:"None"
Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
encoding
str
default:"LINEAR16"
Audio encoding format.
aggregate_sentences
bool
default:"True"
Whether to aggregate sentences before synthesis.
append_trailing_space
bool
default:"True"
Whether to append a trailing space to text before sending to TTS.
params
InputParams
default:"None"
Runtime-configurable synthesis settings. See InworldTTSService InputParams below.

InworldTTSService InputParams

ParameterTypeDefaultDescription
temperaturefloatNoneTemperature for speech synthesis.
speaking_ratefloatNoneSpeaking rate for speech synthesis.
apply_text_normalizationstrNoneWhether to apply text normalization.
max_buffer_delay_msintNoneMaximum buffer delay in milliseconds. Defaults to 3000 if not set.
buffer_char_thresholdintNoneBuffer character threshold. Defaults to 250 if not set.
auto_modeboolTrueServer-controlled flushing for optimal latency and quality. Recommended when text is sent in full sentences/phrases.
timestamp_transport_strategyLiteral["ASYNC", "SYNC"]NoneStrategy for timestamp transport.

InworldHttpTTSService

HTTP-based service supporting both streaming and non-streaming modes.
api_key
str
required
Inworld API key.
aiohttp_session
aiohttp.ClientSession
required
aiohttp ClientSession for HTTP requests.
voice_id
str
default:"Ashley"
ID of the voice to use for synthesis.
model
str
default:"inworld-tts-1.5-max"
ID of the model to use for synthesis.
streaming
bool
default:"True"
Whether to use streaming mode.
sample_rate
int
default:"None"
Audio sample rate in Hz.
encoding
str
default:"LINEAR16"
Audio encoding format.
params
InputParams
default:"None"
Runtime-configurable synthesis settings. See InworldHttpTTSService InputParams below.

InworldHttpTTSService InputParams

ParameterTypeDefaultDescription
temperaturefloatNoneTemperature for speech synthesis.
speaking_ratefloatNoneSpeaking rate for speech synthesis.
timestamp_transport_strategyLiteral["ASYNC", "SYNC"]NoneStrategy for timestamp transport.

Usage

Basic Setup (WebSocket)

from pipecat.services.inworld import InworldTTSService

tts = InworldTTSService(
    api_key=os.getenv("INWORLD_API_KEY"),
    voice_id="Ashley",
)

With Custom Settings

tts = InworldTTSService(
    api_key=os.getenv("INWORLD_API_KEY"),
    voice_id="Ashley",
    model="inworld-tts-1.5-max",
    params=InworldTTSService.InputParams(
        temperature=0.8,
        speaking_rate=1.1,
        auto_mode=True,
    ),
)

HTTP Service

import aiohttp
from pipecat.services.inworld import InworldHttpTTSService

async with aiohttp.ClientSession() as session:
    tts = InworldHttpTTSService(
        api_key=os.getenv("INWORLD_API_KEY"),
        aiohttp_session=session,
        voice_id="Ashley",
        streaming=True,
    )

Notes

  • WebSocket vs HTTP: The WebSocket service (InworldTTSService) provides the lowest latency with bidirectional streaming and supports multiple independent audio contexts per connection (max 5). The HTTP service supports both streaming and non-streaming modes via the streaming parameter.
  • Word timestamps: Both services provide word-level timestamps for synchronized text display. Timestamps are tracked cumulatively across utterances within a turn.
  • Auto mode: When auto_mode=True (default), the server controls flushing of buffered text for optimal latency and quality. This is recommended when text is sent in full sentences or phrases (i.e., when aggregate_sentences=True).
  • Keepalive: The WebSocket service sends periodic keepalive messages every 60 seconds to maintain the connection.

Event Handlers

Inworld TTS supports the standard service connection events:
EventDescription
on_connectedConnected to Inworld WebSocket
on_disconnectedDisconnected from Inworld WebSocket
on_connection_errorWebSocket connection error occurred
@tts.event_handler("on_connected")
async def on_connected(service):
    print("Connected to Inworld")