Overview

Inworld AI provides high-quality text-to-speech synthesis with natural-sounding voices and real-time streaming capabilities. The service supports both streaming and non-streaming modes, making it suitable for various use cases from low-latency conversational AI to batch audio generation.
Streaming mode is recommended for real-time applications requiring low latency.

Installation

To use Inworld services, no additional dependencies are required beyond the base installation:
pip install "pipecat-ai"
You’ll also need to set up your Inworld API key as an environment variable: INWORLD_API_KEY.
Get your API key from Inworld Studio. Make sure to base64-encode your API key.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data (LINEAR16 PCM, WAV header stripped)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - API or processing errors

Features

  • High-Quality Voices: Natural-sounding voices including Ashley, Hades, and more
  • Streaming & Non-Streaming: Unified interface supporting both real-time and batch processing
  • Automatic Language Detection: No need to specify language manually - Inworld detects it from your text
  • Voice Temperature Control: Accepts 0-2 (best results 0.6 to 1.0); lower values yield steadier, deterministic speech, while higher values add expressive variation.
  • Model Selection: Choose inworld‑tts‑1 for real‑time, cost‑sensitive use (lowest latency); use inworld‑tts‑1‑max (experimental) when you can trade a bit more latency for richer expressiveness and broader multilingual support.
  • Professional-quality Audio Output: LINEAR16 PCM audio at up to 48kHz

Audio Markups

Inworld supports experimental audio markups for enhanced expressiveness in English: Emotion and Delivery Style (use at beginning of text):
  • Emotions: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
  • Delivery Styles: [laughing], [whispering]
Non-verbal Vocalizations (place anywhere in text):
  • Sound Effects: [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]
Audio markup features are experimental and currently support English only. For best results, use only one emotion/delivery style at the beginning of text. For detailed usage guidelines and best practices, refer to Inworld’s documentation on Audio Markups Best Practices.

Usage Examples

Streaming Mode (Real-time)

Perfect for conversational AI applications requiring low latency:
import asyncio
import aiohttp
import os
from pipecat.services.inworld.tts import InworldTTSService

async def main():
    async with aiohttp.ClientSession() as session:
        tts = InworldTTSService(
            api_key=os.getenv("INWORLD_API_KEY"),
            aiohttp_session=session,
            voice_id="Ashley",
            model="inworld-tts-1",
            streaming=True,  # Use streaming mode for real-time audio
            params=InworldTTSService.InputParams(
                temperature=0.8,
            ),
        )

        # Use in your pipeline
        # pipeline = Pipeline([...other_processors..., tts, ...])

asyncio.run(main())

Non-Streaming Mode (Complete Audio)

Ideal for scenarios where you need the complete audio file before playback:
tts = InworldTTSService(
    api_key=os.getenv("INWORLD_API_KEY"),
    aiohttp_session=session,
    voice_id="Hades",
    model="inworld-tts-1-max",  # Higher quality model
    streaming=False,  # Complete audio generation first
    params=InworldTTSService.InputParams(
        temperature=1.2,  # More expressive speech
    ),
)

Streaming vs Non-Streaming

ModeBest ForUse Cases
StreamingReal-time applicationsBuilding conversational AI, minimal latency interactions, processing text as available
Non-StreamingBatch processingLonger content generation, complete audio files, batch scenarios, slighly better quality

Audio Specifications

  • Sample Rate Range: 8kHz - 48kHz (default comes from StartFrame)
  • Bit Depth: 16-bit
  • Encoding: LINEAR16 PCM (uncompressed)
  • Format: WAV headers automatically stripped
Sample RateQualityUse Case
16000 HzBasicVoice calls, simple applications
24000 HzGoodGeneral conversational AI
48000 HzHighProfessional applications, music

Monitoring and Metrics

  • Time To First Byte (TTFB): Latency measurement from request start to first audio chunk
  • Processing Time: Total duration for the complete TTS operation
  • Usage Metrics: Character count of processed text for billing and analytics
Learn how to enable Metrics in your Pipeline.

Resources