Skip to main content
Text to Speech (TTS) services are responsible for converting text into natural-sounding speech audio. They receive text input from LLMs and other sources, then generate audio output that users can hear through their connected devices.

Pipeline Placement

TTS processors must be positioned correctly in your pipeline to receive text and generate audio frames:
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,                           # Generates LLMTextFrames
    tts,                           # Processes text → creates TTSAudioRawFrames
    transport.output(),            # Sends audio to user
    context_aggregator.assistant(), # Processes TTSTextFrames for context
])
Placement requirements:
  • After LLM processing: TTS needs LLMTextFrames from language model responses
  • Before transport output: Audio must be generated before sending to user
  • Before assistant context aggregator: Ensures spoken text is captured in conversation history

Frame Processing Flow

TTS generates speech through two primary mechanisms:
  1. Streamed LLM tokens via LLMTextFrames:
    • TTS aggregates streaming tokens into complete sentences
    • Sentences are sent to TTS service for audio generation
    • Audio bytes stream back and play immediately
    • End-to-end latency often under 200ms
  2. Direct speech requests via TTSSpeakFrames:
    • Bypasses LLM and context aggregators
    • Immediate audio generation for specific text
    • Useful for system messages or prompts
Frame output:
  • TTSAudioRawFrames: Raw audio data for playback
  • TTSTextFrames: Text that was actually spoken (for context updates)
  • TTSStartedFrame/TTSStoppedFrame: Speech boundary markers

Supported TTS Services

Pipecat supports a wide range of TTS providers with different capabilities and performance characteristics:

Supported TTS Services

View the complete list of supported text-to-speech providers

Service Categories

WebSocket-Based Services (Recommended):
  • Cartesia: Ultra-low latency with word timestamps
  • ElevenLabs: High-quality voices with emotion control
  • Rime: Ultra-realistic voices with advanced features
HTTP-Based Services:
  • OpenAI TTS: High-quality synthesis with multiple voices
  • Azure Speech: Enterprise-grade with extensive language support
  • Google Text-to-Speech: Reliable with WaveNet voices
Advanced Features:
  • Word timestamps: Enable word-level accuracy for context and subtitles
  • Voice cloning: Custom voice creation from samples
  • Emotion control: Dynamic emotional expression
  • SSML support: Fine-grained pronunciation control
WebSocket services typically provide the lowest latency, while HTTP services may have intermittent higher latency due to their request/response nature.

TTS Configuration

Service-Specific Configuration

Each TTS service has its own configuration options. Here’s an example with Cartesia:
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transcriptions.language import Language

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="voice-id-here",
    model="sonic-2",              # TTS model to use
    params=CartesiaTTSService.InputParams(
        language=Language.EN,     # Speech language
        speed="normal",           # Speech rate control
    ),
    # Word timestamps automatically enabled for precise context updates
)
Word timestamps: Services like Cartesia, ElevenLabs, and Rime provide word-level timestamps that enable precise context updates during interruptions and better synchronization with other pipeline components. For example, if an interruption occurs while the bot is speaking, the word timestamps allow you to accurately capture which words were spoken up to that point, enabling better context management and user experience. Additionally, transcription events streamed from server to client can be done in sync with the audio output, allowing for real-time subtitles or captions.

Individual TTS Services

Explore configuration options for each supported TTS provider

Pipeline-Level Audio Configuration

Set consistent audio settings across your entire pipeline:
task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=16000,   # Input audio quality
        audio_out_sample_rate=24000,  # Output audio quality (TTS)
    ),
)
Set the audio_out_sample_rate to match your TTS service’s requirements for optimal quality. This is preferred to setting the sample_rate directly in the TTS service as the PipelineParam ensures that all output sample_rates match.

Text Processing and Filtering

Custom Text Aggregation

By default, TTS services have a built-in text aggregator that collects streaming text into sentences before passing them to the underlying service. However, you can customize this behavior by inserting an LLMTextProcessor with a different text aggregator before the TTS in your pipeline. This allows the ability to categorize and structure text into logical units beyond simple sentences, such as code blocks, URLs, or custom tags. You can then configure the TTS to handle these different text types appropriately, such as skipping code blocks or transforming them in a just-in-time manner before speaking.

Skipping Text Aggregations

To skip certain text aggregations (e.g., code snippets or URLs) and keep them from being spoken, use a custom text aggregator like PatternPairAggregator within an LLMTextProcessor, and configure it to identify and handle specific patterns in the text stream. With this, you can then pass any aggregated types you want to skip (like “code”) to the TTS service’s skip_aggregator_types parameter.
# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add pattern for JSON data
pattern_aggregator.add_pattern(
    type="code",
    start_pattern="<code>",
    end_pattern="</code>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Initialize TTS service, and don't speak JSON data
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    skip_aggregator_types=["code"], # The strings here should match the types defined in the PatternPairAggregator
)

# add the llm_text_processor to your pipeline after the llm and before the tts
# llm -> llm_text_processor -> tts

Text Transforms

For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like with skip_aggregator_types, or applied globally to all text using '*' as the type. Text transforms are registered directly on the TTS service instance via the add_text_transformer() method or during initialization using the text_transforms parameter.
The intentions of text transforms are meant to be TTS-specific modifications that do not affect the underlying LLM text or context. That said, since the context aggregator attempts to base its context on what was actually spoken, for services that support word timestamps, like Cartesia, ElevenLabs, and Rime,these transforms will modify the context as they modify what is spoken.
# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add patterns for different parts of an explanation
pattern_aggregator.add_pattern(
    type="phone_number",
    start_pattern="<pnum>",
    end_pattern="</pnum>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Text-to-Speech service
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    # CartesiaTTSService provides a helper for this along with other common transforms
    return CartesiaTTSService.SPELL(text)

async def replace_acronyms(text: str, type: str) -> str:
    # Replace "SEC" with "Southeastern Conference"
    return text.replace(" SEC ", " Southeastern Conference ")

# Setup the text transformers in TTS to spell out phone numbers and replace
# acronyms. The string below matches the type defined in the PatternPairAggregator
# above so that whenever those segments are encountered, this transform
# is applied
tts.add_text_transformer(spell_out_text, "phone_number")
tts.add_text_transformer(replace_acronyms, "*")  # Apply to all text

# add the llm_text_processor to your pipeline after the llm and before the tts
#   llm -> llm_text_processor -> tts

Text Filters

Text filters are no longer the preferred method for text preprocessing and will be deprecated in future releases. Instead, you should use one of the methods described above.
Apply preprocessing to text before synthesis:
from pipecat.utils.text.markdown_text_filter import MarkdownTextFilter

tts = YourTTSService(
    # ... other options
    text_filters=[
        MarkdownTextFilter(),     # Remove markdown formatting
        CustomTextFilter(),       # Your custom processing
    ],
)
Common filters:
  • MarkdownTextFilter: Strips markdown formatting from LLM responses
  • Custom filters: Implement your own text preprocessing logic

Advanced TTS Features

Direct Speech Commands

Use TTSSpeakFrame for immediate speech synthesis:
from pipecat.frames.frames import TTSSpeakFrame

# Make bot speak directly
await tts.queue_frame(TTSSpeakFrame("Hello, how can I help you?"))

Dynamic Settings Updates

Update TTS settings during conversation:
from pipecat.frames.frames import TTSUpdateSettingsFrame

# Change voice speed during conversation
await task.queue_frames([
    TTSUpdateSettingsFrame({"speed": "fast"}),
    TTSSpeakFrame("I'm speaking faster now!")
])

Key Takeaways

  • Pipeline placement matters - TTS must come after LLM, before transport output
  • Service types differ - WebSocket services provide lower latency than HTTP
  • Text processing affects quality - use aggregation and filters for better results
  • Word timestamps enable precision - better interruption handling and context accuracy
  • Configuration impacts performance - balance quality, latency, and bandwidth needs
  • Services are modular - easily swap providers without changing pipeline code

What’s Next

You’ve now learned how to build a complete voice AI pipeline! Let’s explore some additional topics to enhance your implementation.

Pipeline Termination

Learn how to terminate your voice AI pipeline at the end of a conversation