Text to Speech

Text to Speech (TTS) services are responsible for converting text into natural-sounding speech audio. They receive text input from LLMs and other sources, then generate audio output that users can hear through their connected devices.

Pipeline Placement

TTS processors must be positioned correctly in your pipeline to receive text and generate audio frames:

pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,                           # Generates LLMTextFrames
    tts,                           # Processes text → creates TTSAudioRawFrames
    transport.output(),            # Sends audio to user
    context_aggregator.assistant(), # Processes TTSTextFrames for context
])

Placement requirements:

After LLM processing: TTS needs LLMTextFrames from language model responses
Before transport output: Audio must be generated before sending to user
Before assistant context aggregator: Ensures spoken text is captured in conversation history

Frame Processing Flow

TTS generates speech through two primary mechanisms:

Streamed LLM tokens via LLMTextFrames:
- By default, TTS aggregates streaming tokens into complete sentences before synthesis (TextAggregationMode.SENTENCE)
- Set text_aggregation_mode=TextAggregationMode.TOKEN to stream tokens directly for lower latency
- Audio bytes stream back and play immediately
- End-to-end latency often under 200ms
Direct speech requests via TTSSpeakFrames:
- Bypasses LLM for immediate audio generation
- Optionally appends text to conversation context via append_to_context parameter
- Useful for system messages, greetings, or injected speech

Frame output:

TTSAudioRawFrames: Raw audio data for playback
TTSTextFrames: Text that was actually spoken (for context updates)
TTSStartedFrame/TTSStoppedFrame: Speech boundary markers

Supported TTS Services

Pipecat supports a wide range of TTS providers with different capabilities and performance characteristics:

Supported TTS Services

View the complete list of supported text-to-speech providers

Service Categories

WebSocket-Based Services (Recommended):

Cartesia: Ultra-low latency with word timestamps
ElevenLabs: High-quality voices with emotion control
Rime: Ultra-realistic voices with advanced features

HTTP-Based Services:

OpenAI TTS: High-quality synthesis with multiple voices
Azure Speech: Enterprise-grade with extensive language support
Google Text-to-Speech: Reliable with WaveNet voices

Advanced Features:

Word timestamps: Enable word-level accuracy for context and subtitles
Voice cloning: Custom voice creation from samples
Emotion control: Dynamic emotional expression
SSML support: Fine-grained pronunciation control

WebSocket services typically provide the lowest latency, while HTTP services may have intermittent higher latency due to their request/response nature.

TTS Configuration

Service-Specific Configuration

Each TTS service has its own configuration options. Here’s an example with Cartesia:

from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transcriptions.language import Language

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    settings=CartesiaTTSService.Settings(
        model="sonic-3",
        voice="voice-id-here",
        language=Language.EN,     # Speech language
    ),
    # Word timestamps automatically enabled for precise context updates
)

Word timestamps: Services like Cartesia, ElevenLabs, and Rime provide word-level timestamps that enable precise context updates during interruptions and better synchronization with other pipeline components. For example, if an interruption occurs while the bot is speaking, the word timestamps allow you to accurately capture which words were spoken up to that point, enabling better context management and user experience. Additionally, transcription events streamed from server to client can be done in sync with the audio output, allowing for real-time subtitles or captions.

Individual TTS Services

Explore configuration options for each supported TTS provider

Pipeline-Level Audio Configuration

Set consistent audio settings across your entire pipeline:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=16000,   # Input audio quality
        audio_out_sample_rate=24000,  # Output audio quality (TTS)
    ),
)

Set the audio_out_sample_rate to match your TTS service’s requirements for optimal quality. This is preferred to setting the sample_rate directly in the TTS service as the PipelineParam ensures that all output sample_rates match.

Text Processing and Filtering

Custom Text Aggregation

By default, TTS services have a built-in text aggregator that collects streaming text into sentences before passing them to the underlying service. However, you can customize this behavior by inserting an LLMTextProcessor with a different text aggregator before the TTS in your pipeline. This allows the ability to categorize and structure text into logical units beyond simple sentences, such as code blocks, URLs, or custom tags. You can then configure the TTS to handle these different text types appropriately, such as skipping code blocks or transforming them in a just-in-time manner before speaking.

Skipping Text Aggregations

To skip certain text aggregations (e.g., code snippets or URLs) and keep them from being spoken, use a custom text aggregator like PatternPairAggregator within an LLMTextProcessor, and configure it to identify and handle specific patterns in the text stream. With this, you can then pass any aggregated types you want to skip (like “code”) to the TTS service’s skip_aggregator_types parameter.

# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add pattern for JSON data
pattern_aggregator.add_pattern(
    type="code",
    start_pattern="<code>",
    end_pattern="</code>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Initialize TTS service, and don't speak JSON data
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    skip_aggregator_types=["code"], # The strings here should match the types defined in the PatternPairAggregator
)

# add the llm_text_processor to your pipeline after the llm and before the tts
# llm -> llm_text_processor -> tts

Text Transforms

For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like with skip_aggregator_types, or applied globally to all text using '*' as the type. Text transforms are registered directly on the TTS service instance via the add_text_transformer() method or during initialization using the text_transforms parameter.

The intentions of text transforms are meant to be TTS-specific modifications that do not affect the underlying LLM text or context. That said, since the context aggregator attempts to base its context on what was actually spoken, for services that support word timestamps, like Cartesia, ElevenLabs, and Rime,these transforms will modify the context as they modify what is spoken.

# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add patterns for different parts of an explanation
pattern_aggregator.add_pattern(
    type="phone_number",
    start_pattern="<pnum>",
    end_pattern="</pnum>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Text-to-Speech service
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    # CartesiaTTSService provides a helper for this along with other common transforms
    return CartesiaTTSService.SPELL(text)

async def replace_acronyms(text: str, type: str) -> str:
    # Replace "SEC" with "Southeastern Conference"
    return text.replace(" SEC ", " Southeastern Conference ")

# Setup the text transformers in TTS to spell out phone numbers and replace
# acronyms. The string below matches the type defined in the PatternPairAggregator
# above so that whenever those segments are encountered, this transform
# is applied
tts.add_text_transformer(spell_out_text, "phone_number")
tts.add_text_transformer(replace_acronyms, "*")  # Apply to all text

# add the llm_text_processor to your pipeline after the llm and before the tts
#   llm -> llm_text_processor -> tts

Text Filters

Text filters are no longer the preferred method for text preprocessing and will be deprecated in future releases. Instead, you should use one of the methods described above.

Apply preprocessing to text before synthesis:

from pipecat.utils.text.markdown_text_filter import MarkdownTextFilter

tts = YourTTSService(
    # ... other options
    text_filters=[
        MarkdownTextFilter(),     # Remove markdown formatting
        CustomTextFilter(),       # Your custom processing
    ],
)

Common filters:

MarkdownTextFilter: Strips markdown formatting from LLM responses
Custom filters: Implement your own text preprocessing logic

Skipping TTS Output

Sometimes you want text from the LLM to flow through the pipeline—updating the conversation context, reaching observers, or being processed by custom frame processors—without being spoken by the TTS service. Pipecat provides a skip_tts attribute on text and response frames for this purpose. When skip_tts is True on a frame, the TTS service passes it through without generating audio, but the text still reaches downstream processors like the assistant context aggregator.

Configuring All LLM Output

Use LLMConfigureOutputFrame to tell the LLM service to mark all subsequent output frames (LLMTextFrame, LLMFullResponseStartFrame, LLMFullResponseEndFrame) with skip_tts:

from pipecat.frames.frames import LLMConfigureOutputFrame

# Tell the LLM to skip TTS for all output
await task.queue_frame(LLMConfigureOutputFrame(skip_tts=True))

# ... LLM responses will not be spoken ...

# Re-enable TTS
await task.queue_frame(LLMConfigureOutputFrame(skip_tts=False))

This is useful when you want to toggle TTS on or off for an entire stretch of conversation, such as switching between voice and text input modes.

Setting skip_tts on Individual Frames

For more granular control, set skip_tts=True directly on individual text frames. This is useful when building custom frame processors that selectively silence certain parts of the LLM output:

from pipecat.frames.frames import LLMTextFrame

# In a custom frame processor
frame = LLMTextFrame(text)
frame.skip_tts = True
await self.push_frame(frame)

The skip_tts attribute is available on TextFrame and all its subclasses (LLMTextFrame, AggregatedTextFrame, TTSTextFrame, etc.), as well as LLMFullResponseStartFrame and LLMFullResponseEndFrame.

Common Use Cases

Encoding structured output from the LLM. You can instruct the LLM to include markers or metadata in its response that should be processed by pipeline logic but not spoken. For example, Pipecat’s turn completion detection uses this approach — the LLM outputs completion markers (✓, ○, ◐) that are pushed with skip_tts=True so they update the context but aren’t spoken. Switching between voice and text input. When a client sends text input instead of speech, you may want the bot to respond with text only. The client SDKs support this via sendText() with audio_response: false, which uses LLMConfigureOutputFrame internally to temporarily disable TTS for that response. Testing without audio. When building test pipelines, you can use LLMConfigureOutputFrame(skip_tts=True) to bypass audio generation entirely while still exercising the rest of the pipeline.

Advanced TTS Features

Direct Speech Commands

Use TTSSpeakFrame for immediate speech synthesis:

from pipecat.frames.frames import TTSSpeakFrame

# Make bot speak directly
await tts.queue_frame(TTSSpeakFrame("Hello, how can I help you?"))

# Append spoken text to conversation context
await tts.queue_frame(
    TTSSpeakFrame("Welcome! Let's begin.", append_to_context=True)
)

# Speak without adding to context
await tts.queue_frame(
    TTSSpeakFrame("Processing...", append_to_context=False)
)

The append_to_context parameter controls whether the spoken text is added to the conversation history. When append_to_context=True, the text is automatically committed to the context after being spoken, making it useful for bot greetings and injected speech that should be part of the conversation flow.

Dynamic Settings Updates

Update TTS settings during conversation using typed settings objects:

from pipecat.frames.frames import TTSUpdateSettingsFrame
from pipecat.services.cartesia.tts import CartesiaTTSSettings

# Change voice speed during conversation
await task.queue_frames([
    TTSUpdateSettingsFrame(delta=CartesiaTTSSettings(speed="fast")),
    TTSSpeakFrame("I'm speaking faster now!")
])

Key Takeaways

Pipeline placement matters - TTS must come after LLM, before transport output
Service types differ - WebSocket services provide lower latency than HTTP
Text processing affects quality - use aggregation and filters for better results
Word timestamps enable precision - better interruption handling and context accuracy
Configuration impacts performance - balance quality, latency, and bandwidth needs
Services are modular - easily swap providers without changing pipeline code

What’s Next

You’ve now learned how to build a complete voice AI pipeline! Let’s explore some additional topics to enhance your implementation.

Pipeline Termination

Learn how to terminate your voice AI pipeline at the end of a conversation

Learning Pipecat

Fundamentals

Features

Telephony

Pipeline Placement

Frame Processing Flow

Supported TTS Services

Supported TTS Services

Service Categories

TTS Configuration

Service-Specific Configuration

Individual TTS Services

Pipeline-Level Audio Configuration

Text Processing and Filtering

Custom Text Aggregation

Skipping Text Aggregations

Text Transforms

Text Filters

Skipping TTS Output

Configuring All LLM Output

Setting skip_tts on Individual Frames

Common Use Cases

Advanced TTS Features

Direct Speech Commands

Dynamic Settings Updates

Key Takeaways

What’s Next

Pipeline Termination

Learning Pipecat

Fundamentals

Features

Telephony

​Pipeline Placement

​Frame Processing Flow

​Supported TTS Services

Supported TTS Services

​Service Categories

​TTS Configuration

​Service-Specific Configuration

Individual TTS Services

​Pipeline-Level Audio Configuration

​Text Processing and Filtering

​Custom Text Aggregation

​Skipping Text Aggregations

​Text Transforms

​Text Filters

​Skipping TTS Output

​Configuring All LLM Output

​Setting skip_tts on Individual Frames

​Common Use Cases

​Advanced TTS Features

​Direct Speech Commands

​Dynamic Settings Updates

​Key Takeaways

​What’s Next

Pipeline Termination

Pipeline Placement

Frame Processing Flow

Supported TTS Services

Service Categories

TTS Configuration

Service-Specific Configuration

Pipeline-Level Audio Configuration

Text Processing and Filtering

Custom Text Aggregation

Skipping Text Aggregations

Text Transforms

Text Filters

Skipping TTS Output

Configuring All LLM Output

Setting skip_tts on Individual Frames

Common Use Cases

Advanced TTS Features

Direct Speech Commands

Dynamic Settings Updates

Key Takeaways

What’s Next