> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Text to Speech

> Learn how to configure speech synthesis to convert text into natural-sounding audio in your voice AI pipeline

**Text to Speech (TTS)** services are responsible for converting text into natural-sounding speech audio. They receive text input from LLMs and other sources, then generate audio output that users can hear through their connected devices.

## Pipeline Placement

TTS processors must be positioned correctly in your pipeline to receive text and generate audio frames:

```python theme={null}
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,                           # Generates LLMTextFrames
    tts,                           # Processes text → creates TTSAudioRawFrames
    transport.output(),            # Sends audio to user
    context_aggregator.assistant(), # Processes TTSTextFrames for context
])
```

**Placement requirements:**

* **After LLM processing**: TTS needs `LLMTextFrame`s from language model responses
* **Before transport output**: Audio must be generated before sending to user
* **Before assistant context aggregator**: Ensures spoken text is captured in conversation history

### Frame Processing Flow

**TTS generates speech through two primary mechanisms:**

1. **Streamed LLM tokens** via `LLMTextFrame`s:

   * By default, TTS aggregates streaming tokens into complete sentences before synthesis (`TextAggregationMode.SENTENCE`)
   * Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency
   * Audio bytes stream back and play immediately
   * End-to-end latency often under 200ms

2. **Direct speech requests** via `TTSSpeakFrame`s:
   * Bypasses LLM for immediate audio generation
   * Optionally appends text to conversation context via `append_to_context` parameter
   * Useful for developer messages, greetings, or injected speech

**Frame output:**

* `TTSAudioRawFrame`s: Raw audio data for playback
* `TTSTextFrame`s: Text that was actually spoken (for context updates)
* `TTSStartedFrame`/`TTSStoppedFrame`: Speech boundary markers

## Supported TTS Services

Pipecat supports a wide range of TTS providers with different capabilities and performance characteristics:

<Card title="Supported TTS Services" icon="list" href="/api-reference/server/services/supported-services#text-to-speech">
  View the complete list of supported text-to-speech providers
</Card>

### Service Categories

**WebSocket-Based Services (Recommended):**

* **Cartesia**: Ultra-low latency with word timestamps
* **ElevenLabs**: High-quality voices with emotion control
* **Rime**: Ultra-realistic voices with advanced features

**HTTP-Based Services:**

* **OpenAI TTS**: High-quality synthesis with multiple voices
* **Azure Speech**: Enterprise-grade with extensive language support
* **Google Text-to-Speech**: Reliable with WaveNet voices

**Advanced Features:**

* **Word timestamps**: Enable word-level accuracy for context and subtitles
* **Voice cloning**: Custom voice creation from samples
* **Emotion control**: Dynamic emotional expression
* **SSML support**: Fine-grained pronunciation control

<Note>
  WebSocket services typically provide the lowest latency, while HTTP services
  may have intermittent higher latency due to their request/response nature.
</Note>

## TTS Configuration

### Service-Specific Configuration

Each TTS service has its own configuration options. Here's an example with Cartesia:

```python theme={null}
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transcriptions.language import Language

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    settings=CartesiaTTSService.Settings(
        model="sonic-3.5",
        voice="voice-id-here",
        language=Language.EN,     # Speech language
    ),
    # Word timestamps automatically enabled for precise context updates
)
```

**Word timestamps:** Services like Cartesia, ElevenLabs, and Rime provide word-level timestamps that enable precise context updates during interruptions and better synchronization with other pipeline components. For example, if an interruption occurs while the bot is speaking, the word timestamps allow you to accurately capture which words were spoken up to that point, enabling better context management and user experience. Additionally, transcription events streamed from server to client can be done in sync with the audio output, allowing for real-time subtitles or captions.

<Card title="Individual TTS Services" icon="settings" href="/api-reference/server/services/supported-services#text-to-speech">
  Explore configuration options for each supported TTS provider
</Card>

### Pipeline-Level Audio Configuration

Set consistent audio settings across your entire pipeline:

```python theme={null}
worker = PipelineWorker(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=16000,   # Input audio quality
        audio_out_sample_rate=24000,  # Output audio quality (TTS)
    ),
)
```

<Tip>
  Set the `audio_out_sample_rate` to match your TTS service's requirements for
  optimal quality. This is preferred to setting the sample\_rate directly in the
  TTS service as the PipelineParam ensures that all output sample\_rates match.
</Tip>

## Text Processing and Filtering

### Custom Text Aggregation

By default, TTS services have a built-in text aggregator that collects streaming text into sentences before passing them to the underlying service. However, you can customize this behavior by inserting an [`LLMTextProcessor`](/api-reference/server/utilities/frame/llm-text-processor) with a different text aggregator before the TTS in your pipeline. This allows the ability to categorize and structure text into logical units beyond simple sentences, such as code blocks, URLs, or custom tags. You can then configure the TTS to handle these different text types appropriately, such as skipping code blocks or transforming them in a just-in-time manner before speaking.

### Skipping Text Aggregations

To skip certain text aggregations (e.g., code snippets or URLs) and keep them from being spoken, use a custom text aggregator like [`PatternPairAggregator`](/api-reference/server/utilities/text/pattern-pair-aggregator) within an [`LLMTextProcessor`](/api-reference/server/utilities/frame/llm-text-processor), and configure it to identify and handle specific patterns in the text stream. With this, you can then pass any aggregated types you want to skip (like "code") to the TTS service's `skip_aggregator_types` parameter.

```python theme={null}
# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add pattern for JSON data
pattern_aggregator.add_pattern(
    type="code",
    start_pattern="<code>",
    end_pattern="</code>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Initialize TTS service, and don't speak JSON data
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    skip_aggregator_types=["code"], # The strings here should match the types defined in the PatternPairAggregator
)

# add the llm_text_processor to your pipeline after the llm and before the tts
# llm -> llm_text_processor -> tts
```

### Text Transforms

For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like with `skip_aggregator_types`, or applied globally to all text using `'*'` as the type.

Text transforms are registered directly on the TTS service instance via the `add_text_transformer()` method or during initialization using the `text_transforms` parameter.

<Note>
  The intentions of text transforms are meant to be TTS-specific modifications
  that do not affect the underlying LLM text or context. That said, since the
  context aggregator attempts to base its context on what was actually spoken,
  for services that support word timestamps, like Cartesia, ElevenLabs, and
  Rime,these transforms will modify the context as they modify what is spoken.
</Note>

```python theme={null}
# Create pattern aggregator
pattern_aggregator = PatternPairAggregator()

# Add patterns for different parts of an explanation
pattern_aggregator.add_pattern(
    type="phone_number",
    start_pattern="<pnum>",
    end_pattern="</pnum>",
    action=MatchAction.AGGREGATE
)

# Set the aggregator on an LLMTextProcessor
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

# Text-to-Speech service
tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    # CartesiaTTSService provides a helper for this along with other common transforms
    return CartesiaTTSService.SPELL(text)

async def replace_acronyms(text: str, type: str) -> str:
    # Replace "SEC" with "Southeastern Conference"
    return text.replace(" SEC ", " Southeastern Conference ")

# Setup the text transformers in TTS to spell out phone numbers and replace
# acronyms. The string below matches the type defined in the PatternPairAggregator
# above so that whenever those segments are encountered, this transform
# is applied
tts.add_text_transformer(spell_out_text, "phone_number")
tts.add_text_transformer(replace_acronyms, "*")  # Apply to all text

# add the llm_text_processor to your pipeline after the llm and before the tts
#   llm -> llm_text_processor -> tts
```

### Text Filters

<Warning>
  Text filters are no longer the preferred method for text preprocessing and
  will be deprecated in future releases. Instead, you should use one of the
  methods described above.
</Warning>

Apply preprocessing to text before synthesis:

```python theme={null}
from pipecat.utils.text.markdown_text_filter import MarkdownTextFilter

tts = YourTTSService(
    # ... other options
    text_filters=[
        MarkdownTextFilter(),     # Remove markdown formatting
        CustomTextFilter(),       # Your custom processing
    ],
)
```

**Common filters:**

* **MarkdownTextFilter**: Strips markdown formatting from LLM responses
* **Custom filters**: Implement your own text preprocessing logic

## Skipping TTS Output

Sometimes you want text from the LLM to flow through the pipeline—updating the conversation context, reaching observers, or being processed by custom frame processors—without being spoken by the TTS service. Pipecat provides a `skip_tts` attribute on text and response frames for this purpose.

When `skip_tts` is `True` on a frame, the TTS service passes it through without generating audio, but the text still reaches downstream processors like the assistant context aggregator.

### Configuring All LLM Output

Use `LLMConfigureOutputFrame` to tell the LLM service to mark **all** subsequent output frames (`LLMTextFrame`, `LLMFullResponseStartFrame`, `LLMFullResponseEndFrame`) with `skip_tts`:

```python theme={null}
from pipecat.frames.frames import LLMConfigureOutputFrame

# Tell the LLM to skip TTS for all output
await worker.queue_frame(LLMConfigureOutputFrame(skip_tts=True))

# ... LLM responses will not be spoken ...

# Re-enable TTS
await worker.queue_frame(LLMConfigureOutputFrame(skip_tts=False))
```

This is useful when you want to toggle TTS on or off for an entire stretch of conversation, such as switching between voice and text input modes.

### Setting skip\_tts on Individual Frames

For more granular control, set `skip_tts=True` directly on individual text frames. This is useful when building custom frame processors that selectively silence certain parts of the LLM output:

```python theme={null}
from pipecat.frames.frames import LLMTextFrame

# In a custom frame processor
frame = LLMTextFrame(text)
frame.skip_tts = True
await self.push_frame(frame)
```

<Note>
  The `skip_tts` attribute is available on `TextFrame` and all its subclasses
  (`LLMTextFrame`, `AggregatedTextFrame`, `TTSTextFrame`, etc.), as well as
  `LLMFullResponseStartFrame` and `LLMFullResponseEndFrame`.
</Note>

### Common Use Cases

**Encoding structured output from the LLM.** You can instruct the LLM to include markers or metadata in its response that should be processed by pipeline logic but not spoken. For example, Pipecat's [turn completion detection](/api-reference/server/utilities/turn-management/filter-incomplete-turns) uses this approach — the LLM outputs completion markers (`✓`, `○`, `◐`) that are pushed with `skip_tts=True` so they update the context but aren't spoken.

**Switching between voice and text input.** When a client sends text input instead of speech, you may want the bot to respond with text only. The client SDKs support this via `sendText()` with `audio_response: false`, which uses `LLMConfigureOutputFrame` internally to temporarily disable TTS for that response.

**Testing without audio.** When building test pipelines, you can use `LLMConfigureOutputFrame(skip_tts=True)` to bypass audio generation entirely while still exercising the rest of the pipeline.

## Advanced TTS Features

### Direct Speech Commands

Use `TTSSpeakFrame` for immediate speech synthesis:

```python theme={null}
from pipecat.frames.frames import TTSSpeakFrame

# Make bot speak directly (added to context by default)
await tts.queue_frame(TTSSpeakFrame("Hello, how can I help you?"))

# Explicitly append spoken text to conversation context
await tts.queue_frame(
    TTSSpeakFrame("Welcome! Let's begin.", append_to_context=True)
)

# Speak without adding to context
await tts.queue_frame(
    TTSSpeakFrame("Processing...", append_to_context=False)
)
```

The `append_to_context` parameter controls whether the spoken text is added to the conversation history. When `append_to_context=True`, the text is automatically committed to the context after being spoken, making it useful for bot greetings and injected speech that should be part of the conversation flow.

<Note>
  As of Pipecat v1.4.0, `append_to_context` defaults to `True`. A plain
  `TTSSpeakFrame("...")` **is** added to the conversation context after it is
  spoken; pass `append_to_context=False` to speak without recording it. (`None`
  was the previous default and is no longer supported.)
</Note>

### Dynamic Settings Updates

Update TTS settings during conversation using typed settings objects:

```python theme={null}
from pipecat.frames.frames import TTSUpdateSettingsFrame
from pipecat.services.cartesia.tts import CartesiaTTSSettings

# Change voice speed during conversation
await worker.queue_frames([
    TTSUpdateSettingsFrame(delta=CartesiaTTSSettings(speed="fast")),
    TTSSpeakFrame("I'm speaking faster now!")
])
```

## Key Takeaways

* **Pipeline placement matters** - TTS must come after LLM, before transport output
* **Service types differ** - WebSocket services provide lower latency than HTTP
* **Text processing affects quality** - use aggregation and filters for better results
* **Word timestamps enable precision** - better interruption handling and context accuracy
* **Configuration impacts performance** - balance quality, latency, and bandwidth needs
* **Services are modular** - easily swap providers without changing pipeline code

## What's Next

You've now learned how to build a complete voice AI pipeline! Let's explore some additional topics to enhance your implementation.

<Card title="Pipeline Termination" icon="arrow-right" href="/pipecat/learn/pipeline-termination">
  Learn how to terminate your voice AI pipeline at the end of a conversation
</Card>