Pipeline Placement
TTS processors must be positioned correctly in your pipeline to receive text and generate audio frames:- After LLM processing: TTS needs
LLMTextFrames from language model responses - Before transport output: Audio must be generated before sending to user
- Before assistant context aggregator: Ensures spoken text is captured in conversation history
Frame Processing Flow
TTS generates speech through two primary mechanisms:-
Streamed LLM tokens via
LLMTextFrames:- TTS aggregates streaming tokens into complete sentences
- Sentences are sent to TTS service for audio generation
- Audio bytes stream back and play immediately
- End-to-end latency often under 200ms
-
Direct speech requests via
TTSSpeakFrames:- Bypasses LLM and context aggregators
- Immediate audio generation for specific text
- Useful for system messages or prompts
TTSAudioRawFrames: Raw audio data for playbackTTSTextFrames: Text that was actually spoken (for context updates)TTSStartedFrame/TTSStoppedFrame: Speech boundary markers
Supported TTS Services
Pipecat supports a wide range of TTS providers with different capabilities and performance characteristics:Supported TTS Services
View the complete list of supported text-to-speech providers
Service Categories
WebSocket-Based Services (Recommended):- Cartesia: Ultra-low latency with word timestamps
- ElevenLabs: High-quality voices with emotion control
- Rime: Ultra-realistic voices with advanced features
- OpenAI TTS: High-quality synthesis with multiple voices
- Azure Speech: Enterprise-grade with extensive language support
- Google Text-to-Speech: Reliable with WaveNet voices
- Word timestamps: Enable word-level accuracy for context and subtitles
- Voice cloning: Custom voice creation from samples
- Emotion control: Dynamic emotional expression
- SSML support: Fine-grained pronunciation control
WebSocket services typically provide the lowest latency, while HTTP services
may have intermittent higher latency due to their request/response nature.
TTS Configuration
Service-Specific Configuration
Each TTS service has its own configuration options. Here’s an example with Cartesia:Individual TTS Services
Explore configuration options for each supported TTS provider
Pipeline-Level Audio Configuration
Set consistent audio settings across your entire pipeline:Text Processing and Filtering
Custom Text Aggregation
By default, TTS services have a built-in text aggregator that collects streaming text into sentences before passing them to the underlying service. However, you can customize this behavior by inserting anLLMTextProcessor with a different text aggregator before the TTS in your pipeline. This allows the ability to categorize and structure text into logical units beyond simple sentences, such as code blocks, URLs, or custom tags. You can then configure the TTS to handle these different text types appropriately, such as skipping code blocks or transforming them in a just-in-time manner before speaking.
Skipping Text Aggregations
To skip certain text aggregations (e.g., code snippets or URLs) and keep them from being spoken, use a custom text aggregator likePatternPairAggregator within an LLMTextProcessor, and configure it to identify and handle specific patterns in the text stream. With this, you can then pass any aggregated types you want to skip (like “code”) to the TTS service’s skip_aggregator_types parameter.
Text Transforms
For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like withskip_aggregator_types, or applied globally to all text using '*' as the type.
Text transforms are registered directly on the TTS service instance via the add_text_transformer() method or during initialization using the text_transforms parameter.
The intentions of text transforms are meant to be TTS-specific modifications that do not affect the underlying LLM text or context. That said, since the context aggregator attempts to base its context on what was actually spoken, for services that support word timestamps, like Cartesia, ElevenLabs, and Rime,these transforms will modify the context as they modify what is spoken.
Text Filters
Apply preprocessing to text before synthesis:- MarkdownTextFilter: Strips markdown formatting from LLM responses
- Custom filters: Implement your own text preprocessing logic
Advanced TTS Features
Direct Speech Commands
UseTTSSpeakFrame for immediate speech synthesis:
Dynamic Settings Updates
Update TTS settings during conversation:Key Takeaways
- Pipeline placement matters - TTS must come after LLM, before transport output
- Service types differ - WebSocket services provide lower latency than HTTP
- Text processing affects quality - use aggregation and filters for better results
- Word timestamps enable precision - better interruption handling and context accuracy
- Configuration impacts performance - balance quality, latency, and bandwidth needs
- Services are modular - easily swap providers without changing pipeline code
What’s Next
You’ve now learned how to build a complete voice AI pipeline! Let’s explore some additional topics to enhance your implementation.Pipeline Termination
Learn how to terminate your voice AI pipeline at the end of a conversation