RTVI (Real-Time Voice Interaction)
Build real-time voice and multimodal applications with Pipecat’s RTVI protocol
Pipecat’s client SDKs and server implement the RTVI (Real-Time Voice Interaction) standard for real-time voice and multimodal applications. RTVI provides a common protocol for handling voice, text, and other multimodal interactions between clients and servers.
Speaking States
Track when users and bots start/stop speaking for natural turn-taking
Transcription
Handle real-time transcriptions from both users and bots
LLM Processing
Manage LLM responses and function calls with proper client notifications
TTS Management
Control text-to-speech state and audio delivery
How It Works
RTVI uses a pipeline of specialized processors to convert internal Pipecat frames into standardized messages that clients can understand:
Each processor handles a specific aspect of the conversation:
- Speaking State - Tracks when users and bots are speaking
- Transcription - Converts speech to text in real-time
- LLM - Append or replace LLM context
- Metrics - Collects performance data
Basic Example
Here’s a simple example showing how to set up RTVI processors:
Key Components
RTVIProcessor
The main coordinator that manages:
- Client communication
- Service configuration
- Action execution
- Function calls
Learn more about RTVIProcessor →
Frame Processors
Specialized processors that handle different aspects of the conversation:
RTVISpeakingProcessor
- Speaking state changesRTVIUserTranscriptionProcessor
- User speech transcriptionRTVIBotTranscriptionProcessor
- Bot speech transcriptionRTVIBotLLMProcessor
- Language model responsesRTVIBotTTSProcessor
- Text-to-speech processingRTVIMetricsProcessor
- Performance metrics
Learn more about Frame Processors →