Pipecat is an open source Python framework for building voice and multimodal AI bots that can see, hear, and speak in real-time.The framework orchestrates AI services, network transports, and audio processing to enable ultra-low latency conversations that feel natural and responsive. Build everything from simple voice assistants to complex multimodal applications that combine audio, video, images, and text.Want to dive right in? Check out the Quickstart example to run your first Pipecat application.
Pipecat orchestrates AI services in a pipeline, which is a series of processors that handle real-time audio, text, and video frames with ultra-low latency.Here’s what happens in a typical voice conversation:
Transport receives audio from the user (browser, phone, etc.)
Speech Recognition converts speech to text in real-time
LLM generates intelligent responses based on context
Speech Synthesis converts responses back to natural speech
Transport streams audio back to the user
In most cases, the entire round-trip interaction happens between 500-800ms, creating a natural conversation experience for the user.The diagram below shows a typical voice assistant pipeline, where each step happens in real-time: