Core Concepts
Understanding how Pipecat works: frames, processors, and pipelines.
Pipecat uses a pipeline-based architecture to handle real-time AI processing. Let’s look at how this works in practice, then break down the key components.
Real-time Processing in Action
Consider how a voice assistant processes a user’s question and generates a response:
Instead of waiting for complete responses at each step, Pipecat processes data in small units called frames that flow through the pipeline:
- Speech is transcribed in real-time as the user speaks
- Transcription is sent to the LLM as it becomes available
- LLM responses are processed as they stream in
- Text-to-speech begins generating audio for early sentences while later ones are still being generated
- Audio playback starts as soon as the first sentence is ready
- LLM context is aggregated and updated continuously and in real-time
This streaming approach creates natural, responsive interactions.
Architecture Overview
Here’s how Pipecat organizes these processes:
The architecture consists of three key components:
1. Frames
Frames are containers for data moving through your application. Think of them like packages on a conveyor belt - each contains a specific type of cargo. For example:
- Audio data from a microphone
- Text from transcription
- LLM responses
- Generated speech audio
- Images or video
- Control signals and system messages
Frames can flow in two directions:
- Downstream (normal processing flow)
- Upstream (for errors and control signals)
2. Processors (Services)
Processors are workers along our conveyor belt. Each one:
- Receives frames as inputs
- Processes specific frame types
- Passes through frames it doesn’t handle
- Generates new frames as output
Frame processors can do anything, but for real-time, multimodal AI applications, they commonly include:
- A speech-to-text processor that receives raw audio input frames and outputs transcription frames
- An LLM processor takes context frames and produces text frames
- A text-to-speech processor that receives text frames and generates raw audio output frames
- An image generation processor that takes in text frames and outputs an image URL frame
- A logging processor might watch all frames but not modify them
3. Pipelines
Pipelines connect processors together, creating a path for frames to flow through your application. They can be:
The pipeline also contains the transport, which is the connection to the real world (e.g., microphone, speakers).
How It All Works Together
Let’s see how these components handle a simple voice interaction:
-
Input
- User speaks into their microphone
- Transport converts audio into frames
- Frames enter the pipeline
-
Processing
- Transcription processor converts speech to text frames
- LLM processor takes text frames, generates response frames
- TTS processor converts response frames to audio frames
- Error frames flow upstream if issues occur
- System frames can bypass normal processing for immediate handling
-
Output
- Audio frames reach the transport
- Transport plays the audio for the user
This happens continuously and in parallel, creating smooth, real-time interactions.
Next Steps
Now that you understand the big picture, the next step is to install and run your first Pipecat application. Check out the Installation & Setup guide to get started.
Need help? Join our Discord community for support and discussions.