Pipecat uses a pipeline-based architecture to handle real-time AI processing. Let’s look at how this works in practice, then break down the key components.

Real-time Processing in Action

Consider how a voice assistant processes a user’s question and generates a response:

Instead of waiting for complete responses at each step, Pipecat processes data in small units called frames that flow through the pipeline:

  1. Speech is transcribed in real-time as the user speaks
  2. Transcription is sent to the LLM as it becomes available
  3. LLM responses are processed as they stream in
  4. Text-to-speech begins generating audio for early sentences while later ones are still being generated
  5. Audio playback starts as soon as the first sentence is ready
  6. LLM context is aggregated and updated continuously and in real-time

This streaming approach creates natural, responsive interactions.

Architecture Overview

Here’s how Pipecat organizes these processes:

The architecture consists of three key components:

1. Frames

Frames are containers for data moving through your application. Think of them like packages on a conveyor belt - each contains a specific type of cargo. For example:

  • Audio data from a microphone
  • Text from transcription
  • LLM responses
  • Generated speech audio
  • Images or video
  • Control signals and system messages

Frames can flow in two directions:

  • Downstream (normal processing flow)
  • Upstream (for errors and control signals)

2. Processors (Services)

Processors are workers along our conveyor belt. Each one:

  • Receives frames as inputs
  • Processes specific frame types
  • Passes through frames it doesn’t handle
  • Generates new frames as output

Frame processors can do anything, but for real-time, multimodal AI applications, they commonly include:

  • A speech-to-text processor that receives raw audio input frames and outputs transcription frames
  • An LLM processor takes context frames and produces text frames
  • A text-to-speech processor that receives text frames and generates raw audio output frames
  • An image generation processor that takes in text frames and outputs an image URL frame
  • A logging processor might watch all frames but not modify them

3. Pipelines

Pipelines connect processors together, creating a path for frames to flow through your application. They can be:

# Simple linear pipeline
pipeline = Pipeline([
    transport.input()    # Speech   -> Audio
    transcriber,         # Audio    -> Text
    llm_processor,       # Text     -> Response
    tts_service,         # Response -> Audio
    transport.output()   # Audio    -> Playback
])

# Complex parallel pipeline
pipeline = Pipeline([
    input_source,
    ParallelPipeline([
        [image_processor, image_output],  # Handle images
        [audio_processor, audio_output]   # Handle audio
    ])
])

The pipeline also contains the transport, which is the connection to the real world (e.g., microphone, speakers).

How It All Works Together

Let’s see how these components handle a simple voice interaction:

  1. Input

    • User speaks into their microphone
    • Transport converts audio into frames
    • Frames enter the pipeline
  2. Processing

    • Transcription processor converts speech to text frames
    • LLM processor takes text frames, generates response frames
    • TTS processor converts response frames to audio frames
    • Error frames flow upstream if issues occur
    • System frames can bypass normal processing for immediate handling
  3. Output

    • Audio frames reach the transport
    • Transport plays the audio for the user

This happens continuously and in parallel, creating smooth, real-time interactions.

Next Steps

Now that you understand the big picture, the next step is to install and run your first Pipecat application. Check out the Installation & Setup guide to get started.

Need help? Join our Discord community for support and discussions.