Context Summarization

Overview

In long-running voice AI conversations, context grows with every exchange. This increases token usage, raises costs, and can eventually hit context window limits. Pipecat includes built-in context summarization that automatically compresses older conversation history while preserving recent messages and important context.

How It Works

Context summarization automatically triggers when either condition is met:

Token limit reached: Context size exceeds max_context_tokens (estimated using ~4 characters per token)
Message count reached: Number of new messages exceeds max_unsummarized_messages

When triggered, the system:

Sends a LLMContextSummaryRequestFrame to the LLM service
The LLM generates a concise summary of older messages
Context is reconstructed as: [system_message] + [summary] + [recent_messages]
Incomplete function call sequences and recent messages are preserved

Context summarization is asynchronous and happens in the background without blocking the pipeline. The system uses request IDs to match summary requests with results and handles interruptions gracefully.

Enabling Context Summarization

Enable summarization by setting enable_auto_context_summarization=True in LLMAssistantAggregatorParams:

from pipecat.processors.aggregators.llm_response_universal import (
    LLMAssistantAggregatorParams,
    LLMContextAggregatorPair,
)

# Create aggregators with summarization enabled
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    assistant_params=LLMAssistantAggregatorParams(
        enable_auto_context_summarization=True,
    ),
)

With the default configuration, summarization triggers at 8000 estimated tokens or after 20 new messages, whichever comes first.

Customizing Behavior

Use LLMAutoContextSummarizationConfig and LLMContextSummaryConfig to tune the summarization triggers and output:

from pipecat.utils.context.llm_context_summarization import (
    LLMAutoContextSummarizationConfig,
    LLMContextSummaryConfig,
)

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    assistant_params=LLMAssistantAggregatorParams(
        enable_auto_context_summarization=True,
        auto_context_summarization_config=LLMAutoContextSummarizationConfig(
            max_context_tokens=4000,           # Trigger at 4000 tokens
            max_unsummarized_messages=10,      # Or trigger after 10 new messages
            summary_config=LLMContextSummaryConfig(
                target_context_tokens=3000,    # Target summary size
                min_messages_after_summary=2,  # Keep last 2 messages uncompressed
            ),
        ),
    ),
)

See the reference page for all available configuration parameters.

What Gets Preserved

Context summarization intelligently preserves:

System messages: The first system message (defining assistant behavior) is always kept
Recent messages: The last N messages stay uncompressed (configured by min_messages_after_summary)
Function call sequences: Incomplete function call/result pairs are not split during summarization

Custom Summarization Prompts

You can override the default summarization prompt to control how the LLM generates summaries:

custom_prompt = """Summarize this conversation concisely.
Focus on: key decisions, user preferences, and action items.
Keep the summary under {target_tokens} tokens."""

config = LLMAutoContextSummarizationConfig(
    summary_config=LLMContextSummaryConfig(
        summarization_prompt=custom_prompt,
    ),
)

When no custom prompt is provided, Pipecat uses a built-in prompt that instructs the LLM to create a concise summary preserving key information, user preferences, and conversation flow.

Dedicated Summarization LLM

By default, summarization uses the same LLM service that handles conversation. You can route summarization to a separate, cheaper model by setting the llm field:

from pipecat.services.google import GoogleLLMService

# Use a fast/cheap model for summarization
summarization_llm = GoogleLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    model="gemini-2.5-flash",
)

config = LLMAutoContextSummarizationConfig(
    summary_config=LLMContextSummaryConfig(
        llm=summarization_llm,
    ),
)

When a dedicated LLM is configured, summarization requests bypass the pipeline entirely and call the dedicated service directly, so the primary conversation LLM is never interrupted.

On-Demand Summarization

In addition to automatic summarization, you can trigger context summarization on demand by pushing an LLMSummarizeContextFrame into the pipeline. This is useful when you want to give users explicit control over when summarization happens — for example, via a function call tool.

from pipecat.frames.frames import LLMSummarizeContextFrame
from pipecat.services.llm_service import FunctionCallParams

async def summarize_conversation(params: FunctionCallParams):
    """Trigger manual context summarization via a pipeline frame."""
    await params.result_callback({"status": "summarization_requested"})
    await params.llm.queue_frame(LLMSummarizeContextFrame())

from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema

llm.register_function("summarize_conversation", summarize_conversation)

summarize_function = FunctionSchema(
    name="summarize_conversation",
    description=(
        "Summarize and compress the conversation history. "
        "Call this when the user asks you to summarize the conversation "
        "or when you want to free up context space."
    ),
    properties={},
    required=[],
)
tools = ToolsSchema(standard_tools=[summarize_function])
context = LLMContext(messages, tools=tools)

On-demand summarization works even when enable_auto_context_summarization is False — the summarizer is always created internally to handle manually pushed frames. You can also pass a per-request LLMContextSummaryConfig to override the default settings:

from pipecat.utils.context.llm_context_summarization import LLMContextSummaryConfig

await llm.queue_frame(
    LLMSummarizeContextFrame(
        config=LLMContextSummaryConfig(
            target_context_tokens=2000,
            min_messages_after_summary=2,
        )
    )
)

See the complete example for a full working implementation.

Observability

The summarizer emits an on_summary_applied event after each successful summarization, providing message count metrics:

from pipecat.processors.aggregators.llm_context_summarizer import SummaryAppliedEvent

summarizer = assistant_aggregator._summarizer
if summarizer:

    @summarizer.event_handler("on_summary_applied")
    async def on_summary_applied(summarizer, event: SummaryAppliedEvent):
        logger.info(
            f"Context summarized: {event.original_message_count} messages -> "
            f"{event.new_message_count} messages "
            f"({event.summarized_message_count} summarized, "
            f"{event.preserved_message_count} preserved)"
        )

Learning Pipecat

Fundamentals

Features

Telephony