> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

> Test and improve your voice AI agents from local prompt iteration to production monitoring.

## Overview

Building a voice AI agent is only half the challenge. You also need to know it handles real conversations reliably. A good evaluation strategy progresses through two phases:

1. **Local testing**: Iterate on your LLM prompts quickly without needing live audio services, reducing cost and tightening the feedback loop during development.
2. **Production evaluation**: Automated simulations and observability for deployed agents, catching regressions and tracking quality over time with real user traffic.

Starting locally and layering in production tooling as your agent matures gives you the fastest path to a reliable, well-tested agent.

## Local prompt testing

Before investing in full end-to-end simulations, focus on getting your LLM prompts right. Pipecat's architecture makes it straightforward to test your agent's conversational logic without running STT or TTS services, saving both time and cost during development.

The most efficient way to iterate on prompts is to bypass audio entirely and send text directly to your LLM pipeline. This lets you validate conversational logic, function calling, and response quality in seconds rather than minutes.

You can configure your pipeline to accept text input instead of audio by replacing STT with a transcript-based input:

```python theme={null}
from pipecat.frames.frames import TranscriptionFrame

# Send a simulated user utterance directly into the pipeline
frame = TranscriptionFrame(
    text="I'd like to schedule an appointment for tomorrow at 3pm",
    user_id="test-user",
    timestamp=0,
)
```

This approach lets you:

* Test prompt variations rapidly without waiting for audio processing
* Validate function calling behavior with specific user inputs
* Build repeatable test cases for edge cases and failure modes
* Run tests in CI without audio infrastructure

## Production evaluation

Once your prompts are solid and you've validated the local experience, production evaluation tools help you scale testing and monitor quality across real deployments. This is where evaluation platforms come in.

### Simulations

Automated test conversations exercise your agent's behavior across scenarios, edge cases, and failure modes before they reach users. Simulation platforms can connect to your agent via API, WebSocket, or telephony to run scripted or AI-driven test calls.

Key things to test with simulations:

* **Multi-turn flows**: Verify your agent handles complete conversation paths correctly
* **Edge cases**: Test interruptions, unexpected input, silence, and barge-in
* **Telephony behavior**: End-to-end testing over real phone networks catches issues that only surface in production call conditions
* **Regressions**: Run simulation suites before each deployment to catch breaking changes

### Observability

Continuous evaluation of live calls lets you catch regressions, track quality over time, and close the loop between what you test and what users experience. Common approaches include:

* Submitting call recordings and transcripts for automated quality scoring
* Tracking evaluation metrics over time to detect quality drift
* Using [OpenTelemetry traces](/api-reference/server/utilities/opentelemetry) to monitor latency and execution flow

Together, simulations and observability form a feedback loop: simulations validate changes before deployment, and observability surfaces issues that inform your next round of tests.

### Evaluation platforms

Several platforms offer simulation testing and production monitoring for voice AI agents:

<CardGroup cols={2}>
  <Card title="Coval" icon="flask-vial" iconType="duotone" href="/pipecat/fundamentals/evaluations/coval">
    AI-native simulation and evaluation platform for voice agents, trusted by QA, Engineering, Operations, AI, and Executive teams.
  </Card>

  <Card title="Bluejay" icon="bird" iconType="duotone" href="/pipecat/fundamentals/evaluations/bluejay">
    Simulation, observability, and evaluation platform with native Pipecat Cloud integration. Supports no-code API, WebSocket, and telephony testing.
  </Card>

  <Card title="Cekura" icon="shield-check" iconType="duotone" href="/pipecat/fundamentals/evaluations/cekura">
    Automated testing and monitoring platform with native Pipecat Integration for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic Variables and more!
  </Card>
</CardGroup>

<Note>
  Building an evaluation integration for Pipecat? We welcome contributions to
  this page. Open a PR on the [docs
  repository](https://github.com/pipecat-ai/docs).
</Note>

## Pipecat's built-in tools

Pipecat provides several building blocks that feed into any evaluation workflow:

* **[Metrics](/pipecat/fundamentals/metrics)**: Built-in TTFB, processing time, and usage tracking for LLM and TTS services
* **[Saving transcripts](/pipecat/fundamentals/saving-transcripts)**: Capture conversation transcripts for offline analysis and evaluation
* **[OpenTelemetry](/api-reference/server/utilities/opentelemetry)**: Export traces to any OTel-compatible backend for latency and performance monitoring
* **[Observers](/api-reference/server/utilities/observers/observer-pattern)**: Monitor frame flow without modifying the pipeline, useful for custom instrumentation

## Next steps

<CardGroup cols={2}>
  <Card title="Metrics" icon="chart-line" iconType="duotone" href="/pipecat/fundamentals/metrics">
    Monitor performance and LLM/TTS usage with Pipecat's built-in metrics.
  </Card>

  <Card title="Saving Transcripts" icon="scroll" iconType="duotone" href="/pipecat/fundamentals/saving-transcripts">
    Capture conversation transcripts to use with evaluation tools.
  </Card>

  <Card title="OpenTelemetry" icon="tower-broadcast" iconType="duotone" href="/api-reference/server/utilities/opentelemetry">
    Export traces for performance monitoring and debugging.
  </Card>

  <Card title="Custom Frame Processor" icon="puzzle-piece" iconType="duotone" href="/pipecat/fundamentals/custom-frame-processor">
    Build custom processors for evaluation-specific instrumentation.
  </Card>
</CardGroup>
