> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech Input & Turn Detection

> Learn how Pipecat detects user turns using VAD, transcriptions, and turn detection models

A key to natural conversations is properly detecting when the user starts and stops speaking. This is more nuanced than simply detecting audio; a brief pause doesn't always mean the user is done talking.

## Overview

Pipecat uses [user turn strategies](/api-reference/server/utilities/turn-management/user-turn-strategies) to determine when user turns start and end. These strategies can use different techniques:

**For detecting turn start:**

* Voice Activity Detection (VAD): triggers when speech is detected
* Transcription-based (fallback): triggers when transcription is received but VAD didn't detect speech
* Minimum words: waits for a minimum number of spoken words before triggering

**For detecting turn end (default: Smart Turn):**

* Turn detection model (default): uses AI to understand if the user has finished their thought
* Speech timeout: waits for silence after transcription to determine when the user is done

Custom strategies can also be implemented for specific use cases. By combining these techniques, you can create responsive yet natural conversations that don't interrupt users mid-sentence or wait too long after they've finished.

## Voice Activity Detection (VAD)

### What VAD Does

VAD is responsible for detecting when a user starts and stops speaking. Pipecat includes [Silero VAD](https://github.com/snakers4/silero-vad), an open-source model that runs locally on CPU with minimal overhead. [Krisp VIVA VAD](/api-reference/server/utilities/audio/krisp-viva-vad-analyzer) is also available for applications requiring support for higher sample rates.

**Silero VAD performance characteristics:**

* Processes 30+ms audio chunks in less than 1ms
* Runs on a single CPU thread
* Minimal system resource impact

### VAD Configuration

VAD is configured through `VADParams` in the `LLMContextAggregatorPair`:

```python theme={null}
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)

vad_analyzer = SileroVADAnalyzer(
    params=VADParams(
        confidence=0.7,      # Minimum confidence for voice detection
        start_secs=0.2,      # Time to wait before confirming speech start
        stop_secs=0.2,       # Time to wait before confirming speech stop
        min_volume=0.6,      # Minimum volume threshold
    )
)

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(vad_analyzer=vad_analyzer)),
)
```

In the vast number of cases, the default values will work well. Only adjust these parameters if you have specific audio conditions that require it.

### Key Parameters

**`start_secs` (default: 0.2)**

* How long a user must speak before VAD confirms speech has started
* Lower values = more responsive, but may trigger on brief sounds
* Higher values = less sensitive, but may miss quick utterances, like "yes", "no", or "ok"

**`stop_secs` (default: 0.2)**

* How much silence must be detected before confirming speech has stopped
* Critical for turn-taking behavior
* A short value (0.2s) allows STT services to finalize sooner, improving transcription speed
* **Important**: Built-in STT P99 latency values are measured with `stop_secs=0.2`. If you change this value, re-run the [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark) with your settings and pass the measured latency to your STT service via `ttfs_p99_latency`

**`confidence` and `min_volume`**

* Generally work well with defaults
* Only adjust after extensive testing with your specific audio conditions

<Warning>
  Changing confidence and min\_volume requires careful profiling to ensure
  optimal performance across different audio environments and use cases.
</Warning>

## User Turn Detection

While VAD detects speech vs. silence, it can't understand linguistic context. A pause doesn't mean the user is done. User turn strategies interpret VAD signals and transcriptions to determine actual turn boundaries.

### How It Works

1. **Turn Start**: When VAD detects speech (or transcription arrives), the start strategy emits `UserStartedSpeakingFrame` and optionally triggers an interruption
2. **Turn End**: When the stop strategy determines the user is done, it emits `UserStoppedSpeakingFrame`

<Note>
  VAD also emits its own frames (`VADUserStartedSpeakingFrame`,
  `VADUserStoppedSpeakingFrame`) which indicate raw speech/silence detection.
  These are inputs to the turn strategies, not the final turn decisions.
</Note>

### Detecting Turn End

Turn end detection determines when the user has finished speaking and expects a response:

**Smart Turn Model (Default)**: Uses an AI model to analyze audio and determine if the user has finished their thought. This is the default turn stop strategy.

```python theme={null}
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy

stop_strategy = TurnAnalyzerUserTurnStopStrategy(
    turn_analyzer=LocalSmartTurnAnalyzerV3()
)
```

**Speech Timeout**: Waits for a configurable timeout after VAD detects silence and a transcript is received. Useful as a simpler alternative to Smart Turn.

```python theme={null}
from pipecat.turns.user_stop import SpeechTimeoutUserTurnStopStrategy

stop_strategy = SpeechTimeoutUserTurnStopStrategy(user_speech_timeout=0.6)
```

<CardGroup cols={2}>
  <Card title="User Turn Strategies" icon="arrows-turn-to-dots" href="/api-reference/server/utilities/turn-management/user-turn-strategies">
    Complete reference for start and stop strategies
  </Card>

  <Card title="Smart Turn Overview" icon="brain" href="/api-reference/server/utilities/turn-detection/smart-turn-overview">
    Smart Turn model implementation guide
  </Card>
</CardGroup>

### Interruptions

Interruptions stop the bot when the user starts speaking. This is controlled by the `enable_interruptions` parameter on start strategies (enabled by default).

When a user turn starts with interruptions enabled:

1. Bot immediately stops speaking
2. Pending audio and text is cleared
3. Pipeline ready for new user input

To disable interruptions:

```python theme={null}
from pipecat.turns.user_start import VADUserTurnStartStrategy

start_strategy = VADUserTurnStartStrategy(enable_interruptions=False)
```

<Note>
  Keep interruptions enabled (default) for natural conversations. This enables
  users to interrupt the bot mid-sentence, just like human conversations.
</Note>

## Best Practices

### Optimal Configuration

**For most voice AI use cases, the defaults work well:**

```python theme={null}
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)

transport = YourTransport(
    params=TransportParams(),
)

# Default configuration: Smart Turn detection + VAD with stop_secs=0.2
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)

task = PipelineTask(pipeline)
```

The defaults use `TurnAnalyzerUserTurnStopStrategy` with `LocalSmartTurnAnalyzerV3` for turn detection and `SileroVADAnalyzer` with `stop_secs=0.2` for voice activity detection.

### Performance Considerations

* **Use local VAD**: 150-200ms faster than remote VAD services
* **Tune for your use case**: Test with real audio conditions
* **Monitor CPU usage**: VAD adds minimal overhead but monitor in production
* **Consider turn detection**: Improves conversation quality but adds complexity

## Key Takeaways

* **VAD detects speech activity** but turn detection understands conversation context
* **Configuration affects user experience** - tune parameters for your specific use case
* **System frames coordinate behavior** - enable interruptions and natural turn-taking
* **Local processing is faster** - Silero VAD provides low-latency speech detection
* **Turn detection improves quality** - but requires careful VAD configuration

## What's Next

Now that you understand how speech input is detected and processed, let's explore how that audio gets converted to text through speech recognition.

<Card title="Speech to Text" icon="arrow-right" href="/pipecat/learn/speech-to-text">
  Learn how to configure speech recognition in your voice AI pipeline
</Card>
