Skip to main content

Overview

Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses a machine learning model to recognize natural conversational cues like intonation patterns and linguistic signals. Pipecat provides LocalSmartTurnAnalyzerV3 which runs inference locally using ONNX. This is the recommended approach due to the fast CPU inference times in Smart Turn v3.

Installation

pip install "pipecat-ai[local-smart-turn-v3]"
The Smart Turn model weights are bundled with Pipecat, so no need to download these separately.

Integration with User Turn Strategies

Smart Turn Detection is integrated into your application by configuring a TurnAnalyzerUserTurnStopStrategy with LocalSmartTurnAnalyzerV3 in your context aggregator:
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)
from pipecat.transports.base_transport import TransportParams
from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import UserTurnStrategies

transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        audio_in_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
    ),
)

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(
        user_turn_strategies=UserTurnStrategies(
            stop=[TurnAnalyzerUserTurnStopStrategy(
                turn_analyzer=LocalSmartTurnAnalyzerV3()
            )]
        ),
    ),
)
Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs value. We recommend 0.2 seconds.
audio_in_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2))

Configuration

The SmartTurnParams class configures turn detection behavior:
stop_secs
float
default:"3.0"
Duration of silence in seconds required before triggering a silence-based end of turn
pre_speech_ms
float
default:"0.0"
Amount of audio (in milliseconds) to include before speech is detected
max_duration_secs
float
default:"8.0"
Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.

Local Smart Turn

The LocalSmartTurnAnalyzerV3 runs inference locally. Version 3 of the model supports fast CPU inference on ordinary cloud instances.

Constructor Parameters

smart_turn_model_path
Optional[str]
default:"None"
Path to the Smart Turn v3 ONNX file containing the model weights. Download this from https://huggingface.co/pipecat-ai/smart-turn-v3/tree/mainThis parameter is optional, as Pipecat includes a copy of the model internally, and this is used if the path is unset.
sample_rate
Optional[int]
default:"None"
Audio sample rate (will be set by the transport if not provided)
params
SmartTurnParams
default:"SmartTurnParams()"
Configuration parameters for turn detection

Example

from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)
from pipecat.transports.base_transport import TransportParams
from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import UserTurnStrategies

# Create the transport
transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
    ),
)

# Configure Smart Turn Detection via user turn strategies
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(
        user_turn_strategies=UserTurnStrategies(
            stop=[TurnAnalyzerUserTurnStopStrategy(
                turn_analyzer=LocalSmartTurnAnalyzerV3()
            )]
        ),
    ),
)

How It Works

Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:
  1. Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
  2. VAD Processing: Voice Activity Detection (using the Silero model) detects when there is a pause in the user’s speech.
  3. Smart Turn Analysis: When VAD detects a pause in speech, the Smart Turn model analyzes the audio from the most recent 8 seconds of the user’s turn, and makes a decision about whether the turn is complete or incomplete.
The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs, the turn is automatically marked as complete.

Notes

  • The model supports 23 languages, see the source repository for more details
  • You can adjust the stop_secs parameter based on your application’s needs for responsiveness
  • Smart Turn generally provides a more natural conversational experience but is computationally more intensive than simple VAD
  • LocalSmartTurnAnalyzerV3 is designed to run on CPU, and inference can be performed on low-cost cloud instances in under 100ms. However, by installing the onnxruntime-gpu dependency, you can achieve higher performance by making use of GPU inference.