Overview

Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses machine learning to recognize natural conversational cues like intonation patterns and linguistic signals.

Smart Turn Model

Open source model for advanced conversational turn detection. Contribute to model training and development.

Available Implementations

Pipecat provides multiple implementations of Smart Turn Detection for different deployment scenarios:

Installation

The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose.

# For Fal Smart Turn implementations:
pip install "pipecat-ai[remote-smart-turn]"

# For local CoreML inference (Apple Silicon only):
pip install "pipecat-ai[local-smart-turn]"

How It Works

Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:

  1. Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.

  2. VAD Processing: Voice Activity Detection segments the audio into speech and non-speech portions.

  3. Turn Analysis: When VAD detects a pause in speech:

    • The ML model analyzes the speech segment for natural completion cues
    • It identifies acoustic and linguistic patterns that indicate turn completion
    • A decision is made whether the turn is complete or incomplete

The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs, the turn is automatically marked as complete.

Integration with Transport

All Smart Turn implementations use the same basic integration pattern:

from pipecat.transports.base_transport import TransportParams

transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        # Other transport parameters...
        turn_analyzer=SmartTurnAnalyzer(url=remote_smart_turn_url),
    ),
)

Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs value. We recommend 0.2 seconds.

vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2))

SmartTurnParams

All Smart Turn implementations use the same SmartTurnParams class to configure behavior:

stop_secs
float
default:"3.0"

Duration of silence in seconds required before triggering a silence-based end of turn

pre_speech_ms
float
default:"0.0"

Amount of audio (in milliseconds) to include before speech is detected

max_duration_secs
float
default:"8.0"

Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.

Notes

  • The model is designed for English speech; performance may vary with other languages
  • You can adjust the stop_secs parameter based on your application’s needs for responsiveness
  • Smart Turn generally provides a more natural conversational experience but is computationally more intensive than simple VAD