Overview
Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses a machine learning model to recognize natural conversational cues like intonation patterns and linguistic signals.Smart Turn Model
Open source model for advanced conversational turn detection. Contribute to
model training and development.
Data Collector
Contribute conversational data to improve the smart-turn model
Data Classifier
Help classify turn completion patterns in conversations
- LocalSmartTurnAnalyzerV3 - Runs inference locally using ONNX. This method is recommended due to the fast CPU inference times in Smart Turn v3.
- FalSmartTurnAnalyzer - Uses Fal’s hosted smart-turn model for inference.
Installation
The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose. For local inference:Integration with Transport
Smart Turn Detection is integrated into your application by setting one of the available turn analyzers as theturn_analyzer
parameter in your transport configuration:
Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short
stop_secs
value. We recommend 0.2 seconds.Configuration
All implementations use the sameSmartTurnParams
class to configure behavior:
Duration of silence in seconds required before triggering a silence-based end
of turn
Amount of audio (in milliseconds) to include before speech is detected
Maximum allowed segment duration in seconds. For segments longer than this
value, a rolling window is used.
Remote Smart Turn
TheFalSmartTurnAnalyzer
class uses a remote service for turn detection inference.
Constructor Parameters
The URL of the remote Smart Turn service
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
Example
Local Smart Turn
TheLocalSmartTurnAnalyzerV3
runs inference locally. Version 3 of the model supports fast CPU inference on ordinary cloud instances.
Constructor Parameters
Path to the Smart Turn v3 ONNX file containing the model weights. Download this from
https://huggingface.co/pipecat-ai/smart-turn-v3/tree/mainThis parameter is optional, as Pipecat includes a copy of the model internally, and this
is used if the path is unset.
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
Example
How It Works
Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:- Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
- VAD Processing: Voice Activity Detection (using the Silero model) detects when there is a pause in the user’s speech.
- Smart Turn Analysis: When VAD detects a pause in speech, the Smart Turn model analyzes the audio from the most recent 8 seconds of the user’s turn, and makes a decision about whether the turn is complete or incomplete.
stop_secs
, the turn is automatically marked as complete.
Notes
- The model supports 23 languages, see the source repository for more details
- You can adjust the
stop_secs
parameter based on your application’s needs for responsiveness - Smart Turn generally provides a more natural conversational experience but is computationally more intensive than simple VAD
LocalSmartTurnAnalyzerV3
is designed to run on CPU, and inference can be performed on low-cost cloud instances in under 100ms. However, by installing theonnxruntime-gpu
dependency, you can achieve higher performance by making use of GPU inference.