Advanced conversational turn detection powered by the smart-turn model
Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses a machine learning model to recognize natural conversational cues like intonation patterns and linguistic signals.
Open source model for advanced conversational turn detection. Contribute to model training and development.
Contribute conversational data to improve the smart-turn model
Help classify turn completion patterns in conversations
Pipecat provides three implementations of Smart Turn Detection:
All implementations share the same underlying API and parameters, making it easy to switch between them based on your deployment requirements.
The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose.
For Fal’s hosted service inference:
For local inference (CoreML-based):
For local inference (PyTorch-based):
Smart Turn Detection is integrated into your application by setting one of the available turn analyzers as the turn_analyzer
parameter in your transport configuration:
Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs
value. We recommend 0.2 seconds.
All implementations use the same SmartTurnParams
class to configure behavior:
Duration of silence in seconds required before triggering a silence-based end of turn
Amount of audio (in milliseconds) to include before speech is detected
Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.
The FalSmartTurnAnalyzer
class uses a remote service for turn detection inference.
The URL of the remote Smart Turn service
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
The LocalCoreMLSmartTurnAnalyzer
runs inference locally using CoreML, providing lower latency and no network dependencies.
Path to the directory containing the Smart Turn model
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
The LocalSmartTurnAnalyzer
runs inference locally using PyTorch and Hugging Face Transformers, providing a cross-platform solution.
Path to the Smart Turn model or Hugging Face model identifier. Defaults to the official “pipecat-ai/smart-turn” model.
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
To use the LocalCoreMLSmartTurnAnalyzer
or LocalSmartTurnAnalyzer
, you need to set up the model locally:
Install Git LFS (Large File Storage):
Initialize Git LFS
Clone the Smart Turn model repository:
Set the environment variable to the cloned repository path:
Note that the CoreML model is optimized for Apple Silicon devices. If you’re using a different platform, consider using the PyTorch-based LocalSmartTurnAnalyzer
or the remote Smart Turn service.
Learn more about the CoreML setup in the official repository instructions
Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:
Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
VAD Processing: Voice Activity Detection segments the audio into speech and non-speech portions.
Turn Analysis: When VAD detects a pause in speech:
The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs
, the turn is automatically marked as complete.
stop_secs
parameter based on your application’s needs for responsivenessLocalSmartTurnAnalyzer
runs on CPU by default but will use CUDA if availableAdvanced conversational turn detection powered by the smart-turn model
Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses a machine learning model to recognize natural conversational cues like intonation patterns and linguistic signals.
Open source model for advanced conversational turn detection. Contribute to model training and development.
Contribute conversational data to improve the smart-turn model
Help classify turn completion patterns in conversations
Pipecat provides three implementations of Smart Turn Detection:
All implementations share the same underlying API and parameters, making it easy to switch between them based on your deployment requirements.
The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose.
For Fal’s hosted service inference:
For local inference (CoreML-based):
For local inference (PyTorch-based):
Smart Turn Detection is integrated into your application by setting one of the available turn analyzers as the turn_analyzer
parameter in your transport configuration:
Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs
value. We recommend 0.2 seconds.
All implementations use the same SmartTurnParams
class to configure behavior:
Duration of silence in seconds required before triggering a silence-based end of turn
Amount of audio (in milliseconds) to include before speech is detected
Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.
The FalSmartTurnAnalyzer
class uses a remote service for turn detection inference.
The URL of the remote Smart Turn service
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
The LocalCoreMLSmartTurnAnalyzer
runs inference locally using CoreML, providing lower latency and no network dependencies.
Path to the directory containing the Smart Turn model
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
The LocalSmartTurnAnalyzer
runs inference locally using PyTorch and Hugging Face Transformers, providing a cross-platform solution.
Path to the Smart Turn model or Hugging Face model identifier. Defaults to the official “pipecat-ai/smart-turn” model.
Audio sample rate (will be set by the transport if not provided)
Configuration parameters for turn detection
To use the LocalCoreMLSmartTurnAnalyzer
or LocalSmartTurnAnalyzer
, you need to set up the model locally:
Install Git LFS (Large File Storage):
Initialize Git LFS
Clone the Smart Turn model repository:
Set the environment variable to the cloned repository path:
Note that the CoreML model is optimized for Apple Silicon devices. If you’re using a different platform, consider using the PyTorch-based LocalSmartTurnAnalyzer
or the remote Smart Turn service.
Learn more about the CoreML setup in the official repository instructions
Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:
Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
VAD Processing: Voice Activity Detection segments the audio into speech and non-speech portions.
Turn Analysis: When VAD detects a pause in speech:
The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs
, the turn is automatically marked as complete.
stop_secs
parameter based on your application’s needs for responsivenessLocalSmartTurnAnalyzer
runs on CPU by default but will use CUDA if available