Smart Turn Overview
Advanced conversational turn detection powered by machine learning models
Overview
Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses machine learning to recognize natural conversational cues like intonation patterns and linguistic signals.
Smart Turn Model
Open source model for advanced conversational turn detection. Contribute to model training and development.
Available Implementations
Pipecat provides multiple implementations of Smart Turn Detection for different deployment scenarios:
Fal Smart Turn
Cloud-hosted inference using Fal.ai - easy to set up with just an API key
Local CoreML Smart Turn
Run inference locally on Apple Silicon devices for low latency
Installation
The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose.
How It Works
Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:
-
Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
-
VAD Processing: Voice Activity Detection segments the audio into speech and non-speech portions.
-
Turn Analysis: When VAD detects a pause in speech:
- The ML model analyzes the speech segment for natural completion cues
- It identifies acoustic and linguistic patterns that indicate turn completion
- A decision is made whether the turn is complete or incomplete
The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs
, the turn is automatically marked as complete.
Integration with Transport
All Smart Turn implementations use the same basic integration pattern:
Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs
value. We recommend 0.2 seconds.
SmartTurnParams
All Smart Turn implementations use the same SmartTurnParams
class to configure behavior:
Duration of silence in seconds required before triggering a silence-based end of turn
Amount of audio (in milliseconds) to include before speech is detected
Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.
Notes
- The model is designed for English speech; performance may vary with other languages
- You can adjust the
stop_secs
parameter based on your application’s needs for responsiveness - Smart Turn generally provides a more natural conversational experience but is computationally more intensive than simple VAD