Smart Turn Overview

Overview

Smart Turn Detection is an advanced feature in Pipecat that determines when a user has finished speaking and the bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection uses a machine learning model to recognize natural conversational cues like intonation patterns and linguistic signals.

Smart Turn Model

Open source model for advanced conversational turn detection. Contribute to model training and development.

Data Collector

Contribute conversational data to improve the smart-turn model

Data Classifier

Help classify turn completion patterns in conversations

Pipecat provides three implementations of Smart Turn Detection:

FalSmartTurnAnalyzer - Uses a Fal’s hosted smart-turn model for inference
LocalCoreMLSmartTurnAnalyzer - Runs inference locally on Apple Silicon using CoreML (not currently recommended)
LocalSmartTurnAnalyzerV2 - Runs inference locally using PyTorch and Hugging Face Transformers

All implementations share the same underlying API and parameters, making it easy to switch between them based on your deployment requirements.

Installation

The Smart Turn Detection feature requires additional dependencies depending on which implementation you choose. For Fal’s hosted service inference:

pip install "pipecat-ai[remote-smart-turn]"

For local inference (CoreML or PyTorch based):

pip install "pipecat-ai[local-smart-turn]"

Integration with Transport

Smart Turn Detection is integrated into your application by setting one of the available turn analyzers as the turn_analyzer parameter in your transport configuration:

from pipecat.transports.base_transport import TransportParams

transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        # Other transport parameters...
        turn_analyzer=FalSmartTurnAnalyzer(url=remote_smart_turn_url),
    ),
)

Smart Turn Detection requires VAD to be enabled and works best when the VAD analyzer is set to a short stop_secs value. We recommend 0.2 seconds.

audio_in_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2))

Configuration

All implementations use the same SmartTurnParams class to configure behavior:

stop_secs

float

default:"3.0"

Duration of silence in seconds required before triggering a silence-based end of turn

pre_speech_ms

float

default:"0.0"

Amount of audio (in milliseconds) to include before speech is detected

max_duration_secs

float

default:"8.0"

Maximum allowed segment duration in seconds. For segments longer than this value, a rolling window is used.

Remote Smart Turn

The FalSmartTurnAnalyzer class uses a remote service for turn detection inference.

Constructor Parameters

url

str

required

The URL of the remote Smart Turn service

sample_rate

Optional[int]

default:"None"

Audio sample rate (will be set by the transport if not provided)

params

SmartTurnParams

default:"SmartTurnParams()"

Configuration parameters for turn detection

Example

import os
from pipecat.audio.turn.smart_turn.fal_smart_turn import FalSmartTurnAnalyzer
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.transports.base_transport import TransportParams

# Get the URL for the remote Smart Turn service
remote_smart_turn_url = os.getenv("REMOTE_SMART_TURN_URL")

# Create the transport with Smart Turn detection
transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
        turn_analyzer=FalSmartTurnAnalyzer(
            url=remote_smart_turn_url,
            params=SmartTurnParams(
                stop_secs=3.0,
                pre_speech_ms=0.0,
                max_duration_secs=8.0
            )
        ),
    ),
)

Local Smart Turn (CoreML)

The LocalCoreMLSmartTurnAnalyzer runs inference locally using CoreML, providing lower latency and no network dependencies. We currently recommend using the PyTorch implementation with the MPS backend on Apple Silicon, rather than CoreML, due to improved performance.

Constructor Parameters

smart_turn_model_path

str

required

Path to the directory containing the Smart Turn model

sample_rate

Optional[int]

default:"None"

Audio sample rate (will be set by the transport if not provided)

params

SmartTurnParams

default:"SmartTurnParams()"

Configuration parameters for turn detection

Example

import os
from pipecat.audio.turn.smart_turn.local_coreml_smart_turn import LocalCoreMLSmartTurnAnalyzer
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.transports.base_transport import TransportParams

# Path to the Smart Turn model directory
smart_turn_model_path = os.getenv("LOCAL_SMART_TURN_MODEL_PATH")

# Create the transport with local Smart Turn detection
transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
        turn_analyzer=LocalCoreMLSmartTurnAnalyzer(
            smart_turn_model_path=smart_turn_model_path,
            params=SmartTurnParams(
                stop_secs=2.0,  # Shorter stop time when using Smart Turn
                pre_speech_ms=0.0,
                max_duration_secs=8.0
            )
        ),
    ),
)

Local Smart Turn (PyTorch)

The LocalSmartTurnAnalyzerV2 runs inference locally using PyTorch and Hugging Face Transformers, providing a cross-platform solution.

Constructor Parameters

smart_turn_model_path

str

default:"pipecat-ai/smart-turn-v2"

Path to the Smart Turn model or Hugging Face model identifier. Defaults to the official “pipecat-ai/smart-turn-v2” model.

sample_rate

Optional[int]

default:"None"

Audio sample rate (will be set by the transport if not provided)

params

SmartTurnParams

default:"SmartTurnParams()"

Configuration parameters for turn detection

Example

import os
from pipecat.audio.turn.smart_turn.local_smart_turn_v2 import LocalSmartTurnAnalyzerV2
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.transports.base_transport import TransportParams

# Optional: Path to the local Smart Turn model
# If not provided, it will download from Hugging Face
smart_turn_model_path = os.getenv("LOCAL_SMART_TURN_MODEL_PATH")

# Create the transport with PyTorch-based Smart Turn detection
transport = SmallWebRTCTransport(
    webrtc_connection=webrtc_connection,
    params=TransportParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
        turn_analyzer=LocalSmartTurnAnalyzerV2(
            smart_turn_model_path=smart_turn_model_path,
            params=SmartTurnParams(
                stop_secs=2.0,
                pre_speech_ms=0.0,
                max_duration_secs=8.0
            )
        ),
    ),
)

Local Model Setup

To use the LocalCoreMLSmartTurnAnalyzer or LocalSmartTurnAnalyzerV2, you need to set up the model locally:

Install Git LFS (Large File Storage):
brew install git-lfs
Initialize Git LFS
```
git lfs install
```

Clone the Smart Turn model repository:

git clone https://huggingface.co/pipecat-ai/smart-turn-v2

Set the environment variable to the cloned repository path:

# Add to your .env file or environment
export LOCAL_SMART_TURN_MODEL_PATH=/path/to/smart-turn-v2

How It Works

Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points:

Audio Buffering: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio.
VAD Processing: Voice Activity Detection segments the audio into speech and non-speech portions.
Turn Analysis: When VAD detects a pause in speech:
- The ML model analyzes the speech segment for natural completion cues
- It identifies acoustic and linguistic patterns that indicate turn completion
- A decision is made whether the turn is complete or incomplete

The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than stop_secs, the turn is automatically marked as complete.

Notes

The model supports 14 languages, see the source repository for more details
You can adjust the stop_secs parameter based on your application’s needs for responsiveness
Smart Turn generally provides a more natural conversational experience but is computationally more intensive than simple VAD
The PyTorch-based LocalSmartTurnAnalyzerV2 will use CUDA or MPS if available, or will otherwise run on CPU

API Reference

Services

Utilities

Frameworks

Pipeline

Smart Turn Overview

Overview

Smart Turn Model

Data Collector

Data Classifier

Installation

Integration with Transport

Configuration

Remote Smart Turn

Constructor Parameters

Example

Local Smart Turn (CoreML)

Constructor Parameters

Example

Local Smart Turn (PyTorch)

Constructor Parameters

Example

Local Model Setup

How It Works

Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

Smart Turn Model

Data Collector

Data Classifier

​Installation

​Integration with Transport

​Configuration

​Remote Smart Turn

​Constructor Parameters

​Example

​Local Smart Turn (CoreML)

​Constructor Parameters

​Example

​Local Smart Turn (PyTorch)

​Constructor Parameters

​Example

​Local Model Setup

​How It Works

​Notes

Overview

Installation

Integration with Transport

Configuration

Remote Smart Turn

Constructor Parameters

Example

Local Smart Turn (CoreML)

Constructor Parameters

Example

Local Smart Turn (PyTorch)

Constructor Parameters

Example

Local Model Setup

How It Works

Notes