Overview
User turn strategies provide fine-grained control over how user speaking turns are detected in conversations. They determine when a user’s turn starts (user begins speaking) and when it stops (user finishes speaking and expects a response). By default, Pipecat uses a combination of VAD (Voice Activity Detection) and transcription-based detection:- Start: VAD detection or transcription received
- Stop: Transcription received after VAD indicates silence
How It Works
-
Turn Start Detection: When any start strategy triggers, the user aggregator:
- Marks the start of a user turn
- Optionally emits
UserStartedSpeakingFrame - Optionally emits an interruption frame (if the bot is speaking)
- During User Turn: The aggregator collects transcriptions and audio frames.
-
Turn Stop Detection: When a stop strategy triggers, the user aggregator:
- Marks the end of the user turn
- Emits
UserStoppedSpeakingFrame - Pushes the aggregated user message to the LLM context
-
Timeout Handling: If no stop strategy triggers within
user_turn_stop_timeoutseconds (default: 5.0), the turn is automatically ended.
Configuration
User turn strategies are configured viaLLMUserAggregatorParams when creating an LLMContextAggregatorPair:
Start Strategies
Start strategies determine when a user’s turn begins. Multiple strategies can be provided, and the first one to trigger will signal the start of a user turn.Base Parameters
All start strategies inherit these parameters:If True, the user aggregator will emit an interruption frame when the user turn starts, allowing the user to interrupt the bot.
If True, the user aggregator will emit frames indicating when the user starts speaking. Disable this if another component (e.g., an STT service) already generates these frames.
VADUserTurnStartStrategy
Triggers a user turn start based on Voice Activity Detection. This is the most responsive strategy, detecting speech as soon as the VAD indicates the user has started speaking.TranscriptionUserTurnStartStrategy
Triggers a user turn start when a transcription is received. This serves as a fallback for scenarios where VAD-based detection fails (e.g., when the user speaks very softly) but the STT service still produces transcriptions.Whether to trigger on interim (partial) transcription frames for earlier detection.
MinWordsUserTurnStartStrategy
Requires the user to speak a minimum number of words before triggering a turn start. This is useful for preventing brief utterances like “okay” or “yeah” from triggering responses.Minimum number of spoken words required to trigger the start of a user turn.
Whether to consider interim transcription frames for earlier detection.
When the bot is not speaking, this strategy will trigger after just 1 word. The
min_words threshold only applies when the bot is actively speaking, preventing short affirmations from interrupting the bot.ExternalUserTurnStartStrategy
Delegates turn start detection to an external processor. This strategy listens forUserStartedSpeakingFrame frames emitted by other components in the pipeline (such as speech-to-speech services).
This strategy automatically sets
enable_interruptions=False and enable_user_speaking_frames=False since these are expected to be handled by the external processor.Stop Strategies
Stop strategies determine when a user’s turn ends and the bot should respond.Base Parameters
All stop strategies inherit these parameters:If True, the aggregator will emit frames indicating when the user stops speaking. Disable this if another component already generates these frames.
TranscriptionUserTurnStopStrategy
The default stop strategy that signals the end of a user turn when transcription is received and VAD indicates silence.A short delay in seconds used to handle consecutive or slightly delayed transcriptions gracefully.
TurnAnalyzerUserTurnStopStrategy
Uses an AI-powered turn detection model to determine when the user has finished speaking. This provides more intelligent end-of-turn detection that can understand conversational context.The turn detection analyzer instance to use for end-of-turn detection.
A short delay in seconds used to handle consecutive or slightly delayed transcriptions.
ExternalUserTurnStopStrategy
Delegates turn stop detection to an external processor. This strategy listens forUserStoppedSpeakingFrame frames emitted by other components in the pipeline.
A short delay in seconds used to handle consecutive or slightly delayed transcriptions.
UserTurnStrategies
Container for configuring user turn start and stop strategies.List of strategies used to detect when the user starts speaking. The first strategy to trigger will signal the start of the user’s turn.
List of strategies used to detect when the user stops speaking and expects a response.
ExternalUserTurnStrategies
A convenience class that preconfiguresUserTurnStrategies with external strategies for both start and stop detection. Use this when an external processor (such as a speech-to-speech service) controls turn management.
Usage Examples
Default Behavior
The default configuration uses VAD and transcription for turn detection:Minimum Words for Interruption
Require users to speak at least 3 words before they can interrupt the bot:Local Smart Turn Detection
Use a local turn detection model instead of a cloud service:Related
- User Input Muting - Control when user input is ignored
- Smart Turn Detection - AI-powered turn detection