Overview

SileroVAD is a frame processor that performs Voice Activity Detection (VAD) using the Silero VAD model. It analyzes audio frames to detect when users start and stop speaking, and can handle interruptions in conversations.

Constructor Parameters

sample_rate
int
default: "16000"

Audio sample rate in Hz

vad_params
VADParams
default: "VADParams()"

Voice Activity Detection parameters

audio_passthrough
bool
default: "false"

Whether to pass audio frames downstream

VADParams Configuration

class VADParams:
    threshold: float              # Speech detection threshold
    min_speech_duration_ms: int   # Minimum speech duration
    max_speech_duration_s: int    # Maximum speech duration
    min_silence_duration_ms: int  # Minimum silence duration

Input Frames

AudioRawFrame
Frame
required

Raw audio data for VAD analysis. Should match configured sample rate.

Output Frames

Speech Detection Frames

UserStartedSpeakingFrame
SystemFrame

Emitted when speech is detected

UserStoppedSpeakingFrame
SystemFrame

Emitted when speech ends

Interruption Frames

StartInterruptionFrame
SystemFrame

Emitted out-of-band when speech interrupts ongoing processing

StopInterruptionFrame
SystemFrame

Emitted when interrupting speech ends

State Management

VAD States

class VADState(Enum):
    QUIET     # No speech detected
    STARTING  # Potential speech beginning
    SPEAKING  # Active speech
    STOPPING  # Potential speech ending

The processor tracks state transitions to generate appropriate frames:

  • QUIET → SPEAKING: Generates UserStartedSpeakingFrame
  • SPEAKING → QUIET: Generates UserStoppedSpeakingFrame

Usage Example

# Basic VAD setup
vad = SileroVAD(
    sample_rate=16000,
    vad_params=VADParams(
        threshold=0.5,
        min_speech_duration_ms=250,
        min_silence_duration_ms=100
    ),
    audio_passthrough=True
)

# Pipeline integration
pipeline = Pipeline([
    audio_input,
    vad,                 # Detects speech
    transcriber,         # Receives audio if passthrough=True
    response_handler     # Handles speech events
])

Frame Flow

Interruption Handling

The processor provides special handling for interruptions:

  1. When speech is detected:

    # Normal speech event
    await self.push_frame(UserStartedSpeakingFrame())
    
    # Out-of-band interruption notification
    await self.push_frame(StartInterruptionFrame())
    
  2. When speech ends:

    # Normal speech event
    await self.push_frame(UserStoppedSpeakingFrame())
    
    # Out-of-band interruption end
    await self.push_frame(StopInterruptionFrame())
    

Notes

  • Requires audio input at the configured sample rate
  • Interruption frames are sent out-of-band for immediate handling
  • State transitions filter out STARTING and STOPPING states
  • Audio passthrough can be enabled for downstream processing
  • Uses Silero VAD model for accurate speech detection
  • Thread-safe for pipeline processing