SileroVAD
Voice Activity Detection processor using Silero VAD model for speech detection
Overview
SileroVAD
is a frame processor that performs Voice Activity Detection (VAD) using the Silero VAD model. It analyzes audio frames to detect when users start and stop speaking, and can handle interruptions in conversations.
Constructor Parameters
Audio sample rate in Hz
Voice Activity Detection parameters
Whether to pass audio frames downstream
VADParams Configuration
Input Frames
Raw audio data for VAD analysis. Should match configured sample rate.
Output Frames
Speech Detection Frames
Emitted when speech is detected
Emitted when speech ends
Interruption Frames
Emitted out-of-band when speech interrupts ongoing processing
Emitted when interrupting speech ends
State Management
VAD States
The processor tracks state transitions to generate appropriate frames:
- QUIET → SPEAKING: Generates UserStartedSpeakingFrame
- SPEAKING → QUIET: Generates UserStoppedSpeakingFrame
Usage Example
Frame Flow
Interruption Handling
The processor provides special handling for interruptions:
-
When speech is detected:
-
When speech ends:
Notes
- Requires audio input at the configured sample rate
- Interruption frames are sent out-of-band for immediate handling
- State transitions filter out STARTING and STOPPING states
- Audio passthrough can be enabled for downstream processing
- Uses Silero VAD model for accurate speech detection
- Thread-safe for pipeline processing