Overview
FireVadAnalyzer is a Pipecat VAD analyzer backed by
FireRedVAD, a streaming
voice activity detection model that supports 100+ languages. It processes
audio one 10 ms frame at a time and reports speech probability to Pipecat’s
VAD layer, letting transports detect when a user starts and stops speaking.
Source Repository
Source code, examples, and issues for the FireRedVAD integration
PyPI Package
The
pipecat-firered-vad package on PyPIFireRedVAD Model
The upstream FireRedVAD model and benchmarks
Model Weights
Download the FireRedVAD model weights from Hugging Face
Installation
This is a community-maintained package distributed separately frompipecat-ai:
Prerequisites
This integration requires no API key. It does, however, depend on the upstream FireRedVAD package (not published to PyPI) and locally downloaded model weights.1. Install FireRedVAD
fireredvad is not on PyPI. Clone and install it from GitHub:
2. Download model weights
3. Audio requirements
FireRedVAD only accepts 16 kHz, 16-bit mono PCM audio (enforced at construction time). When using a transport such asDailyTransport, set
sample_rate=16000.
Environment Variables
The integration does not read environment variables directly. The example uses the following for convenience:FIREREDVAD_MODEL_DIR: Path to the downloadedStream-VADmodel directory, passed to the analyzer’smodel_dirargument.FIREREDVAD_USE_GPU: Set to1to enable GPU inference (default:0).
Configuration
Constructor parameters forFireVadAnalyzer (all keyword-only):
Path to the downloaded
Stream-VAD model directory, e.g.
"pretrained_models/FireRedVAD/Stream-VAD".Audio sample rate in Hz. Must be
16000 if provided (enforced).Pipecat-level VAD parameters controlling turn-detection smoothing
(
confidence, start_secs, stop_secs).Optional
VadMode sensitivity preset (0–3). When set, it overrides the
individual threshold/frame parameters below. See VAD modes.Run DFSMN inference on GPU (requires CUDA).
Frames in the model’s internal sliding-window smoother. Larger values reduce
jitter at the cost of slightly more onset latency.
Model-level gate. Frames with a smoothed probability above this value are
considered speech. Range 0.0–1.0.
Extra frames prepended at speech onset to avoid clipping the leading edge of a
word.
Minimum consecutive speech frames before a segment is confirmed. Prevents
single-frame false positives.
Maximum frames in one speech segment before a forced split.
Silence frames required to close a speech segment. Higher values make the bot
wait longer before deciding the turn ended.
VAD modes
VadMode provides pre-tuned sensitivity presets. Passing one to the mode
argument adjusts speech_threshold, min_speech_frame, and min_silence_frame
together as a matched set.
| Preset | Value | Description |
|---|---|---|
VadMode.VERY_PERMISSIVE | 0 | Catches soft/distant speech. May increase false alarms. |
VadMode.PERMISSIVE | 1 | Balanced — a good starting point for most use cases. |
VadMode.AGGRESSIVE | 2 | Suppresses background noise well. May clip quiet speech. |
VadMode.VERY_AGGRESSIVE | 3 | Maximum noise rejection. Best for loud environments. |
Usage
Pass the analyzer to a transport viavad_analyzer, the same way you would use
SileroVADAnalyzer:
Call
vad.reset() between sessions (for example on on_participant_left) so
one caller’s audio context does not bleed into the next.Compatibility
Requirespipecat-ai >= 0.0.90. Check the source
repository for the latest
tested version and changelog.