Whisper
Speech-to-text service implementation using locally-downloaded Whisper models
Overview
WhisperSTTService
provides speech-to-text capabilities using OpenAI’s Whisper models running locally. It supports multiple model sizes and configurations for offline transcription.
Installation
To use WhisperSTTService
, install the required dependencies:
Configuration
Constructor Parameters
Whisper model to use. Can be a string or Model enum value
Device to run the model on (‘cpu’, ‘cuda’, or ‘auto’)
Computation type for model inference
Threshold for filtering out non-speech segments
Available Models
Input
The service processes raw audio data with the following requirements:
- PCM audio format
- 16-bit depth
- Single channel (mono)
- Normalized to float32 range [-1.0, 1.0]
Output Frames
TranscriptionFrame
Generated for transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
ErrorFrame
Generated when transcription errors occur, containing error details.
Usage Example
Methods
See the STT base class methods for additional functionality.
Model Selection Guide
Model | Size | Speed | Accuracy | Memory Usage |
---|---|---|---|---|
TINY | 39M | Fastest | Basic | Minimal |
BASE | 74M | Fast | Good | Low |
MEDIUM | 769M | Medium | Better | Moderate |
LARGE | 1.5GB | Slow | Best | High |
DISTIL_MEDIUM_EN | ~400M | Fast | Good (English) | Moderate |
DISTIL_LARGE_V2 | ~750M | Medium | Better | Moderate |
Frame Flow
Metrics Support
The service collects processing metrics:
- Time to First Byte (TTFB)
- Processing duration
- Model loading time
- Inference time
Notes
- Runs completely offline after model download
- First run requires model download
- Supports CPU and CUDA acceleration
- Processes audio in segments
- Filters out non-speech segments
- Thread-safe processing
- Automatic error handling