Whisper
Speech-to-text service implementation using locally-downloaded Whisper models
Overview
WhisperSTTService
provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX).
API Reference
Complete API documentation and method details
Whisper Github
OpenAI’s Whisper research paper and model details
Faster Whisper Example
Working example with standard Whisper
MLX Whisper Example
Working example with Apple Silicon optimization
Installation
Choose your installation based on your hardware:
Standard Whisper (CPU/CUDA)
MLX Whisper (Apple Silicon)
First run will download the selected model from Hugging Face. Model sizes range from 39MB to 1.5GB.
Frames
Input
InputAudioRawFrame
- Raw PCM audio data (16-bit, mono)UserStartedSpeakingFrame
- VAD signal to start buffering audioUserStoppedSpeakingFrame
- VAD signal indicating speech segment completionSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcription
Output
TranscriptionFrame
- Final transcription results (segmented processing)ErrorFrame
- Model loading or processing errors
Service Comparison
Service | Hardware | Performance | Memory | Best For |
---|---|---|---|---|
WhisperSTTService | CPU/CUDA | Good | Moderate | General use, GPU acceleration |
WhisperSTTServiceMLX | Apple Silicon | Better | Lower | Mac users, optimized performance |
Model Selection
Standard Whisper Models
TINY
: Smallest multilingual model, fastest inferenceBASE
: Basic multilingual model, good speed/quality balanceSMALL
: Small multilingual model, better speed/quality balance than BASEMEDIUM
: Medium-sized multilingual model, better qualityLARGE
: Best quality multilingual model, slower inferenceLARGE_V3_TURBO
: Fast multilingual model, slightly lower quality than LARGEDISTIL_LARGE_V2
: Fast multilingual distilled model.DISTIL_MEDIUM_EN
: Fast English-only distilled model.
MLX Whisper Models (Apple Silicon)
TINY
: Smallest multilingual model for MLXMEDIUM
: Medium-sized multilingual model for MLXLARGE_V3
: Best quality multilingual model for MLXLARGE_V3_TURBO
: Finetuned, pruned Whisper large-v3, much faster with slightly lower qualityDISTIL_LARGE_V3
: Fast multilingual distilled model for MLXLARGE_V3_TURBO_Q4
: LARGE_V3_TURBO quantized to Q4 for reduced memory usage
Language Support
Whisper supports a number of languages with automatic detection and regional variants:
View All Supported Languages
View All Supported Languages
Language Code | Description | Service Codes |
---|---|---|
Language.AR | Arabic | ar |
Language.AR_AE | Arabic (UAE) | ar |
Language.AR_BH | Arabic (Bahrain) | ar |
Language.AR_DZ | Arabic (Algeria) | ar |
Language.AR_EG | Arabic (Egypt) | ar |
Language.AR_IQ | Arabic (Iraq) | ar |
Language.AR_JO | Arabic (Jordan) | ar |
Language.AR_KW | Arabic (Kuwait) | ar |
Language.AR_LB | Arabic (Lebanon) | ar |
Language.AR_LY | Arabic (Libya) | ar |
Language.AR_MA | Arabic (Morocco) | ar |
Language.AR_OM | Arabic (Oman) | ar |
Language.AR_QA | Arabic (Qatar) | ar |
Language.AR_SA | Arabic (Saudi Arabia) | ar |
Language.AR_SY | Arabic (Syria) | ar |
Language.AR_TN | Arabic (Tunisia) | ar |
Language.AR_YE | Arabic (Yemen) | ar |
Language.BN | Bengali | bn |
Language.BN_BD | Bengali (Bangladesh) | bn |
Language.BN_IN | Bengali (India) | bn |
Language.CS | Czech | cs |
Language.CS_CZ | Czech (Czech Republic) | cs |
Language.DA | Danish | da |
Language.DA_DK | Danish (Denmark) | da |
Language.DE | German | de |
Language.DE_AT | German (Austria) | de |
Language.DE_CH | German (Switzerland) | de |
Language.DE_DE | German (Germany) | de |
Language.EL | Greek | el |
Language.EL_GR | Greek (Greece) | el |
Language.EN | English | en |
Language.EN_AU | English (Australia) | en |
Language.EN_CA | English (Canada) | en |
Language.EN_GB | English (UK) | en |
Language.EN_HK | English (Hong Kong) | en |
Language.EN_IE | English (Ireland) | en |
Language.EN_IN | English (India) | en |
Language.EN_KE | English (Kenya) | en |
Language.EN_NG | English (Nigeria) | en |
Language.EN_NZ | English (New Zealand) | en |
Language.EN_PH | English (Philippines) | en |
Language.EN_SG | English (Singapore) | en |
Language.EN_TZ | English (Tanzania) | en |
Language.EN_US | English (US) | en |
Language.EN_ZA | English (South Africa) | en |
Language.ES | Spanish | es |
Language.ES_AR | Spanish (Argentina) | es |
Language.ES_BO | Spanish (Bolivia) | es |
Language.ES_CL | Spanish (Chile) | es |
Language.ES_CO | Spanish (Colombia) | es |
Language.ES_CR | Spanish (Costa Rica) | es |
Language.ES_CU | Spanish (Cuba) | es |
Language.ES_DO | Spanish (Dominican Republic) | es |
Language.ES_EC | Spanish (Ecuador) | es |
Language.ES_ES | Spanish (Spain) | es |
Language.ES_GQ | Spanish (Equatorial Guinea) | es |
Language.ES_GT | Spanish (Guatemala) | es |
Language.ES_HN | Spanish (Honduras) | es |
Language.ES_MX | Spanish (Mexico) | es |
Language.ES_NI | Spanish (Nicaragua) | es |
Language.ES_PA | Spanish (Panama) | es |
Language.ES_PE | Spanish (Peru) | es |
Language.ES_PR | Spanish (Puerto Rico) | es |
Language.ES_PY | Spanish (Paraguay) | es |
Language.ES_SV | Spanish (El Salvador) | es |
Language.ES_US | Spanish (US) | es |
Language.ES_UY | Spanish (Uruguay) | es |
Language.ES_VE | Spanish (Venezuela) | es |
Language.FA | Persian | fa |
Language.FA_IR | Persian (Iran) | fa |
Language.FI | Finnish | fi |
Language.FI_FI | Finnish (Finland) | fi |
Language.FR | French | fr |
Language.FR_BE | French (Belgium) | fr |
Language.FR_CA | French (Canada) | fr |
Language.FR_CH | French (Switzerland) | fr |
Language.FR_FR | French (France) | fr |
Language.HI | Hindi | hi |
Language.HI_IN | Hindi (India) | hi |
Language.HU | Hungarian | hu |
Language.HU_HU | Hungarian (Hungary) | hu |
Language.ID | Indonesian | id |
Language.ID_ID | Indonesian (Indonesia) | id |
Language.IT | Italian | it |
Language.IT_IT | Italian (Italy) | it |
Language.JA | Japanese | ja |
Language.JA_JP | Japanese (Japan) | ja |
Language.KO | Korean | ko |
Language.KO_KR | Korean (Korea) | ko |
Language.NL | Dutch | nl |
Language.NL_BE | Dutch (Belgium) | nl |
Language.NL_NL | Dutch (Netherlands) | nl |
Language.PL | Polish | pl |
Language.PL_PL | Polish (Poland) | pl |
Language.PT | Portuguese | pt |
Language.PT_BR | Portuguese (Brazil) | pt |
Language.PT_PT | Portuguese (Portugal) | pt |
Language.RO | Romanian | ro |
Language.RO_RO | Romanian (Romania) | ro |
Language.RU | Russian | ru |
Language.RU_RU | Russian (Russia) | ru |
Language.SK | Slovak | sk |
Language.SK_SK | Slovak (Slovakia) | sk |
Language.SV | Swedish | sv |
Language.SV_SE | Swedish (Sweden) | sv |
Language.TH | Thai | th |
Language.TH_TH | Thai (Thailand) | th |
Language.TR | Turkish | tr |
Language.TR_TR | Turkish (Turkey) | tr |
Language.UK | Ukrainian | uk |
Language.UK_UA | Ukrainian (Ukraine) | uk |
Language.UR | Urdu | ur |
Language.UR_IN | Urdu (India) | ur |
Language.UR_PK | Urdu (Pakistan) | ur |
Language.VI | Vietnamese | vi |
Language.VI_VN | Vietnamese (Vietnam) | vi |
Language.ZH | Chinese | zh |
Language.ZH_CN | Chinese (China) | zh |
Language.ZH_HK | Chinese (Hong Kong) | zh |
Language.ZH_TW | Chinese (Taiwan) | zh |
Common languages:
Language.EN
- English -en
Language.ES
- Spanish -es
Language.FR
- French -fr
Language.DE
- German -de
Language.IT
- Italian -it
Language.JA
- Japanese -ja
Language.KO
- Korean -ko
Language.ZH
- Chinese -zh
Language.PT
- Portuguese -pt
Language.RU
- Russian -ru
Language.AR
- Arabic -ar
Language.HI
- Hindi -hi
Whisper can automatically detect language or you can specify it for better performance. All regional variants map to the same base language code.
Usage Example
Basic Configuration
GPU Acceleration
Apple Silicon Optimization
Metrics
Both services provide comprehensive metrics:
- Time to First Byte (TTFB) - Latency from audio input to first transcription
- Processing Duration - Total time spent processing audio segments
Additional Notes
- Segmented Processing: Both services use VAD to process speech in segments rather than continuously
- Offline Operation: Runs completely offline after initial model download
- Speech Filtering:
no_speech_prob
threshold filters out non-speech audio segments - Automatic Normalization: Audio is automatically normalized to float32 [-1.0, 1.0] range