Speech-to-text service implementation using locally-downloaded Whisper models
WhisperSTTService
provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX).
InputAudioRawFrame
- Raw PCM audio data (16-bit, mono)UserStartedSpeakingFrame
- VAD signal to start buffering audioUserStoppedSpeakingFrame
- VAD signal indicating speech segment completionSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcriptionTranscriptionFrame
- Final transcription results (segmented processing)ErrorFrame
- Model loading or processing errorsService | Hardware | Performance | Memory | Best For |
---|---|---|---|---|
WhisperSTTService | CPU/CUDA | Good | Moderate | General use, GPU acceleration |
WhisperSTTServiceMLX | Apple Silicon | Better | Lower | Mac users, optimized performance |
TINY
: Smallest multilingual model, fastest inferenceBASE
: Basic multilingual model, good speed/quality balanceSMALL
: Small multilingual model, better speed/quality balance than BASEMEDIUM
: Medium-sized multilingual model, better qualityLARGE
: Best quality multilingual model, slower inferenceLARGE_V3_TURBO
: Fast multilingual model, slightly lower quality than LARGEDISTIL_LARGE_V2
: Fast multilingual distilled model.DISTIL_MEDIUM_EN
: Fast English-only distilled model.TINY
: Smallest multilingual model for MLXMEDIUM
: Medium-sized multilingual model for MLXLARGE_V3
: Best quality multilingual model for MLXLARGE_V3_TURBO
: Finetuned, pruned Whisper large-v3, much faster with slightly lower qualityDISTIL_LARGE_V3
: Fast multilingual distilled model for MLXLARGE_V3_TURBO_Q4
: LARGE_V3_TURBO quantized to Q4 for reduced memory usageView All Supported Languages
Language Code | Description | Service Codes |
---|---|---|
Language.AR | Arabic | ar |
Language.AR_AE | Arabic (UAE) | ar |
Language.AR_BH | Arabic (Bahrain) | ar |
Language.AR_DZ | Arabic (Algeria) | ar |
Language.AR_EG | Arabic (Egypt) | ar |
Language.AR_IQ | Arabic (Iraq) | ar |
Language.AR_JO | Arabic (Jordan) | ar |
Language.AR_KW | Arabic (Kuwait) | ar |
Language.AR_LB | Arabic (Lebanon) | ar |
Language.AR_LY | Arabic (Libya) | ar |
Language.AR_MA | Arabic (Morocco) | ar |
Language.AR_OM | Arabic (Oman) | ar |
Language.AR_QA | Arabic (Qatar) | ar |
Language.AR_SA | Arabic (Saudi Arabia) | ar |
Language.AR_SY | Arabic (Syria) | ar |
Language.AR_TN | Arabic (Tunisia) | ar |
Language.AR_YE | Arabic (Yemen) | ar |
Language.BN | Bengali | bn |
Language.BN_BD | Bengali (Bangladesh) | bn |
Language.BN_IN | Bengali (India) | bn |
Language.CS | Czech | cs |
Language.CS_CZ | Czech (Czech Republic) | cs |
Language.DA | Danish | da |
Language.DA_DK | Danish (Denmark) | da |
Language.DE | German | de |
Language.DE_AT | German (Austria) | de |
Language.DE_CH | German (Switzerland) | de |
Language.DE_DE | German (Germany) | de |
Language.EL | Greek | el |
Language.EL_GR | Greek (Greece) | el |
Language.EN | English | en |
Language.EN_AU | English (Australia) | en |
Language.EN_CA | English (Canada) | en |
Language.EN_GB | English (UK) | en |
Language.EN_HK | English (Hong Kong) | en |
Language.EN_IE | English (Ireland) | en |
Language.EN_IN | English (India) | en |
Language.EN_KE | English (Kenya) | en |
Language.EN_NG | English (Nigeria) | en |
Language.EN_NZ | English (New Zealand) | en |
Language.EN_PH | English (Philippines) | en |
Language.EN_SG | English (Singapore) | en |
Language.EN_TZ | English (Tanzania) | en |
Language.EN_US | English (US) | en |
Language.EN_ZA | English (South Africa) | en |
Language.ES | Spanish | es |
Language.ES_AR | Spanish (Argentina) | es |
Language.ES_BO | Spanish (Bolivia) | es |
Language.ES_CL | Spanish (Chile) | es |
Language.ES_CO | Spanish (Colombia) | es |
Language.ES_CR | Spanish (Costa Rica) | es |
Language.ES_CU | Spanish (Cuba) | es |
Language.ES_DO | Spanish (Dominican Republic) | es |
Language.ES_EC | Spanish (Ecuador) | es |
Language.ES_ES | Spanish (Spain) | es |
Language.ES_GQ | Spanish (Equatorial Guinea) | es |
Language.ES_GT | Spanish (Guatemala) | es |
Language.ES_HN | Spanish (Honduras) | es |
Language.ES_MX | Spanish (Mexico) | es |
Language.ES_NI | Spanish (Nicaragua) | es |
Language.ES_PA | Spanish (Panama) | es |
Language.ES_PE | Spanish (Peru) | es |
Language.ES_PR | Spanish (Puerto Rico) | es |
Language.ES_PY | Spanish (Paraguay) | es |
Language.ES_SV | Spanish (El Salvador) | es |
Language.ES_US | Spanish (US) | es |
Language.ES_UY | Spanish (Uruguay) | es |
Language.ES_VE | Spanish (Venezuela) | es |
Language.FA | Persian | fa |
Language.FA_IR | Persian (Iran) | fa |
Language.FI | Finnish | fi |
Language.FI_FI | Finnish (Finland) | fi |
Language.FR | French | fr |
Language.FR_BE | French (Belgium) | fr |
Language.FR_CA | French (Canada) | fr |
Language.FR_CH | French (Switzerland) | fr |
Language.FR_FR | French (France) | fr |
Language.HI | Hindi | hi |
Language.HI_IN | Hindi (India) | hi |
Language.HU | Hungarian | hu |
Language.HU_HU | Hungarian (Hungary) | hu |
Language.ID | Indonesian | id |
Language.ID_ID | Indonesian (Indonesia) | id |
Language.IT | Italian | it |
Language.IT_IT | Italian (Italy) | it |
Language.JA | Japanese | ja |
Language.JA_JP | Japanese (Japan) | ja |
Language.KO | Korean | ko |
Language.KO_KR | Korean (Korea) | ko |
Language.NL | Dutch | nl |
Language.NL_BE | Dutch (Belgium) | nl |
Language.NL_NL | Dutch (Netherlands) | nl |
Language.PL | Polish | pl |
Language.PL_PL | Polish (Poland) | pl |
Language.PT | Portuguese | pt |
Language.PT_BR | Portuguese (Brazil) | pt |
Language.PT_PT | Portuguese (Portugal) | pt |
Language.RO | Romanian | ro |
Language.RO_RO | Romanian (Romania) | ro |
Language.RU | Russian | ru |
Language.RU_RU | Russian (Russia) | ru |
Language.SK | Slovak | sk |
Language.SK_SK | Slovak (Slovakia) | sk |
Language.SV | Swedish | sv |
Language.SV_SE | Swedish (Sweden) | sv |
Language.TH | Thai | th |
Language.TH_TH | Thai (Thailand) | th |
Language.TR | Turkish | tr |
Language.TR_TR | Turkish (Turkey) | tr |
Language.UK | Ukrainian | uk |
Language.UK_UA | Ukrainian (Ukraine) | uk |
Language.UR | Urdu | ur |
Language.UR_IN | Urdu (India) | ur |
Language.UR_PK | Urdu (Pakistan) | ur |
Language.VI | Vietnamese | vi |
Language.VI_VN | Vietnamese (Vietnam) | vi |
Language.ZH | Chinese | zh |
Language.ZH_CN | Chinese (China) | zh |
Language.ZH_HK | Chinese (Hong Kong) | zh |
Language.ZH_TW | Chinese (Taiwan) | zh |
Language.EN
- English - en
Language.ES
- Spanish - es
Language.FR
- French - fr
Language.DE
- German - de
Language.IT
- Italian - it
Language.JA
- Japanese - ja
Language.KO
- Korean - ko
Language.ZH
- Chinese - zh
Language.PT
- Portuguese - pt
Language.RU
- Russian - ru
Language.AR
- Arabic - ar
Language.HI
- Hindi - hi
WhisperSTTService
and use it in a pipeline:
STTUpdateSettingsFrame
for the WhisperSTTService
:
no_speech_prob
threshold filters out non-speech audio segments