Groq (Whisper)
Speech-to-text service implementation using Groq’s Whisper API
Overview
GroqSTTService
provides speech-to-text capabilities using Groq’s hosted Whisper API. It offers high-accuracy transcription with minimal setup requirements. The service uses Voice Activity Detection (VAD) to process only speech segments, optimizing API usage and improving response time.
Installation
To use GroqSTTService
, install the required dependencies:
You’ll need to set up your Groq API key as an environment variable: GROQ_API_KEY
.
You can obtain a Groq API key from the Groq Console.
Configuration
Constructor Parameters
Whisper model to use. Currently only “whisper-large-v3-turbo” is available.
Your Groq API key. If not provided, will use environment variable.
Custom API base URL for Groq API requests.
Language of the audio input. Defaults to English.
Optional text to guide the model’s style or continue a previous segment.
Sampling temperature between 0 and 1. Lower values are more deterministic, higher values more creative. Defaults to 0.0.
Audio sample rate in Hz. If not provided, uses the pipeline’s sample rate.
Input
The service processes audio data with the following requirements:
- PCM audio format
- 16-bit depth
- Single channel (mono)
Output Frames
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Detected language (if available)
ErrorFrame
Generated when transcription errors occur, containing error details.
Methods
Set Model
See the STT base class methods for additional functionality.
Language Support
Groq’s Whisper API supports a wide range of languages. The service automatically maps Language
enum values to the appropriate Whisper language codes.
Language Code | Description | Whisper Code |
---|---|---|
Language.AF | Afrikaans | af |
Language.AR | Arabic | ar |
Language.HY | Armenian | hy |
Language.AZ | Azerbaijani | az |
Language.BE | Belarusian | be |
Language.BS | Bosnian | bs |
Language.BG | Bulgarian | bg |
Language.CA | Catalan | ca |
Language.ZH | Chinese | zh |
Language.HR | Croatian | hr |
Language.CS | Czech | cs |
Language.DA | Danish | da |
Language.NL | Dutch | nl |
Language.EN | English | en |
Language.ET | Estonian | et |
Language.FI | Finnish | fi |
Language.FR | French | fr |
Language.GL | Galician | gl |
Language.DE | German | de |
Language.EL | Greek | el |
Language.HE | Hebrew | he |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IS | Icelandic | is |
Language.ID | Indonesian | id |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KN | Kannada | kn |
Language.KK | Kazakh | kk |
Language.KO | Korean | ko |
Language.LV | Latvian | lv |
Language.LT | Lithuanian | lt |
Language.MK | Macedonian | mk |
Language.MS | Malay | ms |
Language.MR | Marathi | mr |
Language.MI | Maori | mi |
Language.NE | Nepali | ne |
Language.NO | Norwegian | no |
Language.FA | Persian | fa |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SR | Serbian | sr |
Language.SK | Slovak | sk |
Language.SL | Slovenian | sl |
Language.ES | Spanish | es |
Language.SW | Swahili | sw |
Language.SV | Swedish | sv |
Language.TL | Tagalog | tl |
Language.TA | Tamil | ta |
Language.TH | Thai | th |
Language.TR | Turkish | tr |
Language.UK | Ukrainian | uk |
Language.UR | Urdu | ur |
Language.VI | Vietnamese | vi |
Language.CY | Welsh | cy |
Groq’s Whisper implementation supports language variants (like en-US
,
fr-CA
) by mapping them to their base language. For example, Language.EN_US
and Language.EN_GB
will both map to en
.
The service will automatically detect the language if none is specified, but specifying the language typically improves transcription accuracy.
For the most up-to-date list of supported languages, refer to the Groq documentation.
Usage Example
Voice Activity Detection Integration
This service inherits from SegmentedSTTService
, which uses Voice Activity Detection (VAD) to identify speech segments for processing. This approach:
- Processes only actual speech, not silence or background noise
- Maintains a small audio buffer (default 1 second) to capture speech that occurs slightly before VAD detection
- Receives
UserStartedSpeakingFrame
andUserStoppedSpeakingFrame
from a VAD component in the pipeline - Only sends complete utterances to the API when speech has ended
Ensure your transport includes a VAD component (like
SileroVADAnalyzer
) to
properly detect speech segments.
Metrics Support
The service collects the following metrics:
- Time to First Byte (TTFB)
- Processing duration
- API response time
Notes
- Requires valid Groq API key
- Uses Groq’s hosted Whisper model
- Requires VAD component in transport
- Processes complete utterances, not continuous audio
- Handles API rate limiting
- Automatic error handling
- Thread-safe processing
Error Handling
The service handles common API errors including:
- Authentication errors
- Rate limiting
- Invalid audio format
- Network connectivity issues
- API timeouts
Errors are propagated through ErrorFrames with descriptive messages.