SambaNova (Whisper)
Speech-to-text service implementation using SambaNova’s Whisper API
Overview
SambaNovaSTTService
provides speech-to-text capabilities using SambaNova’s hosted Whisper API.
It offers high-accuracy transcription with minimal setup requirements.
The service uses Voice Activity Detection (VAD) to process only speech segments, optimizing API usage and improving response time.
Installation
To use SambaNovaSTTService
, install the required dependencies:
You need to set up your SambaNova API key as an environment variable: SAMBANOVA_API_KEY
.
Get your SambaNova API key here.
Configuration
Constructor Parameters
Whisper model to use. Currently only “Whisper-Large-v3” is available.
Your SambaNova API key. If not provided, will use environment variable.
Custom API base URL for SambaNova API requests.
Language of the audio input. Defaults to English.
Optional text to guide the model’s style or continue a previous segment.
Sampling temperature between 0 and 1. Lower values are more deterministic, higher values more creative. Defaults to 0.0.
Audio sample rate in Hz. If not provided, uses the pipeline’s sample rate.
Input
The service processes audio data with the following requirements:
- PCM audio format.
- 16-bit depth.
- Single channel (mono).
Output Frames
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Detected language (if available)
ErrorFrame
Generated when transcription errors occur, containing error details.
Methods
Set Model
See the STT base class methods for additional functionality.
Language Support
SambaNova’s Whisper API supports a wide range of languages.
The service automatically maps Language
enum values to the appropriate Whisper language codes.
Language Code | Description | Whisper Code |
---|---|---|
Language.AF | Afrikaans | af |
Language.AR | Arabic | ar |
Language.HY | Armenian | hy |
Language.AZ | Azerbaijani | az |
Language.BE | Belarusian | be |
Language.BS | Bosnian | bs |
Language.BG | Bulgarian | bg |
Language.CA | Catalan | ca |
Language.ZH | Chinese | zh |
Language.HR | Croatian | hr |
Language.CS | Czech | cs |
Language.DA | Danish | da |
Language.NL | Dutch | nl |
Language.EN | English | en |
Language.ET | Estonian | et |
Language.FI | Finnish | fi |
Language.FR | French | fr |
Language.GL | Galician | gl |
Language.DE | German | de |
Language.EL | Greek | el |
Language.HE | Hebrew | he |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IS | Icelandic | is |
Language.ID | Indonesian | id |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KN | Kannada | kn |
Language.KK | Kazakh | kk |
Language.KO | Korean | ko |
Language.LV | Latvian | lv |
Language.LT | Lithuanian | lt |
Language.MK | Macedonian | mk |
Language.MS | Malay | ms |
Language.MR | Marathi | mr |
Language.MI | Maori | mi |
Language.NE | Nepali | ne |
Language.NO | Norwegian | no |
Language.FA | Persian | fa |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SR | Serbian | sr |
Language.SK | Slovak | sk |
Language.SL | Slovenian | sl |
Language.ES | Spanish | es |
Language.SW | Swahili | sw |
Language.SV | Swedish | sv |
Language.TL | Tagalog | tl |
Language.TA | Tamil | ta |
Language.TH | Thai | th |
Language.TR | Turkish | tr |
Language.UK | Ukrainian | uk |
Language.UR | Urdu | ur |
Language.VI | Vietnamese | vi |
Language.CY | Welsh | cy |
SambaNova’s Whisper implementation supports language variants (like en-US
, fr-CA
)
by mapping them to their base language.
For example, Language.EN_US
and Language.EN_GB
will both map to en
.
The service will automatically detect the language if none is specified, but specifying the language typically improves transcription accuracy.
For the most up-to-date list of supported languages, refer to the SambaNova’s docs.
Usage Example
Voice Activity Detection Integration
This service inherits from SegmentedSTTService
, which uses Voice Activity Detection (VAD) to identify speech segments for processing.
This approach:
- Processes only actual speech, not silence or background noise.
- Maintains a small audio buffer (default 1 second) to capture speech that occurs slightly before VAD detection.
- Receives
UserStartedSpeakingFrame
andUserStoppedSpeakingFrame
from a VAD component in the pipeline. - Only sends complete utterances to the API when speech has ended.
Ensure your transport includes a VAD component (like
SileroVADAnalyzer
) to properly detect speech segments.
Metrics Support
The service collects the following metrics:
- Time to First Byte (TTFB).
- Processing duration.
- API response time.
Notes
- Requires valid SambaNova API key.
- Uses SambaNova’s hosted Whisper model.
- Requires VAD component in transport.
- Processes complete utterances, not continuous audio.
- Handles API rate limiting.
- Automatic error handling.
- Thread-safe processing.
Error Handling
The service handles common API errors including:
- Authentication errors.
- Rate limiting.
- Invalid audio format.
- Network connectivity issues.
- API timeouts.
Errors are propagated through ErrorFrames
with descriptive messages.