OpenAI
Speech-to-text service implementation using OpenAI’s Speech-to-Text APIs
Overview
OpenAISTTService
provides speech-to-text capabilities using OpenAI’s latest models, including the GPT-4o transcription model and the hosted Whisper API. It offers high-accuracy transcription with minimal setup requirements, using Voice Activity Detection (VAD) to process only speech segments.
Installation
To use OpenAISTTService
, install the required dependencies:
You’ll need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY
.
You can obtain an OpenAI API key from the OpenAI platform.
Configuration
Constructor Parameters
Model to use. Supported models include “gpt-4o-transcribe” (recommended) and “whisper-1”.
Your OpenAI API key.
Custom API base URL for OpenAI API requests.
Language of the audio input. Defaults to English.
Optional text to guide the model’s style or continue a previous segment.
Sampling temperature between 0 and 1. Lower values are more deterministic, higher values more creative. Defaults to 0.0.
Audio sample rate in Hz. If not provided, uses the pipeline’s sample rate.
Input
The service processes audio data with the following requirements:
- PCM audio format
- 16-bit depth
- Single channel (mono)
Output Frames
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Detected language (if available)
ErrorFrame
Generated when transcription errors occur, containing error details.
Methods
Set Model
See the STT base class methods for additional functionality.
Models
Model | Description | Best For |
---|---|---|
gpt-4o-transcribe | Latest GPT-4o model fine-tuned for transcription | High accuracy, robustness to accents, better context understanding |
whisper-1 | OpenAI’s Whisper model | Broad language support, good performance on clean audio |
Language Support
OpenAI’s speech-to-text models support a wide range of languages. The service automatically maps Language
enum values to the appropriate language codes.
Language Code | Description | Service Code |
---|---|---|
Language.AF | Afrikaans | af |
Language.AR | Arabic | ar |
Language.HY | Armenian | hy |
Language.AZ | Azerbaijani | az |
Language.BE | Belarusian | be |
Language.BS | Bosnian | bs |
Language.BG | Bulgarian | bg |
Language.CA | Catalan | ca |
Language.ZH | Chinese | zh |
Language.HR | Croatian | hr |
Language.CS | Czech | cs |
Language.DA | Danish | da |
Language.NL | Dutch | nl |
Language.EN | English | en |
Language.ET | Estonian | et |
Language.FI | Finnish | fi |
Language.FR | French | fr |
Language.GL | Galician | gl |
Language.DE | German | de |
Language.EL | Greek | el |
Language.HE | Hebrew | he |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IS | Icelandic | is |
Language.ID | Indonesian | id |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KN | Kannada | kn |
Language.KK | Kazakh | kk |
Language.KO | Korean | ko |
Language.LV | Latvian | lv |
Language.LT | Lithuanian | lt |
Language.MK | Macedonian | mk |
Language.MS | Malay | ms |
Language.MR | Marathi | mr |
Language.MI | Maori | mi |
Language.NE | Nepali | ne |
Language.NO | Norwegian | no |
Language.FA | Persian | fa |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RO | Romanian | ro |
Language.RU | Russian | ru |
Language.SR | Serbian | sr |
Language.SK | Slovak | sk |
Language.SL | Slovenian | sl |
Language.ES | Spanish | es |
Language.SW | Swahili | sw |
Language.SV | Swedish | sv |
Language.TL | Tagalog | tl |
Language.TA | Tamil | ta |
Language.TH | Thai | th |
Language.TR | Turkish | tr |
Language.UK | Ukrainian | uk |
Language.UR | Urdu | ur |
Language.VI | Vietnamese | vi |
Language.CY | Welsh | cy |
OpenAI’s models support language variants (like en-US
, fr-CA
) by mapping
them to their base language. For example, Language.EN_US
and
Language.EN_GB
will both map to en
.
The service will automatically detect the language if none is specified, but specifying the language typically improves transcription accuracy.
Usage Example
Voice Activity Detection Integration
This service inherits from SegmentedSTTService
, which uses Voice Activity Detection (VAD) to identify speech segments for processing. This approach:
- Processes only actual speech, not silence or background noise
- Maintains a small audio buffer (default 1 second) to capture speech that occurs slightly before VAD detection
- Receives
UserStartedSpeakingFrame
andUserStoppedSpeakingFrame
from a VAD component in the pipeline - Only sends complete utterances to the API when speech has ended
Ensure your transport includes a VAD component (like
SileroVADAnalyzer
) to
properly detect speech segments.
Metrics Support
The service collects the following metrics:
- Time to First Byte (TTFB)
- Processing duration
- API response time
Notes
- Requires valid OpenAI API key
- GPT-4o transcription model offers superior accuracy to Whisper
- Requires VAD component in transport
- Handles API rate limiting
- Automatic error handling
- Thread-safe processing
Error Handling
The service handles common API errors including:
- Authentication errors
- Rate limiting
- Invalid audio format
- Network connectivity issues
- API timeouts
Errors are propagated through ErrorFrames with descriptive messages.