Overview

OpenAISTTService provides speech-to-text capabilities using OpenAI’s latest models, including the GPT-4o transcription model and the hosted Whisper API. It offers high-accuracy transcription with minimal setup requirements, using Voice Activity Detection (VAD) to process only speech segments.

Installation

To use OpenAISTTService, install the required dependencies:

pip install "pipecat-ai[openai]"

You’ll need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY.

You can obtain an OpenAI API key from the OpenAI platform.

Configuration

Constructor Parameters

model
str
default:"gpt-4o-transcribe"

Model to use. Supported models include “gpt-4o-transcribe” (recommended) and “whisper-1”.

api_key
str

Your OpenAI API key.

base_url
str

Custom API base URL for OpenAI API requests.

language
Language
default:"Language.EN"

Language of the audio input. Defaults to English.

prompt
str

Optional text to guide the model’s style or continue a previous segment.

temperature
float

Sampling temperature between 0 and 1. Lower values are more deterministic, higher values more creative. Defaults to 0.0.

sample_rate
int

Audio sample rate in Hz. If not provided, uses the pipeline’s sample rate.

Input

The service processes audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • Single channel (mono)

Output Frames

The service produces two types of frames during transcription:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Detected language (if available)

ErrorFrame

Generated when transcription errors occur, containing error details.

Methods

Set Model

await service.set_model("gpt-4o-transcribe")  # For the latest GPT-4o transcription model
# or
await service.set_model("whisper-1")  # For the Whisper model

See the STT base class methods for additional functionality.

Models

ModelDescriptionBest For
gpt-4o-transcribeLatest GPT-4o model fine-tuned for transcriptionHigh accuracy, robustness to accents, better context understanding
whisper-1OpenAI’s Whisper modelBroad language support, good performance on clean audio

Language Support

OpenAI’s speech-to-text models support a wide range of languages. The service automatically maps Language enum values to the appropriate language codes.

Language CodeDescriptionService Code
Language.AFAfrikaansaf
Language.ARArabicar
Language.HYArmenianhy
Language.AZAzerbaijaniaz
Language.BEBelarusianbe
Language.BSBosnianbs
Language.BGBulgarianbg
Language.CACatalanca
Language.ZHChinesezh
Language.HRCroatianhr
Language.CSCzechcs
Language.DADanishda
Language.NLDutchnl
Language.ENEnglishen
Language.ETEstonianet
Language.FIFinnishfi
Language.FRFrenchfr
Language.GLGaliciangl
Language.DEGermande
Language.ELGreekel
Language.HEHebrewhe
Language.HIHindihi
Language.HUHungarianhu
Language.ISIcelandicis
Language.IDIndonesianid
Language.ITItalianit
Language.JAJapaneseja
Language.KNKannadakn
Language.KKKazakhkk
Language.KOKoreanko
Language.LVLatvianlv
Language.LTLithuanianlt
Language.MKMacedonianmk
Language.MSMalayms
Language.MRMarathimr
Language.MIMaorimi
Language.NENepaline
Language.NONorwegianno
Language.FAPersianfa
Language.PLPolishpl
Language.PTPortuguesept
Language.RORomanianro
Language.RURussianru
Language.SRSerbiansr
Language.SKSlovaksk
Language.SLSloveniansl
Language.ESSpanishes
Language.SWSwahilisw
Language.SVSwedishsv
Language.TLTagalogtl
Language.TATamilta
Language.THThaith
Language.TRTurkishtr
Language.UKUkrainianuk
Language.URUrduur
Language.VIVietnamesevi
Language.CYWelshcy

OpenAI’s models support language variants (like en-US, fr-CA) by mapping them to their base language. For example, Language.EN_US and Language.EN_GB will both map to en.

The service will automatically detect the language if none is specified, but specifying the language typically improves transcription accuracy.

Usage Example

from pipecat.services.openai import OpenAISTTService

# Configure service
stt = OpenAISTTService(
    model="gpt-4o-transcribe",
    api_key="your-api-key",
    language=Language.EN,
    prompt="Transcribe technical terms accurately. Format numbers as digits rather than words."
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

Voice Activity Detection Integration

This service inherits from SegmentedSTTService, which uses Voice Activity Detection (VAD) to identify speech segments for processing. This approach:

  • Processes only actual speech, not silence or background noise
  • Maintains a small audio buffer (default 1 second) to capture speech that occurs slightly before VAD detection
  • Receives UserStartedSpeakingFrame and UserStoppedSpeakingFrame from a VAD component in the pipeline
  • Only sends complete utterances to the API when speech has ended

Ensure your transport includes a VAD component (like SileroVADAnalyzer) to properly detect speech segments.

Metrics Support

The service collects the following metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • API response time

Notes

  • Requires valid OpenAI API key
  • GPT-4o transcription model offers superior accuracy to Whisper
  • Requires VAD component in transport
  • Handles API rate limiting
  • Automatic error handling
  • Thread-safe processing

Error Handling

The service handles common API errors including:

  • Authentication errors
  • Rate limiting
  • Invalid audio format
  • Network connectivity issues
  • API timeouts

Errors are propagated through ErrorFrames with descriptive messages.