Overview

OpenAISTTService provides high-accuracy speech recognition using OpenAI’s advanced transcription models, including the latest GPT-4o transcription model and the proven Whisper API. It uses Voice Activity Detection (VAD) to efficiently process speech segments with superior accuracy and context understanding.

Installation

To use OpenAI services, install the required dependency:
pip install "pipecat-ai[openai]"
You’ll need to set up your OpenAI API key as an environment variable: OPENAI_API_KEY.
Get your API key from the OpenAI Platform.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, mono)
  • UserStartedSpeakingFrame - VAD signal to start buffering audio
  • UserStoppedSpeakingFrame - VAD signal to process buffered audio

Output

  • TranscriptionFrame - Final transcription results (no interim results)
  • ErrorFrame - API or processing errors

Models

OpenAI offers two transcription models with different strengths:
ModelDescriptionBest ForAccuracySpeed
gpt-4o-transcribeLatest GPT-4o model fine-tuned for transcriptionHigh accuracy, robustness to accents, context understandingHighestFast
whisper-1OpenAI’s proven Whisper modelBroad language support, clean audioHighFast
Recommended: Use gpt-4o-transcribe for the best accuracy and context understanding, especially with challenging audio or technical content.

Language Support

OpenAI’s speech-to-text models support 60+ languages with automatic language detection:
Common languages:
  • Language.EN - English - en
  • Language.ES - Spanish - es
  • Language.FR - French - fr
  • Language.DE - German - de
  • Language.IT - Italian - it
  • Language.JA - Japanese - ja
Regional variants (like EN_US, FR_CA) are automatically mapped to their base language codes.

Usage Example

Basic Configuration

Initialize the OpenAISTTService and use it in a pipeline:
from pipecat.services.openai.stt import OpenAISTTService
from pipecat.transcriptions.language import Language

# Simple setup with GPT-4o (recommended)
stt = OpenAISTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    language=Language.EN
)

# Use in pipeline with VAD
pipeline = Pipeline([
    transport.input(),  # Must include VAD analyzer
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Advanced Configuration

# Optimized for technical content
stt = OpenAISTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    language=Language.EN,
    prompt="Transcribe technical terms accurately. Format numbers as digits rather than words.",
    temperature=0.0  # Deterministic output
)

Dynamic Configuration

Make settings updates by pushing an STTUpdateSettingsFrame for the OpenAISTTService:
from pipecat.frames.frames import STTUpdateSettingsFrame

await task.queue_frame(STTUpdateSettingsFrame(
    language=Language.FR,
))

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - API response latency
  • Processing Duration - Total transcription time
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Segmented Processing: Processes complete utterances, not continuous streams
  • No Interim Results: Only final transcriptions are provided (typical for batch APIs)
  • Audio Buffer: Maintains 1-second buffer to capture speech before VAD detection
  • Language Variants: Regional language codes automatically map to base languages
  • Context Prompts: GPT-4o especially benefits from domain-specific prompts
  • Rate Limits: Check your OpenAI plan for concurrent request and usage limits
  • Quality Focus: OpenAI prioritizes accuracy and context understanding over speed