Overview

WhisperSTTService provides speech-to-text capabilities using OpenAI’s Whisper models running locally. It supports multiple model sizes and configurations for offline transcription.

Installation

To use WhisperSTTService, install the required dependencies:

pip install pipecat-ai[whisper]

Configuration

Constructor Parameters

model
str | Model
default:
"Model.DISTIL_MEDIUM_EN"

Whisper model to use. Can be a string or Model enum value

device
str
default:
"auto"

Device to run the model on (‘cpu’, ‘cuda’, or ‘auto’)

compute_type
str
default:
"default"

Computation type for model inference

no_speech_prob
float
default:
"0.4"

Threshold for filtering out non-speech segments

Available Models

class Model(Enum):
    TINY = "tiny"                   # Smallest, fastest model
    BASE = "base"                   # Basic model
    MEDIUM = "medium"               # Medium-sized model
    LARGE = "large-v3"              # Largest, most accurate model
    DISTIL_LARGE_V2 = "Systran/faster-distil-whisper-large-v2"
    DISTIL_MEDIUM_EN = "Systran/faster-distil-whisper-medium.en"

Input

The service processes raw audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • Single channel (mono)
  • Normalized to float32 range [-1.0, 1.0]

Output Frames

TranscriptionFrame

Generated for transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

ErrorFrame

Generated when transcription errors occur, containing error details.

Usage Example

from pipecat.services.whisper import WhisperSTTService, Model

# Configure service with default model
stt_service = WhisperSTTService(
    model=Model.DISTIL_MEDIUM_EN,
    device="cuda",
    no_speech_prob=0.4
)

# Or use a custom model path
stt_service = WhisperSTTService(
    model="path/to/custom/model",
    device="cpu"
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),   # Produces audio frames
    stt_service,         # Processes audio → produces transcription frames
    text_handler         # Consumes transcription frames
])

Methods

See the STT base class methods for additional functionality.

Model Selection Guide

ModelSizeSpeedAccuracyMemory Usage
TINY39MFastestBasicMinimal
BASE74MFastGoodLow
MEDIUM769MMediumBetterModerate
LARGE1.5GBSlowBestHigh
DISTIL_MEDIUM_EN~400MFastGood (English)Moderate
DISTIL_LARGE_V2~750MMediumBetterModerate

Frame Flow

Metrics Support

The service collects processing metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • Model loading time
  • Inference time

Notes

  • Runs completely offline after model download
  • First run requires model download
  • Supports CPU and CUDA acceleration
  • Processes audio in segments
  • Filters out non-speech segments
  • Thread-safe processing
  • Automatic error handling