Overview

WhisperSTTService provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX).

Installation

Choose your installation based on your hardware:

Standard Whisper (CPU/CUDA)

pip install "pipecat-ai[whisper]"

MLX Whisper (Apple Silicon)

pip install "pipecat-ai[mlx-whisper]"
First run will download the selected model from Hugging Face. Model sizes range from 39MB to 1.5GB.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, mono)
  • UserStartedSpeakingFrame - VAD signal to start buffering audio
  • UserStoppedSpeakingFrame - VAD signal indicating speech segment completion
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • TranscriptionFrame - Final transcription results (segmented processing)
  • ErrorFrame - Model loading or processing errors

Service Comparison

ServiceHardwarePerformanceMemoryBest For
WhisperSTTServiceCPU/CUDAGoodModerateGeneral use, GPU acceleration
WhisperSTTServiceMLXApple SiliconBetterLowerMac users, optimized performance

Model Selection

Standard Whisper Models

  • TINY: Smallest multilingual model, fastest inference
  • BASE: Basic multilingual model, good speed/quality balance
  • SMALL: Small multilingual model, better speed/quality balance than BASE
  • MEDIUM: Medium-sized multilingual model, better quality
  • LARGE: Best quality multilingual model, slower inference
  • LARGE_V3_TURBO: Fast multilingual model, slightly lower quality than LARGE
  • DISTIL_LARGE_V2: Fast multilingual distilled model.
  • DISTIL_MEDIUM_EN: Fast English-only distilled model.

MLX Whisper Models (Apple Silicon)

  • TINY: Smallest multilingual model for MLX
  • MEDIUM: Medium-sized multilingual model for MLX
  • LARGE_V3: Best quality multilingual model for MLX
  • LARGE_V3_TURBO: Finetuned, pruned Whisper large-v3, much faster with slightly lower quality
  • DISTIL_LARGE_V3: Fast multilingual distilled model for MLX
  • LARGE_V3_TURBO_Q4: LARGE_V3_TURBO quantized to Q4 for reduced memory usage

Language Support

Whisper supports a number of languages with automatic detection and regional variants:
Common languages:
  • Language.EN - English - en
  • Language.ES - Spanish - es
  • Language.FR - French - fr
  • Language.DE - German - de
  • Language.IT - Italian - it
  • Language.JA - Japanese - ja
  • Language.KO - Korean - ko
  • Language.ZH - Chinese - zh
  • Language.PT - Portuguese - pt
  • Language.RU - Russian - ru
  • Language.AR - Arabic - ar
  • Language.HI - Hindi - hi
Whisper can automatically detect language or you can specify it for better performance. All regional variants map to the same base language code.

Usage Example

Basic Configuration

Initialize the WhisperSTTService and use it in a pipeline:
from pipecat.services.whisper.stt import WhisperSTTService, Model
from pipecat.transcriptions.language import Language

# Standard Whisper with default settings
stt = WhisperSTTService()

# Use in pipeline with VAD
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

GPU Acceleration

# CUDA acceleration for faster processing
stt = WhisperSTTService(
    model=Model.LARGE_V3_TURBO,
    device="cuda",
    compute_type="float16",  # Reduce memory usage
    no_speech_prob=0.3,      # Lower threshold for speech detection
    language=Language.EN     # Specify language for better performance
)

Apple Silicon Optimization

from pipecat.services.whisper.stt import WhisperSTTServiceMLX, MLXModel

# MLX Whisper optimized for Apple Silicon
stt = WhisperSTTServiceMLX(
    model=MLXModel.LARGE_V3_TURBO_Q4,  # Quantized for efficiency
    no_speech_prob=0.6,
    language=Language.EN,
)

Dynamic Configuration

Make settings updates by pushing an STTUpdateSettingsFrame for the WhisperSTTService:
from pipecat.frames.frames import STTUpdateSettingsFrame

await task.queue_frame(STTUpdateSettingsFrame(
    language=Language.FR,
  )
)

Metrics

Both services provide comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from audio input to first transcription
  • Processing Duration - Total time spent processing audio segments
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Segmented Processing: Both services use VAD to process speech in segments rather than continuously
  • Offline Operation: Runs completely offline after initial model download
  • Speech Filtering: no_speech_prob threshold filters out non-speech audio segments
  • Automatic Normalization: Audio is automatically normalized to float32 [-1.0, 1.0] range