Overview

UltravoxSTTService provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.

Installation

To use Ultravox services, install the required dependency:
pip install "pipecat-ai[ultravox]"
You’ll also need a Hugging Face token to access the models: HF_TOKEN.
Get your Hugging Face token from Hugging Face Settings.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, 16kHz, mono)
  • UserStartedSpeakingFrame - Triggers audio buffering
  • UserStoppedSpeakingFrame - Processes collected audio
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • LLMFullResponseStartFrame - Indicates transcription generation start
  • LLMTextFrame - Streaming text tokens as they’re generated
  • LLMFullResponseEndFrame - Indicates transcription completion
  • ErrorFrame - Processing errors or resource issues

Models

Ultravox offers several models with different resource requirements:
  • fixie-ai/ultravox-v0_6-llama-3_3-70b - Latest model with improved accuracy and efficiency
  • fixie-ai/ultravox-v0_5-llama-3_3-70b - Recommended for new deployments
  • fixie-ai/ultravox-v0_5-llama-3_1-8b - Smaller model for resource-constrained environments
  • fixie-ai/ultravox-v0_4_1-llama-3_1-8b - Previous version for compatibility
  • fixie-ai/ultravox-v0_4_1-llama-3_1-70b - Larger model for high accuracy
See the Ultravox models on Hugging Face for more details.

Usage Example

Basic Configuration

from pipecat.services.ultravox.stt import UltravoxSTTService

# Simple setup with default model
ultravox_processor = UltravoxSTTService(
    hf_token=os.getenv("HF_TOKEN"),
    temperature=0.7,
    max_tokens=100
)

# Use in pipeline (requires VAD for speech detection)
pipeline = Pipeline([
    transport.input(),
    ultravox_processor,  # Note: Ultravox outputs LLM frames, not transcription frames
    tts,
    transport.output()
])

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from speech end to first token
  • Processing Duration - Total time for audio processing and generation
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • VAD Dependency: Requires Voice Activity Detection (VAD) to trigger audio processing
  • GPU Acceleration: Designed for GPU deployment; consider using Cerebrium, Modal, or other GPU-optimized environments
  • Model Loading: First model load can take several minutes; consider pre-initialization
  • Memory Usage: Audio buffer grows with speech duration; automatically cleared after processing
  • Output Format: Generates LLMTextFrame objects, not traditional TranscriptionFrame
  • Local Processing: All processing happens locally; no external API calls after model download
  • Hugging Face Authentication: Required for downloading models from Hugging Face Hub