Ultravox

Overview

UltravoxSTTService provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.

API Reference

Complete API documentation and method details

Ultravox Docs

Official Ultravox documentation and features

Example Code

Working example with GPU optimization

Installation

To use Ultravox services, install the required dependency:

pip install "pipecat-ai[ultravox]"

You’ll also need a Hugging Face token to access the models: HF_TOKEN.

Get your Hugging Face token from Hugging Face Settings.

Frames

Input

InputAudioRawFrame - Raw PCM audio data (16-bit, 16kHz, mono)
UserStartedSpeakingFrame - Triggers audio buffering
UserStoppedSpeakingFrame - Processes collected audio
STTUpdateSettingsFrame - Runtime transcription configuration updates
STTMuteFrame - Mute audio input for transcription

Output

LLMFullResponseStartFrame - Indicates transcription generation start
LLMTextFrame - Streaming text tokens as they’re generated
LLMFullResponseEndFrame - Indicates transcription completion
ErrorFrame - Processing errors or resource issues

Models

Ultravox offers several models with different resource requirements:

fixie-ai/ultravox-v0_6-llama-3_3-70b - Latest model with improved accuracy and efficiency
fixie-ai/ultravox-v0_5-llama-3_3-70b - Recommended for new deployments
fixie-ai/ultravox-v0_5-llama-3_1-8b - Smaller model for resource-constrained environments
fixie-ai/ultravox-v0_4_1-llama-3_1-8b - Previous version for compatibility
fixie-ai/ultravox-v0_4_1-llama-3_1-70b - Larger model for high accuracy

See the Ultravox models on Hugging Face for more details.

Usage Example

Basic Configuration

from pipecat.services.ultravox.stt import UltravoxSTTService

# Simple setup with default model
ultravox_processor = UltravoxSTTService(
    hf_token=os.getenv("HF_TOKEN"),
    temperature=0.7,
    max_tokens=100
)

# Use in pipeline (requires VAD for speech detection)
pipeline = Pipeline([
    transport.input(),
    ultravox_processor,  # Note: Ultravox outputs LLM frames, not transcription frames
    tts,
    transport.output()
])

Metrics

The service provides comprehensive metrics:

Time to First Byte (TTFB) - Latency from speech end to first token
Processing Duration - Total time for audio processing and generation

Learn how to enable Metrics in your Pipeline.

Additional Notes

VAD Dependency: Requires Voice Activity Detection (VAD) to trigger audio processing
GPU Acceleration: Designed for GPU deployment; consider using Cerebrium, Modal, or other GPU-optimized environments
Model Loading: First model load can take several minutes; consider pre-initialization
Memory Usage: Audio buffer grows with speech duration; automatically cleared after processing
Output Format: Generates LLMTextFrame objects, not traditional TranscriptionFrame
Local Processing: All processing happens locally; no external API calls after model download
Hugging Face Authentication: Required for downloading models from Hugging Face Hub

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

Ultravox Docs

Example Code

Installation

Frames

Input

Output

Models

Usage Example

Basic Configuration

Metrics

Additional Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

Ultravox Docs

Example Code

​Installation

​Frames

​Input

​Output

​Models

​Usage Example

​Basic Configuration

​Metrics

​Additional Notes

Overview

Installation

Frames

Input

Output

Models

Usage Example

Basic Configuration

Metrics

Additional Notes