Overview
UltravoxSTTService provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.
Ultravox STT API Reference
Pipecat’s API methods for Ultravox STT integration
Example Implementation
Complete example with GPU optimization
Ultravox Documentation
Official Ultravox documentation and features
Hugging Face Models
Access Ultravox models and get HF tokens
Installation
To use Ultravox services, install the required dependency:Prerequisites
Ultravox Model Setup
Before using Ultravox STT services, you need:- Hugging Face Account: Sign up at Hugging Face
- HF Token: Generate a Hugging Face token for model access
- GPU Resources: Recommended for optimal performance with local model inference
Required Environment Variables
HF_TOKEN: Your Hugging Face token for model access
Usage
Notes
- Local inference: Ultravox runs entirely locally, requiring no external API calls. A GPU is recommended for acceptable latency.
- Multimodal approach: Unlike traditional ASR models, Ultravox encodes audio directly into the LLM’s embedding space, which can improve conversational understanding.
- Refer to Ultravox documentation: For detailed configuration options and model variants, see the official Ultravox documentation.