Speech-to-text service implementation using a locally-loaded Ultravox multimodal model
UltravoxSTTService
provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.
HF_TOKEN
.
InputAudioRawFrame
- Raw PCM audio data (16-bit, 16kHz, mono)UserStartedSpeakingFrame
- Triggers audio bufferingUserStoppedSpeakingFrame
- Processes collected audioSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcriptionLLMFullResponseStartFrame
- Indicates transcription generation startLLMTextFrame
- Streaming text tokens as they’re generatedLLMFullResponseEndFrame
- Indicates transcription completionErrorFrame
- Processing errors or resource issuesfixie-ai/ultravox-v0_6-llama-3_3-70b
- Latest model with improved accuracy and efficiencyfixie-ai/ultravox-v0_5-llama-3_3-70b
- Recommended for new deploymentsfixie-ai/ultravox-v0_5-llama-3_1-8b
- Smaller model for resource-constrained environmentsfixie-ai/ultravox-v0_4_1-llama-3_1-8b
- Previous version for compatibilityfixie-ai/ultravox-v0_4_1-llama-3_1-70b
- Larger model for high accuracyLLMTextFrame
objects, not traditional TranscriptionFrame