Speech-to-Text
NVIDIA Riva
Speech-to-text service implementation using NVIDIA Riva
Overview
NVIDIA Riva provides two STT services:
RivaSTTService
for real-time streaming transcription using Parakeet modelsRivaSegmentedSTTService
for segmented transcription using Canary models with advanced language support
API Reference
Complete API documentation and method details
NVIDIA Riva Docs
Official NVIDIA Riva ASR documentation
Example Code
Working example with NVIDIA services integration
Installation
To use NVIDIA Riva services, install the required dependency:
You’ll also need to set up your NVIDIA API key as an environment variable: NVIDIA_API_KEY
.
Get your API key from NVIDIA’s developer portal.
Frames
Input
InputAudioRawFrame
- Raw PCM audio data (16-bit, mono)STTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcription
Output
InterimTranscriptionFrame
- Real-time transcription updates (streaming only)TranscriptionFrame
- Final transcription resultsErrorFrame
- Connection or processing errors
Service Comparison
Feature | RivaSTTService | RivaSegmentedSTTService |
---|---|---|
Processing | Real-time streaming | Segmented (VAD-based) |
Model | Parakeet CTC 1.1B | Canary 1B |
Latency | Ultra-low | Higher (batch processing) |
Languages | English-focused | Multi-language |
Interim Results | ✅ Yes | ❌ No |
Best For | Real-time conversation | Multi-language accuracy |
Models
Model | Service Class | Description | Languages |
---|---|---|---|
parakeet-ctc-1.1b-asr | RivaSTTService | Streaming ASR optimized for low latency | English (various accents) |
canary-1b-asr | RivaSegmentedSTTService | Multilingual ASR with high accuracy | 15+ languages |
See NVIDIA’s model cards for detailed performance metrics.
Language Support
RivaSTTService (Parakeet)
Primarily supports English with various regional accents:
Language.EN_US
- English (US) -en-US
RivaSegmentedSTTService (Canary)
Supports multiple languages:
Language Code | Description | Service Codes |
---|---|---|
Language.EN_US | English (US) | en-US |
Language.EN_GB | English (UK) | en-GB |
Language.ES | Spanish | es-ES |
Language.ES_US | Spanish (US) | es-US |
Language.FR | French | fr-FR |
Language.DE | German | de-DE |
Language.IT | Italian | it-IT |
Language.PT_BR | Portuguese (Brazil) | pt-BR |
Language.JA | Japanese | ja-JP |
Language.KO | Korean | ko-KR |
Language.RU | Russian | ru-RU |
Language.HI | Hindi | hi-IN |
Language.AR | Arabic | ar-AR |
Usage Example
Real-time Streaming (RivaSTTService)
Segmented Multi-language (RivaSegmentedSTTService)
Advanced Configuration
Both services support advanced ASR parameters:
Word Boosting
boosted_lm_words
: List of domain-specific terms to emphasizeboosted_lm_score
: Boost intensity (default: 4.0, recommended: 4.0-8.0)
Audio Processing
profanity_filter
: Filter inappropriate contentautomatic_punctuation
: Add punctuation automaticallyverbatim_transcripts
: Control transcript formatting
Voice Activity Detection (Streaming only)
start_history
: History frames for speech start detectionstart_threshold
: Confidence threshold for speech startstop_threshold
: Confidence threshold for speech end
Metrics
- Time to First Byte (TTFB) - Latency from audio segment to transcription
- Processing Duration - Time spent processing each segment
Additional Notes
- Authentication: Uses NVIDIA Cloud Functions with Bearer token authentication
- Real-time vs Batch: Choose streaming for conversation, segmented for accuracy and multi-language
- VAD Requirement: Segmented service requires Voice Activity Detection in the pipeline
- Custom Endpoints: Supports custom Riva server endpoints for on-premise deployments