Overview

NVIDIA Riva provides two STT services:

  • RivaSTTService for real-time streaming transcription using Parakeet models
  • RivaSegmentedSTTService for segmented transcription using Canary models with advanced language support

Installation

To use NVIDIA Riva services, install the required dependency:

pip install "pipecat-ai[riva]"

You’ll also need to set up your NVIDIA API key as an environment variable: NVIDIA_API_KEY.

Get your API key from NVIDIA’s developer portal.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (16-bit, mono)
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • InterimTranscriptionFrame - Real-time transcription updates (streaming only)
  • TranscriptionFrame - Final transcription results
  • ErrorFrame - Connection or processing errors

Service Comparison

FeatureRivaSTTServiceRivaSegmentedSTTService
ProcessingReal-time streamingSegmented (VAD-based)
ModelParakeet CTC 1.1BCanary 1B
LatencyUltra-lowHigher (batch processing)
LanguagesEnglish-focusedMulti-language
Interim Results✅ Yes❌ No
Best ForReal-time conversationMulti-language accuracy

Models

ModelService ClassDescriptionLanguages
parakeet-ctc-1.1b-asrRivaSTTServiceStreaming ASR optimized for low latencyEnglish (various accents)
canary-1b-asrRivaSegmentedSTTServiceMultilingual ASR with high accuracy15+ languages

See NVIDIA’s model cards for detailed performance metrics.

Language Support

RivaSTTService (Parakeet)

Primarily supports English with various regional accents:

  • Language.EN_US - English (US) - en-US

RivaSegmentedSTTService (Canary)

Supports multiple languages:

Language CodeDescriptionService Codes
Language.EN_USEnglish (US)en-US
Language.EN_GBEnglish (UK)en-GB
Language.ESSpanishes-ES
Language.ES_USSpanish (US)es-US
Language.FRFrenchfr-FR
Language.DEGermande-DE
Language.ITItalianit-IT
Language.PT_BRPortuguese (Brazil)pt-BR
Language.JAJapaneseja-JP
Language.KOKoreanko-KR
Language.RURussianru-RU
Language.HIHindihi-IN
Language.ARArabicar-AR

Usage Example

Real-time Streaming (RivaSTTService)

from pipecat.services.riva.stt import RivaSTTService
from pipecat.transcriptions.language import Language

# Basic streaming configuration
stt = RivaSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    params=RivaSTTService.InputParams(
        language=Language.EN_US
    )
)

# Use in pipeline for real-time conversation
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Segmented Multi-language (RivaSegmentedSTTService)

from pipecat.services.riva.stt import RivaSegmentedSTTService
from pipecat.audio.vad.silero import SileroVADAnalyzer

# Multi-language segmented transcription
stt = RivaSegmentedSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    params=RivaSegmentedSTTService.InputParams(
        language=Language.ES,  # Spanish
        profanity_filter=False,
        automatic_punctuation=True,
        boosted_lm_words=["inteligencia", "artificial"],
        boosted_lm_score=5.0
    )
)



# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,  # Processes audio segments
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Advanced Configuration

Both services support advanced ASR parameters:

Word Boosting

  • boosted_lm_words: List of domain-specific terms to emphasize
  • boosted_lm_score: Boost intensity (default: 4.0, recommended: 4.0-8.0)

Audio Processing

  • profanity_filter: Filter inappropriate content
  • automatic_punctuation: Add punctuation automatically
  • verbatim_transcripts: Control transcript formatting

Voice Activity Detection (Streaming only)

  • start_history: History frames for speech start detection
  • start_threshold: Confidence threshold for speech start
  • stop_threshold: Confidence threshold for speech end

Metrics

  • Time to First Byte (TTFB) - Latency from audio segment to transcription
  • Processing Duration - Time spent processing each segment

Additional Notes

  • Authentication: Uses NVIDIA Cloud Functions with Bearer token authentication
  • Real-time vs Batch: Choose streaming for conversation, segmented for accuracy and multi-language
  • VAD Requirement: Segmented service requires Voice Activity Detection in the pipeline
  • Custom Endpoints: Supports custom Riva server endpoints for on-premise deployments