Overview

RivaSTTService provides real-time speech-to-text capabilities using NVIDIA’s Riva Parakeet model. It supports interim results and configurable recognition parameters for enhanced accuracy. RivaSegmentedSTTService provides speech-to-text capabilities via NVIDIA’s Riva Canary model.

Installation

To use RivaSTTService or RivaSegmentedSTTService, install the required dependencies:

pip install "pipecat-ai[riva]"

You’ll also need to set up your NVIDIA API key as an environment variable: NVIDIA_API_KEY.

You can obtain an NVIDIA API key by signing up through NVIDIA’s developer portal.

RivaSTTService

Configuration

api_key
str
required

Your NVIDIA API key

server
str
default:"grpc.nvcf.nvidia.com:443"

NVIDIA Riva server address

model_function_map
Mapping [str, str]

A mapping of the NVIDIA function identifier for the STT service with the model name.

sample_rate
int
default:"None"

Audio sample rate in Hz

params
InputParams
default:"InputParams()"

Additional configuration parameters

InputParams

language
Language
default:"Language.EN_US"

The language for speech recognition

Input

The service processes audio frames containing:

  • Raw PCM audio data
  • 16-bit depth
  • Single channel (mono)

Output Frames

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Language used for transcription

InterimTranscriptionFrame

Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results.

RivaSegmentedSTTService

Configuration

api_key
str
required

Your NVIDIA API key

server
str
default:"grpc.nvcf.nvidia.com:443"

NVIDIA Riva server address

model_function_map
Mapping [str, str]

A mapping of the NVIDIA function identifier for the STT service with the model name.

sample_rate
int
default:"None"

Audio sample rate in Hz

params
InputParams
default:"InputParams()"

Additional configuration parameters

InputParams

language
Language
default:"Language.EN_US"

The language for speech recognition

Input

The service processes audio frames containing:

  • Raw audio bytes in WAV format

Output Frames

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Language used for transcription

InterimTranscriptionFrame

Generated during ongoing speech, containing same fields as TranscriptionFrame but with preliminary results.

Methods

See the STT base class methods for additional functionality.

Models

ModelPipecat ClassModel Card Link
parakeet-ctc-1.1b-asrRivaSTTServiceNVIDIA Model Card
canary-1b-asrRivaSegmentedSTTServiceNVIDIA Model Card

Usage Examples

RivaSTTService

from pipecat.services.riva.stt import RivaSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = RivaSTTService(
    api_key="your-nvidia-api-key",
    params=RivaSTTService.InputParams(
        language=Language.EN_US
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

RivaSegmentedSTTService

from pipecat.services.riva.stt import RivaSegmentedSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = RivaSegmentedSTTService(
    api_key="your-nvidia-api-key",
    params=RivaSegmentedSTTService.InputParams(
        language=Language.EN_US
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

Language Support

Riva model parakeet-ctc-1.1b-asr (default) primarily supports English with various regional accents:

Language CodeDescriptionService Codes
Language.EN_USEnglish (US)en-US

Frame Flow

Advanced Configuration

The service supports several advanced configuration options that can be adjusted:

_profanity_filter
bool
default:"False"

Filter profanity from transcription

_automatic_punctuation
bool
default:"False"

Automatically add punctuation

_no_verbatim_transcripts
bool
default:"False"

Whether to disable verbatim transcripts

_boosted_lm_words
list
default:"None"

List of words to boost in the language model

_boosted_lm_score
float
default:"4.0"

Score applied to boosted words

Example with Advanced Configuration

# Configure service with advanced parameters
stt = RivaSTTService(
    api_key="your-nvidia-api-key",
    params=RivaSTTService.InputParams(
        language=Language.EN_US
    )
)

# Configure advanced options
stt._profanity_filter = True
stt._automatic_punctuation = True
stt._boosted_lm_words = ["PipeCat", "AI", "speech"]

Notes

  • Uses NVIDIA’s Riva AI Services platform
  • Handles streaming audio input
  • Provides real-time transcription results
  • Manages connection lifecycle
  • Uses asyncio for asynchronous processing
  • Automatically cleans up resources on stop/cancel