> ## Documentation Index > Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt > Use this file to discover all available pages before exploring further. # NVIDIA Nemotron Speech > Speech-to-text service implementation using NVIDIA Nemotron Speech ## Overview NVIDIA Nemotron Speech provides three STT service implementations: * **`NvidiaSTTService`** -- Real-time streaming transcription using Nemotron ASR Streaming models with interim results and continuous audio processing. * **`NvidiaSegmentedSTTService`** -- Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy. * **`NvidiaSageMakerSTTService`** -- Streaming transcription using NVIDIA Nemotron ASR via an AWS SageMaker bidirectional-stream endpoint with automatic reconnection on error. Pipecat's API methods for NVIDIA Nemotron Speech STT integration Complete example with NVIDIA services integration Official NVIDIA ASR NIM documentation Access API keys and Nemotron Speech services ## Installation To use NVIDIA Nemotron Speech services, install the required dependency: ```bash theme={null} uv add "pipecat-ai[nvidia]" ``` ## Prerequisites ### NVIDIA Nemotron Speech Setup Before using NVIDIA Nemotron Speech STT services, you need: 1. **NVIDIA Developer Account** (for cloud deployments): Sign up at [NVIDIA Developer Portal](https://developer.nvidia.com) 2. **API Key** (for cloud deployments): Generate an NVIDIA API key for Nemotron Speech services 3. **Model Selection**: Choose between Nemotron ASR Streaming (streaming) and Canary (segmented) models For local deployments, you can run NVIDIA ASR NIM locally without an API key. See the [NVIDIA ASR NIM documentation](https://docs.nvidia.com/nim/speech/latest/asr/) for deployment instructions. ### Environment Variables * `NVIDIA_API_KEY`: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments) ## NvidiaSTTService Real-time streaming transcription using NVIDIA Nemotron Speech's streaming ASR models. NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments. NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. `localhost:50051`). Mapping containing `function_id` and `model_name` for the ASR model. Audio sample rate in Hz. When `None`, uses the pipeline's configured sample rate. Additional configuration parameters. *Deprecated in v0.0.105. Use `settings=NvidiaSTTService.Settings(...)` instead.* Runtime-configurable settings. See [Settings](#settings) below. Whether to use SSL for the gRPC connection. Defaults to `True` for the NVIDIA cloud endpoint. Set to `False` for local deployments. Number of audio channels. VAD start history in frames. Use `-1` for Nemotron Speech default. VAD start threshold. Use `-1.0` for Nemotron Speech default. VAD stop history in frames. Use `-1` for Nemotron Speech default. VAD stop threshold. Use `-1.0` for Nemotron Speech default. End-of-utterance stop history in frames. Use `-1` for Nemotron Speech default. End-of-utterance stop threshold. Use `-1.0` for Nemotron Speech default. Custom Nemotron Speech configuration string (e.g. `"enable_vad_endpointing:true,neural_vad.onset:0.65"`). P99 latency from speech end to final transcript in seconds. Override for your deployment. See [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark). ### Settings Runtime-configurable settings passed via the `settings` constructor argument using `NvidiaSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details. | Parameter | Type | Default | Description | | -------------------------- | ----------------- | ---------------- | ------------------------------------------------------------------------ | | `model` | `str` | `None` | STT model identifier. *(Inherited from base STT settings.)* | | `language` | `Language \| str` | `Language.EN_US` | Target language for transcription. *(Inherited from base STT settings.)* | | `profanity_filter` | `bool` | `False` | Whether to filter profanity from results. | | `automatic_punctuation` | `bool` | `True` | Whether to add automatic punctuation. | | `verbatim_transcripts` | `bool` | `True` | Whether to return verbatim transcripts. | | `boosted_lm_words` | `list[str]` | `None` | List of words to boost in the language model. | | `boosted_lm_score` | `float` | `4.0` | Score boost for specified words. | | `max_alternatives` | `int` | `1` | Maximum number of recognition alternatives. | | `interim_results` | `bool` | `True` | Whether to return interim (partial) results. | | `word_time_offsets` | `bool` | `False` | Whether to include word-level time offsets. | | `speaker_diarization` | `bool` | `False` | Whether to enable speaker diarization. | | `diarization_max_speakers` | `int` | `0` | Maximum number of speakers for diarization. | ### Usage ```python theme={null} from pipecat.services.nvidia.stt import NvidiaSTTService stt = NvidiaSTTService( api_key=os.getenv("NVIDIA_API_KEY"), ) ``` ### Notes * **Model cannot be changed after initialization**: Use the `model_function_map` parameter in the constructor to specify the model and function ID. * **Streaming**: Provides real-time interim and final results through continuous audio streaming. * **Automatic reconnection**: The service automatically reconnects if the gRPC stream drops unexpectedly, maintaining continuous transcription. * **Metrics support**: This service supports metrics generation (`can_generate_metrics()` returns `True`). ## NvidiaSegmentedSTTService Batch/segmented transcription using NVIDIA Nemotron Speech's Canary models. Processes complete audio segments after VAD detects speech boundaries. NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments. NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. `localhost:50051`). Mapping containing `function_id` and `model_name` for the ASR model. Audio sample rate in Hz. When `None`, uses the pipeline's configured sample rate. Additional configuration parameters. *Deprecated in v0.0.105. Use `settings=NvidiaSegmentedSTTService.Settings(...)` instead.* Runtime-configurable settings. See [Settings](#settings-2) below. Whether to use SSL for the gRPC connection. Defaults to `True` for the NVIDIA cloud endpoint. Set to `False` for local deployments. Custom Nemotron Speech configuration string (e.g. `"enable_vad_endpointing:true,neural_vad.onset:0.65"`). P99 latency from speech end to final transcript in seconds. Override for your deployment. See [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark). ### Settings Runtime-configurable settings passed via the `settings` constructor argument using `NvidiaSegmentedSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details. | Parameter | Type | Default | Description | | -------------------------- | ----------------- | ---------------- | ------------------------------------------------------------------------ | | `model` | `str` | `None` | STT model identifier. *(Inherited from base STT settings.)* | | `language` | `Language \| str` | `Language.EN_US` | Target language for transcription. *(Inherited from base STT settings.)* | | `profanity_filter` | `bool` | `False` | Whether to filter profanity from results. | | `automatic_punctuation` | `bool` | `True` | Whether to add automatic punctuation. | | `verbatim_transcripts` | `bool` | `False` | Whether to return verbatim transcripts. | | `boosted_lm_words` | `list[str]` | `None` | List of words to boost in the language model. | | `boosted_lm_score` | `float` | `4.0` | Score boost for specified words. | | `max_alternatives` | `int` | `1` | Maximum number of recognition alternatives. | | `word_time_offsets` | `bool` | `False` | Whether to include word-level time offsets. | | `speaker_diarization` | `bool` | `False` | Whether to enable speaker diarization. | | `diarization_max_speakers` | `int` | `0` | Maximum number of speakers for diarization. | ### Usage ```python theme={null} from pipecat.services.nvidia.stt import NvidiaSegmentedSTTService from pipecat.transcriptions.language import Language stt = NvidiaSegmentedSTTService( api_key=os.getenv("NVIDIA_API_KEY"), settings=NvidiaSegmentedSTTService.Settings( language=Language.ES, automatic_punctuation=True, boosted_lm_words=["Pipecat", "NVIDIA"], boosted_lm_score=6.0, ), ) ``` ### Notes * **Model cannot be changed after initialization**: Use the `model_function_map` parameter in the constructor to specify the model and function ID. * **Segmented processing**: Processes complete audio segments for higher accuracy compared to streaming. * **Language support**: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US). See the [NVIDIA ASR NIM documentation](https://docs.nvidia.com/nim/speech/latest/reference/support-matrix/asr.html#supported-languages-by-model-type) for the complete list. * **Word boosting**: Use `boosted_lm_words` and `boosted_lm_score` to improve recognition of domain-specific terms. The `InputParams` / `params=` pattern is deprecated as of v0.0.105. Use `Settings` / `settings=` instead. See the [Service Settings guide](/pipecat/fundamentals/service-settings) for migration details. ## NvidiaSageMakerSTTService Streaming speech recognition using NVIDIA Nemotron ASR via an AWS SageMaker bidirectional-stream endpoint. This service maintains a persistent HTTP/2 bidi-stream connection to a deployed SageMaker endpoint that proxies to NVIDIA NIM's realtime WebSocket API. Name of the deployed SageMaker endpoint. AWS region where the endpoint is deployed. Audio sample rate in Hz. When `None`, uses the pipeline's configured sample rate. Runtime-configurable settings. See [Settings](#settings-3) below. P99 latency from speech end to final transcript in seconds. Override for your deployment. ### Settings Runtime-configurable settings passed via the `settings` constructor argument using `NvidiaSageMakerSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details. | Parameter | Type | Default | Description | | ---------- | ----------------- | ---------------------------------------------------------- | -------------------------------------- | | `model` | `str` | `cache-aware-parakeet-rnnt-en-US-asr-streaming-sortformer` | STT model identifier. | | `language` | `Language \| str` | `en-US` | ISO-639-1 language code passed to NIM. | ### Usage ```python theme={null} from pipecat.services.nvidia.sagemaker.stt import NvidiaSageMakerSTTService stt = NvidiaSageMakerSTTService( endpoint_name=os.getenv("SAGEMAKER_ASR_ENDPOINT_NAME"), region=os.getenv("AWS_REGION", "us-west-2"), settings=NvidiaSageMakerSTTService.Settings( language="en-US", ), ) ``` ### Notes * **AWS SageMaker deployment required**: This service requires a deployed SageMaker endpoint running NVIDIA Nemotron ASR NIM. See the [deployment example](https://github.com/pipecat-ai/pipecat-examples/tree/main/deployment/aws-sagemaker-nvidia) for setup instructions. * **AWS credentials**: Requires `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables for SageMaker authentication. * **Environment variables**: `SAGEMAKER_ASR_ENDPOINT_NAME` for the endpoint name. * **Automatic reconnection**: The service automatically reconnects on error to maintain continuous transcription. * **VAD-aware**: Produces interim results during speech and commits final transcriptions when VAD detects the user has stopped speaking. * **Metrics support**: This service supports metrics generation (`can_generate_metrics()` returns `True`).