> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# NVIDIA Nemotron Speech

> Speech-to-text service implementation using NVIDIA Nemotron Speech

## Overview

NVIDIA Nemotron Speech provides two STT service implementations:

* **`NvidiaSTTService`** -- Real-time streaming transcription using Nemotron ASR Streaming models with interim results and continuous audio processing.
* **`NvidiaSegmentedSTTService`** -- Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy.

<CardGroup cols={2}>
  <Card title="NVIDIA Nemotron Speech STT API Reference" icon="code" href="https://reference-server.pipecat.ai/en/latest/api/pipecat.services.nvidia.stt.html">
    Pipecat's API methods for NVIDIA Nemotron Speech STT integration
  </Card>

  <Card title="Example Implementation" icon="play" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-nvidia.py">
    Complete example with NVIDIA services integration
  </Card>

  <Card title="NVIDIA ASR NIM Documentation" icon="book" href="https://docs.nvidia.com/nim/speech/latest/asr/">
    Official NVIDIA ASR NIM documentation
  </Card>

  <Card title="NVIDIA Developer Portal" icon="microphone" href="https://developer.nvidia.com">
    Access API keys and Nemotron Speech services
  </Card>
</CardGroup>

## Installation

To use NVIDIA Nemotron Speech services, install the required dependency:

```bash theme={null}
uv add "pipecat-ai[nvidia]"
```

## Prerequisites

### NVIDIA Nemotron Speech Setup

Before using NVIDIA Nemotron Speech STT services, you need:

1. **NVIDIA Developer Account** (for cloud deployments): Sign up at [NVIDIA Developer Portal](https://developer.nvidia.com)
2. **API Key** (for cloud deployments): Generate an NVIDIA API key for Nemotron Speech services
3. **Model Selection**: Choose between Nemotron ASR Streaming (streaming) and Canary (segmented) models

For local deployments, you can run NVIDIA ASR NIM locally without an API key. See the [NVIDIA ASR NIM documentation](https://docs.nvidia.com/nim/speech/latest/asr/) for deployment instructions.

### Environment Variables

* `NVIDIA_API_KEY`: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments)

## NvidiaSTTService

Real-time streaming transcription using NVIDIA Nemotron Speech's streaming ASR models.

<ParamField path="api_key" type="str | None" default="None">
  NVIDIA API key for authentication. Required when using the cloud endpoint. Not
  needed for local deployments.
</ParamField>

<ParamField path="server" type="str" default="grpc.nvcf.nvidia.com:443">
  NVIDIA Nemotron Speech server address. For local deployments, pass the local
  address (e.g. `localhost:50051`).
</ParamField>

<ParamField path="model_function_map" type="Mapping[str, str]" default="{&#x22;function_id&#x22;: &#x22;bb0837de-8c7b-481f-9ec8-ef5663e9c1fa&#x22;, &#x22;model_name&#x22;: &#x22;nemotron-asr-streaming&#x22;}">
  Mapping containing `function_id` and `model_name` for the ASR model.
</ParamField>

<ParamField path="sample_rate" type="int" default="None">
  Audio sample rate in Hz. When `None`, uses the pipeline's configured sample
  rate.
</ParamField>

<ParamField path="params" type="NvidiaSTTService.InputParams" default="None" deprecated>
  Additional configuration parameters. *Deprecated in v0.0.105. Use
  `settings=NvidiaSTTService.Settings(...)` instead.*
</ParamField>

<ParamField path="settings" type="NvidiaSTTService.Settings" default="None">
  Runtime-configurable settings. See [Settings](#settings) below.
</ParamField>

<ParamField path="use_ssl" type="bool" default="True">
  Whether to use SSL for the gRPC connection. Defaults to `True` for the NVIDIA
  cloud endpoint. Set to `False` for local deployments.
</ParamField>

<ParamField path="audio_channel_count" type="int" default="1">
  Number of audio channels.
</ParamField>

<ParamField path="start_history" type="int" default="-1">
  VAD start history in frames. Use `-1` for Nemotron Speech default.
</ParamField>

<ParamField path="start_threshold" type="float" default="-1.0">
  VAD start threshold. Use `-1.0` for Nemotron Speech default.
</ParamField>

<ParamField path="stop_history" type="int" default="320">
  VAD stop history in frames. Use `-1` for Nemotron Speech default.
</ParamField>

<ParamField path="stop_threshold" type="float" default="-1.0">
  VAD stop threshold. Use `-1.0` for Nemotron Speech default.
</ParamField>

<ParamField path="stop_history_eou" type="int" default="-1">
  End-of-utterance stop history in frames. Use `-1` for Nemotron Speech default.
</ParamField>

<ParamField path="stop_threshold_eou" type="float" default="-1.0">
  End-of-utterance stop threshold. Use `-1.0` for Nemotron Speech default.
</ParamField>

<ParamField path="custom_configuration" type="str" default="&#x22;&#x22;">
  Custom Nemotron Speech configuration string (e.g.
  `"enable_vad_endpointing:true,neural_vad.onset:0.65"`).
</ParamField>

<ParamField path="ttfs_p99_latency" type="float" default="1.0">
  P99 latency from speech end to final transcript in seconds. Override for your
  deployment. See [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark).
</ParamField>

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `NvidiaSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details.

| Parameter                  | Type              | Default          | Description                                                              |
| -------------------------- | ----------------- | ---------------- | ------------------------------------------------------------------------ |
| `model`                    | `str`             | `None`           | STT model identifier. *(Inherited from base STT settings.)*              |
| `language`                 | `Language \| str` | `Language.EN_US` | Target language for transcription. *(Inherited from base STT settings.)* |
| `profanity_filter`         | `bool`            | `False`          | Whether to filter profanity from results.                                |
| `automatic_punctuation`    | `bool`            | `True`           | Whether to add automatic punctuation.                                    |
| `verbatim_transcripts`     | `bool`            | `True`           | Whether to return verbatim transcripts.                                  |
| `boosted_lm_words`         | `list[str]`       | `None`           | List of words to boost in the language model.                            |
| `boosted_lm_score`         | `float`           | `4.0`            | Score boost for specified words.                                         |
| `max_alternatives`         | `int`             | `1`              | Maximum number of recognition alternatives.                              |
| `interim_results`          | `bool`            | `True`           | Whether to return interim (partial) results.                             |
| `word_time_offsets`        | `bool`            | `False`          | Whether to include word-level time offsets.                              |
| `speaker_diarization`      | `bool`            | `False`          | Whether to enable speaker diarization.                                   |
| `diarization_max_speakers` | `int`             | `0`              | Maximum number of speakers for diarization.                              |

### Usage

```python theme={null}
from pipecat.services.nvidia.stt import NvidiaSTTService

stt = NvidiaSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
)
```

### Notes

* **Model cannot be changed after initialization**: Use the `model_function_map` parameter in the constructor to specify the model and function ID.
* **Streaming**: Provides real-time interim and final results through continuous audio streaming.
* **Metrics support**: This service supports metrics generation (`can_generate_metrics()` returns `True`).

## NvidiaSegmentedSTTService

Batch/segmented transcription using NVIDIA Nemotron Speech's Canary models. Processes complete audio segments after VAD detects speech boundaries.

<ParamField path="api_key" type="str | None" default="None">
  NVIDIA API key for authentication. Required when using the cloud endpoint. Not
  needed for local deployments.
</ParamField>

<ParamField path="server" type="str" default="grpc.nvcf.nvidia.com:443">
  NVIDIA Nemotron Speech server address. For local deployments, pass the local
  address (e.g. `localhost:50051`).
</ParamField>

<ParamField path="model_function_map" type="Mapping[str, str]" default="{&#x22;function_id&#x22;: &#x22;ee8dc628-76de-4acc-8595-1836e7e857bd&#x22;, &#x22;model_name&#x22;: &#x22;canary-1b-asr&#x22;}">
  Mapping containing `function_id` and `model_name` for the ASR model.
</ParamField>

<ParamField path="sample_rate" type="int" default="None">
  Audio sample rate in Hz. When `None`, uses the pipeline's configured sample
  rate.
</ParamField>

<ParamField path="params" type="NvidiaSegmentedSTTService.InputParams" default="None" deprecated>
  Additional configuration parameters. *Deprecated in v0.0.105. Use
  `settings=NvidiaSegmentedSTTService.Settings(...)` instead.*
</ParamField>

<ParamField path="settings" type="NvidiaSegmentedSTTService.Settings" default="None">
  Runtime-configurable settings. See [Settings](#settings-2) below.
</ParamField>

<ParamField path="use_ssl" type="bool" default="True">
  Whether to use SSL for the gRPC connection. Defaults to `True` for the NVIDIA
  cloud endpoint. Set to `False` for local deployments.
</ParamField>

<ParamField path="custom_configuration" type="str" default="&#x22;&#x22;">
  Custom Nemotron Speech configuration string (e.g.
  `"enable_vad_endpointing:true,neural_vad.onset:0.65"`).
</ParamField>

<ParamField path="ttfs_p99_latency" type="float" default="1.0">
  P99 latency from speech end to final transcript in seconds. Override for your
  deployment. See [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark).
</ParamField>

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `NvidiaSegmentedSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details.

| Parameter               | Type              | Default          | Description                                                              |
| ----------------------- | ----------------- | ---------------- | ------------------------------------------------------------------------ |
| `model`                 | `str`             | `None`           | STT model identifier. *(Inherited from base STT settings.)*              |
| `language`              | `Language \| str` | `Language.EN_US` | Target language for transcription. *(Inherited from base STT settings.)* |
| `profanity_filter`      | `bool`            | `False`          | Whether to filter profanity from results.                                |
| `automatic_punctuation` | `bool`            | `True`           | Whether to add automatic punctuation.                                    |
| `verbatim_transcripts`  | `bool`            | `False`          | Whether to return verbatim transcripts.                                  |
| `boosted_lm_words`      | `list[str]`       | `None`           | List of words to boost in the language model.                            |
| `boosted_lm_score`      | `float`           | `4.0`            | Score boost for specified words.                                         |
| `max_alternatives`      | `int`             | `1`              | Maximum number of recognition alternatives.                              |
| `word_time_offsets`     | `bool`            | `False`          | Whether to include word-level time offsets.                              |

### Usage

```python theme={null}
from pipecat.services.nvidia.stt import NvidiaSegmentedSTTService
from pipecat.transcriptions.language import Language

stt = NvidiaSegmentedSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    settings=NvidiaSegmentedSTTService.Settings(
        language=Language.ES,
        automatic_punctuation=True,
        boosted_lm_words=["Pipecat", "NVIDIA"],
        boosted_lm_score=6.0,
    ),
)
```

### Notes

* **Model cannot be changed after initialization**: Use the `model_function_map` parameter in the constructor to specify the model and function ID.
* **Segmented processing**: Processes complete audio segments for higher accuracy compared to streaming.
* **Language support**: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US). See the [NVIDIA ASR NIM documentation](https://docs.nvidia.com/nim/speech/latest/reference/support-matrix/asr.html#supported-languages-by-model-type) for the complete list.
* **Word boosting**: Use `boosted_lm_words` and `boosted_lm_score` to improve recognition of domain-specific terms.

<Tip>
  The `InputParams` / `params=` pattern is deprecated as of v0.0.105. Use
  `Settings` / `settings=` instead. See the [Service Settings
  guide](/pipecat/fundamentals/service-settings) for migration details.
</Tip>
