> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Sarvam

> Speech-to-text service implementation using Sarvam AI's WebSocket-based streaming API

## Overview

`SarvamSTTService` provides real-time speech recognition using Sarvam AI's WebSocket API, supporting Indian language transcription with Voice Activity Detection (VAD) and multiple audio formats for high-accuracy speech recognition.

<CardGroup cols={2}>
  <Card title="Sarvam STT API Reference" icon="code" href="https://reference-server.pipecat.ai/en/latest/api/pipecat.services.sarvam.stt.html">
    Pipecat's API methods for Sarvam STT integration
  </Card>

  <Card title="Example Implementation" icon="play" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-sarvam.py">
    Complete example with interruption handling
  </Card>

  <Card title="Sarvam Documentation" icon="book" href="https://docs.sarvam.ai/api-reference-docs/api-guides-tutorials/speech-to-text/overview">
    Official Sarvam AI STT documentation and features
  </Card>

  <Card title="Sarvam AI Platform" icon="microphone" href="https://dashboard.sarvam.ai/">
    Access API keys and speech models
  </Card>
</CardGroup>

## Installation

To use Sarvam services, install the required dependency:

```bash theme={null}
uv add "pipecat-ai[sarvam]"
```

## Prerequisites

### Sarvam AI Account Setup

Before using Sarvam STT services, you need:

1. **Sarvam AI Account**: Sign up at [Sarvam AI](https://dashboard.sarvam.ai/)
2. **API Key**: Generate an API key from your account dashboard
3. **Model Access**: Access to Saarika (STT) or Saaras (STT-Translate) models, including the `saaras:v3` model with support for multiple modes (transcribe, translate, verbatim, translit, codemix)

### Required Environment Variables

* `SARVAM_API_KEY`: Your Sarvam AI API key for authentication

## Configuration

### SarvamSTTService

<ParamField path="api_key" type="str" required>
  Sarvam API key for authentication.
</ParamField>

<ParamField path="model" type="str" default="saaras:v3" deprecated>
  Sarvam model to use. Allowed values: `"saarika:v2.5"` (standard STT),
  `"saaras:v2.5"` (STT-Translate, auto-detects language), `"saaras:v3"`
  (advanced, supports mode and fine-grained VAD). *Deprecated in v0.0.105. Use
  `settings=SarvamSTTService.Settings(...)` instead.*
</ParamField>

<ParamField path="sample_rate" type="int" default="None">
  Audio sample rate in Hz. Defaults to 16000 if not specified.
</ParamField>

<ParamField path="mode" type="Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix']" default="None">
  Mode of operation. Only applicable to models that support it (e.g.,
  `saaras:v3`). Defaults to the model's default mode.
</ParamField>

<ParamField path="input_audio_codec" type="str" default="wav">
  Audio codec/format of the input file.
</ParamField>

<ParamField path="params" type="SarvamSTTService.InputParams" default="None" deprecated>
  Configuration parameters for Sarvam STT service. *Deprecated in v0.0.105. Use
  `settings=SarvamSTTService.Settings(...)` instead.*
</ParamField>

<ParamField path="settings" type="SarvamSTTService.Settings" default="None">
  Runtime-configurable settings for the STT service. See [Settings](#settings)
  below.
</ParamField>

<ParamField path="keepalive_timeout" type="float" default="None">
  Seconds of no audio before sending silence to keep the connection alive.
  `None` disables keepalive.
</ParamField>

<ParamField path="ttfs_p99_latency" type="float" default="SARVAM_TTFS_P99">
  P99 latency from speech end to final transcript in seconds. Override for your
  deployment. See [stt-benchmark](https://github.com/pipecat-ai/stt-benchmark).
</ParamField>

<ParamField path="keepalive_interval" type="float" default="5.0">
  Seconds between idle checks when keepalive is enabled.
</ParamField>

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `SarvamSTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details.

| Parameter                       | Type              | Default | Description                                                                                                                                                                                                                        |
| ------------------------------- | ----------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`                         | `str`             | `None`  | STT model identifier. *(Inherited from base STT settings.)*                                                                                                                                                                        |
| `language`                      | `Language \| str` | `None`  | Target language for transcription. *(Inherited from base STT settings.)* Behavior varies by model: `saarika:v2.5` defaults to "unknown" (auto-detect), `saaras:v2.5` ignores this (auto-detects), `saaras:v3` defaults to "en-IN". |
| `prompt`                        | `str`             | `None`  | Optional prompt to guide transcription/translation style. Only applicable to `saaras:v2.5`.                                                                                                                                        |
| `vad_signals`                   | `bool`            | `None`  | Enable VAD signals in responses. When enabled, the service broadcasts `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` from the server.                                                                                   |
| `high_vad_sensitivity`          | `bool`            | `None`  | Enable high VAD sensitivity for more responsive speech detection.                                                                                                                                                                  |
| `positive_speech_threshold`     | `float`           | `None`  | VAD probability threshold (0.0-1.0) above which a frame is considered speech. Only for `saaras:v3`.                                                                                                                                |
| `negative_speech_threshold`     | `float`           | `None`  | VAD probability threshold (0.0-1.0) below which a frame is considered silence. Only for `saaras:v3`.                                                                                                                               |
| `min_speech_frames`             | `int`             | `None`  | Minimum consecutive speech frames to start a speech segment. Only for `saaras:v3`.                                                                                                                                                 |
| `first_turn_min_speech_frames`  | `int`             | `None`  | Minimum speech frames for the first user turn. Only for `saaras:v3`.                                                                                                                                                               |
| `negative_frames_count`         | `int`             | `None`  | Number of silence frames within the window to end a speech segment. Only for `saaras:v3`.                                                                                                                                          |
| `negative_frames_window`        | `int`             | `None`  | Sliding window size (in frames) for counting negative frames. Only for `saaras:v3`.                                                                                                                                                |
| `start_speech_volume_threshold` | `float`           | `None`  | Volume level (dB) below which audio is too quiet to be speech. Only for `saaras:v3`.                                                                                                                                               |
| `interrupt_min_speech_frames`   | `int`             | `None`  | Minimum speech frames to register a barge-in/interruption. Only for `saaras:v3`.                                                                                                                                                   |
| `pre_speech_pad_frames`         | `int`             | `None`  | Number of audio frames to prepend before detected speech onset. Only for `saaras:v3`.                                                                                                                                              |
| `num_initial_ignored_frames`    | `int`             | `None`  | Number of leading audio frames to skip at connection start. Only for `saaras:v3`.                                                                                                                                                  |

## Usage

### Basic Setup

```python theme={null}
from pipecat.services.sarvam.stt import SarvamSTTService

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
)
```

### With Language and Model Configuration

```python theme={null}
from pipecat.services.sarvam.stt import SarvamSTTService
from pipecat.transcriptions.language import Language

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    mode="transcribe",
    settings=SarvamSTTService.Settings(
        model="saaras:v3",
        language=Language.HI_IN,
        prompt="Transcribe Hindi conversation about technology.",
    ),
)
```

### With Server-Side VAD

```python theme={null}
from pipecat.services.sarvam.stt import SarvamSTTService

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    settings=SarvamSTTService.Settings(
        vad_signals=True,
        high_vad_sensitivity=True,
    ),
)
```

## Notes

* **Default model changed**: As of this update, the default model is `saaras:v3` (previously `saarika:v2.5`). Applications that relied on the previous default should set `settings=SarvamSTTService.Settings(model="saarika:v2.5")` explicitly.
* **Supported languages**: Bengali (bn-IN), Gujarati (gu-IN), Hindi (hi-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Tamil (ta-IN), Telugu (te-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN), and Assamese (as-IN).
* **Model-specific parameter validation**: The service validates that parameters are compatible with the selected model. For example, `prompt` is only supported with `saaras:v2.5`, `language` is not supported with `saaras:v2.5` (which auto-detects language), and the fine-grained VAD parameters are only supported with `saaras:v3`.
* **Fine-grained VAD tuning (saaras:v3 only)**: The `saaras:v3` model supports server-side VAD with 10 tuning parameters for speech detection thresholds, frame-count controls, pre-speech padding, interruption sensitivity, and initial-frame skipping. These parameters are only available with the `saaras:v3` model.
* **VAD modes**: When `vad_signals=False` (default), the service relies on Pipecat's local VAD and flushes the server buffer on `VADUserStoppedSpeakingFrame`. When `vad_signals=True`, the service uses Sarvam's server-side VAD and broadcasts speaking frames from the server.

<Tip>
  The `InputParams` / `params=` pattern is deprecated as of v0.0.105. Use
  `Settings` / `settings=` instead. See the [Service Settings
  guide](/pipecat/fundamentals/service-settings) for migration details.
</Tip>

## Event Handlers

In addition to the standard [service connection events](/api-reference/server/events/service-events) (`on_connected`, `on_disconnected`, `on_connection_error`), Sarvam STT provides:

| Event               | Description                         |
| ------------------- | ----------------------------------- |
| `on_speech_started` | Speech detected in the audio stream |
| `on_speech_stopped` | Speech stopped                      |
| `on_utterance_end`  | End of utterance detected           |

```python theme={null}
@stt.event_handler("on_speech_started")
async def on_speech_started(service):
    print("User started speaking")

@stt.event_handler("on_utterance_end")
async def on_utterance_end(service):
    print("Utterance ended")
```
