Skip to main content

Overview

SarvamSTTService provides real-time speech recognition using Sarvam AI’s WebSocket API, supporting Indian language transcription with Voice Activity Detection (VAD) and multiple audio formats for high-accuracy speech recognition.

Installation

To use Sarvam services, install the required dependency:
pip install "pipecat-ai[sarvam]"

Prerequisites

Sarvam AI Account Setup

Before using Sarvam STT services, you need:
  1. Sarvam AI Account: Sign up at Sarvam AI
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Access to Saarika (STT) or Saaras (STT-Translate) models, including the saaras:v3 model with support for multiple modes (transcribe, translate, verbatim, translit, codemix)

Required Environment Variables

  • SARVAM_API_KEY: Your Sarvam AI API key for authentication

Configuration

SarvamSTTService

api_key
str
required
Sarvam API key for authentication.
model
str
default:"saarika:v2.5"
deprecated
Sarvam model to use. Allowed values: "saarika:v2.5" (standard STT), "saaras:v2.5" (STT-Translate, auto-detects language), "saaras:v3" (advanced, supports mode and prompts). Deprecated in v0.0.105. Use settings=SarvamSTTService.Settings(...) instead.
sample_rate
int
default:"None"
Audio sample rate in Hz. Defaults to 16000 if not specified.
mode
Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix']
default:"None"
Mode of operation. Only applicable to models that support it (e.g., saaras:v3). Defaults to the model’s default mode.
input_audio_codec
str
default:"wav"
Audio codec/format of the input file.
params
SarvamSTTService.InputParams
default:"None"
deprecated
Configuration parameters for Sarvam STT service. Deprecated in v0.0.105. Use settings=SarvamSTTService.Settings(...) instead.
settings
SarvamSTTService.Settings
default:"None"
Runtime-configurable settings for the STT service. See Settings below.
keepalive_timeout
float
default:"None"
Seconds of no audio before sending silence to keep the connection alive. None disables keepalive.
ttfs_p99_latency
float
default:"SARVAM_TTFS_P99"
P99 latency from speech end to final transcript in seconds. Override for your deployment. See stt-benchmark.
keepalive_interval
float
default:"5.0"
Seconds between idle checks when keepalive is enabled.

Settings

Runtime-configurable settings passed via the settings constructor argument using SarvamSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneSTT model identifier. (Inherited from base STT settings.)
languageLanguage | strNoneTarget language for transcription. (Inherited from base STT settings.) Behavior varies by model: saarika:v2.5 defaults to “unknown” (auto-detect), saaras:v2.5 ignores this (auto-detects), saaras:v3 defaults to “en-IN”.
promptstrNoneOptional prompt to guide transcription/translation style. Only applicable to saaras models (v2.5 and v3).
vad_signalsboolNoneEnable VAD signals in responses. When enabled, the service broadcasts UserStartedSpeakingFrame and UserStoppedSpeakingFrame from the server.
high_vad_sensitivityboolNoneEnable high VAD sensitivity for more responsive speech detection.

Usage

Basic Setup

from pipecat.services.sarvam.stt import SarvamSTTService

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
)

With Language and Model Configuration

from pipecat.services.sarvam.stt import SarvamSTTService
from pipecat.transcriptions.language import Language

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    mode="transcribe",
    settings=SarvamSTTService.Settings(
        model="saaras:v3",
        language=Language.HI_IN,
        prompt="Transcribe Hindi conversation about technology.",
    ),
)

With Server-Side VAD

from pipecat.services.sarvam.stt import SarvamSTTService

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    settings=SarvamSTTService.Settings(
        vad_signals=True,
        high_vad_sensitivity=True,
    ),
)

Notes

  • Supported languages: Bengali (bn-IN), Gujarati (gu-IN), Hindi (hi-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Tamil (ta-IN), Telugu (te-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN), and Assamese (as-IN).
  • Model-specific parameter validation: The service validates that parameters are compatible with the selected model. For example, prompt is not supported with saarika:v2.5, and language is not supported with saaras:v2.5 (which auto-detects language).
  • VAD modes: When vad_signals=False (default), the service relies on Pipecat’s local VAD and flushes the server buffer on VADUserStoppedSpeakingFrame. When vad_signals=True, the service uses Sarvam’s server-side VAD and broadcasts speaking frames from the server.
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Event Handlers

In addition to the standard service connection events (on_connected, on_disconnected, on_connection_error), Sarvam STT provides:
EventDescription
on_speech_startedSpeech detected in the audio stream
on_speech_stoppedSpeech stopped
on_utterance_endEnd of utterance detected
@stt.event_handler("on_speech_started")
async def on_speech_started(service):
    print("User started speaking")

@stt.event_handler("on_utterance_end")
async def on_utterance_end(service):
    print("Utterance ended")