Skip to main content

Overview

SarvamSTTService provides real-time speech recognition using Sarvam AI’s WebSocket API, supporting Indian language transcription with Voice Activity Detection (VAD) and multiple audio formats for high-accuracy speech recognition.

Installation

To use Sarvam services, install the required dependency:
pip install "pipecat-ai[sarvam]"

Prerequisites

Sarvam AI Account Setup

Before using Sarvam STT services, you need:
  1. Sarvam AI Account: Sign up at Sarvam AI
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Access to Saarika (STT) or Saaras (STT-Translate) models, including the saaras:v3 model with support for multiple modes (transcribe, translate, verbatim, translit, codemix)

Required Environment Variables

  • SARVAM_API_KEY: Your Sarvam AI API key for authentication

Configuration

SarvamSTTService

api_key
str
required
Sarvam API key for authentication.
model
str
default:"saarika:v2.5"
Sarvam model to use. Allowed values: "saarika:v2.5" (standard STT), "saaras:v2.5" (STT-Translate, auto-detects language), "saaras:v3" (advanced, supports mode and prompts).
sample_rate
int
default:"None"
Audio sample rate in Hz. Defaults to 16000 if not specified.
input_audio_codec
str
default:"wav"
Audio codec/format of the input file.
params
InputParams
default:"None"
Configuration parameters. See InputParams below.
keepalive_timeout
float
default:"None"
Seconds of no audio before sending silence to keep the connection alive. None disables keepalive.
keepalive_interval
float
default:"5.0"
Seconds between idle checks when keepalive is enabled.

InputParams

ParameterTypeDefaultDescription
languageLanguageNoneTarget language for transcription. Behavior varies by model: saarika:v2.5 defaults to “unknown” (auto-detect), saaras:v2.5 ignores this (auto-detects), saaras:v3 defaults to “en-IN”.
promptstrNoneOptional prompt to guide transcription/translation style. Only applicable to saaras models (v2.5 and v3).
modestrNoneMode of operation for saaras:v3 only. Options: "transcribe", "translate", "verbatim", "translit", "codemix". Defaults to "transcribe" for saaras:v3.
vad_signalsboolNoneEnable VAD signals in responses. When enabled, the service broadcasts UserStartedSpeakingFrame and UserStoppedSpeakingFrame from the server.
high_vad_sensitivityboolNoneEnable high VAD sensitivity for more responsive speech detection.

Usage

Basic Setup

from pipecat.services.sarvam import SarvamSTTService

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
)

With Language and Model Configuration

from pipecat.services.sarvam import SarvamSTTService
from pipecat.transcriptions.language import Language

stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    model="saaras:v3",
    params=SarvamSTTService.InputParams(
        language=Language.HI_IN,
        mode="transcribe",
        prompt="Transcribe Hindi conversation about technology.",
    ),
)

With Server-Side VAD

from pipecat.services.sarvam import SarvamSTTService

# Use Sarvam's built-in VAD instead of Pipecat's
stt = SarvamSTTService(
    api_key=os.getenv("SARVAM_API_KEY"),
    params=SarvamSTTService.InputParams(
        vad_signals=True,
        high_vad_sensitivity=True,
    ),
)

Notes

  • Supported languages: Bengali (bn-IN), Gujarati (gu-IN), Hindi (hi-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Tamil (ta-IN), Telugu (te-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN), and Assamese (as-IN).
  • Model-specific parameter validation: The service validates that parameters are compatible with the selected model. For example, prompt is not supported with saarika:v2.5, and language is not supported with saaras:v2.5 (which auto-detects language).
  • VAD modes: When vad_signals=False (default), the service relies on Pipecat’s local VAD and flushes the server buffer on VADUserStoppedSpeakingFrame. When vad_signals=True, the service uses Sarvam’s server-side VAD and broadcasts speaking frames from the server.

Event Handlers

In addition to the standard service connection events (on_connected, on_disconnected, on_connection_error), Sarvam STT provides:
EventDescription
on_speech_startedSpeech detected in the audio stream
on_speech_stoppedSpeech stopped
on_utterance_endEnd of utterance detected
@stt.event_handler("on_speech_started")
async def on_speech_started(service):
    print("User started speaking")

@stt.event_handler("on_utterance_end")
async def on_utterance_end(service):
    print("Utterance ended")