Overview

GroqSTTService provides speech-to-text capabilities using Groq’s hosted Whisper API. It offers high-accuracy transcription with minimal setup requirements. The service uses Voice Activity Detection (VAD) to process only speech segments, optimizing API usage and improving response time.

Installation

To use GroqSTTService, install the required dependencies:

pip install "pipecat-ai[groq]"

You’ll need to set up your Groq API key as an environment variable: GROQ_API_KEY.

You can obtain a Groq API key from the Groq Console.

Configuration

Constructor Parameters

model
str
default:"whisper-large-v3-turbo"

Whisper model to use. Currently only “whisper-large-v3-turbo” is available.

api_key
str

Your Groq API key. If not provided, will use environment variable.

base_url
str
default:"https://api.groq.com/openai/v1"

Custom API base URL for Groq API requests.

language
Language
default:"Language.EN"

Language of the audio input. Defaults to English.

prompt
str

Optional text to guide the model’s style or continue a previous segment.

temperature
float

Sampling temperature between 0 and 1. Lower values are more deterministic, higher values more creative. Defaults to 0.0.

sample_rate
int

Audio sample rate in Hz. If not provided, uses the pipeline’s sample rate.

Input

The service processes audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • Single channel (mono)

Output Frames

The service produces two types of frames during transcription:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Detected language (if available)

ErrorFrame

Generated when transcription errors occur, containing error details.

Methods

Set Model

await service.set_model("whisper-large-v3-turbo")

See the STT base class methods for additional functionality.

Language Support

Groq’s Whisper API supports a wide range of languages. The service automatically maps Language enum values to the appropriate Whisper language codes.

Language CodeDescriptionWhisper Code
Language.AFAfrikaansaf
Language.ARArabicar
Language.HYArmenianhy
Language.AZAzerbaijaniaz
Language.BEBelarusianbe
Language.BSBosnianbs
Language.BGBulgarianbg
Language.CACatalanca
Language.ZHChinesezh
Language.HRCroatianhr
Language.CSCzechcs
Language.DADanishda
Language.NLDutchnl
Language.ENEnglishen
Language.ETEstonianet
Language.FIFinnishfi
Language.FRFrenchfr
Language.GLGaliciangl
Language.DEGermande
Language.ELGreekel
Language.HEHebrewhe
Language.HIHindihi
Language.HUHungarianhu
Language.ISIcelandicis
Language.IDIndonesianid
Language.ITItalianit
Language.JAJapaneseja
Language.KNKannadakn
Language.KKKazakhkk
Language.KOKoreanko
Language.LVLatvianlv
Language.LTLithuanianlt
Language.MKMacedonianmk
Language.MSMalayms
Language.MRMarathimr
Language.MIMaorimi
Language.NENepaline
Language.NONorwegianno
Language.FAPersianfa
Language.PLPolishpl
Language.PTPortuguesept
Language.RORomanianro
Language.RURussianru
Language.SRSerbiansr
Language.SKSlovaksk
Language.SLSloveniansl
Language.ESSpanishes
Language.SWSwahilisw
Language.SVSwedishsv
Language.TLTagalogtl
Language.TATamilta
Language.THThaith
Language.TRTurkishtr
Language.UKUkrainianuk
Language.URUrduur
Language.VIVietnamesevi
Language.CYWelshcy

Groq’s Whisper implementation supports language variants (like en-US, fr-CA) by mapping them to their base language. For example, Language.EN_US and Language.EN_GB will both map to en.

The service will automatically detect the language if none is specified, but specifying the language typically improves transcription accuracy.

For the most up-to-date list of supported languages, refer to the Groq documentation.

Usage Example

from pipecat.services.groq import GroqSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = GroqSTTService(
    model="whisper-large-v3-turbo",
    api_key="your-api-key",
    language=Language.EN,
    prompt="Transcribe the following conversation",
    temperature=0.0
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

Voice Activity Detection Integration

This service inherits from SegmentedSTTService, which uses Voice Activity Detection (VAD) to identify speech segments for processing. This approach:

  • Processes only actual speech, not silence or background noise
  • Maintains a small audio buffer (default 1 second) to capture speech that occurs slightly before VAD detection
  • Receives UserStartedSpeakingFrame and UserStoppedSpeakingFrame from a VAD component in the pipeline
  • Only sends complete utterances to the API when speech has ended

Ensure your transport includes a VAD component (like SileroVADAnalyzer) to properly detect speech segments.

Metrics Support

The service collects the following metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • API response time

Notes

  • Requires valid Groq API key
  • Uses Groq’s hosted Whisper model
  • Requires VAD component in transport
  • Processes complete utterances, not continuous audio
  • Handles API rate limiting
  • Automatic error handling
  • Thread-safe processing

Error Handling

The service handles common API errors including:

  • Authentication errors
  • Rate limiting
  • Invalid audio format
  • Network connectivity issues
  • API timeouts

Errors are propagated through ErrorFrames with descriptive messages.