> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech to Text

> Learn how to configure speech recognition to convert user audio into text in your Pipecat pipeline

**Speech to Text (STT)** services are responsible for converting user audio into text transcriptions. They receive audio input from users and provide real-time transcriptions that your bot can process and respond to.

## Pipeline Placement

STT processors must be positioned correctly in your pipeline to receive and process audio frames:

```python theme={null}
pipeline = Pipeline([
    transport.input(),             # Creates InputAudioRawFrames
    stt,                           # Processes audio → creates TranscriptionFrames
    context_aggregator.user(),     # Uses transcriptions for context
    llm,
    tts,
    transport.output(),
])
```

**Placement requirements:**

* **After `transport.input()`**: STT needs `InputAudioRawFrame`s from the transport
* **Before context processing**: Transcriptions must be available for context aggregation
* **Before LLM processing**: Text must be ready for language model input

## STT Service Types

Pipecat provides two types of STT services based on how they process audio:

### 1. STTService (Streaming)

**How it works:**

* Establishes a WebSocket connection to the STT provider
* Continuously streams audio for real-time transcription
* Lower latency due to persistent connection

### 2. SegmentedSTTService (HTTP-based)

**How it works:**

* Uses local VAD (Voice Activity Detection) to chunk speech
* Sends audio segments to STT service as wav files
* Higher latency due to segmentation and HTTP POST requests

<Note>
  STT services are modular and can be swapped out with no additional overhead.
  You can easily switch between streaming and segmented services based on your
  needs.
</Note>

## Supported STT Services

Pipecat supports a wide range of STT providers to fit different needs and budgets:

<Card title="Supported STT Services" icon="list" href="/api-reference/server/services/supported-services#speech-to-text">
  View the complete list of supported speech-to-text providers
</Card>

Popular options include:

<CardGroup cols={3}>
  <Card title="Deepgram" icon="microphone" href="/api-reference/server/services/stt/deepgram">
    Fast, accurate streaming STT with excellent real-time performance
  </Card>

  <Card title="Speechmatics" icon="waveform" href="/api-reference/server/services/stt/speechmatics">
    Advanced speech recognition with strong accent and dialect handling
  </Card>

  <Card title="AssemblyAI" icon="brain" href="/api-reference/server/services/stt/assemblyai">
    AI-powered transcription with speaker diarization and sentiment analysis
  </Card>

  <Card title="Gladia" icon="sparkles" href="/api-reference/server/services/stt/gladia">
    High-performance STT with multilingual support and custom models
  </Card>

  <Card title="Azure Speech" icon="microsoft" href="/api-reference/server/services/stt/azure">
    Microsoft's enterprise-grade STT service with extensive language support
  </Card>

  <Card title="Google Speech-to-Text" icon="google" href="/api-reference/server/services/stt/google">
    Reliable transcription with strong language model integration
  </Card>
</CardGroup>

## STT Configuration

### Service-Specific Configuration

Each STT service has its own customization options. Refer to specific service documentation for details:

<Card title="Individual STT Services" icon="screwdriver-wrench" href="/api-reference/server/services/supported-services#speech-to-text">
  Explore configuration options for each supported STT provider
</Card>

For example, let's look at configuring the **DeepgramSTTService** using the `LiveOptions` class:

```python theme={null}
from deepgram import LiveOptions
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.transcriptions.language import Language

# Configure using LiveOptions for full control
live_options = LiveOptions(
    model="nova-2",
    language=Language.EN_US,
    interim_results=True,        # Enable interim transcripts
    punctuate=True,              # Add punctuation
    profanity_filter=True,       # Filter profanity
    vad_events=False,            # Use pipeline VAD instead
)

stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=live_options,
)
```

### STTService Base Class Configuration

All STT services inherit from the STTService base class. The base class has base configuration options which are set with smart defaults:

```python theme={null}
stt = YourSTTService(
    # Service-specific options...
    audio_passthrough=True,      # Pass audio frames downstream (recommended)
    sample_rate=16000,           # Audio sample rate (better set in PipelineParams)
)
```

**Key options:**

* **`audio_passthrough=True`**: Allows audio frames to continue downstream to other processors (like audio recording)
* **`sample_rate`**: Audio sampling rate - best practice is to **set the `audio_in_sample_rate` in `PipelineParams` for consistency**

<Warning>
  Setting `audio_passthrough=False` will stop audio frames from being passed
  downstream, which may break audio recording or other audio-dependent
  processors.
</Warning>

### Pipeline-Level Audio Configuration

Instead of setting sample rates on individual services, configure them pipeline-wide:

```python theme={null}
task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=16000,   # All input processors use this rate
        audio_out_sample_rate=24000,  # All output processors use this rate
    ),
)
```

This ensures all audio processors use consistent sample rates without manual configuration.

<Tip>
  Always set audio sample rates in `PipelineParams` to avoid mismatches between
  different audio processors. This simplifies configuration and ensures
  consistent audio quality across your pipeline.
</Tip>

## Multilingual Transcription

Many STT services in Pipecat default to `Language.EN` (English). If you need to transcribe speech in other languages or let the model auto-detect the spoken language, you can enable multilingual support. However, providers implement this differently:

**`language=None`** — Whisper-based services (Groq, OpenAI, local Whisper) and ElevenLabs support automatic language detection when no language is specified:

```python theme={null}
from pipecat.services.groq.stt import GroqSTTService

stt = GroqSTTService(
    api_key=os.getenv("GROQ_API_KEY"),
    settings=GroqSTTService.Settings(
        language=None,  # Auto-detect language
    ),
)
```

**`language="multi"`** — Deepgram uses a special `"multi"` language code to enable multilingual transcription:

```python theme={null}
from pipecat.services.deepgram.stt import DeepgramSTTService

stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    settings=DeepgramSTTService.Settings(
        language="multi",  # Enable multilingual mode
    ),
)
```

**Language array** — Google Cloud STT accepts a list of languages for multi-language recognition. See the [Google STT docs](/api-reference/server/services/stt/google) for details.

<Note>
  Some services have additional multilingual features. For example, Soniox
  supports language hints, AssemblyAI offers a dedicated multilingual model, and
  Speechmatics supports bilingual transcription. See individual service docs for
  details.
</Note>

## Best Practices

### Enable Interim Results

When available, enable interim transcripts for better user experience:

```python theme={null}
stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=LiveOptions(
      interim_results=True,
    )
)
```

**Benefits:**

* Notifies context aggregation that more text is coming
* Prevents premature LLM completions
* Enables interruption detection
* Improves conversation flow

### Enable Punctuation and Formatting

Use punctuation when available for better LLM comprehension:

```python theme={null}
stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=LiveOptions(
        punctuate=True,        # Adds punctuation
        profanity_filter=True, # Optional content filtering
    )
)
```

**Benefits:**

* Professional-looking transcripts
* Better LLM comprehension
* Eliminates post-processing needs
* Improved context understanding

### Use Local VAD

While many STT services provide Voice Activity Detection, use Pipecat's local Silero VAD for better performance:

```python theme={null}
from pipecat.audio.vad.silero import SileroVADAnalyzer

# Configure in context aggregator
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(
        vad_analyzer=SileroVADAnalyzer(),
    ),
)
```

**Advantages:**

* **150-200ms faster** speech detection (no network round trip)
* More responsive conversation flow
* Better interruption handling
* Reduced latency overall

### Tune STT Latency

Each STT service has a measured P99 latency for delivering final transcripts after the user stops speaking. This value is used by turn stop strategies to decide how long to wait before ending the user's turn. If you notice the bot responding too early (cutting off the user) or too late (long pauses), tuning this value can help.

<Card title="STT Latency Tuning" icon="gauge-high" href="/pipecat/fundamentals/stt-latency-tuning">
  Learn about TTFS latency, see default values for every STT service, and how to
  measure and override for your deployment
</Card>

## Key Takeaways

* **Pipeline placement matters** - STT must come after transport input, before context processing
* **Service types differ** - streaming services have lower latency than segmented
* **Services are modular** - easily swap providers without code changes
* **Best practices improve performance** - use interim results, formatting, and local VAD
* **Configuration affects quality** - proper setup significantly impacts transcription accuracy

## What's Next

Now that you understand speech recognition, let's explore how to manage conversation context and memory in your voice AI bot.

<Card title="Context Management" icon="arrow-right" href="/pipecat/learn/context-management">
  Learn how to handle conversation history and context in your pipeline
</Card>
