> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# AssemblyAI

> Speech-to-text service implementation using AssemblyAI's real-time transcription API

## Overview

`AssemblyAISTTService` provides real-time speech recognition using AssemblyAI's WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.

<CardGroup cols={2}>
  <Card title="AssemblyAI STT API Reference" icon="code" href="https://reference-server.pipecat.ai/en/latest/api/pipecat.services.assemblyai.stt.html">
    Pipecat's API methods for AssemblyAI STT integration
  </Card>

  <Card title="Example Implementation" icon="play" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-assemblyai-turn-detection.py">
    Example with AssemblyAI built-in turn detection
  </Card>

  <Card title="Universal-3 Pro Streaming" icon="bolt" href="https://www.assemblyai.com/docs/streaming/universal-3-pro">
    U3 Pro streaming documentation and features
  </Card>

  <Card title="U3 Pro API Reference" icon="book" href="https://www.assemblyai.com/docs/api-reference/streaming-api/universal-3-pro/universal-3-pro">
    Complete U3 Pro streaming API reference
  </Card>

  <Card title="AssemblyAI Console" icon="microphone" href="https://www.assemblyai.com/dashboard/signup">
    Access API keys and transcription features
  </Card>
</CardGroup>

## Installation

To use AssemblyAI services, install the required dependency:

```bash theme={null}
uv add "pipecat-ai[assemblyai]"
```

## Prerequisites

### AssemblyAI Account Setup

Before using AssemblyAI STT services, you need:

1. **AssemblyAI Account**: Sign up at [AssemblyAI Console](https://www.assemblyai.com/dashboard/signup)
2. **API Key**: Generate an API key from your dashboard
3. **Model Selection**: Choose from available transcription models and features

### Required Environment Variables

* `ASSEMBLYAI_API_KEY`: Your AssemblyAI API key for authentication

## Configuration

### AssemblyAISTTService

<ParamField path="api_key" type="str" required>
  AssemblyAI API key for authentication.
</ParamField>

<ParamField path="language" type="Language" default="Language.EN" deprecated>
  Language code for transcription. AssemblyAI currently supports English.
  *Deprecated in v0.0.105. Use `settings=AssemblyAISTTService.Settings(...)`
  instead.*
</ParamField>

<ParamField path="api_endpoint_base_url" type="str" default="wss://streaming.assemblyai.com/v3/ws">
  WebSocket endpoint URL. Override for custom or proxied deployments.
</ParamField>

<ParamField path="sample_rate" type="int" default="16000">
  Audio sample rate in Hz.
</ParamField>

<ParamField path="encoding" type="str" default="pcm_s16le">
  Audio encoding format.
</ParamField>

<ParamField path="connection_params" type="AssemblyAIConnectionParams" default="None" deprecated>
  Connection configuration parameters. *Deprecated in v0.0.105. Use
  `settings=AssemblyAISTTService.Settings(...)` instead. See
  [AssemblyAIConnectionParams](#assemblyaiconnectionparams) below for field
  mapping.*
</ParamField>

<ParamField path="vad_force_turn_endpoint" type="bool" default="True">
  Controls turn detection mode. When `True` (Pipecat mode, default): Forces
  AssemblyAI to return finals ASAP so Pipecat's turn detection (e.g., Smart
  Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling.
  No UserStarted/StoppedSpeakingFrame emitted from STT. When `False` (AssemblyAI
  turn detection mode, u3-rt-pro only): AssemblyAI's model controls turn endings
  using built-in turn detection. Uses AssemblyAI API defaults for all parameters
  unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.
</ParamField>

<ParamField path="should_interrupt" type="bool" default="True">
  Whether to interrupt the bot when the user starts speaking in AssemblyAI turn
  detection mode (`vad_force_turn_endpoint=False`). Only applies when using
  AssemblyAI's built-in turn detection.
</ParamField>

<ParamField path="speaker_format" type="str | None" default="None">
  Optional format string for speaker labels when diarization is enabled. Use
  `{speaker}` for speaker label and `{text}` for transcript text. Example:
  `"<{speaker}>{text}</{speaker}>"` or `"{speaker}: {text}"`. If None, transcript
  text is not modified.
</ParamField>

<ParamField path="settings" type="AssemblyAISTTService.Settings" default="None">
  Runtime-configurable settings for the STT service. See [Settings](#settings)
  below.
</ParamField>

<ParamField path="ttfs_p99_latency" type="float" default="ASSEMBLYAI_TTFS_P99">
  P99 latency from speech end to final transcript in seconds. Override for your
  deployment.
</ParamField>

### AssemblyAIConnectionParams

<Warning>
  `connection_params` is deprecated as of v0.0.105. Use
  `settings=AssemblyAISTTService.Settings(...)` instead. The `sample_rate` and
  `encoding` fields remain as direct constructor arguments. All other fields
  have moved into Settings — `speech_model` maps to `model`.
</Warning>

Connection-level parameters previously passed via the `connection_params` constructor argument.

| Parameter                                | Type        | Default       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ---------------------------------------- | ----------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample_rate`                            | `int`       | `16000`       | Audio sample rate in Hz.                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `encoding`                               | `Literal`   | `"pcm_s16le"` | Audio encoding format. Options: `"pcm_s16le"`, `"pcm_mulaw"`.                                                                                                                                                                                                                                                                                                                                                                                                                             |
| `end_of_turn_confidence_threshold`       | `float`     | `None`        | Confidence threshold for end-of-turn detection.                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| `min_turn_silence`                       | `int`       | `None`        | Minimum silence duration (ms) when confident about end-of-turn.                                                                                                                                                                                                                                                                                                                                                                                                                           |
| `min_end_of_turn_silence_when_confident` | `int`       | `None`        | **DEPRECATED**. Use `min_turn_silence` instead. Will be removed in a future version.                                                                                                                                                                                                                                                                                                                                                                                                      |
| `max_turn_silence`                       | `int`       | `None`        | Maximum silence duration (ms) before forcing end-of-turn.                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| `keyterms_prompt`                        | `List[str]` | `None`        | List of key terms to guide transcription. Will be JSON serialized before sending.                                                                                                                                                                                                                                                                                                                                                                                                         |
| `prompt`                                 | `str`       | `None`        | **BETA**: Optional text prompt to guide transcription. Only used when `speech_model` is `"u3-rt-pro"`. Cannot be used with `keyterms_prompt`. We suggest starting with no prompt. See [AssemblyAI prompting best practices](https://www.assemblyai.com/docs/speech-to-text/streaming/prompting) for guidance.                                                                                                                                                                             |
| `speech_model`                           | `Literal`   | `"u3-rt-pro"` | **Required**. Speech model to use. Options: `"universal-streaming-english"`, `"universal-streaming-multilingual"`, `"u3-rt-pro"`. Defaults to `"u3-rt-pro"` if not specified.                                                                                                                                                                                                                                                                                                             |
| `language_detection`                     | `bool`      | `None`        | Enable automatic language detection. Only applicable to `universal-streaming-multilingual`. Turn messages include language information.                                                                                                                                                                                                                                                                                                                                                   |
| `format_turns`                           | `bool`      | `True`        | Whether to format transcript turns. Only applicable to `universal-streaming-english` and `universal-streaming-multilingual` models. For `u3-rt-pro`, formatting is automatic and built-in.                                                                                                                                                                                                                                                                                                |
| `speaker_labels`                         | `bool`      | `None`        | Enable speaker diarization. Final transcripts include a speaker field (e.g., "Speaker A", "Speaker B").                                                                                                                                                                                                                                                                                                                                                                                   |
| `vad_threshold`                          | `float`     | `None`        | Voice activity detection confidence threshold. Only applicable to `u3-rt-pro`. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD's activation threshold. Defaults to `None` (not sent). |

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `AssemblyAISTTService.Settings(...)`. These can be updated mid-conversation with `STTUpdateSettingsFrame`. See [Service Settings](/pipecat/fundamentals/service-settings) for details.

| Parameter                          | Type              | Default       | Description                                                                                |
| ---------------------------------- | ----------------- | ------------- | ------------------------------------------------------------------------------------------ |
| `model`                            | `str`             | `None`        | STT model identifier. *(Inherited from base STT settings.)*                                |
| `language`                         | `Language \| str` | `Language.EN` | Language for speech recognition. *(Inherited from base STT settings.)*                     |
| `formatted_finals`                 | `bool`            | `True`        | Whether to enable transcript formatting.                                                   |
| `word_finalization_max_wait_time`  | `int`             | `None`        | Maximum time to wait for word finalization in milliseconds.                                |
| `end_of_turn_confidence_threshold` | `float`           | `None`        | Confidence threshold for end-of-turn detection.                                            |
| `min_turn_silence`                 | `int`             | `None`        | Minimum silence duration (ms) when confident about end-of-turn.                            |
| `max_turn_silence`                 | `int`             | `None`        | Maximum silence duration (ms) before forcing end-of-turn.                                  |
| `keyterms_prompt`                  | `List[str]`       | `None`        | List of key terms to guide transcription.                                                  |
| `prompt`                           | `str`             | `None`        | Optional text prompt to guide transcription (u3-rt-pro only).                              |
| `language_detection`               | `bool`            | `None`        | Enable automatic language detection.                                                       |
| `format_turns`                     | `bool`            | `True`        | Whether to format transcript turns.                                                        |
| `speaker_labels`                   | `bool`            | `None`        | Enable speaker diarization.                                                                |
| `vad_threshold`                    | `float`           | `None`        | VAD confidence threshold (0.0–1.0) for classifying audio frames as silence.                |
| `domain`                           | `str`             | `None`        | Optional domain for specialized recognition modes (e.g., `"medical-v1"` for Medical Mode). |

## Usage

### Basic Setup

```python theme={null}
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
)
```

### With Custom Settings

```python theme={null}
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        keyterms_prompt=["Pipecat", "AssemblyAI"],
    ),
    vad_force_turn_endpoint=True,
)
```

### With AssemblyAI Built-in Turn Detection

AssemblyAI's u3-rt-pro model supports built-in turn detection for more natural conversation flow:

```python theme={null}
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    vad_force_turn_endpoint=False,  # Use AssemblyAI's built-in turn detection
    settings=AssemblyAISTTService.Settings(
        # Optional: Tune turn detection timing
        min_turn_silence=100,  # Minimum silence (ms) when confident about end-of-turn
        max_turn_silence=1000,  # Maximum silence (ms) before forcing end-of-turn
    ),
)
```

### With Speaker Diarization

Enable speaker identification for multi-party conversations:

```python theme={null}
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        speaker_labels=True,  # Enable speaker diarization
    ),
    speaker_format="{speaker}: {text}",  # Format transcripts with speaker labels
)
```

## Notes

* **u3-rt-pro model**: The default model is now `u3-rt-pro`, which provides the best performance and supports built-in turn detection.
* **Turn detection modes**:
  * **Pipecat mode** (`vad_force_turn_endpoint=True`, default): Forces AssemblyAI to return finals ASAP so Pipecat's turn detection (e.g., Smart Turn) decides when the user is done. The service sends a `ForceEndpoint` message when VAD detects the user has stopped speaking.
  * **AssemblyAI mode** (`vad_force_turn_endpoint=False`, u3-rt-pro only): AssemblyAI's model controls turn endings using built-in turn detection. The service emits `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` based on AssemblyAI's detection.
* **Speaker diarization**: Enable `speaker_labels=True` in Settings to automatically identify different speakers. Final transcripts will include a speaker field (e.g., "Speaker A", "Speaker B"). Use the `speaker_format` parameter to format transcripts with speaker labels.
* **Language detection**: When using `universal-streaming-multilingual` with `language_detection=True`, Turn messages include `language_code` and `language_confidence` fields for automatic language detection.
* **Prompting**: The `prompt` parameter (u3-rt-pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used with `keyterms_prompt`.
* **Dynamic settings updates**: You can update `keyterms_prompt`, `prompt`, `min_turn_silence`, and `max_turn_silence` at runtime using `STTUpdateSettingsFrame` without reconnecting.

<Tip>
  The `connection_params=` / `InputParams` / `params=` pattern is deprecated as
  of v0.0.105. Use `Settings` / `settings=` instead. See the [Service Settings
  guide](/pipecat/fundamentals/service-settings) for migration details.
</Tip>

## Event Handlers

AssemblyAI STT supports the standard [service connection events](/api-reference/server/events/service-events), plus turn-level events for conversation tracking:

| Event             | Description                                                   |
| ----------------- | ------------------------------------------------------------- |
| `on_connected`    | Connected to AssemblyAI WebSocket                             |
| `on_disconnected` | Disconnected from AssemblyAI WebSocket                        |
| `on_end_of_turn`  | End of turn detected (fires after final transcript is pushed) |

```python theme={null}
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to AssemblyAI")

@stt.event_handler("on_end_of_turn")
async def on_end_of_turn(service, transcript):
    print(f"Turn ended: {transcript}")
```

The `on_end_of_turn` event receives `(service, transcript)` where `transcript` is the final transcript text. This event fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn't race with `TranscriptionFrame`. Works in both Pipecat and AssemblyAI turn detection modes.
