Overview

AssemblyAISTTService provides real-time speech-to-text capabilities using AssemblyAI’s WebSocket API. It supports streaming transcription with both interim and final results.

Installation

To use AssemblyAISTTService, install the required dependencies:

pip install "pipecat-ai[assemblyai]"

You’ll also need to set up your AssemblyAI API key as an environment variable: ASSEMBLYAI_API_KEY.

You can obtain a AssemblyAI API key by signing up at AssemblyAI.

Configuration

Constructor Parameters

api_key
str
required

Your AssemblyAI API key.

connection_params
AssemblyAIConnectionParams

Connection parameters for the AssemblyAI WebSocket connection. See below for details.

vad_force_turn_endpoint
bool
default:"True"

When true, sends a ForceEndpoint event to AssemblyAI when a UserStoppedSpeakingFrame is received. Requires a VAD (Voice Activity Detection) processor in the pipeline to generate these frames.

language
Language
default:"Language.EN"

Language for transcription. AssemblyAI currently only supports English Streaming transcription.

api_endpoint_base_url
str

Base URL for the WebSocket API endpoint.

Connection Parameters

sample_rate
int
default:"16000"

The sample rate of the audio stream

encoding
str
default:"pcm_s16le"

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw

formatted_finals
bool
default:"True"

Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.

word_finalization_max_wait_time
int

The max amount of time in milliseconds to wait for a word to be finalized.

end_of_turn_confidence_threshold
float

The confidence threshold to use when determining if the end of a turn has been reached.

min_end_of_turn_silence_when_confident
int

The minimum amount of silence required to detect end of turn when confident.

max_turn_silence
int

The maximum amount of silence allowed in a turn before end of turn is triggered.

Input

The service processes raw audio data with the following requirements:

  • PCM audio format
  • 16-bit depth
  • 16kHz sample rate (default)
  • Single channel (mono)

Output Frames

The service produces two types of frames during transcription:

TranscriptionFrame

Generated for final transcriptions, containing:

text
string

Transcribed text

user_id
string

User identifier

timestamp
string

ISO 8601 formatted timestamp

language
Language

Transcription language

InterimTranscriptionFrame

Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.

Methods

See the STT base class methods for additional functionality.

Language Support

AssemblyAI Streaming STT currently only supports English.

Usage Example

from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Configure service
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

Frame Flow

Metrics Support

The service collects processing metrics:

  • Time to First Byte (TTFB)
  • Processing duration
  • Connection status

Notes

  • Currently supports English-only real-time transcription
  • Handles WebSocket connection management
  • Provides both interim and final transcriptions
  • Thread-safe processing with proper event loop handling
  • Automatic error handling and reporting
  • Manages connection lifecycle