AssemblyAI
Speech-to-text service implementation using AssemblyAI’s real-time transcription API
Overview
AssemblyAISTTService
provides real-time speech-to-text capabilities using AssemblyAI’s WebSocket API. It supports streaming transcription with both interim and final results.
Installation
To use AssemblyAISTTService
, install the required dependencies:
You’ll also need to set up your AssemblyAI API key as an environment variable: ASSEMBLYAI_API_KEY
.
You can obtain a AssemblyAI API key by signing up at AssemblyAI.
Configuration
Constructor Parameters
Your AssemblyAI API key.
Connection parameters for the AssemblyAI WebSocket connection. See below for details.
When true, sends a ForceEndpoint
event to AssemblyAI when a
UserStoppedSpeakingFrame
is received. Requires a VAD (Voice Activity
Detection) processor in the pipeline to generate these frames.
Language for transcription. AssemblyAI currently only supports English Streaming transcription.
Base URL for the WebSocket API endpoint.
Connection Parameters
The sample rate of the audio stream
The encoding of the audio stream. Allowed values: pcm_s16le
, pcm_mulaw
Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.
The max amount of time in milliseconds to wait for a word to be finalized.
The confidence threshold to use when determining if the end of a turn has been reached.
The minimum amount of silence required to detect end of turn when confident.
The maximum amount of silence allowed in a turn before end of turn is triggered.
Input
The service processes raw audio data with the following requirements:
- PCM audio format
- 16-bit depth
- 16kHz sample rate (default)
- Single channel (mono)
Output Frames
The service produces two types of frames during transcription:
TranscriptionFrame
Generated for final transcriptions, containing:
Transcribed text
User identifier
ISO 8601 formatted timestamp
Transcription language
InterimTranscriptionFrame
Generated during ongoing speech, containing the same fields as TranscriptionFrame but with preliminary results.
Methods
See the STT base class methods for additional functionality.
Language Support
AssemblyAI Streaming STT currently only supports English.
Usage Example
Frame Flow
Metrics Support
The service collects processing metrics:
- Time to First Byte (TTFB)
- Processing duration
- Connection status
Notes
- Currently supports English-only real-time transcription
- Handles WebSocket connection management
- Provides both interim and final transcriptions
- Thread-safe processing with proper event loop handling
- Automatic error handling and reporting
- Manages connection lifecycle