Overview

AzureSTTService provides real-time speech recognition using Azure’s Cognitive Services Speech SDK with support for continuous recognition, extensive language support, and configurable audio processing.

Installation

To use Azure Speech services, install the required dependency:

pip install "pipecat-ai[azure]"

You’ll also need to set up your Azure credentials as environment variables:

  • AZURE_API_KEY (or AZURE_SPEECH_API_KEY)
  • AZURE_REGION (or AZURE_SPEECH_REGION)

Get your API key and region from the Azure Portal by creating a Speech Services resource.

Frames

Input

  • InputAudioRawFrame - Raw PCM audio data (configurable sample rate, mono or stereo)
  • STTUpdateSettingsFrame - Runtime transcription configuration updates
  • STTMuteFrame - Mute audio input for transcription

Output

  • TranscriptionFrame - Final transcription results
  • ErrorFrame - Connection or processing errors

Language Support

Azure Speech STT supports extensive language coverage with regional variants:

Common languages:

  • Language.EN_US - English (US) - en-US
  • Language.ES - Spanish - es-ES
  • Language.FR - French - fr-FR
  • Language.DE - German - de-DE
  • Language.IT - Italian - it-IT
  • Language.JA - Japanese - ja-JP

Usage Example

Basic Configuration

Initialize the AzureSTTService and use it in a pipeline:

from pipecat.services.azure.stt import AzureSTTService
from pipecat.transcriptions.language import Language

# Basic configuration
stt = AzureSTTService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    language=Language.EN_US,
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Dynamic Configuration

Make settings updates by pushing an STTUpdateSettingsFrame for the AzureSTTService:

from pipecat.frames.frames import STTUpdateSettingsFrame

await task.queue_frame(STTUpdateSettingsFrame(
    language=Language.FR,
))

Metrics

The service provides:

  • Time to First Byte (TTFB) - Latency from audio input to first transcription
  • Processing Duration - Total time spent processing audio

Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Continuous Recognition: Uses Azure’s continuous recognition mode for real-time processing
  • Audio Flexibility: Supports configurable sample rates and both mono/stereo input
  • Resource Management: Automatic cleanup of Azure speech recognizer and audio streams
  • Threading: Thread-safe operation with proper async event loop handling using asyncio.run_coroutine_threadsafe
  • Regional Support: Requires Azure region specification for optimal performance and compliance
  • Connection Management: Handles Azure SDK connection lifecycle with proper start/stop/cancel operations