Overview

Azure Cognitive Services provides high-quality text-to-speech synthesis with two implementations:
  • AzureTTSService (WebSocket-based streaming)
  • AzureHttpTTSService (HTTP-based batch synthesis).
AzureTTSService is recommended for real-time applications requiring low latency and streaming capabilities.

Installation

To use Azure services, install the required dependencies:
pip install "pipecat-ai[azure]"
You’ll also need to set up your Azure credentials as environment variables:
  • AZURE_API_KEY (or AZURE_SPEECH_API_KEY)
  • AZURE_REGION (or AZURE_SPEECH_REGION)
Get your API key and region from the Azure Portal under Cognitive Services > Speech.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data (PCM format)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - Azure API or processing errors

Service Comparison

FeatureAzureTTSService (Streaming)AzureHttpTTSService (HTTP)
Streaming✅ Real-time chunks❌ Single audio block
Latency🚀 Low📈 Higher
Complexity⚠️ WebSocket management✅ Simple HTTP
ConnectionWebSocket-basedHTTP-based

Language Support

Common languages supported include:
  • Language.EN_US - English (US)
  • Language.EN_GB - English (UK)
  • Language.FR - French
  • Language.DE - German
  • Language.ES - Spanish
  • Language.IT - Italian

Supported Sample Rates

Azure supports multiple sample rates with automatic format selection:
  • 8000 Hz: Raw8Khz16BitMonoPcm
  • 16000 Hz: Raw16Khz16BitMonoPcm
  • 22050 Hz: Raw22050Hz16BitMonoPcm
  • 24000 Hz: Raw24Khz16BitMonoPcm (default)
  • 44100 Hz: Raw44100Hz16BitMonoPcm
  • 48000 Hz: Raw48Khz16BitMonoPcm

Usage Example

Initialize the AzureTTSService and use it in a pipeline:
from pipecat.services.azure.tts import AzureTTSService
from pipecat.transcriptions.language import Language
import os

# Configure streaming service
tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-JennyNeural",
    params=AzureTTSService.InputParams(
        language=Language.EN_US,
        rate="1.1",
        style="cheerful",
        style_degree="1.5"
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

HTTP Service

Initialize the AzureHttpTTSService and use it in a pipeline:
from pipecat.services.azure.tts import AzureHttpTTSService

# For simpler, non-streaming use cases
http_tts = AzureHttpTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-AriaNeural",
    params=AzureHttpTTSService.InputParams(
        language=Language.EN_US,
        rate="1.05"
    )
)

SSML Features

Azure TTS supports rich SSML customization through parameters:
# Advanced SSML configuration
params = AzureTTSService.InputParams(
    language=Language.EN_US,
    style="cheerful",           # Speaking style
    style_degree="2.0",         # Style intensity (0.01-2.0)
    role="YoungAdultFemale",    # Voice role
    rate="1.2",                 # Speech rate
    pitch="+2st",               # Pitch adjustment
    volume="loud",              # Volume level
    emphasis="strong"           # Text emphasis
)

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region="eastus",
    voice="en-US-JennyNeural",
    params=params
)

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame for the AzureTTSService:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(TTSUpdateSettingsFrame(
    voice="en-US-AriaNeural",
  )
)

Metrics

Both services provide comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Neural Voices: Use neural voices (ending in “Neural”) for highest quality
  • Regional Availability: Some voices and features may be region-specific
  • SSML Automatic: Service automatically constructs SSML based on parameters
  • Audio Format: Automatic format selection based on sample rate
  • Voice Matching: Ensure voice selection matches the specified language
  • Streaming Recommended: Use AzureTTSService for real-time applications requiring low latency
  • Connection Management: WebSocket lifecycle handled automatically in streaming service