Azure

Overview

Azure Cognitive Services provides high-quality text-to-speech synthesis with two implementations:

AzureTTSService (WebSocket-based streaming)
AzureHttpTTSService (HTTP-based batch synthesis).

AzureTTSService is recommended for real-time applications requiring low latency and streaming capabilities.

API Reference

Complete API documentation and method details

Azure Speech Docs

Official Azure Speech Services documentation

Example Code

Working example with streaming synthesis

Installation

To use Azure services, install the required dependencies:

pip install "pipecat-ai[azure]"

You’ll also need to set up your Azure credentials as environment variables:

AZURE_API_KEY (or AZURE_SPEECH_API_KEY)
AZURE_REGION (or AZURE_SPEECH_REGION)

Get your API key and region from the Azure Portal under Cognitive Services > Speech.

Frames

Input

TextFrame - Text content to synthesize into speech
TTSSpeakFrame - Text that should be spoken immediately
TTSUpdateSettingsFrame - Runtime configuration updates
LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

TTSStartedFrame - Signals start of synthesis
TTSAudioRawFrame - Generated audio data (PCM format)
TTSStoppedFrame - Signals completion of synthesis
ErrorFrame - Azure API or processing errors

Service Comparison

Feature	AzureTTSService (Streaming)	AzureHttpTTSService (HTTP)
Streaming	✅ Real-time chunks	❌ Single audio block
Latency	🚀 Low	📈 Higher
Complexity	⚠️ WebSocket management	✅ Simple HTTP
Connection	WebSocket-based	HTTP-based

Language Support

View All Supported Languages

Language Code	Description	Service Code
`Language.BG`	Bulgarian	`bg-BG`
`Language.CA`	Catalan	`ca-ES`
`Language.ZH`	Chinese (Simplified)	`zh-CN`
`Language.ZH_TW`	Chinese (Traditional)	`zh-TW`
`Language.CS`	Czech	`cs-CZ`
`Language.DA`	Danish	`da-DK`
`Language.NL`	Dutch (Netherlands)	`nl-NL`
`Language.NL_BE`	Dutch (Belgium)	`nl-BE`
`Language.EN`	English (US)	`en-US`
`Language.EN_US`	English (US)	`en-US`
`Language.EN_AU`	English (Australia)	`en-AU`
`Language.EN_GB`	English (UK)	`en-GB`
`Language.EN_NZ`	English (New Zealand)	`en-NZ`
`Language.EN_IN`	English (India)	`en-IN`
`Language.ET`	Estonian	`et-EE`
`Language.FI`	Finnish	`fi-FI`
`Language.FR`	French (France)	`fr-FR`
`Language.FR_CA`	French (Canada)	`fr-CA`
`Language.DE`	German (Germany)	`de-DE`
`Language.DE_CH`	German (Switzerland)	`de-CH`
`Language.EL`	Greek	`el-GR`
`Language.HI`	Hindi	`hi-IN`
`Language.HU`	Hungarian	`hu-HU`
`Language.ID`	Indonesian	`id-ID`
`Language.IT`	Italian	`it-IT`
`Language.JA`	Japanese	`ja-JP`
`Language.KO`	Korean	`ko-KR`
`Language.LV`	Latvian	`lv-LV`
`Language.LT`	Lithuanian	`lt-LT`
`Language.MS`	Malay	`ms-MY`
`Language.NO`	Norwegian	`nb-NO`
`Language.PL`	Polish	`pl-PL`
`Language.PT`	Portuguese (Portugal)	`pt-PT`
`Language.PT_BR`	Portuguese (Brazil)	`pt-BR`
`Language.RO`	Romanian	`ro-RO`
`Language.RU`	Russian	`ru-RU`
`Language.SK`	Slovak	`sk-SK`
`Language.ES`	Spanish	`es-ES`
`Language.SV`	Swedish	`sv-SE`
`Language.TH`	Thai	`th-TH`
`Language.TR`	Turkish	`tr-TR`
`Language.UK`	Ukrainian	`uk-UA`
`Language.VI`	Vietnamese	`vi-VN`

Common languages supported include:

Language.EN_US - English (US)
Language.EN_GB - English (UK)
Language.FR - French
Language.DE - German
Language.ES - Spanish
Language.IT - Italian

Supported Sample Rates

Azure supports multiple sample rates with automatic format selection:

8000 Hz: Raw8Khz16BitMonoPcm
16000 Hz: Raw16Khz16BitMonoPcm
22050 Hz: Raw22050Hz16BitMonoPcm
24000 Hz: Raw24Khz16BitMonoPcm (default)
44100 Hz: Raw44100Hz16BitMonoPcm
48000 Hz: Raw48Khz16BitMonoPcm

Usage Example

Streaming Service (Recommended)

Initialize the AzureTTSService and use it in a pipeline:

from pipecat.services.azure.tts import AzureTTSService
from pipecat.transcriptions.language import Language
import os

# Configure streaming service
tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-JennyNeural",
    params=AzureTTSService.InputParams(
        language=Language.EN_US,
        rate="1.1",
        style="cheerful",
        style_degree="1.5"
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

HTTP Service

Initialize the AzureHttpTTSService and use it in a pipeline:

from pipecat.services.azure.tts import AzureHttpTTSService

# For simpler, non-streaming use cases
http_tts = AzureHttpTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-AriaNeural",
    params=AzureHttpTTSService.InputParams(
        language=Language.EN_US,
        rate="1.05"
    )
)

SSML Features

Azure TTS supports rich SSML customization through parameters:

# Advanced SSML configuration
params = AzureTTSService.InputParams(
    language=Language.EN_US,
    style="cheerful",           # Speaking style
    style_degree="2.0",         # Style intensity (0.01-2.0)
    role="YoungAdultFemale",    # Voice role
    rate="1.2",                 # Speech rate
    pitch="+2st",               # Pitch adjustment
    volume="loud",              # Volume level
    emphasis="strong"           # Text emphasis
)

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region="eastus",
    voice="en-US-JennyNeural",
    params=params
)

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame for the AzureTTSService:

from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(TTSUpdateSettingsFrame(
    voice="en-US-AriaNeural",
  )
)

Metrics

Both services provide comprehensive metrics:

Time to First Byte (TTFB) - Latency from text input to first audio
Processing Duration - Total synthesis time
Character Usage - Text processed for billing

Learn how to enable Metrics in your Pipeline.

Additional Notes

Neural Voices: Use neural voices (ending in “Neural”) for highest quality
Regional Availability: Some voices and features may be region-specific
SSML Automatic: Service automatically constructs SSML based on parameters
Audio Format: Automatic format selection based on sample rate
Voice Matching: Ensure voice selection matches the specified language
Streaming Recommended: Use AzureTTSService for real-time applications requiring low latency
Connection Management: WebSocket lifecycle handled automatically in streaming service

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

Azure Speech Docs

Example Code

Installation

Frames

Input

Output

Service Comparison

Language Support

Supported Sample Rates

Usage Example

Streaming Service (Recommended)

HTTP Service

SSML Features

Dynamic Configuration

Metrics

Additional Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

Azure Speech Docs

Example Code

​Installation

​Frames

​Input

​Output

​Service Comparison

​Language Support

​Supported Sample Rates

​Usage Example

​Streaming Service (Recommended)

​HTTP Service

​SSML Features

​Dynamic Configuration

​Metrics

​Additional Notes

Overview

Installation

Frames

Input

Output

Service Comparison

Language Support

Supported Sample Rates

Usage Example

Streaming Service (Recommended)

HTTP Service

SSML Features

Dynamic Configuration

Metrics

Additional Notes