Overview
Azure Cognitive Services provides high-quality text-to-speech synthesis with two implementations:AzureTTSService
(WebSocket-based streaming)AzureHttpTTSService
(HTTP-based batch synthesis).
AzureTTSService
is recommended for real-time applications requiring low
latency and streaming capabilities.API Reference
Complete API documentation and method details
Azure Speech Docs
Official Azure Speech Services documentation
Example Code
Working example with streaming synthesis
Installation
To use Azure services, install the required dependencies:AZURE_API_KEY
(orAZURE_SPEECH_API_KEY
)AZURE_REGION
(orAZURE_SPEECH_REGION
)
Get your API key and region from the Azure Portal
under Cognitive Services > Speech.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (PCM format)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- Azure API or processing errors
Service Comparison
Feature | AzureTTSService (Streaming) | AzureHttpTTSService (HTTP) |
---|---|---|
Streaming | ✅ Real-time chunks | ❌ Single audio block |
Latency | 🚀 Low | 📈 Higher |
Complexity | ⚠️ WebSocket management | ✅ Simple HTTP |
Connection | WebSocket-based | HTTP-based |
Language Support
View All Supported Languages
View All Supported Languages
Language Code | Description | Service Code |
---|---|---|
Language.BG | Bulgarian | bg-BG |
Language.CA | Catalan | ca-ES |
Language.ZH | Chinese (Simplified) | zh-CN |
Language.ZH_TW | Chinese (Traditional) | zh-TW |
Language.CS | Czech | cs-CZ |
Language.DA | Danish | da-DK |
Language.NL | Dutch (Netherlands) | nl-NL |
Language.NL_BE | Dutch (Belgium) | nl-BE |
Language.EN | English (US) | en-US |
Language.EN_US | English (US) | en-US |
Language.EN_AU | English (Australia) | en-AU |
Language.EN_GB | English (UK) | en-GB |
Language.EN_NZ | English (New Zealand) | en-NZ |
Language.EN_IN | English (India) | en-IN |
Language.ET | Estonian | et-EE |
Language.FI | Finnish | fi-FI |
Language.FR | French (France) | fr-FR |
Language.FR_CA | French (Canada) | fr-CA |
Language.DE | German (Germany) | de-DE |
Language.DE_CH | German (Switzerland) | de-CH |
Language.EL | Greek | el-GR |
Language.HI | Hindi | hi-IN |
Language.HU | Hungarian | hu-HU |
Language.ID | Indonesian | id-ID |
Language.IT | Italian | it-IT |
Language.JA | Japanese | ja-JP |
Language.KO | Korean | ko-KR |
Language.LV | Latvian | lv-LV |
Language.LT | Lithuanian | lt-LT |
Language.MS | Malay | ms-MY |
Language.NO | Norwegian | nb-NO |
Language.PL | Polish | pl-PL |
Language.PT | Portuguese (Portugal) | pt-PT |
Language.PT_BR | Portuguese (Brazil) | pt-BR |
Language.RO | Romanian | ro-RO |
Language.RU | Russian | ru-RU |
Language.SK | Slovak | sk-SK |
Language.ES | Spanish | es-ES |
Language.SV | Swedish | sv-SE |
Language.TH | Thai | th-TH |
Language.TR | Turkish | tr-TR |
Language.UK | Ukrainian | uk-UA |
Language.VI | Vietnamese | vi-VN |
Language.EN_US
- English (US)Language.EN_GB
- English (UK)Language.FR
- FrenchLanguage.DE
- GermanLanguage.ES
- SpanishLanguage.IT
- Italian
Supported Sample Rates
Azure supports multiple sample rates with automatic format selection:- 8000 Hz:
Raw8Khz16BitMonoPcm
- 16000 Hz:
Raw16Khz16BitMonoPcm
- 22050 Hz:
Raw22050Hz16BitMonoPcm
- 24000 Hz:
Raw24Khz16BitMonoPcm
(default) - 44100 Hz:
Raw44100Hz16BitMonoPcm
- 48000 Hz:
Raw48Khz16BitMonoPcm
Usage Example
Streaming Service (Recommended)
Initialize theAzureTTSService
and use it in a pipeline:
HTTP Service
Initialize theAzureHttpTTSService
and use it in a pipeline:
SSML Features
Azure TTS supports rich SSML customization through parameters:Dynamic Configuration
Make settings updates by pushing aTTSUpdateSettingsFrame
for the AzureTTSService
:
Metrics
Both services provide comprehensive metrics:- Time to First Byte (TTFB) - Latency from text input to first audio
- Processing Duration - Total synthesis time
- Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.
Additional Notes
- Neural Voices: Use neural voices (ending in “Neural”) for highest quality
- Regional Availability: Some voices and features may be region-specific
- SSML Automatic: Service automatically constructs SSML based on parameters
- Audio Format: Automatic format selection based on sample rate
- Voice Matching: Ensure voice selection matches the specified language
- Streaming Recommended: Use
AzureTTSService
for real-time applications requiring low latency - Connection Management: WebSocket lifecycle handled automatically in streaming service