Overview

AWS Polly provides text-to-speech synthesis through Amazon’s cloud service with support for standard, neural, and generative engines. The service offers extensive language support, SSML features, and voice customization options including prosody controls for pitch, rate, and volume.

Installation

To use AWS Polly services, install the required dependencies:
pip install "pipecat-ai[aws]"
You’ll also need to set up your AWS credentials as environment variables:
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_SESSION_TOKEN (if using temporary credentials)
  • AWS_REGION (defaults to “us-east-1”)
Set up AWS credentials through the AWS Console or use AWS CLI configuration.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data (PCM, resampled from 16kHz)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - AWS API or processing errors

Language Support

Common languages supported include:
  • Language.EN - English (US)
  • Language.ES - Spanish
  • Language.FR - French
  • Language.DE - German
  • Language.IT - Italian
  • Language.JA - Japanese

Usage Example

Basic Configuration

Initialize the AWSPollyTTSService and use it in a pipeline:
from pipecat.services.aws.tts import AWSPollyTTSService
from pipecat.transcriptions.language import Language
import os

tts = AWSPollyTTSService(
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    api_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region="us-west-2",
    voice_id="Joanna",
    params=AWSPollyTTSService.InputParams(
        engine="neural",
        language=Language.EN,
        rate="+10%",
        volume="loud"
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame for the AWSPollyTTSService:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(
    TTSUpdateSettingsFrame(settings={"voice": "your-new-voice-id"})
)

SSML Features

AWS Polly automatically constructs SSML for advanced speech control:
# Prosody controls (engine-dependent)
service = AWSPollyTTSService(
    voice_id="Joanna",
    params=AWSPollyTTSService.InputParams(
        engine="standard",   # Full prosody support
        rate="slow",         # SSML rate values
        pitch="low",         # Pitch adjustment
        volume="loud"        # Volume control
    )
)

# Lexicon support for custom pronunciations
service = AWSPollyTTSService(
    voice_id="Joanna",
    params=AWSPollyTTSService.InputParams(
        lexicon_names=["custom-pronunciations"]
    )
)

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • Engine Selection: Use generative for highest quality, neural for balance, standard for lowest latency
  • Region Requirements: Generative engine only available in select regions (us-west-2, us-east-1, etc.)
  • Audio Format: Service outputs PCM audio resampled from 16kHz to your specified rate
  • Credential Management: Supports both environment variables and direct credential passing
  • SSML Automatic: Service automatically wraps text in appropriate SSML tags based on parameters
  • Prosody Limitations: Generative engine only supports rate adjustment, not pitch or volume