Overview

MiniMax’s T2A (Text-to-Audio) API provides high-quality text-to-speech synthesis with streaming capabilities, emotional voice control, and support for multiple languages. The service offers various models optimized for different use cases, from low-latency to high-definition audio quality.

Installation

To use MiniMax services, no additional dependencies are required beyond the base installation:
pip install "pipecat-ai"
You’ll need MiniMax API credentials:
  • MINIMAX_API_KEY
  • MINIMAX_GROUP_ID
Get your API credentials from the MiniMax Platform.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data chunks (streaming PCM)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - API or processing errors

Model Comparison

ModelQualityLatencyFeatures
speech-02-hdHighestHigherSuperior rhythm and stability
speech-02-turboHighLowerEnhanced multilingual capabilities
speech-01-hdHighMediumRich voices with expressive emotions
speech-01-turboGoodLowestRegular updates, fast response
Refer to the MiniMax documentation for up-to-date model information.

Voice Selection

MiniMax offers diverse voice personalities:
Voice IDDescriptionTone
Wise_WomanMature female voiceAuthoritative, knowledgeable
Friendly_PersonWarm, approachableConversational, welcoming
Patient_ManCalm male voiceSteady, reassuring
Lively_GirlYoung female voiceEnergetic, enthusiastic
Deep_Voice_ManRich male voiceProfessional, commanding
Calm_WomanSerene female voicePeaceful, soothing
Elegant_ManSophisticated maleRefined, articulate
See the MiniMax documentation for the complete list of available voices.

Supported Sample Rates

MiniMax supports multiple sample rates for different quality levels:
  • 8000 Hz
  • 16000 Hz
  • 22050 Hz
  • 24000 Hz
  • 32000 Hz
  • 44100 Hz

Language Support

Common languages supported include:
  • Language.EN - English
  • Language.ZH - Chinese (Mandarin)
  • Language.ES - Spanish
  • Language.FR - French
  • Language.DE - German
  • Language.JA - Japanese

Usage Example

Basic Configuration

Initialize the MiniMaxHttpTTSService and use it in a pipeline:
from pipecat.services.minimax.tts import MiniMaxHttpTTSService
from pipecat.transcriptions.language import Language
import os

# Configure service
tts = MiniMaxHttpTTSService(
    api_key=os.getenv("MINIMAX_API_KEY"),
    group_id=os.getenv("MINIMAX_GROUP_ID"),
    aiohttp_session=session,
    params=MiniMaxHttpTTSService.InputParams(
        language=Language.EN
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame for the MiniMaxHttpTTSService:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(TTSUpdateSettingsFrame(
    voice_id="new_voice",
  )
)

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • HTTP Session Required: Must provide an aiohttp.ClientSession for API communication
  • Emotional AI: Advanced emotional expression capabilities with voice-specific optimizations
  • Text Normalization: Optional English normalization for better number and abbreviation handling