Overview
OpenAI’s TTS API provides high-quality text-to-speech synthesis with multiple voice models including traditional TTS models and advanced GPT-based models. The service outputs 24kHz PCM audio with streaming capabilities for real-time applications.API Reference
Complete API documentation and method details
OpenAI TTS Docs
Official OpenAI text-to-speech API documentation
Example Code
Working example with voice customization
Installation
To use OpenAI services, install the required dependencies:OPENAI_API_KEY
.
Get your API key from the OpenAI
Platform.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (24kHz PCM, mono)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- API or processing errors
Models
Model | Description | Best For |
---|---|---|
gpt-4o-mini-tts | Latest GPT-based TTS model | Faster generation, improved prosody, recommended for most use cases |
tts-1 | Original TTS model | Standard quality speech synthesis |
tts-1-hd | High-definition TTS model | Premium quality speech with higher fidelity |
Voice Options
OpenAI provides multiple voice personalities:Voice | Description | Characteristics |
---|---|---|
alloy | Balanced, neutral | Professional, clear |
echo | Calm, measured | Thoughtful, deliberate |
fable | Warm, engaging | Storytelling, expressive |
onyx | Deep, authoritative | Commanding, confident |
nova | Bright, energetic | Enthusiastic, friendly |
shimmer | Soft, gentle | Soothing, approachable |
ash | Mature, sophisticated | Experienced, wise |
ballad | Smooth, melodic | Musical, flowing |
coral | Vibrant, lively | Dynamic, spirited |
sage | Wise, contemplative | Reflective, knowledgeable |
verse | Poetic, rhythmic | Artistic, expressive |
Usage Example
Basic Configuration
InitializeOpenAITTSService
and use it in a pipeline:
Dynamic Voice Changes
Audio Specifications
Sample Rate
- Fixed Rate: 24kHz (24,000 Hz)
- Format: 16-bit PCM
- Channels: Mono (1 channel)
- Streaming: Chunked delivery for low latency
OpenAI TTS only outputs at 24kHz. Ensure your pipeline sample rate matches to
avoid audio issues.
Advanced Features
Voice Instructions (GPT Models)
Custom Endpoints
Metrics
The service provides comprehensive metrics:- Time to First Byte (TTFB) - Latency from text input to first audio
- Processing Duration - Total synthesis time
- Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.
Additional Notes
- Sample Rate Constraint: OpenAI TTS always outputs at 24kHz - ensure pipeline compatibility
- Streaming Optimized: Audio chunks delivered as generated for low-latency playback
- Voice Quality: GPT-based models offer superior prosody and naturalness
- Instructions Support: GPT models accept behavioral instructions for voice customization
- Error Handling: Robust error handling with detailed error messages
- Thread Safety: Safe for concurrent use in multi-threaded applications
- Cost Efficiency: Character-based billing with usage metrics tracking