Cartesia
Text-to-speech services using Cartesia’s WebSocket and HTTP APIs
Overview
Cartesia provides two TTS service implementations:
CartesiaTTSService
: WebSocket-based service with word-level timestamps and streamingCartesiaHttpTTSService
: HTTP-based service for simpler, non-streaming synthesis
Installation
To use Cartesia services, install the required dependencies:
You’ll also need to set up your Cartesia API key as an environment variable: CARTESIA_API_KEY
.
You can obtain a Cartesia API key by signing up at Cartesia.
Choosing a Cartesia service
Cartesia has two supported services:
CartesiaTTSService
which is a websocket-based implementationCartesiaHttpTTSService
, which is an HTTP-based implementation
CartesiaTTSService
The CartesiaTTSService
is recommended for real-time streaming and interactive applications. It offers:
- Streaming audio in chunks
- Word-level timestamps
- Text frame generation aligned with audio playback
- Sophisticated interruption handling
- Continuous session management through websocket connection
- Non-blocking operation that allows other frames to be processed while audio is being generated
CartesiaHttpTTSService
The CartesiaHttpTTSService
is simpler and more straightforward, suitable for non-interactive use cases. It:
- Processes the entire text in one request
- Returns audio in a single frame
- Has simpler implementation and fewer moving parts
- May be more suitable for batch processing
- Blocks during the HTTP request, preventing other frames from being processed until the audio is fully generated
Both services support usage metrics and start/stop frame events, but they differ in how they handle the audio streaming process and interaction capabilities. Choose the websocket-based service if you need real-time responsiveness, or the HTTP service if you prefer simplicity and don’t mind the blocking behavior.
Input Parameters
Both services use the same input parameters structure:
The language to use for synthesis. See Language Support section for available options.
Controls the speech rate.
Can be specified as either:
- String options:
"slowest"
,"slow"
,"normal"
,"fast"
,"fastest"
- Float value: Between
-1.0
(slowest) and1.0
(fastest), where0.0
is normal speed
List of emotion controls to apply.
Each emotion can be specified as:
- Simple emotion:
"anger"
,"positivity"
,"surprise"
,"sadness"
,"curiosity"
- Emotion with level: “emotion:level” where level can be
"lowest"
,"low"
,"high"
,"highest"
Example: ["positivity:high", "curiosity"]
Note: Emotion controls are additive and their effects may vary by voice and content.
CartesiaTTSService
WebSocket-based implementation supporting real-time streaming and word timestamps.
Constructor Parameters
Cartesia API key
Voice identifier
API version
WebSocket endpoint URL
Model identifier
Output audio sample rate in Hz
Audio encoding format
Audio container format
Modifies text provided to the TTS. Learn more about the available filters.
CartesiaHttpTTSService
HTTP-based implementation for simpler synthesis requirements.
Constructor Parameters
Cartesia API key
Voice identifier
Model identifier
API base URL
Output audio sample rate in Hz
Audio encoding format
Audio container format
Modifies text provided to the TTS. Learn more about the available filters.
Output Frames
Control Frames
Signals start of synthesis
Signals completion of synthesis
Audio Frames
Contains generated audio data
Error Frames
Contains error information
Methods
See the TTS base class methods for additional functionality.
Language Support
Supports multiple languages through the Language
enum:
Language Code | Description | Service Code |
---|---|---|
Language.DE | German | de |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.JA | Japanese | ja |
Language.PT | Portuguese | pt |
Language.ZH | Chinese (Mandarin) | zh |