Overview

ElevenLabsTTSService provides high-quality text-to-speech synthesis using ElevenLabs’ WebSocket API. It supports real-time streaming, word-level timing, and various voice customization options.

Installation

To use ElevenLabsTTSService, install the required dependencies:

pip install pipecat-ai[elevenlabs]

You’ll also need to set up your ElevenLabs API key as an environment variable: ELEVENLABS_API_KEY.

You can obtain a ElevenLabs API key by signing up at ElevenLabs.

Configuration

Constructor Parameters

api_key
str
required

ElevenLabs API key

voice_id
str
required

Voice identifier

model
str
default: "eleven_turbo_v2_5"

Model identifier

url
str
default: "wss://api.elevenlabs.io"

API endpoint URL

output_format
ElevenLabsOutputFormat
default: "pcm_24000"

Audio output format: - “pcm_16000” - “pcm_22050” - “pcm_24000” - “pcm_44100”

text_filter
BaseTextFilter
default: "None"

Modifies text provided to the TTS. Learn more about the available filters.

Input Parameters

class InputParams(BaseModel):
    language: Optional[Language] = Language.EN
    optimize_streaming_latency: Optional[str]
    stability: Optional[float]
    similarity_boost: Optional[float]
    style: Optional[float]
    use_speaker_boost: Optional[bool]

Voice Settings

Voice characteristics can be configured using:

stability
float

Voice stability (requires similarity_boost)

similarity_boost
float

Voice similarity boost (requires stability)

style
float

Style intensity (requires stability and similarity_boost)

use_speaker_boost
bool

Enable speaker boost (requires stability and similarity_boost)

Output Frames

Control Frames

TTSStartedFrame
Frame

Signals start of synthesis

TTSStoppedFrame
Frame

Signals completion of synthesis

Audio Frames

TTSAudioRawFrame
Frame

Contains generated audio data: - PCM encoded audio - Configured sample rate - Mono channel

Usage Examples

Basic Usage

# Configure service
tts_service = ElevenLabsTTSService(
    api_key="your-api-key",
    voice_id="voice-id",
    output_format="pcm_24000",
    params=ElevenLabsTTSService.InputParams(
        language=Language.EN
    )
)

# Use in pipeline
pipeline = Pipeline([
    text_input,
    tts_service,
    audio_output
])

With Voice Settings

# Configure with voice customization
service = ElevenLabsTTSService(
    api_key="your-api-key",
    voice_id="voice-id",
    params=ElevenLabsTTSService.InputParams(
        stability=0.7,
        similarity_boost=0.8,
        style=0.5,
        use_speaker_boost=True
    )
)

Methods

See the TTS base class methods for additional functionality.

Language Support

ElevenLabs supports the following languages and their variants:

Language CodeDescriptionService Code
Language.BGBulgarianbg
Language.ZHChinesezh
Language.CSCzechcs
Language.DADanishda
Language.NLDutchnl
Language.ENEnglishen
Language.EN_USEnglish (US)en
Language.EN_AUEnglish (Australia)en
Language.EN_GBEnglish (UK)en
Language.EN_NZEnglish (New Zealand)en
Language.EN_INEnglish (India)en
Language.FIFinnishfi
Language.FRFrenchfr
Language.FR_CAFrench (Canada)fr
Language.DEGermande
Language.DE_CHGerman (Swiss)de
Language.ELGreekel
Language.HIHindihi
Language.HUHungarianhu
Language.IDIndonesianid
Language.ITItalianit
Language.JAJapaneseja
Language.KOKoreanko
Language.MSMalayms
Language.NONorwegianno
Language.PLPolishpl
Language.PTPortuguesept-PT
Language.PT_BRPortuguese (Brazil)pt-BR
Language.RORomanianro
Language.RURussianru
Language.SKSlovaksk
Language.ESSpanishes
Language.SVSwedishsv
Language.TRTurkishtr
Language.UKUkrainianuk
Language.VIVietnamesevi

Note: Language support may vary based on the selected model.

Usage Example

# Configure service with specific language
service = ElevenLabsTTSService(
    api_key="your-api-key",
    voice_id="voice-id",
    params=ElevenLabsTTSService.InputParams(
        language=Language.FR  # French
    )
)

Word Timing

The service provides word-level timing information:

# Word timing calculation
word_times = calculate_word_times(
    alignment_info,
    cumulative_time
)

Frame Flow

Features

Sentence Aggregation

  • Aggregates sentences for better audio quality
  • Maintains natural speech flow
  • Reduces artifacts

Word Timing

  • Provides word-level timestamps
  • Enables text-audio synchronization
  • Supports interruption handling

Connection Management

  • WebSocket-based streaming
  • Automatic reconnection
  • Keepalive handling
  • Clean disconnection

Notes

  • Supports real-time streaming
  • Provides word-level timing
  • Handles interruptions gracefully
  • Maintains WebSocket connection
  • Includes metrics collection
  • Supports voice customization
  • Thread-safe processing
  • Automatic language mapping