Overview

Fish Audio provides real-time text-to-speech synthesis through a WebSocket-based streaming API. The service offers custom voice models, prosody controls, and multiple audio formats optimized for conversational AI applications with low latency.

Installation

To use Fish Audio services, install the required dependencies:
pip install "pipecat-ai[fish]"
You’ll also need to set up your Fish Audio API key as an environment variable: FISH_API_KEY.
Get your API key from the Fish Audio Console.

Frames

Input

  • TextFrame - Text content to synthesize into speech
  • TTSSpeakFrame - Text that should be spoken immediately
  • TTSUpdateSettingsFrame - Runtime configuration updates
  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

  • TTSStartedFrame - Signals start of synthesis
  • TTSAudioRawFrame - Generated audio data chunks (streaming)
  • TTSStoppedFrame - Signals completion of synthesis
  • ErrorFrame - API or processing errors

Sample Rate Options

Supported sample rates for different quality levels:
  • 8000 Hz - Phone quality
  • 16000 Hz - Standard quality
  • 24000 Hz - High quality (recommended)
  • 44100 Hz - CD quality
  • 48000 Hz - Professional quality

Language Support

Fish Audio currently supports:
Language CodeDescriptionService Code
Language.ENEnglishen
Language.JAJapaneseja
Language.ZHChinesezh
Fish Audio is expanding language support. Check the official documentation for the latest available languages.

Latency Modes

Choose the appropriate latency mode for your application:
ModeDescriptionBest For
normalStandard latency (Default)General applications
balancedBalanced quality/speedReal-time conversations

Usage Example

Basic Configuration

from pipecat.services.fish.tts import FishAudioTTSService
from pipecat.transcriptions.language import Language
import os

# Configure service with custom voice
tts = FishAudioTTSService(
    api_key=os.getenv("FISH_API_KEY"),
    reference_id="4ce7e917cedd4bc2bb2e6ff3a46acaa1",  # Voice model ID
    model_id="speech-1.5",
    output_format="pcm",
    sample_rate=24000,
    params=FishAudioTTSService.InputParams(
        language=Language.EN,
        latency="normal",
        prosody_speed=1.0,
        prosody_volume=0
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Advanced Prosody Control

# Custom prosody settings
tts = FishAudioTTSService(
    api_key=os.getenv("FISH_API_KEY"),
    reference_id="your-voice-model-id",
    params=FishAudioTTSService.InputParams(
        language=Language.EN,
        latency="balanced",      # Balance quality vs speed
        prosody_speed=1.2,       # 20% faster speech
        prosody_volume=3,        # +3dB volume boost
        normalize=True           # Normalize audio output
    )
)

Dynamic Configuration

Make settings updates by pushing a TTSUpdateSettingsFrame:
from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(TTSUpdateSettingsFrame(
    reference_id="new-voice-model-id",  # Change voice model
  )
)

Metrics

The service provides comprehensive metrics:
  • Time to First Byte (TTFB) - Latency from text input to first audio
  • Processing Duration - Total synthesis time
  • Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.

Additional Notes

  • WebSocket Streaming: Real-time audio generation with automatic chunking
  • Interruption Handling: Built-in support for conversation interruptions
  • Custom Voice Models: Use your own trained voice models via reference IDs
  • Audio Buffering: Efficient streaming with configurable buffer sizes
  • Connection Management: Automatic reconnection on connection failures
  • Format Flexibility: Multiple audio formats for different deployment scenarios
  • Prosody Control: Fine-tune speech characteristics including speed and volume