This guide provides an overview of the audio capabilities OpenAI offers via their APIs. We’ll also link to Pipecat sample code.

Two Ways To Build Voice-to-voice

You can build voice-to-voice applications in two ways:

  1. The cascaded models approach, using separate models for transcription, the LLM, and voice generation.

A cascaded pipeline looks like this, in Pipecat code. Here’s a single-file example that uses a cascaded pipeline. (See below for an overview of Pipecat core concepts.)

pipeline = Pipeline(
    [
        transport.input(),
        speech_to_text,
        context_aggregator.user(),
        llm,
        text_to_speech,
        context_aggregator.assistant(),
        transport.output(),
    ]
)
  1. Using a single, speech-to-speech model. This is conceptually much simpler. Though note that most applications also need to implement things like function calling, retrieval-augmented search, context management, and integration with existing systems. So the core pipeline is only part of an app’s complexity.

Here’s a speech-to-speech pipeline in Pipecat code. And here’s a single-file example that uses the OpenAI Realtime API.

pipeline = Pipeline(
    [
        transport.input(),
        context_aggregator.user(),
        speech_to_speech_llm,
        context_aggregator.assistant(),
        transport.output(),
    ]
)

Which approach should you choose?

  • The cascaded models approach is preferable if you are implementing a complex workflow and need the best possible instruction following performance and function calling reliability. The gpt-4o model operating in text-to-text mode has the strongest instruction following and function calling performance.
  • The speech-to-speech approach offers better audio understanding and human-like voice output. If your application is primarily free-form, open-ended conversation, these attributes might be more important than instruction following and function calling performance. Note also that gpt-4o-audio-preview and the OpenAI Realtime API are currently beta products.

OpenAI Audio Models and APIs

Transcription API

  • Models: gpt-4o-transcribe, gpt-4o-mini-transcribe
  • Pipecat service: OpenAISTTService (reference docs)
  • OpenAI endpoint: /v1/audio/transcriptions (docs)

Chat Completions API

  • Models: gpt-4o, gpt-4o-mini, gpt-4o-audio-preview
  • Pipecat service: OpenAILLMService (reference docs)
  • OpenAI endpoint: /v1/chat/completions (docs)

Realtime API

  • Models: gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview
  • Pipecat service: OpenAIRealtimeBetaLLMService (reference docs)
  • OpenAI docs (overview)

Speech API

  • Models: gpt-4o-mini-tts
  • Pipecat service: OpenAITTSService (reference docs)
  • OpenAI endpoint: /v1/audio/speech (docs)

Sample code and starter kits

If you have a code example or starter kit you would like this doc to link to, please let us know. We can add examples that help people get started with the OpenAI audio models and APIs.

Single-file examples

OpenAI + Twilio + Pipecat Cloud

This starter kit is a complete telephone voice agent that can talk about the NCAA March Madness basketball tournaments and look up realtime game information using function calls.

The starter kit includes two bot configurations: cascaded model and speech-to-speech. The code can be packaged for deployment to Pipecat Cloud, a commercial platform for Pipecat agent hosting.