Building With OpenAI Audio Models and APIs

This guide provides an overview of the audio capabilities OpenAI offers via their APIs. We’ll also link to Pipecat sample code.

Two Ways To Build Voice-to-voice

You can build voice-to-voice applications in two ways:

The cascaded models approach, using separate models for transcription, the LLM, and voice generation.

A cascaded pipeline looks like this, in Pipecat code. Here’s a single-file example that uses a cascaded pipeline. (See below for an overview of Pipecat core concepts.)

pipeline = Pipeline(
    [
        transport.input(),
        speech_to_text,
        context_aggregator.user(),
        llm,
        text_to_speech,
        context_aggregator.assistant(),
        transport.output(),
    ]
)

Using a single, speech-to-speech model. This is conceptually much simpler. Though note that most applications also need to implement things like function calling, retrieval-augmented search, context management, and integration with existing systems. So the core pipeline is only part of an app’s complexity.

Here’s a speech-to-speech pipeline in Pipecat code. And here’s a single-file example that uses the OpenAI Realtime API.

pipeline = Pipeline(
    [
        transport.input(),
        context_aggregator.user(),
        speech_to_speech_llm,
        context_aggregator.assistant(),
        transport.output(),
    ]
)

Which approach should you choose?

The cascaded models approach is preferable if you are implementing a complex workflow and need the best possible instruction following performance and function calling reliability. The gpt-4o model operating in text-to-text mode has the strongest instruction following and function calling performance.
The speech-to-speech approach offers better audio understanding and human-like voice output. If your application is primarily free-form, open-ended conversation, these attributes might be more important than instruction following and function calling performance. Note also that gpt-4o-audio-preview and the OpenAI Realtime API are currently beta products.

OpenAI Audio Models and APIs

Transcription API

Models: gpt-4o-transcribe, gpt-4o-mini-transcribe
Pipecat service: OpenAISTTService (reference docs)
OpenAI endpoint: /v1/audio/transcriptions (docs)

Chat Completions API

Models: gpt-4o, gpt-4o-mini, gpt-4o-audio-preview
Pipecat service: OpenAILLMService (reference docs)
OpenAI endpoint: /v1/chat/completions (docs)

Realtime API

Models: gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview
Pipecat service: OpenAIRealtimeBetaLLMService (reference docs)
OpenAI docs (overview)

Speech API

Models: gpt-4o-mini-tts
Pipecat service: OpenAITTSService (reference docs)
OpenAI endpoint: /v1/audio/speech (docs)

Sample code and starter kits

If you have a code example or starter kit you would like this doc to link to, please let us know. We can add examples that help people get started with the OpenAI audio models and APIs.

Single-file examples

OpenAI STT → LLM → TTS

A complete implementation demonstrating the cascaded approach with OpenAI services

OpenAI Realtime API

A speech-to-speech implementation using OpenAI’s Realtime API

OpenAI + Twilio + Pipecat Cloud

This starter kit is a complete telephone voice agent that can talk about the NCAA March Madness basketball tournaments and look up realtime game information using function calls.

The starter kit includes two bot configurations: cascaded model and speech-to-speech. The code can be packaged for deployment to Pipecat Cloud, a commercial platform for Pipecat agent hosting.

Fundamentals

Features

Telephony

Deploying your bot

Building With OpenAI Audio Models and APIs

Two Ways To Build Voice-to-voice

OpenAI Audio Models and APIs

Transcription API

Chat Completions API

Realtime API

Speech API

Sample code and starter kits

Single-file examples

OpenAI STT → LLM → TTS

OpenAI Realtime API

OpenAI + Twilio + Pipecat Cloud

Fundamentals

Features

Telephony

Deploying your bot

​Two Ways To Build Voice-to-voice

​OpenAI Audio Models and APIs

​Transcription API

​Chat Completions API

​Realtime API

​Speech API

​Sample code and starter kits

​Single-file examples

OpenAI STT → LLM → TTS

OpenAI Realtime API

​OpenAI + Twilio + Pipecat Cloud

Two Ways To Build Voice-to-voice

OpenAI Audio Models and APIs

Transcription API

Chat Completions API

Realtime API

Speech API

Sample code and starter kits

Single-file examples

OpenAI + Twilio + Pipecat Cloud