Dial-in: Twilio (Media Streams)
Call your Pipecat bot over websockets using Twilio
This guide walks through creating a voice AI agent that users can reach by dialing a Twilio phone number. We’ll integrate Twilio Media Streams via WebSockets with a Pipecat pipeline to create a fully functional voice AI experience.
Things you’ll need
- An active Twilio account with at least one phone number
- A public-facing server or a tunneling service like ngrok
- API keys for speech-to-text, text-to-speech, and LLM services
Complete example code on GitHub
The full source code for this example is available in the Pipecat repository.
Architecture Overview
When a user dials your Twilio phone number:
- Twilio calls your server’s endpoint, which returns TwiML that establishes a WebSocket connection
- Twilio sends real-time audio data over the WebSocket, and your server processes it using the Pipecat pipeline
- Your Pipecat agent processes the audio which is sent to the pipeline, which then outputs audio data back to Twilio
- Twilio plays this audio to the caller in real-time
Server Implementation
The server needs to:
- Serve TwiML in response to Twilio’s HTTP requests
- Handle WebSocket connections
- Process audio with the Pipecat pipeline
Let’s look at the key components:
FastAPI Server Setup
This server has two main endpoints:
POST /
- Returns TwiML instructions to TwilioWebSocket /ws
- Handles the WebSocket connection for real-time audio
TwiML Configuration
The TwiML tells Twilio to establish a WebSocket connection with your server:
Replace <your server url>
with your server’s publicly accessible domain. The Pause
element keeps the call alive for a maximum of 40 seconds. Adjust this value based on your expected conversation length.
Pipecat Bot Implementation
The run_bot
function creates and connects all the components in the Pipecat pipeline:
Key Technical Considerations
Audio Format and Sample Rate
Twilio Media Streams uses 8kHz mono audio with 16-bit PCM encoding. Make sure your pipeline is configured correctly:
Serialization and Call Control
The TwilioFrameSerializer
handles the protocol specifics for communicating with Twilio’s Media Streams:
When you provide the account_sid
and auth_token
to the
TwilioFrameSerializer
, it will automatically end the call via Twilio’s REST
API when the pipeline ends. This ensures clean call termination when your bot
finishes its conversation.
Voice Activity Detection
The SileroVAD analyzer helps determine when a user has finished speaking:
Configuring Twilio
To set up your Twilio phone number to use your server:
- Purchase a phone number in your Twilio account
- Navigate to the Phone Numbers section and select your number
- Under “Voice & Fax”, set the webhook for “A Call Comes In” to your server’s URL (e.g.,
https://your-server.com/
) - Make sure the request type is set to HTTP POST
- Save your changes
If you’re using ngrok for local development, your webhook URL will look like
https://abc123.ngrok.io/
. Remember to update your TwiML template with the
correct WebSocket URL as well.
Testing Your Implementation
Local Testing Without Phone Calls
The example includes a test client that can simulate phone calls without actually using Twilio:
The -t
flag puts the server in testing mode, and the test client creates virtual clients that communicate with your server as if they were Twilio Media Streams.
Using the Phone
To test with an actual phone call:
- Make sure your server is running and accessible via the internet
- Configure Twilio as described above
- Dial your Twilio phone number from any phone
- You should hear your AI agent respond!
Ending a Conversation
There are two primary ways to end a conversation:
- Automatic termination: If you provided Twilio credentials to the
TwilioFrameSerializer
, the call will be ended automatically when your pipeline ends:
- Manual termination: You can also end the call explicitly through Twilio’s REST API. This is useful for implementing custom hang-up logic:
Scaling Considerations
For production deployments, consider:
- Multiple concurrent calls: Each bot instance should run in its own process to handle concurrent calls efficiently. The example server can spawn individual bot processes for each call:
-
Error handling: Add robust error handling for network issues, service outages, etc.
-
Logging and monitoring: Implement detailed logging to track call quality and agent performance.
-
Security: Add authentication to your endpoints and use environment variables for all sensitive credentials.