Modal is well-suited for Pipecat deployments because it handles container orchestration, scaling, and cold starts efficiently. This makes it a good choice for production Pipecat bots that need reliable performance.

This guide walks through the Modal example included in the Pipecat repository, which follows the same deployment pattern.

Modal example

View the complete Modal deployment example in our GitHub repository

Install the Modal CLI

Set up Modal

Follow Modal’s official instructions for creating an account and setting up the CLI

Deploy a self-serve LLM

  1. Deploy Modal’s OpenAI-compatible LLM service:

    git clone https://github.com/modal-labs/modal-examples
    cd modal-examples
    modal deploy 06_gpu_and_ml/llm-serving/vllm_inference.py
    

Refer to Modal’s guide and example for Deploying an OpenAI-compatible LLM service with vLLM for more details.

  1. Take note of the endpoint URL from the previous step, which will look like:
    https://{your-workspace}--example-vllm-openai-compatible-serve.modal.run
    
    You’ll need this for the bot_vllm.py file in the next section.

The default Modal LLM example uses Llama-3.1 and will shut down after 15 minutes of inactivity. Cold starts take 5-10 minutes. To prepare the service, we recommend visiting the /docs endpoint (https://<Modal workspace>--example-vllm-openai-compatible-serve.modal.run/docs) for your deployed LLM and wait for it to fully load before connecting your client.

Deploy FastAPI App and Pipecat pipeline to Modal

  1. Setup environment variables:
cd server
cp env.example .env
# Modify .env to provide your service API Keys

Alternatively, you can configure your Modal app to use secrets.

  1. Update the modal_url in server/src/bot_vllm.py to point to the URL you received from the self-serve LLM deployment in the previous step.

  2. From within the server directory, test the app locally:

modal serve app.py
  1. Deploy to production:
modal deploy app.py
  1. Note the endpoint URL produced from this deployment. It will look like:
https://{your-workspace}--pipecat-modal-fastapi-app.modal.run

You’ll need this URL for the client’s app.js configuration mentioned in its README.

Launch your bots on Modal

Simply click on the URL displayed after running the server or deploy step to launch an agent and be redirected to a Daily room to talk with the launched bot. This will use the OpenAI pipeline.

Option 2: Connect via an RTVI Client

Follow the instructions provided in the client folder’s README for building and running a custom client that connects to your Modal endpoint. The provided client includes a dropdown for choosing which bot pipeline to run.

On your Modal dashboard, you should have two Apps listed under Live Apps:

  1. example-vllm-openai-compatible: This App contains the containers and logs used to run your self-hosted LLM. There will be just one App Function listed: serve. Click on this function to view logs for your LLM.
  2. pipecat-modal: This App contains the containers and logs used to run your connect endpoints and Pipecat pipelines. It will list two App Functions:
    1. fastapi_app: This function is running the endpoints that your client will interact with and initiate starting a new pipeline (/, /connect, /status). Click on this function to see logs for each endpoint hit.
    2. bot_runner: This function handles launching and running a bot pipeline. Click on this function to get a list of all pipeline runs and access each run’s logs.
  • In most other Pipecat examples, we use Popen to launch the pipeline process from the /connect endpoint. In this example, we use a Modal function instead. This allows us to run the pipelines using a separately defined Modal image as well as run each pipeline in an isolated container.
  • For the FastAPI and most common Pipecat Pipeline containers, a default debian_slim CPU-only should be all that’s required to run. GPU containers are needed for self-hosted services.
  • To minimize cold starts of the pipeline and reduce latency for users, set min_containers=1 on the Modal Function that launches the pipeline to ensure at least one warm instance of your function is always available.

Next steps

Explore Modal's LLM Examples

For next steps on running a self-hosted LLM and reducing latency, check out all of Modal’s LLM examples