Overview

NimLLMService provides access to NVIDIA’s NIM language models through an OpenAI-compatible interface. It inherits from OpenAILLMService and supports streaming responses, function calling, and context management, with special handling for NVIDIA’s incremental token reporting.

Installation

To use NVIDIA NIM services, install the required dependencies:

pip install "pipecat-ai[nim]"

You’ll also need to set up your NVIDIA API key as an environment variable: NVIDIA_API_KEY.

Get your API key from NVIDIA Build.

Frames

Input

  • OpenAILLMContextFrame - Conversation context and history
  • LLMMessagesFrame - Direct message list
  • VisionImageRawFrame - Images for vision processing
  • LLMUpdateSettingsFrame - Runtime parameter updates

Output

  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - Response boundaries
  • LLMTextFrame - Streamed completion chunks
  • FunctionCallInProgressFrame / FunctionCallResultFrame - Function call lifecycle
  • ErrorFrame - API or processing errors

Function Calling

Function Calling Guide

Learn how to implement function calling with standardized schemas, register handlers, manage context properly, and control execution flow in your conversational AI applications.

Context Management

Context Management Guide

Learn how to manage conversation context, handle message history, and integrate context aggregators for consistent conversational experiences.

Usage Example

import os
from pipecat.services.nim.llm import NimLLMService
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema

# Configure NVIDIA NIM service
llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    model="nvidia/llama-3.1-nemotron-70b-instruct",
    params=NimLLMService.InputParams(
        temperature=0.7,
        max_tokens=1000
    )
)

# Define function for tool calling
weather_function = FunctionSchema(
    name="get_current_weather",
    description="Get current weather information",
    properties={
        "location": {
            "type": "string",
            "description": "City and state, e.g. San Francisco, CA"
        },
        "format": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "Temperature unit to use"
        }
    },
    required=["location", "format"]
)

tools = ToolsSchema(standard_tools=[weather_function])

# Create context optimized for voice
context = OpenAILLMContext(
    messages=[
        {
            "role": "system",
            "content": """You are a helpful assistant optimized for voice interactions.
            Keep responses concise and avoid special characters for better speech synthesis."""
        }
    ],
    tools=tools
)

# Create context aggregators
context_aggregator = llm.create_context_aggregator(context)

# Register function handler with feedback
async def fetch_weather(params):
    location = params.arguments["location"]
    await params.result_callback({"conditions": "sunny", "temperature": "75°F"})

llm.register_function("get_current_weather", fetch_weather)

# Optional: Add function call feedback
@llm.event_handler("on_function_calls_started")
async def on_function_calls_started(service, function_calls):
    await tts.queue_frame(TTSSpeakFrame("Let me check on that."))

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Metrics

Includes specialized token usage tracking for NIM’s incremental reporting:

  • Time to First Byte (TTFB) - Response latency measurement
  • Processing Duration - Total request processing time
  • Token Usage - Tracks tokens used per request, compatible with NIM’s incremental reporting

Enable with:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        enable_metrics=True,
        enable_usage_metrics=True
    )
)

Additional Notes

  • OpenAI Compatibility: Full compatibility with OpenAI API features and parameters
  • NVIDIA Optimization: Hardware-accelerated inference on NVIDIA infrastructure
  • Token Reporting: Custom handling for NIM’s incremental vs. OpenAI’s final token reporting
  • Model Variety: Access to Nemotron and other NVIDIA-optimized model variants

Overview

NimLLMService provides access to NVIDIA’s NIM language models through an OpenAI-compatible interface. It inherits from OpenAILLMService and supports streaming responses, function calling, and context management, with special handling for NVIDIA’s incremental token reporting.

Installation

To use NVIDIA NIM services, install the required dependencies:

pip install "pipecat-ai[nim]"

You’ll also need to set up your NVIDIA API key as an environment variable: NVIDIA_API_KEY.

Get your API key from NVIDIA Build.

Frames

Input

  • OpenAILLMContextFrame - Conversation context and history
  • LLMMessagesFrame - Direct message list
  • VisionImageRawFrame - Images for vision processing
  • LLMUpdateSettingsFrame - Runtime parameter updates

Output

  • LLMFullResponseStartFrame / LLMFullResponseEndFrame - Response boundaries
  • LLMTextFrame - Streamed completion chunks
  • FunctionCallInProgressFrame / FunctionCallResultFrame - Function call lifecycle
  • ErrorFrame - API or processing errors

Function Calling

Function Calling Guide

Learn how to implement function calling with standardized schemas, register handlers, manage context properly, and control execution flow in your conversational AI applications.

Context Management

Context Management Guide

Learn how to manage conversation context, handle message history, and integrate context aggregators for consistent conversational experiences.

Usage Example

import os
from pipecat.services.nim.llm import NimLLMService
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema

# Configure NVIDIA NIM service
llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    model="nvidia/llama-3.1-nemotron-70b-instruct",
    params=NimLLMService.InputParams(
        temperature=0.7,
        max_tokens=1000
    )
)

# Define function for tool calling
weather_function = FunctionSchema(
    name="get_current_weather",
    description="Get current weather information",
    properties={
        "location": {
            "type": "string",
            "description": "City and state, e.g. San Francisco, CA"
        },
        "format": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "Temperature unit to use"
        }
    },
    required=["location", "format"]
)

tools = ToolsSchema(standard_tools=[weather_function])

# Create context optimized for voice
context = OpenAILLMContext(
    messages=[
        {
            "role": "system",
            "content": """You are a helpful assistant optimized for voice interactions.
            Keep responses concise and avoid special characters for better speech synthesis."""
        }
    ],
    tools=tools
)

# Create context aggregators
context_aggregator = llm.create_context_aggregator(context)

# Register function handler with feedback
async def fetch_weather(params):
    location = params.arguments["location"]
    await params.result_callback({"conditions": "sunny", "temperature": "75°F"})

llm.register_function("get_current_weather", fetch_weather)

# Optional: Add function call feedback
@llm.event_handler("on_function_calls_started")
async def on_function_calls_started(service, function_calls):
    await tts.queue_frame(TTSSpeakFrame("Let me check on that."))

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Metrics

Includes specialized token usage tracking for NIM’s incremental reporting:

  • Time to First Byte (TTFB) - Response latency measurement
  • Processing Duration - Total request processing time
  • Token Usage - Tracks tokens used per request, compatible with NIM’s incremental reporting

Enable with:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        enable_metrics=True,
        enable_usage_metrics=True
    )
)

Additional Notes

  • OpenAI Compatibility: Full compatibility with OpenAI API features and parameters
  • NVIDIA Optimization: Hardware-accelerated inference on NVIDIA infrastructure
  • Token Reporting: Custom handling for NIM’s incremental vs. OpenAI’s final token reporting
  • Model Variety: Access to Nemotron and other NVIDIA-optimized model variants