NVIDIA NIM
LLM service implementation using NVIDIA’s NIM (NVIDIA Inference Microservice) API with OpenAI-compatible interface
Overview
NimLLMService
provides access to NVIDIA’s NIM language models through an OpenAI-compatible interface. It inherits from OpenAILLMService
and supports streaming responses, function calling, and context management, with special handling for NVIDIA’s incremental token reporting.
Installation
To use NimLLMService
, install the required dependencies:
You’ll need to set up your NVIDIA NIM API key as an environment variable: NIM_API_KEY
Configuration
Constructor Parameters
Your NVIDIA NIM API key
Model identifier
NVIDIA NIM API endpoint
Input Parameters
Inherits OpenAI-compatible parameters:
Reduces likelihood of repeating tokens based on their frequency. Range: [-2.0, 2.0]
Maximum number of tokens to generate. Must be greater than or equal to 1
Reduces likelihood of repeating any tokens that have appeared. Range: [-2.0, 2.0]
Controls randomness in the output. Range: [0.0, 2.0]
Controls diversity via nucleus sampling. Range: [0.0, 1.0]
Usage Example
Methods
See the LLM base class methods for additional functionality.
Function Calling
Supports OpenAI-compatible function calling:
Available Models
NVIDIA NIM provides access to various models:
Model Name | Description |
---|---|
nvidia/llama-3.1-nemotron-70b-instruct | Llama 3.1 70B Nemotron instruct |
nvidia/llama-3.1-nemotron-13b-instruct | Llama 3.1 13B Nemotron instruct |
nvidia/llama-3.1-nemotron-8b-instruct | Llama 3.1 8B Nemotron instruct |
See NVIDIA’s NIM console for a complete list of supported models.
Token Usage Handling
NimLLMService includes special handling for token usage metrics:
- Accumulates incremental token updates from NIM
- Records prompt tokens on first appearance
- Tracks completion tokens as they increase
- Reports final totals at the end of processing
This ensures compatibility with OpenAI’s token reporting format while maintaining accurate metrics.
Frame Flow
Inherits the OpenAI LLM Service frame flow:
Metrics Support
The service collects standard LLM metrics:
- Token usage (prompt and completion)
- Processing duration
- Time to First Byte (TTFB)
- Function call metrics
Notes
- OpenAI-compatible interface
- Supports streaming responses
- Handles function calling
- Manages conversation context
- Custom token usage tracking for NIM’s incremental reporting
- Thread-safe processing
- Automatic error handling