Skip to main content

Overview

SmolVlmService runs HuggingFace SmolVLM vision-language models locally to describe images flowing through a Pipecat pipeline. The compact, instruction-tuned models run entirely on your own hardware — no external API calls, no per-image billing, and no data leaving your infrastructure. It accepts UserImageRawFrame input and emits a VisionTextFrame containing the generated description.

Source Repository

Source code, examples, and issues for the SmolVLM integration

PyPI Package

The pipecat-smolvlm package on PyPI

SmolVLM Models

Browse the SmolVLM checkpoints on HuggingFace

Installation

This is a community-maintained package distributed separately from pipecat-ai:
# Core install
pip install pipecat-smolvlm

# With Flash Attention 2 (CUDA only — significant speedup)
pip install "pipecat-smolvlm[flash-attn]"

# Intel XPU support
pip install "pipecat-smolvlm[xpu]"
Model weights are downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/.

Prerequisites

No accounts or API keys are required — inference runs locally. The service automatically selects the best available device (Intel XPU, CUDA, Apple MPS, then CPU). Pass use_cpu=True to force CPU execution.

Configuration

model
str
default:"None"
(Deprecated) HuggingFace model identifier. Prefer settings=SmolVlmService.Settings(model=...).
use_cpu
bool
default:"False"
Force CPU execution even when a GPU is available. Useful for reproducibility or constrained environments.
settings
SmolVlmService.Settings
default:"None"
Runtime-configurable generation settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using SmolVlmService.Settings(...).
ParameterTypeDefaultDescription
modelstr"HuggingFaceTB/SmolVLM-256M-Instruct"HuggingFace model ID or local path.
max_new_tokensint500Maximum tokens to generate per image.
default_promptstr"Describe the given image."Fallback prompt when the frame carries no text.
temperaturefloat | NoneNoneSampling temperature. None uses greedy decoding.
do_sampleboolFalseEnable sampling without a fixed temperature.
Available parameters and defaults are defined by the integration. See the source repository for the authoritative, up-to-date list.

Usage

from pipecat_smolvlm import SmolVlmService

# Default: SmolVLM-256M-Instruct, auto device detection
service = SmolVlmService()

# Custom model and generation settings
service = SmolVlmService(
    settings=SmolVlmService.Settings(
        model="HuggingFaceTB/SmolVLM-500M-Instruct",
        max_new_tokens=256,
        default_prompt="What objects can you see?",
        temperature=0.3,
        do_sample=True,
    )
)

# Force CPU (no GPU required)
service = SmolVlmService(use_cpu=True)
Drop SmolVlmService into a Pipeline wherever a vision service is expected:
pipeline = Pipeline([
    transport.input(),
    image_capture_processor,   # converts camera frames → UserImageRawFrame
    SmolVlmService(),
    vision_to_speech,          # VisionTextFrame → TextFrame
    tts_service,
    transport.output(),
])
To use a custom per-image prompt, set the text field on the incoming UserImageRawFrame; the service falls back to settings.default_prompt when it is empty.

Compatibility

The package targets pipecat-ai>=0.0.100. Check the source repository for the latest tested version and changelog.