SmolVLM

Overview

SmolVlmService runs HuggingFace SmolVLM vision-language models locally to describe images flowing through a Pipecat pipeline. The compact, instruction-tuned models run entirely on your own hardware — no external API calls, no per-image billing, and no data leaving your infrastructure. It accepts UserImageRawFrame input and emits a VisionTextFrame containing the generated description.

Source Repository

Source code, examples, and issues for the SmolVLM integration

PyPI Package

The pipecat-smolvlm package on PyPI

SmolVLM Models

Browse the SmolVLM checkpoints on HuggingFace

Installation

This is a community-maintained package distributed separately from pipecat-ai:

# Core install
pip install pipecat-smolvlm

# With Flash Attention 2 (CUDA only — significant speedup)
pip install "pipecat-smolvlm[flash-attn]"

# Intel XPU support
pip install "pipecat-smolvlm[xpu]"

Model weights are downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/.

Prerequisites

No accounts or API keys are required — inference runs locally. The service automatically selects the best available device (Intel XPU, CUDA, Apple MPS, then CPU). Pass use_cpu=True to force CPU execution.

Configuration

str

default:"None"

(Deprecated) HuggingFace model identifier. Prefer settings=SmolVlmService.Settings(model=...).

bool

default:"False"

Force CPU execution even when a GPU is available. Useful for reproducibility or constrained environments.

SmolVlmService.Settings

default:"None"

Runtime-configurable generation settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using SmolVlmService.Settings(...).

Parameter	Type	Default	Description
`model`	`str`	`"HuggingFaceTB/SmolVLM-256M-Instruct"`	HuggingFace model ID or local path.
`max_new_tokens`	`int`	`500`	Maximum tokens to generate per image.
`default_prompt`	`str`	`"Describe the given image."`	Fallback prompt when the frame carries no text.
`temperature`	`float \| None`	`None`	Sampling temperature. `None` uses greedy decoding.
`do_sample`	`bool`	`False`	Enable sampling without a fixed temperature.

Available parameters and defaults are defined by the integration. See the source repository for the authoritative, up-to-date list.

Usage

from pipecat_smolvlm import SmolVlmService

# Default: SmolVLM-256M-Instruct, auto device detection
service = SmolVlmService()

# Custom model and generation settings
service = SmolVlmService(
    settings=SmolVlmService.Settings(
        model="HuggingFaceTB/SmolVLM-500M-Instruct",
        max_new_tokens=256,
        default_prompt="What objects can you see?",
        temperature=0.3,
        do_sample=True,
    )
)

# Force CPU (no GPU required)
service = SmolVlmService(use_cpu=True)

Drop SmolVlmService into a Pipeline wherever a vision service is expected:

pipeline = Pipeline([
    transport.input(),
    image_capture_processor,   # converts camera frames → UserImageRawFrame
    SmolVlmService(),
    vision_to_speech,          # VisionTextFrame → TextFrame
    tts_service,
    transport.output(),
])

To use a custom per-image prompt, set the text field on the incoming UserImageRawFrame; the service falls back to settings.default_prompt when it is empty.

Compatibility

The package targets pipecat-ai>=0.0.100. Check the source repository for the latest tested version and changelog.

Pipecat Server

Client SDKs

Pipecat Flows

Pipecat Cloud

CLI

Pipecat Context Hub

Overview

Source Repository

PyPI Package

SmolVLM Models

Installation

Prerequisites

Configuration

Settings

Usage

Compatibility

​Overview

Source Repository

PyPI Package

SmolVLM Models

​Installation

​Prerequisites

​Configuration

​Settings

​Usage

​Compatibility

Overview

Installation

Prerequisites

Configuration

Settings

Usage

Compatibility