Aggregators
VisionImageFrameAggregator
Frame processor that combines text prompts with images for vision model processing
Overview
VisionImageFrameAggregator
is a processor that pairs text prompts with images to create vision processing requests. It waits for consecutive text and image frames, combining them into a single vision frame for processing by multimodal models.
Constructor
The processor maintains internal state to track the most recent text prompt.
Input Frames
Text Prompt
TextFrame
Frame
Contains the text prompt or question about the image
Image Data
InputImageRawFrame
Frame
Contains the image to be analyzed, including: - Raw image data - Image dimensions - Format information
Output Frames
VisionImageRawFrame
Frame
Combined frame containing: - Text prompt - Image data - Image dimensions - Format information
Processing Pattern
The aggregator follows a specific sequence:
- Receives
TextFrame
→ stores prompt - Receives
InputImageRawFrame
→ combines with stored prompt - Outputs
VisionImageRawFrame
- Resets stored prompt
Usage Examples
Basic Usage
Pipeline Integration
Frame Flow
Example Sequence
Notes
- Text prompts must precede their corresponding images
- Only the most recent text prompt is stored
- Unmatched text prompts are replaced by newer ones
- Non-matching frames are passed through unchanged
- State is automatically reset after output
- Thread-safe for pipeline processing