Aggregators
UserResponseAggregator
Frame processor for aggregating user speech transcriptions into complete responses
Overview
UserResponseAggregator
is a synchronous frame processor that combines multiple transcription frames between speech start and end events into a single complete response. It handles both interim and final transcriptions to build complete user utterances.
Class Definition
Input Frames
The processor handles several frame types:
UserStartedSpeakingFrame
SystemFrame
Signals the start of user speech and begins aggregation
TranscriptionFrame
TextFrame
Contains final transcription segments to be aggregated
InterimTranscriptionFrame
TextFrame
Contains preliminary transcription segments (tracked but not aggregated)
UserStoppedSpeakingFrame
SystemFrame
Signals the end of user speech and triggers final aggregation
Output Frames
TextFrame
Frame
Contains the complete aggregated user response
Behavior
Aggregation Rules
The processor follows these patterns:
Processing Flow
UserStartedSpeakingFrame
initiates aggregationInterimTranscriptionFrames
are tracked but not aggregatedTranscriptionFrames
are aggregated with space separationUserStoppedSpeakingFrame
triggers output if aggregation is complete- Final
TextFrame
contains the complete utterance
Usage Example
Frame Flow
Notes
- Maintains proper text spacing between segments
- Handles delayed transcriptions after speech end
- Tracks interim results for state management
- Passes through unhandled frame types
- Thread-safe for pipeline processing
- Cleans up state on reset or cancellation