Documentation Index
Fetch the complete documentation index at: https://docs.pipecat.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is a UIWorker?
When you put a voice agent in front of an app, talking isn’t enough — the agent needs to see what the user sees and act on the screen: read the page, point at things, fill in fields, click buttons. AUIWorker is the server-side agent that makes this possible. It voice-enables a client UI by connecting an LLM to whatever the user is looking at.
The connection is two-way, over the RTVI UI channel:
- Client → server. The client streams the screen to the worker as accessibility snapshots, and forwards the user’s UI interactions as events.
- Server → client. The worker drives the page back — scrolling, highlighting, selecting text, filling inputs, clicking, or running app-defined commands — and can surface long-running work as progress cards.
UIWorker is the screen half of a voice/UI split: a voice agent owns the conversation, and the UIWorker owns the screen. Each is a separate LLM with its own focused context. The worker auto-injects the latest screen state into its context before every turn, so the conversational voice LLM never has to carry a giant accessibility tree:
- The voice agent converses and decides what’s worth saying.
- The UIWorker reasons over the current page and acts on it.
The two directions map to RTVI UI messages: the client sends
ui-snapshot and
ui-event; the worker sends ui-command and ui-job-group. You rarely touch
these directly — PipelineWorker wires the channel up automatically when RTVI
is enabled (the default). See The RTVI
Standard for the wire protocol and the
UIWorker API reference for the
class.The two-way interface
AUIWorker gives an LLM a handful of capabilities, split across the two directions of the interface. Everything below works out of the box on any UIWorker subclass.
What the worker sees (client → server)
The screen, as a snapshot. The client sends an accessibility snapshot of the page whenever it changes. The worker renders the latest one as a<ui_state> block and — with auto_inject_ui_state on (the default) — injects it into the LLM context before every turn, so the model always reasons over what’s currently on screen. Each element carries a stable ref the worker uses to act on it:
<selection> block so the LLM can resolve deictic references like “this paragraph” or “what I selected”.
User interactions, as events. The client dispatches app-defined events (a button click, a custom gesture) with sendUIEvent(name, payload). Route them to handlers with @ui_event(name); each runs in its own task:
What the worker does (server → client)
Drives the page. The worker acts on the screen by sending UI commands. The built-in helpers cover the common actions, andsend_command(name, payload) sends any app-defined command:
| Helper | Effect |
|---|---|
scroll_to(ref) | Bring an element into view |
highlight(ref) | Briefly flash an element |
select_text(ref) | Select an element’s text (pointing / deixis) |
click(ref) | Click a checkbox, radio, or button |
set_input_value(ref, value) | Fill a text input or textarea |
send_command(name, payload) | Any app-defined command (e.g. "add_note") |
@pipecat-ai/client-react; apps can override them or define their own command names.
Answers back. A UIWorker answers via a built-in single-flight respond job: a requester dispatches job("ui", name="respond", payload={"query": ...}), the worker runs one screen-grounded LLM turn, and a @tool ends it by calling respond_to_job(). That call chooses how the answer reaches the user:
respond_to_job(text, tts_speak=True)— speaktextverbatim through the requester’s TTS.respond_to_job(text)— return{"answer": text}for the requester’s voice LLM to phrase.respond_to_job()— complete the turn silently (the worker acted, but said nothing).
ui_job_group / start_ui_job_group fan it out to peer workers and surface it to the client as a cancellable progress card, streaming each worker’s updates as they arrive:
By default a
UIWorker is stateless: it clears its context at the start of each
respond job, so every turn sees only the current <ui_state> and query. Set
keep_history=True to accumulate history across turns — useful for multi-turn
references like “can we add a note for that?” — at the cost of more tokens.Hello world
The smallestUIWorker ties the interface together: a delegate that answers questions about the page. The voice agent forwards screen-relevant utterances to it; the worker reads the screen (client → server) and speaks the reply (server → client).
The worker needs only an LLM and one @tool that ends the turn with respond_to_job():
respond job to the worker and speaks back whatever it returns:
UIWorker comes online to receive snapshots and jobs as soon as its pipeline starts:
Snapshot
The client streams the current screen as a
ui-snapshot. PipelineWorker
broadcasts it on the bus; the UIWorker stores the latest one.Route
The user speaks. The voice LLM calls
answer_about_screen, which dispatches
a respond job to the UIWorker.Ground
The worker’s
respond job runs one LLM turn with the latest <ui_state>
auto-injected, so its answer is grounded in what’s on screen.Patterns
How the worker gets triggered — and who speaks — falls into two patterns.Delegation
In the hello-world example, the voice agent is the gatekeeper: it decides which turns involve the screen and routes those to theUIWorker, then voices the result. This is the delegation pattern, and it’s the common one. The voice LLM stays small and screen-unaware; the worker owns all screen reasoning.
Most apps don’t need a custom tool per action. ReplyToolMixin provides a single bundled reply tool — a required spoken answer plus optional scroll_to, highlight, select_text, fills, and click — covering pointing, reading, and form apps:
select_text to point at “this paragraph”, fills + click to complete a form — and unused fields stay null.
Delegation also scales to background work. A @tool can fan out to peer workers with start_ui_job_group, which surfaces a cancellable progress card on the client and returns immediately so the voice agent isn’t blocked:
Parallel handling
Sometimes the screen should react to every user turn, not only the ones the voice agent chooses to delegate. In the parallel handling pattern, both agents receive each turn and act in parallel: the voice agent converses while theUIWorker updates the screen, independently.
The key difference: there’s no tool call routing work to the worker. Instead, the voice pipeline’s user aggregator fires on_user_turn_stopped once per turn, and that handler dispatches the transcript to the worker as a respond job. Because it runs in its own task, the voice LLM (running from the same turn) and the worker act concurrently:
@tool completes the job with respond_to_job() and no answer, so nothing it does reaches TTS. The separate voice layer owns speech:
Choosing a pattern
| Delegation | Parallel handling | |
|---|---|---|
| How the worker is triggered | The voice LLM calls a tool | The voice pipeline’s on_user_turn_stopped event, every turn |
| Who speaks | Often the worker (tts_speak=True), or the voice LLM phrases the worker’s result | The voice agent; the worker acts silently |
| Voice LLM’s role | Gatekeeper — decides what’s screen-relevant | Converses; reads shared state but never mutates the UI |
| Best when | The page matters only for some turns | Every turn should drive the UI and speech is incidental |
What’s next
UIWorker API Reference
Full reference for the
UIWorker class, UI commands, job groups, and
ReplyToolMixin.