Components
VisionAgent
Python SDK that orchestrates tasks. Sends screenshots to LLM, receives actions, executes via Agent OS.Agent OS
Device driver running locally. Provides screen capture, mouse/keyboard control, multi-display support.LLM
Understands UI from screenshots, plans actions, executes step-by-step. Configurable—use AskUI’s default or bring your own.Execution Flow
- Screenshot → Agent OS captures screen
- Understanding → LLM identifies UI elements
- Planning → LLM determines next action
- Execution → Agent OS performs click/type
- Loop → Repeat until task complete