← Back to Computer Vision
cs.CV

Building GUI agents by learning from tutorial videos at scale

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

May 14, 2026

Training GUI agents currently relies on expensive hand-labeled data from narrow domains, limiting their real-world applicability. Video2GUI automatically extracts structured interaction trajectories from tutorial videos using a coarse-to-fine filtering approach. Processing 500 million video metadata entries yields WildGUI, a dataset of 12 million trajectories spanning 1,500+ applications and websites. Pre-training multimodal models (Qwen2.5-VL, Mimo-VL) on this dataset produces 5-20% improvements on GUI grounding and action benchmarks, matching existing state-of-the-art results. The authors will release both the dataset and pipeline.
Published as Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining arXiv:2605.14747
Read the original paper →