Building GUI agents by learning from tutorial videos at scale

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

Training GUI agents currently relies on expensive hand-labeled data from narrow domains, limiting their real-world applicability. Video2GUI automatically extracts structured interaction trajectories from tutorial videos using a coarse-to-fine filtering approach. Processing 500 million video metadata entries yields WildGUI, a dataset of 12 million trajectories spanning 1,500+ applications and websites. Pre-training multimodal models (Qwen2.5-VL, Mimo-VL) on this dataset produces 5-20% improvements on GUI grounding and action benchmarks, matching existing state-of-the-art results. The authors will release both the dataset and pipeline.