← Back to Computer Vision
cs.CV

Teaching GUI agents to remember what matters

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

May 18, 2026

GUI agents struggle with multi-step tasks because they either store full screenshot histories (overwhelming the model) or rely on text-only summaries (losing visual details). MementoGUI adds a learned memory module called MementoCore that selectively preserves task-relevant interface events with both textual summaries and cropped visual regions, plus an episodic memory for retrieving past trajectories. The system works as a plug-in adapter—no need to retrain the base GUI agent. The authors curated a training dataset from computer-use trajectories, created MementoGUI-Bench for evaluation, and tested on GUI-Odyssey and MM-Mind2Web, showing consistent improvements over history-replay and text-only baselines.
Published as MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents arXiv:2605.18652
Read the original paper →