← Back to Computer Vision cs.CV
Teaching GUI agents to remember what matters
Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo
May 18, 2026
GUI agents struggle with multi-step tasks because they either store full screenshot histories (overwhelming the model) or rely on text-only summaries (losing visual details). MementoGUI adds a learned memory module called MementoCore that selectively preserves task-relevant interface events with both textual summaries and cropped visual regions, plus an episodic memory for retrieving past trajectories. The system works as a plug-in adapter—no need to retrain the base GUI agent. The authors curated a training dataset from computer-use trajectories, created MementoGUI-Bench for evaluation, and tested on GUI-Odyssey and MM-Mind2Web, showing consistent improvements over history-replay and text-only baselines.
Read the original paper →