Teaching GUI agents to remember what matters

GUI agents struggle with multi-step tasks because they either store full screenshot histories (overwhelming the model) or rely on text-only summaries (losing visual details). MementoGUI adds a learned memory module called MementoCore that selectively preserves task-relevant interface events with both textual summaries and cropped visual regions, plus an episodic memory for retrieving past trajectories. The system works as a plug-in adapter—no need to retrain the base GUI agent. The authors curated a training dataset from computer-use trajectories, created MementoGUI-Bench for evaluation, and tested on GUI-Odyssey and MM-Mind2Web, showing consistent improvements over history-replay and text-only baselines.