How well do language models remember during real tasks?

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

Language models need working memory to handle long tasks like coding or web navigation, but existing benchmarks only test retention in chat. MemGym isolates memory performance from reasoning and tool-use across five real-world environments (coding, web browsing, research), letting researchers rank memory strategies fairly. The team built fast evaluation pipelines, including a lightweight reward model for coding tasks, making it tractable to benchmark memory at scale.