← Back to Computation and Language
cs.CL

How well do language models remember during real tasks?

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

May 20, 2026

Language models need working memory to handle long tasks like coding or web navigation, but existing benchmarks only test retention in chat. MemGym isolates memory performance from reasoning and tool-use across five real-world environments (coding, web browsing, research), letting researchers rank memory strategies fairly. The team built fast evaluation pipelines, including a lightweight reward model for coding tasks, making it tractable to benchmark memory at scale.
Published as MemGym: a Long-Horizon Memory Environment for LLM Agents arXiv:2605.20833
Read the original paper →