Do language agents actually learn from past tasks?

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

Language agents solve tasks one at a time but rarely learn from them. This work introduces AgentCL, a benchmark with controlled task streams designed so earlier solutions genuinely transfer to later ones, plus MemProbe, a memory system that stores and filters agent insights. Testing on coding and research tasks shows existing memory designs barely improve performance on naive streams, but controlled streams expose their weaknesses—highlighting that agents need fundamentally better ways to balance learning new things without forgetting old ones.