← Back to Computation and Language
cs.CL

Can transformers learn to reuse their own memories?

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yong Jae Lee, Yelong Shen

May 26, 2026

Latent Recurrent Transformer reuses hidden states from previous tokens as memory for the next token, without requiring extra model components or breaking the standard attention interface. A novel training approach—interleaved parallel training—enables efficient pretraining at scale by processing position subsets in parallel while sharing recurrent supervision. Across multiple model sizes, LRT achieves better language-modeling loss and stronger in-context learning performance at matched compute budgets.
Published as Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior arXiv:2605.26797
Read the original paper →