Can transformers learn to reuse their own memories?

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yong Jae Lee, Yelong Shen

Latent Recurrent Transformer reuses hidden states from previous tokens as memory for the next token, without requiring extra model components or breaking the standard attention interface. A novel training approach—interleaved parallel training—enables efficient pretraining at scale by processing position subsets in parallel while sharing recurrent supervision. Across multiple model sizes, LRT achieves better language-modeling loss and stronger in-context learning performance at matched compute budgets.