Training language models 100× more efficiently with brain-inspired architecture

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

Standard language model pretraining demands massive compute and internet-scale data, locking foundational research behind expensive infrastructure. HRM-Text replaces the Transformer's flat attention with a hierarchical recurrent architecture that separates fast execution from slow strategic planning—mimicking how brains process information across timescales. Trained only on instruction-response pairs with a custom objective, a 1B-parameter model achieves competitive performance on MMLU, ARC-C, and GSM8K using 100–900× fewer tokens and 96–432× less compute than standard 2–7B baselines. The work proves that thoughtful codesign of architecture and training objective can dramatically close the efficiency gap.