← Back to Artificial Intelligence cs.AI
How to speed up long-context LLMs without losing accuracy
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei
June 4, 2026
Long-context LLM inference is slow because attending to millions of tokens is expensive. This work shares a single routing index across decoder layers—computing which tokens matter just once instead of repeatedly—while keeping the fine-grained selection that token sparse methods provide. On 128K context windows, it delivers 7.6× decoding speedup without sacrificing answer quality, outperforming both crude block-sparse and expensive per-layer routing approaches.
Read the original paper →