← Back to Artificial Intelligence
cs.AI

How to speed up long-context LLMs without losing accuracy

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei

June 4, 2026

Long-context LLM inference is slow because attending to millions of tokens is expensive. This work shares a single routing index across decoder layers—computing which tokens matter just once instead of repeatedly—while keeping the fine-grained selection that token sparse methods provide. On 128K context windows, it delivers 7.6× decoding speedup without sacrificing answer quality, outperforming both crude block-sparse and expensive per-layer routing approaches.
Published as You Only Index Once: Cross-Layer Sparse Attention with Shared Routing arXiv:2606.06467
Read the original paper →