How to speed up long-context LLMs without losing accuracy

Long-context LLM inference is slow because attending to millions of tokens is expensive. This work shares a single routing index across decoder layers—computing which tokens matter just once instead of repeatedly—while keeping the fine-grained selection that token sparse methods provide. On 128K context windows, it delivers 7.6× decoding speedup without sacrificing answer quality, outperforming both crude block-sparse and expensive per-layer routing approaches.