← Back to Artificial Intelligence cs.AI
Making long-context LLMs sparse in just hundreds of training steps
Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Lan Tao, Lin Qu, Yuan Yao, Xiaoxing Ma
May 16, 2026
Long-context inference in LLMs suffers from quadratic attention complexity. RTPurbo exploits three key observations: only some attention heads need full context, long-range retrieval works in low dimensions, and token budgets should be query-dependent. The method keeps full KV caches only for retrieval heads and uses a lightweight 16-dimensional indexer for sparse attention. After minimal fine-tuning, it delivers up to 9.36× prefill speedup and 2.01× decode speedup on long-context benchmarks with negligible accuracy loss, without requiring expensive native sparse pretraining.
Read the original paper →