← Back to Artificial Intelligence
cs.AI

Making long-context LLMs sparse in just hundreds of training steps

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Lan Tao, Lin Qu, Yuan Yao, Xiaoxing Ma

May 16, 2026

Long-context inference in LLMs suffers from quadratic attention complexity. RTPurbo exploits three key observations: only some attention heads need full context, long-range retrieval works in low dimensions, and token budgets should be query-dependent. The method keeps full KV caches only for retrieval heads and uses a lightweight 16-dimensional indexer for sparse attention. After minimal fine-tuning, it delivers up to 9.36× prefill speedup and 2.01× decode speedup on long-context benchmarks with negligible accuracy loss, without requiring expensive native sparse pretraining.
Published as Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps arXiv:2605.16928
Read the original paper →