Making long-context LLMs sparse in just hundreds of training steps

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Lan Tao, Lin Qu, Yuan Yao, Xiaoxing Ma

Long-context inference in LLMs suffers from quadratic attention complexity. RTPurbo exploits three key observations: only some attention heads need full context, long-range retrieval works in low dimensions, and token budgets should be query-dependent. The method keeps full KV caches only for retrieval heads and uses a lightweight 16-dimensional indexer for sparse attention. After minimal fine-tuning, it delivers up to 9.36× prefill speedup and 2.01× decode speedup on long-context benchmarks with negligible accuracy loss, without requiring expensive native sparse pretraining.