Smarter sparse attention that learns which tokens matter

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

Hierarchical attention methods speed up transformers by first filtering to relevant token blocks, then applying full attention only to those. The problem: they assume every query needs the same number of tokens, and the filtering stage blocks gradient flow. DashAttention replaces rigid top-k selection with an adaptive sparse transformation that picks a variable number of blocks per query, staying fully differentiable. On long-context LLM tasks, it matches full attention at 75% sparsity and outperforms prior methods like NSA and InfLLMv2, with a Triton implementation offering 3× speedups.