← Back to Machine Learning cs.LG
Smarter sparse attention that learns which tokens matter
Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso
May 18, 2026
Hierarchical attention methods speed up transformers by first filtering to relevant token blocks, then applying full attention only to those. The problem: they assume every query needs the same number of tokens, and the filtering stage blocks gradient flow. DashAttention replaces rigid top-k selection with an adaptive sparse transformation that picks a variable number of blocks per query, staying fully differentiable. On long-context LLM tasks, it matches full attention at 75% sparsity and outperforms prior methods like NSA and InfLLMv2, with a Triton implementation offering 3× speedups.
Read the original paper →