← Back to Machine Learning cs.LG
How to compute attention without moving data around so much?
Pál András Papp, Aleksandros Sobczyk, Anastasios Zouzias
May 22, 2026
Computing attention in large language models involves shuttling enormous matrices between fast and slow memory. Existing methods like FlashAttention require data transfers that scale quadratically with sequence length. This work uses approximate attention techniques to achieve nearly linear scaling instead, proving their algorithms match information-theoretic lower bounds. The approach handles most practical parameter regimes and could unlock faster inference on long sequences.
Read the original paper →