← Back to Machine Learning
cs.LG

How to compute attention without moving data around so much?

Pál András Papp, Aleksandros Sobczyk, Anastasios Zouzias

May 22, 2026

Computing attention in large language models involves shuttling enormous matrices between fast and slow memory. Existing methods like FlashAttention require data transfers that scale quadratically with sequence length. This work uses approximate attention techniques to achieve nearly linear scaling instead, proving their algorithms match information-theoretic lower bounds. The approach handles most practical parameter regimes and could unlock faster inference on long sequences.
Published as Approaching I/O-optimality for Approximate Attention arXiv:2605.23751
Read the original paper →