How to compute attention without moving data around so much?

Computing attention in large language models involves shuttling enormous matrices between fast and slow memory. Existing methods like FlashAttention require data transfers that scale quadratically with sequence length. This work uses approximate attention techniques to achieve nearly linear scaling instead, proving their algorithms match information-theoretic lower bounds. The approach handles most practical parameter regimes and could unlock faster inference on long sequences.