How to shrink video AI's memory footprint by 93%?

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

Generating minute-long videos with diffusion models hits a memory wall: storing attention keys and values explodes with sequence length. VideoMLA replaces the standard per-head cache with a shared low-rank bottleneck, slashing per-token memory by 92.7%. Surprisingly, this works even though video attention isn't inherently low-rank like language model attention—the compression forces the model to adapt during training rather than exploit spectral structure. On VBench benchmarks, it matches baseline quality at short horizons and achieves the best score at long horizons, plus 1.23× throughput gain on a single B200.