← Back to Computer Vision cs.CV
How to make video AI models 100× faster without losing accuracy?
Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles
May 29, 2026
Video models choke on long sequences because attention scales quadratically with frame count. StateKV replaces full self-attention with a lightweight recurrent state that tracks important cross-frame context, then uses a separate cache for decoding. Tested on three benchmarks across seven pretrained models, it matches full attention accuracy while cutting prefill cost dramatically—and lets you run larger models within the same compute budget.
Read the original paper →