← Back to Computer Vision
cs.CV

How to make video AI models 100× faster without losing accuracy?

Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles

May 29, 2026

Video models choke on long sequences because attention scales quadratically with frame count. StateKV replaces full self-attention with a lightweight recurrent state that tracks important cross-frame context, then uses a separate cache for decoding. Tested on three benchmarks across seven pretrained models, it matches full attention accuracy while cutting prefill cost dramatically—and lets you run larger models within the same compute budget.
Published as Linear Scaling Video VLMs for Long Video Understanding arXiv:2605.31598
Read the original paper →