← Back to Computer Vision
cs.CV

Why state-space models beat transformers at vision understanding when speed matters

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

May 29, 2026

Zamba2-VL combines Mamba2 state-space layers with sparse transformer blocks to build competitive vision-language models at 1.2B to 7B parameters. On image understanding, reasoning, OCR, and grounding benchmarks, it matches Qwen3-VL and Molmo2 quality while achieving roughly 10× lower time-to-first-token—the latency that matters for real-time interaction. The efficiency gap widens at smaller scales, making these models practical for on-device deployment.
Published as Zamba2-VL Technical Report arXiv:2606.00390
Read the original paper →