← Back to Computer Vision cs.CV
Why state-space models beat transformers at vision understanding when speed matters
Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge
May 29, 2026
Zamba2-VL combines Mamba2 state-space layers with sparse transformer blocks to build competitive vision-language models at 1.2B to 7B parameters. On image understanding, reasoning, OCR, and grounding benchmarks, it matches Qwen3-VL and Molmo2 quality while achieving roughly 10× lower time-to-first-token—the latency that matters for real-time interaction. The efficiency gap widens at smaller scales, making these models practical for on-device deployment.
Read the original paper →