Why state-space models beat transformers at vision understanding when speed matters

Zamba2-VL combines Mamba2 state-space layers with sparse transformer blocks to build competitive vision-language models at 1.2B to 7B parameters. On image understanding, reasoning, OCR, and grounding benchmarks, it matches Qwen3-VL and Molmo2 quality while achieving roughly 10× lower time-to-first-token—the latency that matters for real-time interaction. The efficiency gap widens at smaller scales, making these models practical for on-device deployment.