← Back to Computer Vision
cs.CV

Teaching vision models which geometry matters for each pixel

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

May 21, 2026

Current vision-language models treat geometric information as a one-size-fits-all signal, but spatial reasoning actually requires different geometric evidence depending on where a pixel sits in the scene. GeoWeaver solves this by matching each visual token to the most relevant geometric abstractions from a learned geometry bank before feeding it to language models. On spatial reasoning benchmarks, this token-adaptive approach outperforms models that bolt geometry on later, suggesting geometric grounding should be foundational rather than auxiliary.
Published as GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning arXiv:2605.22558
Read the original paper →