← Back to Computer Vision cs.CV
Teaching vision models which geometry matters for each pixel
Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang
May 21, 2026
Current vision-language models treat geometric information as a one-size-fits-all signal, but spatial reasoning actually requires different geometric evidence depending on where a pixel sits in the scene. GeoWeaver solves this by matching each visual token to the most relevant geometric abstractions from a learned geometry bank before feeding it to language models. On spatial reasoning benchmarks, this token-adaptive approach outperforms models that bolt geometry on later, suggesting geometric grounding should be foundational rather than auxiliary.
Read the original paper →