Teaching vision models which geometry matters for each pixel

Current vision-language models treat geometric information as a one-size-fits-all signal, but spatial reasoning actually requires different geometric evidence depending on where a pixel sits in the scene. GeoWeaver solves this by matching each visual token to the most relevant geometric abstractions from a learned geometry bank before feeding it to language models. On spatial reasoning benchmarks, this token-adaptive approach outperforms models that bolt geometry on later, suggesting geometric grounding should be foundational rather than auxiliary.