← Back to Computer Vision cs.CV
Why discrete labels break vision-language models at reasoning?
Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman
June 4, 2026
Vision-language models fail at compositional reasoning—understanding how objects relate to each other. Injecting discrete scene graph labels backfires because text doesn't align with continuous visual features. HyperVis bypasses this bottleneck by computing dense visual relations via cross-attention, projecting them onto hyperbolic space where hierarchies emerge naturally from geometry. As a training regularizer and inference encoder, it recovers baseline performance on GQA (61.03%) and significantly improves compositional scoring on SugarCrepe (79.94%, +6.25pp), with learned curvature stabilizing at κ=4.0—much stronger than prior hyperbolic VLMs, confirming visual features genuinely need curved space.
Read the original paper →