← Back to Computer Vision
cs.CV

Why discrete labels break vision-language models at reasoning?

Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman

June 4, 2026

Vision-language models fail at compositional reasoning—understanding how objects relate to each other. Injecting discrete scene graph labels backfires because text doesn't align with continuous visual features. HyperVis bypasses this bottleneck by computing dense visual relations via cross-attention, projecting them onto hyperbolic space where hierarchies emerge naturally from geometry. As a training regularizer and inference encoder, it recovers baseline performance on GQA (61.03%) and significantly improves compositional scoring on SugarCrepe (79.94%, +6.25pp), with learned curvature stabilizing at κ=4.0—much stronger than prior hyperbolic VLMs, confirming visual features genuinely need curved space.
Published as HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning arXiv:2606.06100
Read the original paper →