Teaching vision models to reason across multiple viewpoints

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

Multimodal language models struggle to reason consistently about objects and scenes viewed from multiple angles. CrossView Suite addresses three concrete gaps: lack of training data, absence of evaluation benchmarks, and no explicit mechanisms for matching objects across views. The suite includes CrossViewSet (1.6M annotated samples), CrossViewBench (scene-disjoint evaluation benchmark), and CrossViewer (a three-stage framework using spatial region tokenization, cross-view object alignment, and multi-view feature fusion). Experiments confirm that large-scale training data, systematic evaluation, and explicit alignment all significantly improve cross-view spatial reasoning. Code and models are released.