Teaching video AI where the camera is looking

Jihan Yang, Zifan Zhao, Xichen Pan, Shusheng Yang, Junyi Zhang, Bingyi Kang, Hu Xu, Saining Xie

Video language models treat each frame as an isolated image, missing how camera movement relates observations across time. This work adds lightweight camera pose estimation to a video MLLM, using learnable per-frame tokens and a pose regression head. The model gains 4.5–6.5% on spatial reasoning benchmarks like VSI-Bench and generalizes across eight video QA tasks, while achieving state-of-the-art streaming pose estimation on ScanNet. Even pseudo-labeled poses from unlabeled video improve general video understanding, suggesting camera geometry is fundamental to reasoning about physical scenes.