← Back to Computer Vision cs.CV
Can one model estimate consistent geometry from any video length?
Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Si-Yuan Cao, Hui-Liang Shen
May 28, 2026
ViGeo recovers 3D geometry (depth, surface normals, point maps) from video in a single unified model. The trick: dynamic chunking attention lets it handle both causal (streaming) and bidirectional (offline) contexts, switching at test time. A separate data refinement step teaches the model on cleaned, temporally consistent targets rather than noisy raw annotations. Trained only on public data, it outperforms specialized methods across streaming, full-length, and long-video benchmarks.
Read the original paper →