← Back to Computer Vision
cs.CV

Can one model estimate consistent geometry from any video length?

Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Si-Yuan Cao, Hui-Liang Shen

May 28, 2026

ViGeo recovers 3D geometry (depth, surface normals, point maps) from video in a single unified model. The trick: dynamic chunking attention lets it handle both causal (streaming) and bidirectional (offline) contexts, switching at test time. A separate data refinement step teaches the model on cleaned, temporally consistent targets rather than noisy raw annotations. Trained only on public data, it outperforms specialized methods across streaming, full-length, and long-video benchmarks.
Published as Towards Consistent Video Geometry Estimation arXiv:2605.30060
Read the original paper →