Can one model estimate consistent geometry from any video length?

Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Si-Yuan Cao, Hui-Liang Shen

ViGeo recovers 3D geometry (depth, surface normals, point maps) from video in a single unified model. The trick: dynamic chunking attention lets it handle both causal (streaming) and bidirectional (offline) contexts, switching at test time. A separate data refinement step teaches the model on cleaned, temporally consistent targets rather than noisy raw annotations. Trained only on public data, it outperforms specialized methods across streaming, full-length, and long-video benchmarks.