← Back to Computer Vision cs.CV
Scaling reconstruction models with efficient architecture and massive data
Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht
May 14, 2026
VGGT-Ω improves upon feed-forward 3D reconstruction models by demonstrating that reconstruction quality scales predictably with model and data size. The key innovation is an efficient architecture that replaces global attention with register-based information aggregation, reducing training memory to 30% of the predecessor while enabling training on 15× more labeled data and unlabeled video. Architectural simplifications include a single dense prediction head with multi-task supervision and removal of expensive high-resolution convolutions. On the Sintel benchmark, the model achieves 77% improvement in camera estimation accuracy over prior work. The learned representations also transfer to vision-language-action models, suggesting reconstruction serves as a scalable proxy task for spatial understanding. Code and project details are available.
Read the original paper →