Editing 3D scenes from text in a single forward pass

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

Most 3D scene editing methods edit individual 2D views separately and stitch them back into 3D, which causes blurry textures and geometry mismatches across viewpoints. VGGT-Edit instead works natively in 3D: it injects text instructions aligned to the scene's spatial poses, then uses a residual transformation head to directly predict geometric displacements while leaving the background intact. The authors also release DeltaScene, a large-scale dataset built with an automated 3D-agreement filtering pipeline to provide clean ground truth. VGGT-Edit substantially outperforms 2D-lifting baselines on multi-view consistency and detail sharpness, and targets researchers and practitioners building interactive 3D applications.