Turning a single satellite photo into a detailed street-level 3D scene

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia

Reconstructing walkable 3D environments from overhead satellite imagery is hampered by the extreme viewpoint shift between satellite and street perspectives. Sat3DGen tackles this with a geometry-first approach: novel geometric constraints and a perspective-view training strategy are added to a feed-forward image-to-3D framework, directly attacking the sparse and inconsistent supervision that causes prior methods to produce wobbly geometry. Evaluated on a new benchmark pairing VIGOR-OOD with high-resolution DSM data, the method improves RMSE from 6.76m to 5.20m and drops FID from ~40 to 19 without dedicated image-quality modules. Code is publicly released, and the pipeline supports downstream tasks including semantic-map-to-3D synthesis, multi-camera video generation, and unsupervised DSM estimation.