← Back to Computer Vision
cs.CV

How to reconstruct moving scenes from video without per-frame geometry?

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji, Seungryong Kim

May 29, 2026

Reconstructing dynamic scenes from single-camera video typically requires predicting 3D geometry separately for each frame, leading to duplicate structures and poor motion understanding. C4G instead learns a compact set of reusable Gaussian tokens conditioned on timestamp, letting each token aggregate temporal context and shift position over time for globally coherent motion. A diffusion-based rendering enhancer captures fine details, and the same framework extends to 4D feature fields for point tracking—all without camera pose requirements or per-scene optimization.
Published as Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction arXiv:2605.31595
Read the original paper →