How to reconstruct moving scenes from video without per-frame geometry?

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji, Seungryong Kim

Reconstructing dynamic scenes from single-camera video typically requires predicting 3D geometry separately for each frame, leading to duplicate structures and poor motion understanding. C4G instead learns a compact set of reusable Gaussian tokens conditioned on timestamp, letting each token aggregate temporal context and shift position over time for globally coherent motion. A diffusion-based rendering enhancer captures fine details, and the same framework extends to 4D feature fields for point tracking—all without camera pose requirements or per-scene optimization.