Can one model generate all types of moving 3D shapes from video?

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

MORPHOS generates animated 3D objects from video in multiple formats—meshes, point clouds, neural fields—using a unified temporal representation called T-SLAT that jointly encodes shape and appearance. Unlike prior methods locked to one format or that break when geometry changes, it uses causal attention to build frames sequentially, ensuring temporal consistency while handling topological shifts. It outperforms existing approaches on appearance quality and generalizes across all three major 3D representations.