Teaching robots to see by learning efficient 3D representations

Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao Li, Qiang Zhang, Jingkai Sun, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

Current 3D perception systems for robot manipulation force a choice: implicit representations are expressive but lack clear structure, while explicit ones preserve geometry but hit resolution walls. This work proposes structural latent points—a hybrid approach that learns compact geometric summaries by inserting a variational autoencoder into a point-cloud model, capturing coarse shape and semantic cues without encoding precise geometry. Tested on RLBench, ManiSkill2, and real robots, the method improves task success, sample efficiency, and robustness to viewpoint changes compared to existing baselines.