Teaching robots to see 3D: combining language models with point clouds for precise manipulation

Current robot control models rely mostly on 2D images paired with language understanding, missing crucial 3D spatial information needed for precise manipulation. PointACT fuses 3D point cloud data directly into the action decoding process, using a multi-scale attention mechanism that lets the robot simultaneously grasp fine geometric details and overall scene structure. On standard benchmarks (LIBERO and RLBench), it improves task success by 10% over existing vision-language-action models—even larger gains when trained from scratch rather than fine-tuning pretrained backbones.