← Back to Computer Vision
cs.CV

Teaching robots to see 3D: combining language models with point clouds for precise manipulation

Shizhe Chen, Paul Pacaud, Cordelia Schmid

May 20, 2026

Current robot control models rely mostly on 2D images paired with language understanding, missing crucial 3D spatial information needed for precise manipulation. PointACT fuses 3D point cloud data directly into the action decoding process, using a multi-scale attention mechanism that lets the robot simultaneously grasp fine geometric details and overall scene structure. On standard benchmarks (LIBERO and RLBench), it improves task success by 10% over existing vision-language-action models—even larger gains when trained from scratch rather than fine-tuning pretrained backbones.
Published as PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction arXiv:2605.21414
Read the original paper →