Teaching robots to understand what objects are for and how to use them

Robots need to understand not just where to grasp an object, but how to manipulate it—a challenge called affordance understanding. This work builds AFUN, a foundation model that takes an RGB-D image and task description, then outputs both a functional mask (where to interact) and a 3D motion curve (how to interact). By combining heterogeneous data from robots, humans, simulations, and real scans into one schema, it generalizes across environments and objects without task-specific tuning, and beats existing methods by large margins on affordance segmentation and contact prediction.