Teaching robots to learn from human preferences, not commands

Robot learning via reinforcement learning struggles with unsafe, inefficient exploration in real-world deployment. OHP-RL reframes human interventions during training not as behavioral demonstrations to copy, but as preference signals indicating when autonomy should be guided. The method uses a state-dependent preference gate that learns when and how much human feedback should influence the policy, allowing robots to benefit from intermittent, imperfect guidance while maintaining stable learning. Evaluated on three contact-rich manipulation tasks with a Franka robot, OHP-RL achieved higher success rates, faster convergence, and required substantially fewer human interventions than prior methods.