← Back to Machine Learning (Statistics) stat.ML
Can RL learn from preference comparisons instead of reward numbers?
Nikola Pavlovic, Sattar Vakili, Qing Zhao
May 22, 2026
Learning from human preferences (which policy is better) rather than numeric rewards is common in practice but theoretically understudied. This work provides the first rigorous analysis for preference-only reinforcement learning in episodic kernel MDPs. The method builds confidence sets around value estimates from binary preference labels and achieves sublinear regret—meaning the learned policy provably converges to optimal performance as episodes accumulate.
Read the original paper →