Can RL learn from preference comparisons instead of reward numbers?

Learning from human preferences (which policy is better) rather than numeric rewards is common in practice but theoretically understudied. This work provides the first rigorous analysis for preference-only reinforcement learning in episodic kernel MDPs. The method builds confidence sets around value estimates from binary preference labels and achieves sublinear regret—meaning the learned policy provably converges to optimal performance as episodes accumulate.