← Back to Machine Learning cs.LG
Beyond scalar rewards: multi-dimensional preference learning for language models
Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi
May 18, 2026
Current LLM alignment splits between online RL (which requires programmable rewards for math/code) and preference optimization (which handles open-ended tasks but lacks continuous exploration). This work replaces scalar reward models with a General Preference Model that represents quality across k independent dimensions using skew-symmetric subspaces. General Preference Reinforcement Learning (GPRL) computes per-dimension advantages with independent scaling, preventing any single axis from dominating the policy update. A built-in drift monitor detects and corrects single-axis exploitation during training. Starting from Llama-3-8B-Instruct, GPRL achieves 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench over extended training runs.
Read the original paper →