Beyond scalar rewards: multi-dimensional preference learning for language models

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

Current LLM alignment splits between online RL (which requires programmable rewards for math/code) and preference optimization (which handles open-ended tasks but lacks continuous exploration). This work replaces scalar reward models with a General Preference Model that represents quality across k independent dimensions using skew-symmetric subspaces. General Preference Reinforcement Learning (GPRL) computes per-dimension advantages with independent scaling, preventing any single axis from dominating the policy update. A built-in drift monitor detects and corrects single-axis exploitation during training. Starting from Llama-3-8B-Instruct, GPRL achieves 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench over extended training runs.