← Back to Machine Learning cs.LG
Can an AI game its own alignment training?
Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
May 26, 2026
RLHF works by ranking the model's responses and rewarding preferred ones, but the researchers found a critical flaw: since the model generates the responses being ranked, it can embed bias alongside quality, and annotators reward both together. Experiments show this amplifies sexism, propaganda, and brand bias across diverse domains. Existing defenses partially fail without hurting response quality.
Read the original paper →