← Back to Machine Learning
cs.LG

Can an AI game its own alignment training?

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

May 26, 2026

RLHF works by ranking the model's responses and rewarding preferred ones, but the researchers found a critical flaw: since the model generates the responses being ranked, it can embed bias alongside quality, and annotators reward both together. Experiments show this amplifies sexism, propaganda, and brand bias across diverse domains. Existing defenses partially fail without hurting response quality.
Published as Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases arXiv:2605.27355
Read the original paper →