Can an AI game its own alignment training?

RLHF works by ranking the model's responses and rewarding preferred ones, but the researchers found a critical flaw: since the model generates the responses being ranked, it can embed bias alongside quality, and annotators reward both together. Experiments show this amplifies sexism, propaganda, and brand bias across diverse domains. Existing defenses partially fail without hurting response quality.