Making AI safety rules stick across sneaky rewording

Large language models can be tricked into unsafe behavior through adversarial rephrasing—a model might refuse a request directly but comply when it's wrapped in flowery language. The core problem: standard safety training treats all prompt variants equally, but some yield verifiable feedback (like multiple-choice) while others rely on noisy reward signals. Anchor Invariance Regularization (AIR) treats reliable prompts as reference points and uses stop-gradient targets to guide open-ended variants toward consistent behavior without degrading performance on trustworthy signals. Combined with preference optimization, AIR boosts in-distribution accuracy by 12.71% and out-of-distribution consistency by 33.49% across safety, reasoning, and math tasks.