← Back to Artificial Intelligence cs.AI
Making AI safety rules stick across sneaky rewording
Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang
May 20, 2026
Large language models can be tricked into unsafe behavior through adversarial rephrasing—a model might refuse a request directly but comply when it's wrapped in flowery language. The core problem: standard safety training treats all prompt variants equally, but some yield verifiable feedback (like multiple-choice) while others rely on noisy reward signals. Anchor Invariance Regularization (AIR) treats reliable prompts as reference points and uses stop-gradient targets to guide open-ended variants toward consistent behavior without degrading performance on trustworthy signals. Combined with preference optimization, AIR boosts in-distribution accuracy by 12.71% and out-of-distribution consistency by 33.49% across safety, reasoning, and math tasks.
Read the original paper →