← Back to Artificial Intelligence
cs.AI

Making AI safety rules stick across sneaky rewording

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

May 20, 2026

Large language models can be tricked into unsafe behavior through adversarial rephrasing—a model might refuse a request directly but comply when it's wrapped in flowery language. The core problem: standard safety training treats all prompt variants equally, but some yield verifiable feedback (like multiple-choice) while others rely on noisy reward signals. Anchor Invariance Regularization (AIR) treats reliable prompts as reference points and uses stop-gradient targets to guide open-ended variants toward consistent behavior without degrading performance on trustworthy signals. Combined with preference optimization, AIR boosts in-distribution accuracy by 12.71% and out-of-distribution consistency by 33.49% across safety, reasoning, and math tasks.
Published as Towards Context-Invariant Safety Alignment for Large Language Models arXiv:2605.20994
Read the original paper →