← Back to Artificial Intelligence cs.AI
Why aligned AI still needs an off switch
Yige Li, Yunhao Feng, Jun Sun
May 26, 2026
Alignment training makes AI systems *want* to behave safely, but that doesn't guarantee they'll actually stop or change course when a human says so—especially under conflicting instructions or tool access. Researchers introduce ControlBench, a benchmark exposing controllability failures in agent tasks, and show current safeguards often fail to provide persistent runtime control. They propose an architectural framework with explicit control planes and intervention pathways as a complement to alignment.
Read the original paper →