← Back to Artificial Intelligence
cs.AI

Why aligned AI still needs an off switch

Yige Li, Yunhao Feng, Jun Sun

May 26, 2026

Alignment training makes AI systems *want* to behave safely, but that doesn't guarantee they'll actually stop or change course when a human says so—especially under conflicting instructions or tool access. Researchers introduce ControlBench, a benchmark exposing controllability failures in agent tasks, and show current safeguards often fail to provide persistent runtime control. They propose an architectural framework with explicit control planes and intervention pathways as a complement to alignment.
Published as Position: AI Safety Requires Effective Controllability arXiv:2605.27117
Read the original paper →