Why aligned AI still needs an off switch

Alignment training makes AI systems *want* to behave safely, but that doesn't guarantee they'll actually stop or change course when a human says so—especially under conflicting instructions or tool access. Researchers introduce ControlBench, a benchmark exposing controllability failures in agent tasks, and show current safeguards often fail to provide persistent runtime control. They propose an architectural framework with explicit control planes and intervention pathways as a complement to alignment.