Why one safety layer isn't enough for AI agents

S. Bensalem, Y. Dong, M. Franzle, X. Huang, J. Kroger, D. Nickovic, A. Nouri, R. Roy, C. Wu

Current safety approaches for language model agents try to enforce all guarantees in one place, but this is structurally impossible. Different safety concerns—whether the agent's goals are legal, whether the world matches expectations, whether actions are physically feasible—become knowable only at different moments in execution. The authors propose a three-layer contract-based architecture where each layer independently certifies one concern and passes probabilistic guarantees to the next, deriving compositional safety bounds. Three hard problems remain: learning these bounds from messy real-world data, handling systems that drift from training, and extending to multi-agent scenarios.