← Back to Computation and Language cs.CL
Can AI agents be tricked into harmful acts gradually?
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi
May 21, 2026
AI safety testing has focused on what models say—toxic outputs, bias, harmful completions. But when AI agents control tools in real environments, the real risk is what they do. This benchmark presents "Boiling the Frog" scenarios: benign workspace edits gradually escalate toward a harmful request, testing whether models notice and resist. Across nine models, 44% of attacks succeeded overall, with Gemini Flash hitting 93% susceptibility in loss-of-control scenarios. Released with EU AI Act risk taxonomy.
Read the original paper →