← Back to Computation and Language cs.CL
Why AI agents fail: A systematic diagnosis framework
Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev
May 14, 2026
Current AI agent evaluation reports success or failure without explaining root causes or localizing errors in complex multi-step processes. This work introduces a holistic framework that combines top-down agent-level diagnosis with bottom-up span-level analysis, breaking long task traces into independent assessments to identify specific failure points. On TRAIL, GAIA, and SWE-Bench benchmarks, the approach achieves up to 3.5× improvement in localization accuracy and 12.5× on joint localization-categorization versus prior methods. Crucially, the same frontier model shows several times higher accuracy when operating within this framework rather than as a single monolithic judge, demonstrating that evaluation methodology rather than model capability is the limiting factor for understanding agent behavior.
Read the original paper →