Why AI agents fail: A systematic diagnosis framework

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

Current AI agent evaluation reports success or failure without explaining root causes or localizing errors in complex multi-step processes. This work introduces a holistic framework that combines top-down agent-level diagnosis with bottom-up span-level analysis, breaking long task traces into independent assessments to identify specific failure points. On TRAIL, GAIA, and SWE-Bench benchmarks, the approach achieves up to 3.5× improvement in localization accuracy and 12.5× on joint localization-categorization versus prior methods. Crucially, the same frontier model shows several times higher accuracy when operating within this framework rather than as a single monolithic judge, demonstrating that evaluation methodology rather than model capability is the limiting factor for understanding agent behavior.