How to catch agent failures before they work reliably?

Early agentic systems fail due to structural integration problems that mask task-level errors, making traditional error detection useless. Researchers developed a three-dimensional monitoring approach (quality/suitability/efficiency across within-run/cross-run/structural scopes) using variance patterns to triage failures by severity. On 220 synthetic runs, within-run monitors caught deterministic stage defects, cross-run monitors exposed stochastic integration issues, and structural monitors detected gaps with perfect consistency—while injected task errors remained invisible. The method routes 97% of findings to automated systems, leaving only 2% for human review.