Debugging multi-agent AI workflows through intelligent bottleneck detection

Multi-agent workflows—systems with multiple specialized AI agents—often outperform single-prompt approaches but are hard to debug when intermediate outputs fail silently. PROTEA addresses this by providing a test-driven interface that executes workflows, scores outputs at each step using configurable rubrics, and visualizes bottlenecks on a workflow graph. The key innovation is backward node evaluation: it infers what each agent's output should have been based on the final answer and workflow structure, then flags discrepancies. Developers can edit prompts and immediately see how changes propagate through the system. Two production workflows saw accuracy gains of 19.6 percentage points and Hit@5 improvement from 0.30 to 0.38. A user study with six experienced developers confirmed the tool's value for localization and iterative refinement.