← Back to Computation and Language
cs.CL

What agent leaderboards don't tell you about LLM capabilities

Parsa Mazaheri, Kasra Mazaheri

May 19, 2026

Today's LLM agents control files, browsers, and code, but benchmarks measure different things (task success, tool validity, safety, robustness), making comparison impossible. This work proposes a unified framework with taxonomies for agent behavior, failure modes, and a key finding: stripping explicit labels from prompts causes all models to collapse to 54–62% accuracy, suggesting much of their capability comes from supervised cues rather than genuine reasoning.
Published as AgentAtlas: Beyond Outcome Leaderboards for LLM Agents arXiv:2605.20530
Read the original paper →