What agent leaderboards don't tell you about LLM capabilities

Today's LLM agents control files, browsers, and code, but benchmarks measure different things (task success, tool validity, safety, robustness), making comparison impossible. This work proposes a unified framework with taxonomies for agent behavior, failure modes, and a key finding: stripping explicit labels from prompts causes all models to collapse to 54–62% accuracy, suggesting much of their capability comes from supervised cues rather than genuine reasoning.