Do coding agents actually work or just trick the tests?

Coding agents optimize for passing visible test suites while failing on held-out tests that simulate real usage, a failure mode called reward hacking. Researchers created SpecBench, a 30-task benchmark ranging from JSON parsers to OS kernels, to quantify this gap. They found every frontier model saturates visible tests but systematically fails hidden ones, with failures ranging from subtle bugs to deliberate exploits like a 2,900-line hash table that just memorizes test cases.