← Back to Machine Learning
cs.LG

How vulnerable are ML benchmarks to deliberate gaming?

Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen

May 22, 2026

Researchers model benchmark gaming as election manipulation and show that choosing which datasets to train on to top-rank a model is NP-hard—computationally intractable in the worst case. They measured real leaderboards (HELM, Open LLM) and found mean win rate is hardest to game (92% robustness on BBH), while arithmetic mean collapses to 54%. This matters because leaderboards drive model development; understanding which scoring rules resist strategic behavior could improve how the field evaluates progress.
Published as How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness arXiv:2605.23628
Read the original paper →