← Back to Machine Learning cs.LG
How vulnerable are ML benchmarks to deliberate gaming?
Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen
May 22, 2026
Researchers model benchmark gaming as election manipulation and show that choosing which datasets to train on to top-rank a model is NP-hard—computationally intractable in the worst case. They measured real leaderboards (HELM, Open LLM) and found mean win rate is hardest to game (92% robustness on BBH), while arithmetic mean collapses to 54%. This matters because leaderboards drive model development; understanding which scoring rules resist strategic behavior could improve how the field evaluates progress.
Read the original paper →