How vulnerable are ML benchmarks to deliberate gaming?

Researchers model benchmark gaming as election manipulation and show that choosing which datasets to train on to top-rank a model is NP-hard—computationally intractable in the worst case. They measured real leaderboards (HELM, Open LLM) and found mean win rate is hardest to game (92% robustness on BBH), while arithmetic mean collapses to 54%. This matters because leaderboards drive model development; understanding which scoring rules resist strategic behavior could improve how the field evaluates progress.