How to rank language models without labeled data or trustworthy benchmarks?

Benchmarks leak into training data, making standard scores unreliable. CoEval sidesteps this by generating fresh, contamination-free tasks from task descriptions alone and using a diverse panel of models-as-judges instead of humans. On four tasks, the approach recovered true model rankings with 0.86 correlation to ground truth while costing $5.89 total—cheap enough to regenerate per release.