← Back to Computation and Language cs.CL
How to rank language models without labeled data or trustworthy benchmarks?
Alexander Apartsin, Yehudit Aperstein
June 2, 2026
Benchmarks leak into training data, making standard scores unreliable. CoEval sidesteps this by generating fresh, contamination-free tasks from task descriptions alone and using a diverse panel of models-as-judges instead of humans. On four tasks, the approach recovered true model rankings with 0.86 correlation to ground truth while costing $5.89 total—cheap enough to regenerate per release.
Read the original paper →