← Back to Computation and Language
cs.CL

How to rank language models without labeled data or trustworthy benchmarks?

Alexander Apartsin, Yehudit Aperstein

June 2, 2026

Benchmarks leak into training data, making standard scores unreliable. CoEval sidesteps this by generating fresh, contamination-free tasks from task descriptions alone and using a diverse panel of models-as-judges instead of humans. On four tasks, the approach recovered true model rankings with 0.86 correlation to ground truth while costing $5.89 total—cheap enough to regenerate per release.
Published as CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks arXiv:2606.03650
Read the original paper →