← Back to Computation and Language
cs.CL

Predicting LLM capabilities without waiting for full training

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

May 18, 2026

Deciding which model architecture or training data to use requires expensive downstream evaluations that are slow and uninformative early in training. The team constructed proxy metrics from token-level statistics (entropy, top-k accuracy, expert token rank) computed on expert-written solutions, then tested them across three scenarios: selecting among different model families (Rho = 0.81 vs. 0.36 for cross-entropy loss), ranking pretraining corpora, and forecasting final accuracy from early training stages. The proxies consistently outperformed loss-based baselines, enabling reliable performance prediction at 10,000× lower compute cost for data selection and tracking accuracy trajectories across an 18× compute span.
Published as Forecasting Downstream Performance of LLMs With Proxy Metrics arXiv:2605.18607
Read the original paper →