Predicting LLM capabilities without waiting for full training

Deciding which model architecture or training data to use requires expensive downstream evaluations that are slow and uninformative early in training. The team constructed proxy metrics from token-level statistics (entropy, top-k accuracy, expert token rank) computed on expert-written solutions, then tested them across three scenarios: selecting among different model families (Rho = 0.81 vs. 0.36 for cross-entropy loss), ranking pretraining corpora, and forecasting final accuracy from early training stages. The proxies consistently outperformed loss-based baselines, enabling reliable performance prediction at 10,000× lower compute cost for data selection and tracking accuracy trajectories across an 18× compute span.