← Back to Computation and Language
cs.CL

Why bigger models organize their thinking differently

Weilun Xu

May 16, 2026

This work investigates how language model representations are geometrically organized using a new metric called Subspace PGA, which measures whether a layer's distance structure aligns with the prediction readout matrix. Testing seven Pythia models (70M–6.9B parameters) plus cross-family models, the authors find intermediate layers show significant predictive organization (z-scores 9–24), but behavior diverges by scale: models with dimension ≤1024 progressively lose this organization at late layers despite improving loss, while larger models (≥2048 dimensions) maintain it throughout. The difference traces to a capacity bottleneck—small models mask predictive information under dominant directions rather than fully destroying it. This distinction is invisible to spectral metrics or standard loss analysis, suggesting scale determines not just performance but the fundamental geometry through which models solve prediction.
Published as Scale Determines Whether Language Models Organize Representation Geometry for Prediction arXiv:2605.17084
Read the original paper →