Why bigger models organize their thinking differently

This work investigates how language model representations are geometrically organized using a new metric called Subspace PGA, which measures whether a layer's distance structure aligns with the prediction readout matrix. Testing seven Pythia models (70M–6.9B parameters) plus cross-family models, the authors find intermediate layers show significant predictive organization (z-scores 9–24), but behavior diverges by scale: models with dimension ≤1024 progressively lose this organization at late layers despite improving loss, while larger models (≥2048 dimensions) maintain it throughout. The difference traces to a capacity bottleneck—small models mask predictive information under dominant directions rather than fully destroying it. This distinction is invisible to spectral metrics or standard loss analysis, suggesting scale determines not just performance but the fundamental geometry through which models solve prediction.