Why bigger AI models sometimes get worse: a physics perspective

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma

Large language models sometimes perform worse when scaled up or quantized, a puzzle existing scaling laws can't explain. This work reframes LLM training as information transmission through a noisy channel (Shannon's model), where model size is bandwidth and training tokens are signal power. The theory predicts a fundamental capacity limit: pushing either dimension without maintaining signal-to-noise ratio amplifies noise and triggers U-shaped performance degradation. Tested on Pythia and OLMo2 across quantization, fine-tuning, and noise injection, the Shannon Scaling Law outperforms prior approaches and extrapolates to unseen model sizes accurately.