← Back to Machine Learning (Statistics) stat.ML
How description length tames bloated symbolic regression
Gabriel Kronberger, Fabricio Olivetti de Franca, Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira
May 21, 2026
Genetic programming for symbolic regression tends to produce bloated, overfitted equations—especially with noisy data. This work applies description length (DL), an information-theoretic criterion, as a principled way to select compact models that generalize better. Across noisy synthetic and real-world datasets, DL post-selection outperformed standard heuristics like AIC and BIC. However, using DL directly as a fitness function caused premature convergence; the sweet spot was multi-objective search for accuracy and compactness, then DL-based selection afterward.
Read the original paper →