← Back to Machine Learning (Statistics)
stat.ML

How description length tames bloated symbolic regression

Gabriel Kronberger, Fabricio Olivetti de Franca, Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira

May 21, 2026

Genetic programming for symbolic regression tends to produce bloated, overfitted equations—especially with noisy data. This work applies description length (DL), an information-theoretic criterion, as a principled way to select compact models that generalize better. Across noisy synthetic and real-world datasets, DL post-selection outperformed standard heuristics like AIC and BIC. However, using DL directly as a fitness function caused premature convergence; the sweet spot was multi-objective search for accuracy and compactness, then DL-based selection afterward.
Published as Guiding Multi-Objective Genetic Programming with Description Length Improves Symbolic Regression Solutions arXiv:2605.22374
Read the original paper →