How description length tames bloated symbolic regression

Gabriel Kronberger, Fabricio Olivetti de Franca, Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira

Genetic programming for symbolic regression tends to produce bloated, overfitted equations—especially with noisy data. This work applies description length (DL), an information-theoretic criterion, as a principled way to select compact models that generalize better. Across noisy synthetic and real-world datasets, DL post-selection outperformed standard heuristics like AIC and BIC. However, using DL directly as a fitness function caused premature convergence; the sweet spot was multi-objective search for accuracy and compactness, then DL-based selection afterward.