Research & Papers

Description Length and Bayes Factor beat overfitting in genetic programming

Data-efficient model selection outperforms AIC/BIC for symbolic regression on noisy data

Deep Dive

Symbolic regression with genetic programming (GPSR) often overfits and produces bloated expressions, especially when noise is present. In a new arXiv paper, Kronberger et al. evaluate Description Length (DL) and Fractional Bayes Factor (FBF) as principled, data-efficient alternatives to heuristics like AIC and BIC. They implement DL using a Fisher-information-based parameter encoding and test three strategies: multi-objective search for accuracy and length followed by DL/FBF selection; multi-objective search with DL as a direct objective; and single-objective optimization using DL/FBF as fitness.

The results show that DL/FBF post-selection consistently improves test performance across noisy synthetic benchmarks and real-world regression problems, beating AIC/BIC baselines. Interestingly, BIC combined with the same function complexity penalty yields similar results. However, using DL/FBF directly as a fitness function in single-objective GPSR often causes premature convergence to overly simple models. The authors provide practical guidance for using DL/FBF as robust model-selection tools, making this a valuable contribution for practitioners seeking compact, interpretable models.

Key Points
  • DL/FBF post-selection after multi-objective GPSR outperforms AIC/BIC baselines on test data
  • Fisher-information-based parameter encoding gives a principled measure of model complexity
  • Using DL as direct fitness causes premature convergence; best used as selection criterion after search

Why It Matters

Practical recipe for building simpler, more accurate symbolic regression models from noisy real-world data