A Benchmarking Framework for Model Datasets
New framework tackles reproducibility crisis by systematically measuring dataset quality across formats
A team of researchers from the software engineering community has proposed a novel benchmarking framework specifically designed for datasets of software models, addressing a critical gap in AI and model-driven engineering research. Philipp-Lorenz Glaser, Lola Burgueño, and Dominik Bork argue that datasets used to train or evaluate machine learning techniques for modeling support are typically collected ad hoc, without guarantees of quality or suitability for specific tasks. This practice, detailed in their arXiv preprint (2603.05250), leads to limited comparability between studies, obscured dataset quality, weak reproducibility, and potential bias, fundamentally undermining empirical and LLM-based research in the field.
The proposed solution is a Benchmark Platform for MDE that provides a unified infrastructure for systematically assessing and comparing datasets of software models. The framework treats datasets as first-class artifacts, moving beyond simply benchmarking model performance to benchmarking the datasets themselves. It involves systematically measuring a dataset's quality, representativeness, and task suitability using defined criteria and metrics, and is designed to work across different modeling languages and formats. This represents a foundational shift toward more rigorous, transparent, and reproducible AI research in software engineering, where the quality of the input data is recognized as being just as critical as the algorithms that process it.
- Addresses the ad hoc collection of software model datasets used for training ML techniques, which limits study comparability.
- Proposes a unified Benchmark Platform for MDE to assess datasets across languages/formats with defined metrics.
- Treats datasets as first-class artifacts to improve reproducibility and reduce bias in AI for software engineering.
Why It Matters
Improves reproducibility and reduces bias in AI research for software engineering by ensuring higher-quality training data.