BioDefect dataset boosts bug detection in bioinformatics software by 30%+
First dedicated dataset for bioinformatics defect detection improves F1 scores by 29-38%.
Researchers from the field of software engineering have released BioDefect, the first dataset specifically designed for defect detection in bioinformatics software. This dataset addresses a critical gap: while general software defect detection is well-studied, no prior work focused on the unique challenges of bioinformatics code, which often handles sensitive genomic data and complex algorithms. BioDefect stands out by including complete source code repositories rather than just code snippets, preserving the full context of defective code. This enables more accurate simulation of real-world debugging scenarios. The dataset also tackles common issues like label inconsistency and data leakage through careful curation, ensuring high reliability for experiments. With these features, BioDefect aims to improve software quality assurance in bioinformatics, where undetected bugs can lead to flawed scientific analyses and erroneous conclusions.
The effectiveness of BioDefect was validated through systematic evaluations on nine language models (LMs), including DeepSeek-R1. By controlling for model-related factors, the researchers isolated the impact of the dataset itself. Compared to existing general defect detection datasets, BioDefect delivered an average F1-score improvement of 29.61% to 38.04% across all models. This significant boost underscores the importance of domain-specific training data for bioinformatics software. The findings suggest that specialized datasets are crucial for achieving high performance in niche domains. BioDefect sets a new benchmark for defect detection in bioinformatics, paving the way for more reliable software in genomics, proteomics, and other critical life sciences fields. The study is published on arXiv and represents a foundational contribution to both software engineering and bioinformatics communities.
- BioDefect is the first dataset for defect detection in bioinformatics software, including complete source code repositories.
- It addresses label inconsistency and data leakage, ensuring high data quality and experimental reliability.
- Tested on nine language models including DeepSeek-R1, it improved F1 scores by 29.61% to 38.04% over existing datasets.
Why It Matters
Bioinformatics software bugs can lead to flawed research; this dataset could significantly improve software reliability in life sciences.