DRAGON: Robust Classification for Very Large Collections of Software Repositories
This new model can finally organize the messy, undocumented codebases developers hate.
Researchers have unveiled DRAGON, a new AI model for classifying massive collections of software repositories. It uniquely works without relying on README files, using only lightweight signals like file and directory names from version control. DRAGON improves classification accuracy (F1@5) from 54.8% to 60.8%, beating the state of the art. Its performance degrades by only 6% when READMEs are missing, making it robust for real-world use. The team also released the largest open dataset for this task: 825,000 repositories from Software Heritage.
Why It Matters
It enables large-scale organization and discovery of undocumented code, unlocking value from massive, messy software archives.