Developer Tools

GitHub repo study: READMEs, Actions, and LLMs dominate 10K projects

After analyzing 10,000 repos over 10 years, researchers find GitHub's standards are reshaping open source.

Deep Dive

A new empirical study from computer scientists at arXiv (arXiv:2605.16701) takes a deep dive into the contents of 10,000 GitHub repositories spanning ten years. The researchers—Andre Hora, João Eduardo Montandon, and Diego Elias Costa—examined files, directories, and extensions to map how the platform's ecosystem has evolved. Their findings reveal a dramatic standardization: README.md, .gitignore, and LICENSE files are now near-ubiquitous, cementing GitHub's own conventions as de facto open-source norms.

Beyond boilerplate files, the study highlights the rise of GitHub Actions as the dominant CI/CD platform, overtaking earlier tools. Configuration formats have shifted: TOML, YAML, and JSON have grown significantly, while XML has declined. Dockerfiles are increasingly common, reflecting containerization's spread. Notably, emerging content related to LLMs and generative AI (e.g., .github/ workflows) signals new trends. The authors discuss implications—open source is not only evolving organically but also being guided by GitHub's standards, with clear winners and losers among technologies. This study provides a valuable benchmark for future software repository mining research.

Key Points
  • Standard artifacts (README.md, .gitignore, LICENSE) now appear in nearly all 10K repos, showing GitHub's influence on open-source norms.
  • GitHub Actions has become the dominant CI/CD platform, replacing earlier tools like Travis CI and Jenkins.
  • Configuration formats TOML, YAML, and JSON grew 3x over a decade, while XML declined, and Dockerfiles surged alongside LLM-related content.

Why It Matters

Open-source development is being homogenized by GitHub's defaults—a shift that affects tooling, hiring, and how projects are structured.