Developer Tools

New dataset captures AI coding agent configurations from 4,738 repos

15,591 configuration artifacts across Claude Code, OpenAI Codex, and more

Deep Dive

A team led by Matthias Galster has published the first large-scale dataset of configuration artifacts for agentic AI coding tools, addressing a critical gap in software engineering research. The dataset, derived from 40,585 actively maintained GitHub repositories, was filtered using GPT-5.2 to identify 36,710 engineered software projects. It systematically catalogs 15,591 configuration artifacts across five tools—Claude Code, GitHub Copilot, OpenAI Codex, Cursor, and Gemini—and eight configuration mechanisms, including Context Files, Skills, Rules, and Hooks. The accompanying metadata includes 18,167 full configuration files and 148,519 AI-co-authored commits, all publicly available on Zenodo under CC BY 4.0 with an interactive browsing website.

The dataset enables researchers to study how developers steer agentic AI through repository-level configurations, a practice known as context engineering. By analyzing real-world usage patterns, teams can better understand adoption trends, common pitfalls, and effective prompts for tools like Claude Code and OpenAI Codex. The construction pipeline is also open, allowing others to replicate or extend the dataset. This work, presented at AIware 2026, lays the groundwork for improving human-AI collaboration in coding and optimizing agent behaviors at scale.

Key Points
  • Dataset covers 4,738 repositories across 5 agentic AI coding tools: Claude Code, GitHub Copilot, OpenAI Codex, Cursor, and Gemini.
  • Includes 15,591 configuration artifacts spanning 8 mechanisms like Context Files, Skills, Rules, and Hooks.
  • Accompanied by 148,519 AI-co-authored commits and an interactive website for browsing (CC BY 4.0 licensed).

Why It Matters

This dataset lets researchers and teams optimize how they configure AI coding agents for real-world software projects.