Bhardwaj et al. find ML dataset docs lack critical reflexivity
Structured documentation like datasheets misses the mark on self-reflection in dataset creation.
A team of researchers led by Eshta Bhardwaj, Ciara Zogheib, and Christoph Becker at the University of Toronto has published a study evaluating whether structured documentation frameworks—such as datasheets, data statements, and dataset nutrition labels—actually promote reflexivity in dataset development. Reflexivity, the practice of critically examining one's own assumptions and biases during creation, is often cited as a goal by framework creators. The paper adopts mixed-method thematic analysis and corpus-assisted discourse analysis to compare the frameworks against established reflexivity literature from the FAccT community.
The empirical results are stark: both the framework guidelines and their real-world published responses show minimal engagement with major themes of reflexivity. The authors developed a codebook of essential reflexivity topics and recommend actionable strategies, including a set of extended datasheet questions. These additions aim to push dataset developers toward deeper self-examination of ethical choices—from problem formulation to data processing and reuse. The findings highlight a critical gap between the stated goals of structured documentation and their actual implementation, urging the ML community to take reflexivity seriously as a tool for responsible AI development.
- Structured documentation frameworks (datasheets, data statements, nutrition labels) fail to operationalize reflexivity concepts from FAccT literature
- Mixed-method analysis combined thematic coding with corpus-assisted discourse on both framework templates and published responses
- Proposes a codebook of reflexivity topics and 11 new or revised datasheet questions to embed critical self-reflection into dataset development
Why It Matters
For ML practitioners and dataset creators: these findings expose a blind spot in responsible AI documentation practices that must be addressed.