[P] Made a dataset but don't know what to do with it
A weekend project to collect NTSB and FAA final reports has created a unique, open-source aviation safety dataset.
A Reddit user's weekend curiosity about aviation safety has inadvertently created a potentially valuable resource for the AI and data science community. User AbdullahKhanSherwani set out to find an open-source dataset containing the text of final reports from major air crashes, like those from the NTSB (National Transportation Safety Board) or FAA (Federal Aviation Administration). Finding none, they began collecting and cleaning these official investigation documents, which detail causes, contributing factors, and safety recommendations. The project has now reached a crossroads: a structured dataset exists, but its most impactful application is unclear.
The creator has turned to the r/MachineLearning community to brainstorm practical AI implementations. Primary suggestions include building a RAG (Retrieval-Augmented Generation) system, which would allow aviation engineers, investigators, or journalists to query the corpus of past incidents for similar failures or recommendations. Other proposed uses involve fine-tuning a model for automated report summarization, sentiment analysis on safety culture, or temporal pattern recognition to identify recurring technical or human-factor issues across decades of aviation history. The project highlights a common gap in AI: the infrastructure (data) is built before the specific use case is fully defined.
- User created a first-of-its-kind open dataset of cleaned NTSB/FAA air crash final reports.
- The project is now in a 'solution seeking a problem' phase, with the creator crowdsourcing AI use cases.
- Top community suggestion is to build a RAG system for querying historical incident patterns and safety recommendations.
Why It Matters
Demonstrates the value of niche public datasets for AI safety research and the importance of defining use cases before building.