Research & Papers

[D] How do you track data lineage in your ML pipelines? Most teams I've talked to do it manually (or not at all)

Open-source tool automatically tracks data lineage across pandas/numpy operations with zero configuration

Deep Dive

PhD researcher Kishan Raj built AutoLineage, an open-source Python library that automatically tracks data lineage in ML pipelines. Using function hooking, it intercepts pandas/numpy I/O operations without manual logging, generates visual lineage graphs, and produces compliance reports for regulations like the EU AI Act. Users simply add 'import autolineage' to existing code for automatic tracking of data transformations, reads/writes, and model training dependencies.

Why It Matters

Addresses critical ML reproducibility and regulatory compliance gaps as AI systems face increasing documentation requirements.