Developer Tools

End-to-end lineage with DVC and Amazon SageMaker AI MLflow apps

New architecture combines DVC, SageMaker, and MLflow to solve model traceability gaps in regulated industries.

Deep Dive

AWS has partnered with DVC (Data Version Control) to create a comprehensive solution for machine learning lineage tracking, addressing a critical pain point for production ML teams. The architecture integrates three key tools: DVC for versioning datasets and linking them to Git commits, Amazon SageMaker for scalable processing and training, and SageMaker MLflow Apps for experiment tracking and model registry. This combination solves the common problem where teams struggle to trace which exact data version trained a production model or reproduce models deployed months earlier.

The solution creates a complete traceability chain: Production Model → MLflow Run → DVC commit → exact dataset in Amazon S3. DVC handles the data versioning through lightweight .dvc metafiles in Git while storing actual data in S3, overcoming Git's limitations with large files. SageMaker MLflow Apps then record the DVC commit hash (data_git_commit_id) during training runs, creating an auditable link between models and their training data. This is particularly crucial for regulated industries like healthcare and finance where audit requirements demand precise data-to-model linkages and the ability to exclude specific records from future training.

The architecture supports both dataset-level and record-level lineage through deployable patterns that teams can implement in their own AWS accounts using companion notebooks. The workflow involves SageMaker Processing jobs preprocessing data and versioning it with DVC, followed by Training jobs that pull specific dataset versions before logging everything to MLflow. This end-to-end approach transforms what were previously multi-day investigations through scattered logs and S3 buckets into automated, auditable workflows.

Key Points
  • Combines DVC's data versioning with SageMaker MLflow Apps to create Production Model → MLflow Run → DVC commit → S3 dataset traceability chain
  • Solves critical audit requirements for regulated industries where linking deployed models to exact training data is mandatory
  • Uses DVC's lightweight .dvc metafiles in Git with actual data in S3, overcoming Git's 100MB file limitations

Why It Matters

Eliminates multi-day investigations into model provenance and enables compliance in regulated industries like healthcare and finance.