Databricks & AWS unite for governed LLM fine-tuning with SageMaker AI
Fine-tune Ministral-3B on governed data without breaking Unity Catalog's permissions.
Databricks and AWS have released a reference architecture to fine-tune large language models (LLMs) using Amazon SageMaker AI while strictly maintaining Unity Catalog's governance. The core challenge is that SageMaker Training jobs, when reading from Amazon S3, often bypass Unity Catalog's authorization model—creating audit gaps and compliance risks. The solution addresses this by using Amazon EMR Serverless to preprocess training data via Apache Spark while interacting with Unity Catalog's Open REST APIs. OAuth credentials (client ID and secret) for programmatic access are securely stored in AWS Secrets Manager and used to authenticate Spark sessions.
After preprocessing, a new table is created in Unity Catalog containing the clean dataset. SageMaker AI then retrieves the Ministral-3-3B-Instruct model from Hugging Face, fine-tunes it against that governed table, and stores the output artifacts back into a Unity Catalog-managed S3 bucket. Finally, the model is registered in Unity Catalog with external lineage tracking, connecting source data to the trained model. This integration lets enterprises keep their existing AWS services (SageMaker, EMR, S3) while enforcing consistent permissions, enabling regulated industries to use best-in-class ML tools without sacrificing compliance.
- Uses Amazon EMR Serverless with Apache Spark to preprocess data while respecting Unity Catalog's access controls
- Fine-tunes the Ministral-3-3B-Instruct model from Hugging Face via SageMaker AI Training jobs
- OAuth 2.0 credentials stored in AWS Secrets Manager enable programmatic access to Unity Catalog APIs
Why It Matters
Bridges Databricks governance with AWS AI services, letting regulated enterprises fine-tune LLMs without compliance exposure.