Uses Amazon EMR Serverless with Apache Spark to preprocess data while respecting Unity Catalog's access controls?

Uses Amazon EMR Serverless with Apache Spark to preprocess data while respecting Unity Catalog's access controls

Fine-tunes the Ministral-3-3B-Instruct model from Hugging Face via SageMaker AI Training jobs?

Fine-tunes the Ministral-3-3B-Instruct model from Hugging Face via SageMaker AI Training jobs

OAuth 2.0 credentials stored in AWS Secrets Manager enable programmatic access to Unity Catalog APIs?

OAuth 2.0 credentials stored in AWS Secrets Manager enable programmatic access to Unity Catalog APIs

Developer Tools

Databricks & AWS unite for governed LLM fine-tuning with SageMaker AI

AWS Machine Learning Blog May 14, 2026

⚡Fine-tune Ministral-3B on governed data without breaking Unity Catalog's permissions.

Deep Dive

Databricks and AWS have released a reference architecture to fine-tune large language models (LLMs) using Amazon SageMaker AI while strictly maintaining Unity Catalog's governance. The core challenge is that SageMaker Training jobs, when reading from Amazon S3, often bypass Unity Catalog's authorization model—creating audit gaps and compliance risks. The solution addresses this by using Amazon EMR Serverless to preprocess training data via Apache Spark while interacting with Unity Catalog's Open REST APIs. OAuth credentials (client ID and secret) for programmatic access are securely stored in AWS Secrets Manager and used to authenticate Spark sessions.

After preprocessing, a new table is created in Unity Catalog containing the clean dataset. SageMaker AI then retrieves the Ministral-3-3B-Instruct model from Hugging Face, fine-tunes it against that governed table, and stores the output artifacts back into a Unity Catalog-managed S3 bucket. Finally, the model is registered in Unity Catalog with external lineage tracking, connecting source data to the trained model. This integration lets enterprises keep their existing AWS services (SageMaker, EMR, S3) while enforcing consistent permissions, enabling regulated industries to use best-in-class ML tools without sacrificing compliance.

Key Points

Uses Amazon EMR Serverless with Apache Spark to preprocess data while respecting Unity Catalog's access controls
Fine-tunes the Ministral-3-3B-Instruct model from Hugging Face via SageMaker AI Training jobs
OAuth 2.0 credentials stored in AWS Secrets Manager enable programmatic access to Unity Catalog APIs

Why It Matters

Bridges Databricks governance with AWS AI services, letting regulated enterprises fine-tune LLMs without compliance exposure.

Read Original Article

Databricks & AWS unite for governed LLM fine-tuning with SageMaker AI

Why It Matters

Related Articles

🚀 Stay Ahead in AI