Research & Papers

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

A new diffusion model tackles imbalanced datasets by creating realistic fraudulent transaction samples.

Deep Dive

Researchers En-Ya Kuo and Sebastien Motsch have introduced EmDT (Embedding Diffusion Transformer), a novel diffusion model specifically designed to generate synthetic tabular data for fraud detection. The core innovation addresses a persistent problem in the field: imbalanced datasets where fraudulent transactions are rare, causing classifiers like XGBoost to be biased toward the majority class. EmDT tackles this by first using UMAP, a dimensionality reduction technique, to cluster and identify distinct patterns within the existing fraudulent data. It then trains a Transformer-based denoising network, equipped with sinusoidal positional embeddings, to learn the complex relationships between different data features (like transaction amount, location, time) throughout a diffusion process. This allows it to generate new, realistic samples that mimic the statistical properties of real fraud.

Once the synthetic fraudulent data is generated, it is combined with the original dataset to train a standard, high-performing classifier such as XGBoost. Experiments conducted on a real-world credit card fraud detection dataset show that EmDT significantly improves the performance of the downstream classifier compared to traditional oversampling methods (like SMOTE) and other generative approaches. Crucially, the model maintains a strong level of privacy protection for the original data and successfully preserves the intricate correlations between features, which is essential for maintaining the integrity and usefulness of the synthetic data for training accurate models.

Key Points
  • EmDT uses UMAP clustering and a Transformer denoising network to generate realistic synthetic fraud data.
  • The model significantly improved classifier performance on a credit card fraud dataset versus existing methods.
  • It maintains data privacy and preserves critical feature correlations present in the original tabular data.

Why It Matters

This provides a more effective way to train fraud detection systems on rare events, potentially saving financial institutions billions.