ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning
New RL pipeline boosts data utility by 20% in low-data scenarios, focusing on predictive signals.
A research team led by Xiaofeng Lin has introduced ReTabSyn, a novel pipeline for generating synthetic tabular data using reinforcement learning. The core innovation addresses a key weakness in existing deep generative models: their struggle to learn complex data distributions in low-data, imbalanced settings. Instead of striving to replicate the full joint probability distribution of the data—an often inefficient goal—ReTabSyn's training objective directs the generator to prioritize learning the conditional distribution P(y|X). This means the model focuses on preserving the feature correlations most critical for accurate prediction of the target variable, making synthetic data generation far more data-efficient.
The system works by providing direct, reinforced feedback to the generator during training, specifically rewarding it for preserving these vital predictive signals. The researchers empirically fine-tuned a language model-based generator using this RL approach. In rigorous benchmarks simulating real-world challenges like small sample sizes, severe class imbalance, and distribution shift, ReTabSyn consistently outperformed current state-of-the-art baselines. The synthetic data it produces leads to stronger downstream model performance, effectively mitigating data scarcity and privacy concerns.
Furthermore, the reinforcement learning framework offers significant extensibility. It can be readily adapted to incorporate expert-specified constraints on the generated observations, allowing for greater control over the synthetic data's properties. This makes ReTabSyn not just a tool for data augmentation, but a flexible platform for creating tailored datasets that adhere to specific business rules or statistical requirements, opening new avenues for secure and efficient data sharing in sensitive domains.
- Focuses on conditional distribution P(y|X) instead of full joint distribution for greater data efficiency
- Uses RL to provide direct feedback, preserving critical feature correlations in low-data scenarios
- Outperforms SOTA baselines on benchmarks with small samples, class imbalance, and distribution shift
Why It Matters
Enables creation of high-utility synthetic data for ML training when real data is scarce, imbalanced, or privacy-sensitive.