EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
New method creates unlimited synthetic data from private sources without additional privacy overhead.
A research team including Amin Banayeeanzade, Qingchuan Yang, and seven other collaborators has introduced EPSVec, a novel approach to generating synthetic data from private datasets while maintaining differential privacy guarantees. The method addresses critical limitations in current private text generation techniques, which are notoriously inefficient—requiring large private corpora, substantial computational resources, and often suffering from poor quality in low-data scenarios. EPSVec represents a paradigm shift by decoupling the privacy budget from the generation process, enabling organizations to create synthetic versions of sensitive data (like medical records, financial documents, or proprietary research) without risking exposure of the original information.
The technical breakthrough centers on 'dataset vectors'—directions in the activation space of large language models (like GPT-4 or Llama 3) that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes these steering vectors just once using differential privacy mechanisms, then performs standard decoding to generate unlimited synthetic samples without incurring additional privacy costs. This one-time sanitization approach, combined with fixed-shot prompting and pretrained base models, achieves superior distributional alignment and downstream utility compared to existing baselines. The method dramatically reduces computational overhead while maintaining strong fidelity even when working with small private datasets, potentially unlocking sensitive data for AI training, research collaboration, and regulatory compliance across healthcare, finance, and technology sectors.
- Uses 'dataset vectors' in LLM activation space to capture distributional gaps between private and public data
- One-time privacy sanitization enables unlimited synthetic data generation without additional privacy cost
- Outperforms existing methods in low-data regimes while reducing computational overhead by significant margins
Why It Matters
Enables secure use of sensitive data for AI development and research collaboration while maintaining privacy compliance.