CCVFM: Coreset-Based Flow Matching Cuts Training Steps
Instead of starting from noise, CCVFM starts from a compressed data summary.
Generative flow matching models typically start from pure Gaussian noise and learn to transport it to the target data distribution. This can be expensive, especially for high-dimensional multimodal data. Coreset-Induced Conditional Velocity Flow Matching (CCVFM) reframes this by first compressing the target dataset into a small set of weighted atoms using an entropic Sinkhorn coreset. These atoms are then lifted into a closed-form Gaussian mixture that serves as a near-optimal source distribution for the flow. The inner flow no longer needs to learn the entire transformation from noise; instead it only corrects the residual between the surrogate source and the true target. This reduces the complexity of the learned model and the number of required sampling steps.
CCVFM provides theoretical guarantees: the transport cost between the surrogate and target is bounded by the Wasserstein gap under explicit compression assumptions, while the conventional noise-to-data cost suffers from a dimension-scale lower bound. Empirically, on standard benchmarks like CIFAR-10 and ImageNet-32, CCVFM matches or exceeds the performance of baseline flow models with equal architectures but uses fewer sampling steps. The surrogate source encodes multimodal structure from the coreset, allowing the flow to start closer to the target manifold. This work opens the door to using data-informed priors in flow-based generative models, potentially reducing computational costs in image, video, and 3D generation.
- Uses entropic Sinkhorn coreset to compress target data into weighted atoms for a Gaussian mixture source.
- Replaces full noise-to-data learning with a lighter surrogate-to-target residual correction flow.
- Achieves competitive few-step generation on MNIST, CIFAR-10, ImageNet-32, and CelebA-HQ with matched architectures.
Why It Matters
CCVFM demonstrates that better initial source distributions can significantly reduce the computational burden of generative flow models.