Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance
New method improves FID on ImageNet-256 without heavy pretrained encoders
Weitao Du's new paper, Frequency-Forcing, presents a paradigm shift in flow-matching models for image generation. Traditional models transport noise to data uniformly, but Frequency-Forcing imposes an explicit generation order: building coarse, low-frequency structure first, then adding fine details. This approach is inspired by two prior works—K-Flow, which uses a hard frequency constraint by reinterpreting frequency scaling as flow time, and Latent Forcing, which provides soft ordering via an auxiliary semantic latent stream with asynchronous time schedules.
Frequency-Forcing combines the best of both: it realizes K-Flow's frequency ordering through Latent Forcing's soft mechanism. A standard pixel flow is guided by an auxiliary low-frequency stream that matures earlier in time, but unlike Latent Forcing's reliance on heavy pretrained encoders like DINO, Frequency-Forcing derives its scratchpad from the data itself via a lightweight learnable wavelet packet transform. This self-forcing signal avoids external dependencies and learns a basis better adapted to data statistics than fixed bases. On ImageNet-256, the method consistently improves FID scores over strong baselines and naturally composes with semantic streams for additional gains.
- Frequency-Forcing uses a lightweight learnable wavelet packet transform as a self-forcing signal, avoiding heavy pretrained encoders like DINO
- On ImageNet-256, it consistently improves FID scores over both pixel- and latent-space baselines
- The method composes with semantic streams for further performance gains, showing versatility in scale-ordered generation
Why It Matters
Enables more efficient, high-quality image generation without external dependencies, reducing computational costs for AI models.