Tiny Inference-Time Scaling with Latent Verifiers
New latent-space verifier reduces compute FLOPs by 51% while improving image quality scores by 2.7%.
A research team from the University of Modena and Reggio Emilia has introduced a breakthrough method called VHS (Verifier on Hidden States) that dramatically reduces the computational cost of improving AI-generated images. The paper, "Tiny Inference-Time Scaling with Latent Verifiers," addresses a critical bottleneck in current image generation pipelines where Multimodal Large Language Models (MLLMs) are used as verifiers to score and select the best outputs. These MLLMs require decoding generated images from latent space to pixel space and then re-encoding them, creating redundant operations that consume significant time and resources.
VHS operates directly on the intermediate hidden representations of Diffusion Transformer (DiT) generators, eliminating the need for pixel-space conversion entirely. This approach enables more efficient inference-time scaling, particularly valuable when working with limited computational budgets and small numbers of candidate images per prompt. The researchers demonstrated that VHS reduces joint generation-and-verification time by 63.3%, compute FLOPs by 51%, and VRAM usage by 14.5% compared to standard MLLM verifiers while actually improving performance.
Remarkably, this efficiency gain doesn't come at the cost of quality. The team reported a +2.7% improvement on the GenEval benchmark when using the same inference-time budget, showing that their method not only saves resources but can enhance output quality. This represents a significant advancement in making high-quality image generation more accessible and sustainable, particularly for applications requiring rapid iteration or operating under computational constraints.
- VHS reduces joint generation-and-verification time by 63.3% compared to MLLM verifiers
- The method cuts compute FLOPs by 51% and VRAM usage by 14.5% while improving GenEval scores by 2.7%
- Operates directly on DiT hidden representations, eliminating costly pixel-space decoding and re-encoding
Why It Matters
Makes high-quality AI image generation dramatically more efficient and accessible, reducing both cost and environmental impact.