VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
63x more efficient than LlamaGen, beating diffusion models at resolution-agnostic generation.
Researchers from Arizona State University and Huawei have unveiled VibeToken, a novel resolution-agnostic autoregressive (AR) image synthesis approach that challenges the dominance of diffusion models. At its core, VibeToken is a 1D Transformer-based image tokenizer that dynamically encodes images into a user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. This enables arbitrary resolution and aspect ratio generation without retraining, a major limitation of fixed-resolution AR models like LlamaGen.
Building on VibeToken, the team introduces VibeToken-Gen, a class-conditioned AR generator that synthesizes 1024x1024 images using only 64 tokens, achieving a 3.94 gFID score. In comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. VibeToken-Gen maintains constant 179G FLOPs regardless of resolution, a 63.4x efficiency improvement over LlamaGen's 11T FLOPs at 1024x1024. Accepted at CVPR'26, this work could accelerate production deployment of AR visual generative models.
- VibeToken encodes images into 32-256 tokens, enabling dynamic resolution generation
- VibeToken-Gen achieves 3.94 gFID at 1024x1024 using 64 tokens, beating diffusion's 5.87 gFID with 1,024 tokens
- Constant 179G FLOPs vs. LlamaGen's 11T FLOPs (63.4x more efficient) at same resolution
Why It Matters
VibeToken could make AR image generation practical for production, slashing compute costs while matching diffusion quality.