Research & Papers

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

arXiv cs.CV April 29, 2026

⚡63x more efficient than LlamaGen, beating diffusion models at resolution-agnostic generation.

Deep Dive

Researchers from Arizona State University and Huawei have unveiled VibeToken, a novel resolution-agnostic autoregressive (AR) image synthesis approach that challenges the dominance of diffusion models. At its core, VibeToken is a 1D Transformer-based image tokenizer that dynamically encodes images into a user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. This enables arbitrary resolution and aspect ratio generation without retraining, a major limitation of fixed-resolution AR models like LlamaGen.

Building on VibeToken, the team introduces VibeToken-Gen, a class-conditioned AR generator that synthesizes 1024x1024 images using only 64 tokens, achieving a 3.94 gFID score. In comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. VibeToken-Gen maintains constant 179G FLOPs regardless of resolution, a 63.4x efficiency improvement over LlamaGen's 11T FLOPs at 1024x1024. Accepted at CVPR'26, this work could accelerate production deployment of AR visual generative models.

Key Points

VibeToken encodes images into 32-256 tokens, enabling dynamic resolution generation
VibeToken-Gen achieves 3.94 gFID at 1024x1024 using 64 tokens, beating diffusion's 5.87 gFID with 1,024 tokens
Constant 179G FLOPs vs. LlamaGen's 11T FLOPs (63.4x more efficient) at same resolution

Why It Matters

VibeToken could make AR image generation practical for production, slashing compute costs while matching diffusion quality.

Read Original Article

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Why It Matters

Stay Ahead in AI