ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
5-billion-parameter autoencoder handles any resolution without GANs or performance loss.
ViTok-v2, developed by a team of 12 researchers, scales Vision Transformer (ViT) autoencoders to 5 billion parameters—the largest image autoencoder to date. It solves two core limitations of prior tokenizers: resolution dependency and unstable adversarial training. The model uses NaFlex, a native resolution support mechanism that generalizes across resolutions and aspect ratios without degradation, and replaces both LPIPS and GAN losses with a DINOv3-based perceptual loss, enabling stable training at any scale. The model was trained on approximately 2 billion images, a massive dataset for tokenizer pretraining.
In evaluation, ViTok-v2 matches or exceeds state-of-the-art reconstruction quality at 256×256 resolution and outperforms all baselines at 512×512 and above. The paper further demonstrates that joint scaling of both the autoencoder and a flow matching generator pushes the reconstruction-generation Pareto frontier, meaning better tokenization directly enables higher-quality generated images. This work establishes a new baseline for native-resolution image tokenization, with implications for generative models that rely on high-fidelity latent representations.
- First image autoencoder scaled to 5 billion parameters, trained on 2 billion images.
- Introduces NaFlex for native resolution and aspect ratio support, plus DINOv3 perceptual loss that eliminates need for GANs.
- Outperforms all baselines at 512p+ resolutions and advances the reconstruction-generation trade-off when paired with flow matching.
Why It Matters
Enables higher-fidelity image tokenization at any resolution, directly improving generative models like diffusion and flow matching.