SenseTime's SenseNova-U1 ditches VAEs for 32x pixel compression
No more VAE blur: SenseNova-U1 compresses images 32x while preserving pixel-level detail.
SenseTime has released a technical report on SenseNova-U1, a vision model that abandons traditional VAEs and visual encoders. Instead, it uses a VAE-free visual interface: a 2-layer convolution that achieves 32x compression, followed by an MLP head that directly predicts pixels. To handle varying resolutions, the model introduces Dynamic Noise Scale (DNS), which keeps the signal-to-noise ratio consistent from 512px up to 2048px. This approach avoids the detail loss and text blurring commonly associated with VAEs in models like Stable Diffusion or FLUX.
SenseNova-U1 employs a native Mixture-of-Transformers (MoT) architecture: a unified backbone where Understanding and Generation streams share self-attention but use separate FFN/Norm layers, dynamically routed by token type. Training combines autoregressive and flow matching losses, with a 6-stage pipeline including warm-up, SFT, and 8-step distillation. Deployment uses LightLLM/LightX2V for independent parallel scheduling. Two variants are available: a dense 8B-MoT and A3B-MoT (MoE with 30B total, 3B active).
- VAE-free pixel compression: 2-layer conv achieves 32x compression, MLP head predicts pixels directly
- Dynamic Noise Scale (DNS) maintains consistent SNR across resolutions from 512px to 2048px
- Mixture-of-Transformers with shared self-attention but decoupled FFN/Norm for understanding vs generation streams
Why It Matters
SenseNova-U1's VAE-free approach could set a new standard for high-fidelity image generation and understanding.