VAE-free pixel compression?

2-layer conv achieves 32x compression, MLP head predicts pixels directly

Dynamic Noise Scale (DNS) maintains consistent SNR across resolutions from 512px to 2048px?

Dynamic Noise Scale (DNS) maintains consistent SNR across resolutions from 512px to 2048px

Mixture-of-Transformers with shared self-attention but decoupled FFN/Norm for understanding vs generation streams?

Mixture-of-Transformers with shared self-attention but decoupled FFN/Norm for understanding vs generation streams

Image & Video

SenseTime's SenseNova-U1 ditches VAEs for 32x pixel compression

r/StableDiffusion May 13, 2026

⚡No more VAE blur: SenseNova-U1 compresses images 32x while preserving pixel-level detail.

Deep Dive

SenseTime has released a technical report on SenseNova-U1, a vision model that abandons traditional VAEs and visual encoders. Instead, it uses a VAE-free visual interface: a 2-layer convolution that achieves 32x compression, followed by an MLP head that directly predicts pixels. To handle varying resolutions, the model introduces Dynamic Noise Scale (DNS), which keeps the signal-to-noise ratio consistent from 512px up to 2048px. This approach avoids the detail loss and text blurring commonly associated with VAEs in models like Stable Diffusion or FLUX.

SenseNova-U1 employs a native Mixture-of-Transformers (MoT) architecture: a unified backbone where Understanding and Generation streams share self-attention but use separate FFN/Norm layers, dynamically routed by token type. Training combines autoregressive and flow matching losses, with a 6-stage pipeline including warm-up, SFT, and 8-step distillation. Deployment uses LightLLM/LightX2V for independent parallel scheduling. Two variants are available: a dense 8B-MoT and A3B-MoT (MoE with 30B total, 3B active).

Key Points

VAE-free pixel compression: 2-layer conv achieves 32x compression, MLP head predicts pixels directly
Dynamic Noise Scale (DNS) maintains consistent SNR across resolutions from 512px to 2048px
Mixture-of-Transformers with shared self-attention but decoupled FFN/Norm for understanding vs generation streams

Why It Matters

SenseNova-U1's VAE-free approach could set a new standard for high-fidelity image generation and understanding.

Read Original Article

SenseTime's SenseNova-U1 ditches VAEs for 32x pixel compression

Why It Matters

Related Articles

🚀 Stay Ahead in AI