Forget about VAEs? SenseNova's NEO-unify achieves 31.5 PSNR without an encoder – Native Image Gen is coming.
A new 2B-parameter model generates images directly from pixels, matching top VAE performance without the separate component.
SenseNova, the AI research arm of SenseTime, has introduced a potentially paradigm-shifting architecture called NEO-unify. This native unified model departs from the industry-standard "Frankenstein" approach of chaining separate models like CLIP for text understanding, a VAE (Variational Autoencoder) for compression, and a diffusion model for generation. Instead, NEO-unify is a single, integrated 2-billion-parameter model that operates directly on pixel space. This fundamental shift addresses long-standing pain points in AI image generation, such as the loss of fine details and the introduction of artifacts that can occur when an image is encoded and decoded by a separate VAE component.
In its technical preview, NEO-unify demonstrates compelling performance that validates its novel approach. It achieves a Peak Signal-to-Noise Ratio (PSNR) of 31.56 for image reconstruction, a metric that measures fidelity. This score is remarkably close to the 32.65 PSNR of FLUX's highly-regarded standalone VAE, but NEO-unify accomplishes this without any VAE at all. Furthermore, its native understanding of pixels translates to robust image editing capabilities, evidenced by a solid ImgEdit score of 3.32. The developers have confirmed plans for an open-source release, positioning NEO-unify not just as a research paper but as a foundational technology that could influence the next wave of open-source image models, potentially serving as the blueprint for a future Stable Diffusion 4.0.
- Eliminates the standard VAE encoder, working natively on pixels to prevent artifacts and detail loss.
- Achieves a 31.56 PSNR score, rivaling top-performing standalone VAEs like FLUX's (32.65 PSNR).
- A 2-billion-parameter model with strong image editing (3.32 score) and planned open-source release.
Why It Matters
It could replace complex, multi-model pipelines with a single, more efficient architecture, setting a new standard for open-source image generation.