DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
Researchers combine implicit neural representations and diffusion models to compress video at <0.05 bits per pixel.
A research team from ETH Zurich and Disney Research has unveiled DiV-INR, a groundbreaking video compression framework designed for the extreme low-bitrate regime (below 0.05 bits per pixel). The system innovatively marries two advanced AI techniques: Implicit Neural Representations (INRs), which provide a highly compact mathematical description of a video, and pre-trained video diffusion models, which act as a rich generative prior learned from massive datasets. This hybrid approach allows the model to encode video-specific information with minimal parameter overhead while leveraging the diffusion model's ability to generate realistic details.
In practice, DiV-INR replaces traditional, bit-heavy intra-coded keyframes with efficient neural representations. These INR-based conditioners are trained to estimate latent features that guide the diffusion process for video reconstruction. The team conducted a joint optimization of the INR weights and parameter-efficient adapters for the diffusion model, enabling the system to learn reliable conditioning signals. Benchmarks on UVG, MCL-JCV, and JVET Class-B datasets show substantial perceptual quality improvements, with gains of up to 0.214 in BD-LPIPS and 91.14 in BD-FID over the HEVC standard, also outperforming VVC and previous neural codecs.
An intriguing analysis reveals that the model works in a semantic-to-visual hierarchy: it first composes the scene layout and identifies objects before refining textural accuracy. This process is key to achieving perceptually faithful compression at such extreme data constraints. The work represents a significant leap toward making high-quality video streaming feasible in bandwidth-starved environments, from rural internet to space communications, by fundamentally rethinking how visual information is encoded and reconstructed.
- Hybrid AI Architecture: Combines Implicit Neural Representations (INRs) for compact encoding with pre-trained video diffusion models as generative priors for reconstruction.
- Extreme Efficiency: Targets bitrates below 0.05 bits per pixel (bpp), enabling video transmission where traditional codecs fail.
- Superior Perceptual Quality: Outperforms HEVC, VVC, and prior neural codecs, with improvements up to 0.214 BD-LPIPS and 91.14 BD-FID on standard benchmarks.
Why It Matters
Enables high-quality video streaming and communication in severely bandwidth-constrained environments, from remote areas to mobile networks.