CNN-based full-reference models (LPIPS, DISTS, CVQA-FR) showed highest correlation with human perception, outperforming traditional metrics like PSNR, SSIM, and VMAF?

CNN-based full-reference models (LPIPS, DISTS, CVQA-FR) showed highest correlation with human perception, outperforming traditional metrics like PSNR, SSIM, and VMAF.

All tested models overestimated the quality of SCST's sharp outputs, and VMAF specifically failed on Starlight Mini due to spatial inconsistencies?

All tested models overestimated the quality of SCST's sharp outputs, and VMAF specifically failed on Starlight Mini due to spatial inconsistencies.

None of the 20+ quality models achieved sufficient accuracy to replace subjective testing for diffusion-based video super-resolution?

None of the 20+ quality models achieved sufficient accuracy to replace subjective testing for diffusion-based video super-resolution.

Image & Video

Study finds no video quality model accurate enough for diffusion-based super-resolution

arXiv eess.IV May 26, 2026

⚡CNN-based models like LPIPS and DISTS outperform conventional metrics, but none replace human evaluation.

Deep Dive

A new study from Benjamin Herb, Steve Göring, Alexander Raake, and Rakesh Rao Ramachandra Rao (accepted at QoMEX 2026) investigates whether existing video quality models can reliably assess diffusion-based video super-resolution (VSR) outputs. The team compared six upscaling methods—traditional Lanczos, Rhea, SCST, DOVE, SeedVR2, and Starlight Mini—on both compressed (AV1, DCVC-RT) and uncompressed low-resolution videos, displayed on a UHD-4K screen. They evaluated a wide range of full-reference and no-reference quality models, focusing on per-sequence performance.

The key finding: CNN-based full-reference models like LPIPS, DISTS, and CVQA-FR significantly outperformed both conventional full-reference models (e.g., PSNR, SSIM) and all tested no-reference models in correlation with human ratings. However, none reached the accuracy needed to replace subjective testing. Most models overestimated SCST's overly sharp results, while VMAF failed primarily due to spatial inconsistencies introduced by Starlight Mini. The researchers conclude that current video quality models are not yet reliable for evaluating diffusion-based VSR, and they have released all videos, ratings, and model scores as open data to support further research.

Key Points

CNN-based full-reference models (LPIPS, DISTS, CVQA-FR) showed highest correlation with human perception, outperforming traditional metrics like PSNR, SSIM, and VMAF.
All tested models overestimated the quality of SCST's sharp outputs, and VMAF specifically failed on Starlight Mini due to spatial inconsistencies.
None of the 20+ quality models achieved sufficient accuracy to replace subjective testing for diffusion-based video super-resolution.

Why It Matters

As diffusion-based VSR becomes common, this reveals a critical gap: automated quality metrics can't yet accurately evaluate these models.

Read Original Article

Study finds no video quality model accurate enough for diffusion-based super-resolution

Why It Matters

Related Articles

🚀 Stay Ahead in AI