Study finds no video quality model accurate enough for diffusion-based super-resolution
CNN-based models like LPIPS and DISTS outperform conventional metrics, but none replace human evaluation.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new study from Benjamin Herb, Steve Göring, Alexander Raake, and Rakesh Rao Ramachandra Rao (accepted at QoMEX 2026) investigates whether existing video quality models can reliably assess diffusion-based video super-resolution (VSR) outputs. The team compared six upscaling methods—traditional Lanczos, Rhea, SCST, DOVE, SeedVR2, and Starlight Mini—on both compressed (AV1, DCVC-RT) and uncompressed low-resolution videos, displayed on a UHD-4K screen. They evaluated a wide range of full-reference and no-reference quality models, focusing on per-sequence performance.
The key finding: CNN-based full-reference models like LPIPS, DISTS, and CVQA-FR significantly outperformed both conventional full-reference models (e.g., PSNR, SSIM) and all tested no-reference models in correlation with human ratings. However, none reached the accuracy needed to replace subjective testing. Most models overestimated SCST's overly sharp results, while VMAF failed primarily due to spatial inconsistencies introduced by Starlight Mini. The researchers conclude that current video quality models are not yet reliable for evaluating diffusion-based VSR, and they have released all videos, ratings, and model scores as open data to support further research.
- CNN-based full-reference models (LPIPS, DISTS, CVQA-FR) showed highest correlation with human perception, outperforming traditional metrics like PSNR, SSIM, and VMAF.
- All tested models overestimated the quality of SCST's sharp outputs, and VMAF specifically failed on Starlight Mini due to spatial inconsistencies.
- None of the 20+ quality models achieved sufficient accuracy to replace subjective testing for diffusion-based video super-resolution.
Why It Matters
As diffusion-based VSR becomes common, this reveals a critical gap: automated quality metrics can't yet accurately evaluate these models.