Context- and Pixel-aware Large Language Model for Video Quality Assessment
New multimodal LLM detects compression artifacts others miss while describing why a video looks bad.
Video quality assessment has long been split between traditional models that detect pixel-level distortions but lack context, and modern multimodal LLMs that understand scenes but miss subtle defects. CP-LLM bridges this gap with a novel architecture featuring dual vision encoders: one for high-level video context (objects, motion, narrative), another for low-level perceptual quality (blocking, ringing, noise). A shared language decoder then fuses both signals to output a robust quality score alongside an interpretable textual explanation.
The model's key innovation is its separate treatment of semantic and distortion information, allowing it to remain sensitive to small compression artifacts that would be overlooked by a single-encoder MLLM. Extensive experiments on standard VQA benchmarks show state-of-the-art cross-dataset performance. CP-LLM also demonstrates superior robustness when tested on adversarial pixel-level distortions. Accepted at ICIP 2026, this work points toward AI systems that can both quantify and articulate video quality, useful for streaming optimization, content moderation, and video production.
- Dual vision encoders independently process semantic context and pixel-level distortions, then fuse them via a language decoder.
- Achieves state-of-the-art cross-dataset performance on VQA benchmarks with enhanced sensitivity to compression artifacts.
- Outputs both a numeric quality score and a natural language description of why a video is degraded.
Why It Matters
Streaming services and video platforms can now automate quality checks with explainable AI, reducing manual review.