CRAFT's hybrid critic loop uses UNLI, DeBERTa-v3, and Llama-3.2-3B to iteratively verify and repair claims?

CRAFT's hybrid critic loop uses UNLI, DeBERTa-v3, and Llama-3.2-3B to iteratively verify and repair claims.

Achieves best overall average (0.739) and citation F1 (0.635) on MAGMaR 2026, with reference recall at 0.810?

Achieves best overall average (0.739) and citation F1 (0.635) on MAGMaR 2026, with reference recall at 0.810.

Generalizes to WikiVideo (0.823 Avg) and includes multilingual ASR fallback for non-English audio?

Generalizes to WikiVideo (0.823 Avg) and includes multilingual ASR fallback for non-English audio.

Research & Papers

CRAFT pipeline boosts multi-video QA with critic loop and 0.810 recall

arXiv cs.CV May 20, 2026

⚡Llama-3.2 and DeBERTa-v3 unite to cite every claim from heterogeneous video archives.

Deep Dive

A team led by Mahesh Bhosale at the University at Buffalo has released CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a new pipeline designed to answer factual questions across multiple news videos while automatically attributing every claim to its source video. Unlike typical video QA systems that treat clips in isolation, CRAFT first selects query-relevant keyframes dynamically, then extracts transcripts using per-video ASR with a multilingual fallback to handle code-switched or non-English audio. The core innovation is a hybrid critic loop that runs three checks on each generated claim: UNLI for temporal entailment, DeBERTa-v3 for cross-claim consistency, and a Llama-3.2-3B adjudicator that flags and repairs errors before final consolidation. A citation-merging step emits each fact only once but with all supporting source identifiers.

On the MAGMaR 2026 benchmark, CRAFT set new state-of-the-art scores with an overall average of 0.739, a reference recall of 0.810, and a citation F1 of 0.635—meaning it reliably finds the right evidence and credits it correctly. The system also generalized well to a MAGMaR-style conversion of WikiVideo (52 queries), achieving an average of 0.823. Ablation studies confirmed that atomic claim generation, ASR, and the critic loop are the primary drivers of improvement over a vanilla baseline. The code is publicly available, making CRAFT a strong foundation for building trustworthy, cite-anything video question-answering systems.

Key Points

CRAFT's hybrid critic loop uses UNLI, DeBERTa-v3, and Llama-3.2-3B to iteratively verify and repair claims.
Achieves best overall average (0.739) and citation F1 (0.635) on MAGMaR 2026, with reference recall at 0.810.
Generalizes to WikiVideo (0.823 Avg) and includes multilingual ASR fallback for non-English audio.

Why It Matters

Enables verifiable, source-grounded QA over messy news archives—critical for fact-checking and media analysis.

Read Original Article

CRAFT pipeline boosts multi-video QA with critic loop and 0.810 recall

Why It Matters

Related Articles

🚀 Stay Ahead in AI