39% of FOLIO and 36% of MALLS entries had incorrect FOL formalizations?

39% of FOLIO and 36% of MALLS entries had incorrect FOL formalizations.

Correcting ground truths boosted LLM accuracy by 9–22 percentage points across Gemma 4, Qwen3, and GPT-4o-mini?

Correcting ground truths boosted LLM accuracy by 9–22 percentage points across Gemma 4, Qwen3, and GPT-4o-mini.

Their LLM-assisted framework achieves 90% dataset accuracy after relabeling fewer than 24% of instances, vs >70% unguided?

Their LLM-assisted framework achieves 90% dataset accuracy after relabeling fewer than 24% of instances, vs >70% unguided.

Research & Papers

Researchers find 39% of FOLIO benchmarks have incorrect logic – LLM relabeling framework fixes it

arXiv cs.CL June 03, 2026

⚡LLMs tested on corrected datasets saw accuracy jumps of +9 to +22 percentage points.

Deep Dive

A new study by Andrea Brunello et al. systematically inspected the validation split of FOLIO and a subset of MALLS test instances, revealing that approximately 39% and 36% of entries contain incorrect First-Order Logic formalizations (i.e., ground truth labels). Additionally, they found ambiguous NL sentences in 16.4% of FOLIO and 48% of MALLS, plus 8.4% incorrect NLI labels in FOLIO. The authors released corrected ground truths and evaluated three SOTA LLMs (Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) on the cleaned data, observing accuracy improvements ranging from +9 to +22 percentage points – showing how annotation errors significantly distort model evaluation.

Key Points

39% of FOLIO and 36% of MALLS entries had incorrect FOL formalizations.
Correcting ground truths boosted LLM accuracy by 9–22 percentage points across Gemma 4, Qwen3, and GPT-4o-mini.
Their LLM-assisted framework achieves 90% dataset accuracy after relabeling fewer than 24% of instances, vs >70% unguided.

Why It Matters

Fixes broken benchmarks in neurosymbolic AI, ensuring fair evaluation and saving 75% of relabeling effort.

Read Original Article

Researchers find 39% of FOLIO benchmarks have incorrect logic – LLM relabeling framework fixes it

Why It Matters

Related Articles

🚀 Stay Ahead in AI