Researchers find 39% of FOLIO benchmarks have incorrect logic – LLM relabeling framework fixes it
LLMs tested on corrected datasets saw accuracy jumps of +9 to +22 percentage points.
A new study by Andrea Brunello et al. systematically inspected the validation split of FOLIO and a subset of MALLS test instances, revealing that approximately 39% and 36% of entries contain incorrect First-Order Logic formalizations (i.e., ground truth labels). Additionally, they found ambiguous NL sentences in 16.4% of FOLIO and 48% of MALLS, plus 8.4% incorrect NLI labels in FOLIO. The authors released corrected ground truths and evaluated three SOTA LLMs (Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) on the cleaned data, observing accuracy improvements ranging from +9 to +22 percentage points – showing how annotation errors significantly distort model evaluation.
- 39% of FOLIO and 36% of MALLS entries had incorrect FOL formalizations.
- Correcting ground truths boosted LLM accuracy by 9–22 percentage points across Gemma 4, Qwen3, and GPT-4o-mini.
- Their LLM-assisted framework achieves 90% dataset accuracy after relabeling fewer than 24% of instances, vs >70% unguided.
Why It Matters
Fixes broken benchmarks in neurosymbolic AI, ensuring fair evaluation and saving 75% of relabeling effort.