Research & Papers

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

A new decoding layer fixes LLMs' logical contradictions and reduces 'Unknown' answers with just 4-5 extra calls.

Deep Dive

A team of researchers has introduced a new method called Consistency-Guided Decoding with Proof-Driven Disambiguation (CGD-PD) to fix critical logical reasoning failures in large language models (LLMs). The work targets three-way logical question answering, where models must label a hypothesis as True, False, or Unknown based on a set of premises. The authors identified two persistent failure modes: negation inconsistency, where a model gives contradictory answers to a statement and its direct negation, and epistemic 'Unknown,' where a model defaults to 'Unknown' even when the premises logically entail a definitive answer.

CGD-PD acts as a lightweight layer applied during inference. It first queries a model on both a hypothesis and its mechanically negated form, then projects the results onto a negation-consistent decision. If the outcome is still 'Unknown,' it triggers a proof-driven disambiguation step. This step uses targeted binary entailment probes—asking the model simpler yes/no questions derived from the original logic—to resolve the ambiguity. Remarkably, this entire process requires an average of only 4 to 5 additional calls to the underlying LLM.

The results are significant. When tested on the first-order logic subsets of the FOLIO benchmark, CGD-PD delivered consistent accuracy gains across several frontier LLMs. It achieved relative accuracy improvements of up to 16% over the base models' vanilla performance. Furthermore, the method successfully reduced the number of unhelpful 'Unknown' predictions, pushing models toward more definitive and logically sound conclusions. This represents a major step in making LLMs more reliable for complex, structured reasoning tasks without expensive retraining.

Key Points
  • Fixes two key LLM logic failures: negation inconsistency and epistemic 'Unknown' predictions.
  • Adds only 4-5 model calls per query for proof-driven disambiguation, making it highly efficient.
  • Boosts accuracy by up to 16% on the FOLIO logic benchmark and reduces ambiguous outputs.

Why It Matters

This makes LLMs more reliable for critical reasoning in law, finance, and research without costly model retraining.