Research & Papers

Conformal Aggregation Achieves 90% Accuracy by Knowing When to Abstain

New method replaces majority voting with risk-controlled abstention, cutting errors by 8%.

Deep Dive

A new research paper by Yu Gu, Zijun Yu, Vahid Partovi Nia, and Masoud Asgharian tackles the central challenge of aggregation uncertainty in chain-of-thought (CoT) reasoning. Instead of relying on majority voting over multiple sampled reasoning paths—which can produce confidently incorrect answers—the authors propose a conformal aggregation procedure. Their approach uses weighted score aggregation and calibrates an abstention rule via conformal risk control, providing finite-sample statistical guarantees on the confident-error rate (the probability that the system answers and is wrong). This makes the model safer by allowing it to abstain when uncertainty is high, rather than outputting a plausible but incorrect response.

The method, evaluated across four benchmarks and four open-source models, shows dramatic improvements: on GSM8K, selective accuracy hits 90.1% while abstaining on fewer than 5% of problems, compared to 82% accuracy with majority voting. The paper also derives closed-form expressions to predict accuracy gains from calibration data alone. Because it operates entirely at inference time and requires no retraining, this technique can be readily applied to existing CoT-based systems. The key insight—score separability—ensures that abstention provably boosts selective accuracy, making this a practical upgrade for any AI system that uses chain-of-thought reasoning.

Key Points
  • Replaces majority voting with weighted score aggregation and a conformal risk control abstention rule.
  • Provides finite-sample guarantees on the confident-error rate (probability answer is wrong when given).
  • Achieves 90.1% selective accuracy on GSM8K by abstaining on less than 5% of problems, vs. 82% baseline.

Why It Matters

Makes AI reasoning safer by letting models confidently abstain when uncertain, reducing costly errors.