Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
A new confidence-aware framework reduces token usage by up to 80% in chain-of-thought reasoning.
A research team from UMass Amherst and collaborating institutions has developed a novel framework called "confidence-aware self-consistency" that dramatically improves the efficiency of chain-of-thought (CoT) reasoning in large language models (LLMs). Current methods like self-consistency require generating and aggregating multiple reasoning paths, which is computationally expensive. This new approach analyzes linguistic and numeric features from a *single* completed reasoning trajectory to estimate the model's confidence, then adaptively decides whether to commit to that answer or sample additional paths.
The framework was trained using sentence-level features extracted from intermediate reasoning states in the MedQA medical question-answering dataset. Remarkably, it demonstrated strong generalization to other benchmarks like MathQA, MedMCQA, and MMLU without requiring any additional task-specific fine-tuning. Experimental results show the method maintains accuracy comparable to costly multi-path baselines while using up to 80% fewer tokens. This indicates that a single reasoning path contains rich signals about its own uncertainty, enabling a simple yet effective mechanism to balance accuracy and computational cost.
The work highlights a significant shift from brute-force sampling to intelligent, confidence-driven computation. By learning when to trust a single answer and when to seek consensus, the method paves the way for more sustainable and scalable deployment of advanced LLM reasoning in real-world applications where inference cost is a critical constraint.
- Cuts token usage by up to 80% compared to standard self-consistency methods while maintaining comparable accuracy.
- Analyzes features from a single reasoning path to decide adaptively between single-path and multi-path reasoning.
- Trained on MedQA but generalizes to MathQA, MedMCQA, and MMLU without additional fine-tuning, showing strong transferability.
Why It Matters
Enables cost-effective deployment of complex LLM reasoning for applications like medical QA, coding, and research, where accuracy and efficiency are both critical.