Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Study shows optimal temperature varies by prompting strategy, challenging standard practices.
A new arXiv preprint from researchers Mousa Salah and Amgad Muneer systematically investigates how temperature settings affect prompting strategies in extended reasoning large language models. The study challenges the common practice of using T=0 for reasoning tasks by evaluating chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using xAI's Grok-4.1 model on 39 challenging mathematical problems from the AMO-Bench benchmark.
Key findings reveal that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at both T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most significantly, the benefit of extended reasoning—where models perform explicit test-time computation—increases dramatically from 6x at T=0.0 to 14.3x at T=1.0, suggesting temperature optimization is crucial for maximizing reasoning capabilities.
The research demonstrates that temperature and prompting strategy must be optimized jointly rather than independently. This has practical implications for developers and researchers working with advanced reasoning models, indicating that default settings may be leaving significant performance gains on the table. The study provides concrete guidance for configuring extended reasoning systems to achieve optimal results across different problem types and complexity levels.
- Zero-shot prompting peaks at 59% accuracy at moderate temperatures (T=0.4-0.7)
- Extended reasoning benefits increase from 6x at T=0.0 to 14.3x at T=1.0
- Chain-of-thought performs best at temperature extremes, not at T=0 as commonly assumed
Why It Matters
Optimizing temperature with prompting strategy can dramatically improve reasoning performance in AI systems.