Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
Researchers use free energy principle to let LLMs teach themselves reasoning without human labels
FREIA tackles a core problem in unsupervised RL for LLMs: existing methods fail to adapt as the model's reasoning evolves, often misdirecting policy optimization without ground-truth supervision. The algorithm introduces two key innovations. First, the Free Energy-Driven Reward (FER) dynamically adjusts rewards to maintain a balance between consensus and exploration, inspired by the Free Energy Principle. Second, Adaptive Advantage Shaping (AAS) fine-tunes learning signals based on the statistical properties of sampled rewards, avoiding overfitting to noisy feedback.
Evaluated across nine datasets spanning mathematical, logical, and commonsense reasoning tasks, FREIA consistently outperformed other unsupervised RL baselines. Notably, on math reasoning using the DeepSeek-R1-Distill-Qwen-1.5B model, it improved Pass@1 accuracy by 0.5 to 3.5 percentage points. The paper, accepted at ACL 2026, demonstrates that LLMs can self-improve in a reward-free setting, reducing reliance on expensive human annotations and opening the door to more autonomous model refinement.
- FREIA uses a Free Energy-Driven Reward (FER) to balance exploration and consensus during unsupervised RL training
- Adaptive Advantage Shaping (AAS) adjusts learning signals based on reward statistics, preventing misdirection in policy optimization
- Achieved 0.5–3.5 point Pass@1 improvement on math reasoning over baselines using DeepSeek-R1-Distill-Qwen-1.5B, tested on 9 datasets
Why It Matters
FREIA enables LLMs to self-improve reasoning without human labels, reducing cost and scaling autonomy for unsupervised training.