Research & Papers

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

arXiv cs.CL May 07, 2026

⚡Researchers use free energy principle to let LLMs teach themselves reasoning without human labels

Deep Dive

FREIA tackles a core problem in unsupervised RL for LLMs: existing methods fail to adapt as the model's reasoning evolves, often misdirecting policy optimization without ground-truth supervision. The algorithm introduces two key innovations. First, the Free Energy-Driven Reward (FER) dynamically adjusts rewards to maintain a balance between consensus and exploration, inspired by the Free Energy Principle. Second, Adaptive Advantage Shaping (AAS) fine-tunes learning signals based on the statistical properties of sampled rewards, avoiding overfitting to noisy feedback.

Evaluated across nine datasets spanning mathematical, logical, and commonsense reasoning tasks, FREIA consistently outperformed other unsupervised RL baselines. Notably, on math reasoning using the DeepSeek-R1-Distill-Qwen-1.5B model, it improved Pass@1 accuracy by 0.5 to 3.5 percentage points. The paper, accepted at ACL 2026, demonstrates that LLMs can self-improve in a reward-free setting, reducing reliance on expensive human annotations and opening the door to more autonomous model refinement.

Key Points

FREIA uses a Free Energy-Driven Reward (FER) to balance exploration and consensus during unsupervised RL training
Adaptive Advantage Shaping (AAS) adjusts learning signals based on reward statistics, preventing misdirection in policy optimization
Achieved 0.5–3.5 point Pass@1 improvement on math reasoning over baselines using DeepSeek-R1-Distill-Qwen-1.5B, tested on 9 datasets

Why It Matters

FREIA enables LLMs to self-improve reasoning without human labels, reducing cost and scaling autonomy for unsupervised training.

Read Original Article

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

Why It Matters

Stay Ahead in AI