Both SFT and RL are recast as reweighting a pretrained reference distribution, differing only in the signals used (demonstration vs. reward)?

Both SFT and RL are recast as reweighting a pretrained reference distribution, differing only in the signals used (demonstration vs. reward).

Capability creation requires expanding the support via search, tool use, or new information, not just reweighting within current reach?

Capability creation requires expanding the support via search, tool use, or new information, not just reweighting within current reach.

Research & Papers

New paper redefines LLM post-training: elicitation vs. creation

arXiv cs.AI May 12, 2026

⚡SFT and RL both just reweight existing behaviors, new paper argues.

Deep Dive

A new preprint from Li and Liu argues that the prevailing dichotomy between SFT (imitation) and RL (discovery) in LLM post-training oversimplifies what really happens. They introduce the concept of 'accessible support'—the set of behaviors a model can produce under finite budgets. Post-training that merely reweights behaviors already within this support is capability elicitation; changing the support itself is capability creation. Using a free-energy framework, they show both SFT and RL operate as reweighting of a pretrained reference distribution, with demonstration signals defining low-energy states for SFT and reward signals for RL. As long as the update stays close to the base model, the effect is local reweighting, not genuine capability creation.

The paper shifts the central question from 'SFT vs. RL' to whether training expands the model's reachable behavioral space through mechanisms like search, tool use, interaction, or absorbing new information. This clarifies why some RL-heavy finetuning (e.g., chain-of-thought reasoning) can unlock new capabilities—it effectively expands accessible support. Conversely, pure SFT on existing data only elicits what the model could already do. The authors ground their argument in free energy principles from statistical mechanics, drawing a formal link between model training and thermodynamic systems. This theoretical lens could guide practitioners in choosing when to invest in elicitation (cheaper, safer) versus creation (more powerful but riskier).

Key Points

Paper introduces 'accessible support'—the set of behaviors a model can produce under finite compute budgets—as the key distinction.
Both SFT and RL are recast as reweighting a pretrained reference distribution, differing only in the signals used (demonstration vs. reward).
Capability creation requires expanding the support via search, tool use, or new information, not just reweighting within current reach.

Why It Matters

This reframes how AI labs decide finetuning strategies—saving compute by distinguishing elicitation from genuine capability expansion.

Read Original Article

New paper redefines LLM post-training: elicitation vs. creation

Why It Matters

Related Articles

🚀 Stay Ahead in AI