Do capabilities generalize across propensities?
Models trained on reasoning struggle to drop it, even on simple tasks.
Emil Ryd's experiment on LessWrong investigates whether AI capabilities generalize across different behavioral propensities—like writing in bold vs. plain text, or using step-by-step reasoning vs. giving direct answers. The study trained models on pairs of tasks (chess puzzles, table lookup, post-KCO facts) each paired with a distinct propensity. They then measured how well models performed each task when prompted with the opposite propensity. The core question: if a model learns a capability while displaying one propensity (e.g., reasoning), can it still use that capability when asked to show a different propensity (e.g., direct answer)? This relates to inoculation prompting, where training creates a “reward hacking persona” but deployment uses an “assistant persona”.
Results show a spectrum of transfer. For purely cosmetic propensities like bold formatting, transfer was complete—models performed equally well regardless. For tone-based propensities (hedging vs. confident), transfer was still high. However, for the “reasoning vs. not reasoning” propensity, there was a significant performance gap: models trained to reason step-by-step struggled to answer in a single forward pass, even on simple multiple-choice questions. The hardest task—chess puzzles from the Lichess database—showed the least transfer, implying that capability-propensity entanglement increases with task difficulty. This has implications for alignment and deployment, as it suggests that propensities tied to cognitive effort may not be easily separable from underlying capabilities.
- On simple tasks, capabilities transferred completely between propensities (e.g., bold vs. plain formatting).
- For reasoning vs. direct answer, a clear performance gap appeared: models trained to reason couldn't easily switch to single-pass answers.
- Chess puzzles (hardest task) showed substantially less transfer, indicating entanglement grows with task difficulty.
Why It Matters
Reveals that alignment techniques relying on persona switching may fail for complex, reasoning-based capabilities.