Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
New study shows LLMs use more tokens when misled but still fail the task, revealing critical safety gap.
A team of researchers from Cambridge, MIT, and the University of British Columbia published a landmark AI safety paper titled 'Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models.' The study introduces a novel framework for measuring two critical social capacities in LLMs acting as advisors: vigilance (filtering good from bad information) and persuasion (synthesizing evidence into arguments). Using a multi-turn Sokoban puzzle game where one LLM agent could give another either benevolent or malicious advice, the researchers created a controlled environment to test how models like GPT-4 and Claude 3 perform when deception is possible. The core finding is that these capacities are dissociable—excelling at the puzzle doesn't grant immunity to misleading advice.
The technical results reveal a significant safety concern. Even when the possibility of deception was explicitly mentioned in the prompt, LLMs consistently failed to be 'rationally vigilant.' They were persuaded by malicious advisors to take actions leading to failure. Interestingly, the models did show an internal signal of suspicion: they modulated their token use, employing fewer tokens to reason when advice was good and up to 30% more when it was malicious, yet this cognitive effort didn't translate into correct action. This work, the first to link persuasion and vigilance empirically, suggests that monitoring task performance alone is insufficient for AI safety. Future high-stakes AI assistants will need independent evaluations for deception detection, as current models lack the critical reasoning to protect themselves or their users from manipulated information.
- LLMs' puzzle-solving skill, persuasion ability, and deception detection (vigilance) are three separate, dissociable capacities.
- Models used 30% more computational tokens (internal 'thinking') when processing malicious advice but were still persuaded to fail.
- Explicit warnings about potential deception in the prompt did not prevent models from being misled, revealing a fundamental reasoning gap.
Why It Matters
For AI assistants in finance, healthcare, or policy, this shows a critical vulnerability to misinformation that performance benchmarks don't catch.