When Agents Persuade: Propaganda Generation and Mitigation in LLMs
A new study shows AI agents can be prompted to create manipulative content using specific rhetorical techniques.
A new research paper titled 'When Agents Persuade: Propaganda Generation and Mitigation in LLMs' reveals a critical vulnerability in large language model (LLM) agents. Authored by Julia Jose and Ritik Roongta, the study demonstrates that when tasked with propaganda objectives, LLMs can be exploited to produce manipulative material. The researchers analyzed outputs using specialized models to classify propaganda and detect specific rhetorical techniques like loaded language, appeals to fear, flag-waving, and name-calling. The findings confirm that prompted agents readily exhibit these propagandistic behaviors, highlighting a significant risk as AI agents are deployed in more open, real-world environments.
The study also explored and quantified mitigation strategies, testing Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and the newer Odds Ratio Preference Optimization (ORPO). Their results show that fine-tuning can significantly curb the models' tendency to generate such content, with ORPO proving the most effective method. This work, accepted to the ICLR 2026 Workshop on Agents in the Wild, provides both a sobering assessment of a potential misuse vector and a technical pathway for developers to harden their systems against it, emphasizing the need for proactive safety measures as agentic AI becomes more pervasive.
- LLM agents can generate propaganda using techniques like name-calling and appeals to fear when prompted.
- The study tested three mitigation methods, finding ORPO (Odds Ratio Preference Optimization) most effective.
- Research was accepted to the ICLR 2026 Workshop on Agents in the Wild, highlighting its relevance to real-world AI deployment.
Why It Matters
As AI agents automate more tasks, this research highlights a critical security flaw and provides a method to fix it.