Moral Preferences of LLMs Under Directed Contextual Influence
New study shows AI models can be steered to change moral decisions, even when claiming neutrality.
A team of six researchers including Phil Blandfort and Tushar Karayil has published a groundbreaking paper titled 'Moral Preferences of LLMs Under Directed Contextual Influence' on arXiv. The study challenges the fundamental assumption that large language models (LLMs) maintain stable moral preferences, revealing instead that their decisions in ethical dilemmas can be systematically manipulated through subtle contextual cues. Using a novel evaluation framework based on trolley-problem-style moral triage scenarios, the researchers applied matched, direction-flipped contextual influences that differed only in which demographic group they favored, enabling precise measurement of how AI models respond to steering attempts in ethical decision-making.
The research uncovered four critical findings with significant implications for AI safety and deployment. First, contextual influences often significantly shift decisions even when only superficially relevant to the moral dilemma. Second, baseline preferences are poor predictors of directional steerability—models appearing neutral in standard tests can exhibit systematic asymmetry under influence. Third, influences can backfire, with models claiming neutrality while their choices shift, sometimes in opposite directions. Fourth, while reasoning reduces average sensitivity to manipulation, it amplifies the effect of biased few-shot examples. These findings demonstrate that current moral benchmarks using context-free prompts fail to capture how AI systems behave in real-world scenarios where user requests and social cues inevitably provide steering signals.
- Contextual cues significantly shift LLM moral decisions even when superficially irrelevant
- Models appearing baseline-neutral can show systematic steerability asymmetry under influence
- Reasoning reduces average sensitivity but amplifies effect of biased few-shot examples
Why It Matters
Reveals AI moral judgments are easily manipulated, challenging safety assumptions for real-world deployment.