AI Safety

Toxic Prompts Reduce LLM Accuracy by 18% on Key Benchmarks

Rude language in otherwise identical prompts makes AI hallucinate more, a new study finds.

Deep Dive

A new study finds that toxic language in prompts can degrade LLM factual accuracy. Testing five models on benchmarks like ARC-Easy and GSM8K, researchers showed toxic lexical perturbations consistently reduce accuracy and increase uncertainty, while polite phrasing has limited and inconsistent effects. Internal attribution analysis reveals that toxicity selectively amplifies variant-sensitive nodes, suggesting prompt tone is a critical yet underappreciated dimension of LLM reliability.

Key Points
  • Five LLMs were tested on three benchmarks (ARC-Easy, GSM8K, MMLU) with four tone variants (polite, random, toxic levels 1-3).
  • Toxic prompts reduced factual accuracy by up to 18% and increased model uncertainty, while polite phrasing had inconsistent effects.
  • Internal attribution graphs showed toxicity amplifies perturbation-sensitive 'variant' nodes while keeping core reasoning nodes relatively stable.

Why It Matters

User tone is now a critical reliability factor — toxic inputs can systematically increase hallucination risk in LLMs.