Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting
A new study finds prompting LLMs with pragmatic theory can improve their calibration to human social judgments.
A new research paper by Roland Mühlenbernd tackles a critical question in AI alignment: do large language models (LLMs) understand social meaning like humans do, not just in pattern but in precise degree? The study introduces two novel, calibration-focused metrics to answer this. The Effect Size Ratio (ESR) measures if a model correctly ranks the *structure* of social inferences (e.g., which phrasing implies more knowledge). The Calibration Deviation Score (CDS) measures if the model matches the exact *magnitude* of those human judgments, revealing whether an LLM overstates or understates social cues.
Applying these metrics to a case study on numerical precision across three top-tier LLMs, the research yielded nuanced findings. All tested models successfully reproduced the qualitative structure of human social reasoning. However, they differed significantly in magnitude calibration, often exaggerating the strength of inferences. The paper then tested prompting strategies derived from pragmatic theory. Prompting an LLM to consider a speaker's knowledge state and motives most consistently improved its calibration toward human benchmarks. In contrast, prompting it to be aware of linguistic alternatives often amplified its existing exaggerations.
The most effective intervention combined both pragmatic components, improving all calibration metrics across all models. Despite this progress, the study concludes that fine-grained magnitude calibration remains a partially solved challenge. LLMs demonstrably capture the inferential *logic* of social meaning but variably distort its *strength*. This work provides a rigorous framework for measuring and a promising, theory-driven path for improving how AI models interpret the nuanced social signals embedded in human language.
- Introduces two new metrics (ESR and CDS) to separate the structure and strength of LLMs' social reasoning.
- Finds frontier LLMs capture qualitative social inference patterns but often exaggerate the magnitude by 20-40%.
- Shows prompting based on pragmatic theory (speaker knowledge/motives) is the most effective way to improve calibration.
Why It Matters
For deploying AI in sensitive social contexts, ensuring it interprets subtle cues like humans do is critical for trust and safety.