Agent Frameworks

Evaluating LLM Alignment With Human Trust Models

A white-box analysis reveals how an LLM's internal 'trust' representation mirrors established human psychology models.

Deep Dive

A team of researchers from the University of Otago and Université Paul Sabatier conducted a novel 'white-box' analysis to understand how large language models (LLMs) internally conceptualize trust. Their paper, "Evaluating LLM Alignment With Human Trust Models," probes the activation space of the open-source model EleutherAI/gpt-j-6B. Using a method called contrastive prompting, they generated embedding vectors representing dyadic trust and related interpersonal concepts, then compared these to concepts derived from five established psychological models of human trust.

The study's core finding is that the LLM's internal representation of trust shows significant conceptual alignment with human theories. By computing pairwise cosine similarities, the researchers determined that GPT-J-6B's trust representation aligned most closely with the Castelfranchi socio-cognitive model, followed by the Marsh Model. This suggests the model has developed an internal structure for trust that mirrors sophisticated, cognition-based human frameworks, rather than simpler, purely calculus-based models.

This work moves beyond evaluating LLM outputs to analyzing their internal representations, providing a new lens for AI alignment research. The findings indicate that even without explicit training on psychological theory, LLMs can encode abstract socio-cognitive constructs in ways that are meaningfully comparable to human models. This has implications for designing more transparent and predictable human-AI collaborative systems, where understanding an AI's 'mental model' of concepts like trust is crucial for effective interaction.

Key Points
  • Used white-box analysis and contrastive prompting on EleutherAI/gpt-j-6B to map its internal 'trust' representation.
  • Found strongest alignment with the Castelfranchi socio-cognitive model (a human trust theory) via cosine similarity comparisons.
  • Demonstrates LLMs encode complex social constructs in activation space, enabling new comparative analyses for AI alignment.

Why It Matters

Provides a scientific method to audit AI 'understanding' of critical social concepts, informing safer and more aligned human-AI collaboration.