Research & Papers

GenPT: Researchers' projective test beats LLM self-report bias

Self-report questionnaires fail on LLMs—new projective method reveals hidden biases.

Deep Dive

Self-report questionnaires are the standard for probing psychological states of persona-conditioned LLM agents, but they suffer from two critical flaws: contamination from training data and directional bias due to social-desirability or contextual framing. Researchers from multiple Chinese institutions propose GenPT (Generative Projective Testing) to overcome these bottlenecks. GenPT reframes classic projective tests—Thematic Apperception Test (TAT), Rorschach inkblots, and Sentence Completion Test (SCT)—by generating novel stimuli and organizing assessment into a three-stage pipeline that yields standardized psychological indicators and target states.

Evaluated on PC-Agents built with CharacterRAG and AnnaAgent profiles, GenPT was benchmarked against traditional questionnaires. Results showed questionnaires exhibited systematic directional shifts under social-desirability framing, most strongly on suicide ideation questions. In contrast, GenPT's collected behavioral patterns remained near a symmetric baseline. In a longitudinal counseling context using Qwen3 as the backbone, GenPT-based depression assessment shifted by roughly an order of magnitude more than the questionnaire counterpart, demonstrating greater context sensitivity. The authors conclude GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity are critical. Code and stimuli are publicly available.

Key Points
  • Questionnaires showed systematic directional shifts under social-desirability framing, strongest on suicide ideation.
  • GenPT behavioral patterns stayed near symmetric baseline, resisting bias.
  • Under longitudinal counseling with Qwen3, GenPT depression assessment shifted ~10x more than questionnaire.

Why It Matters

More reliable LLM psychometrics could improve AI safety and mental health agent evaluation.