Research & Papers

Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

arXiv cs.CL February 27, 2026

⚡New research shows GPT-5 can generate 450 distinct hypotheses from a single citation, but outputs shift dramatically with prompts.

Deep Dive

A new research paper titled 'Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts' explores whether large language models can support deep, interpretative academic analysis by focusing intensely on a single case rather than broad classification. Researcher Arno Simons used a two-stage pipeline with OpenAI's GPT-5 to analyze footnote 6 from a 1975 sociology paper, first performing surface classification then cross-document interpretative reconstruction. The study foregrounds a critical methodological issue: prompt-sensitivity, testing variations in scaffolding and framing through a balanced 2x3 experimental design.

Across 90 reconstructions, GPT-5 generated a remarkable 450 distinct hypotheses about the citation's meaning. Close analysis identified 21 recurring interpretative moves, with linear probability models showing how specific prompt choices systematically shifted their frequency and vocabulary. While GPT-5's initial surface classification was highly stable (consistently labeling the citation as 'supplementary'), the reconstruction phase revealed significant fragility—scaffolding and examples redistributed the model's attention, sometimes toward strained readings. Compared to a human scholar's 1977 analysis, GPT-5 detected the same textual evidence but interpreted it more often as establishing academic lineage rather than as criticism. The research outlines both the promise of using LLMs as inspectable, contestable co-analysts for humanities and social science research, and the substantial risk that seemingly minor prompt variations can tilt the entire analytical output.

Key Points

GPT-5 generated 450 distinct interpretative hypotheses from analyzing a single academic citation across 90 reconstructions.
Prompt variations in a 2x3 experimental design systematically shifted which of 21 identified interpretative moves the model foregrounded.
The model's surface classification was stable, but deep reconstruction outputs proved fragile and sensitive to scaffolding and framing.

Why It Matters

Highlights both the potential and pitfalls of using advanced AI for scholarly analysis, where prompt engineering directly shapes research conclusions.

Read Original Article

Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

Why It Matters

Stay Ahead in AI