LLMbench: A Comparative Close Reading Workbench for Large Language Models
New browser-based workbench reveals the 'text that could have been otherwise' in LLM responses.
Researcher David M. Berry has introduced LLMbench, a novel browser-based workbench designed for the comparative, qualitative analysis of large language model outputs. Departing from purely quantitative evaluation tools like Google PAIR's LLM Comparator, LLMbench is built for the hermeneutic practices of digital humanities. It places two model responses side-by-side in annotatable panels and provides four core analytical overlays: Probabilities for token-level log-probability inspection, Differences for word-level comparison, Tone for metadiscourse analysis, and Structure for sentence-level parsing. This framework treats AI-generated text as a legitimate research object emerging from a probability distribution—'a text that could have been otherwise.'
The tool's power lies in its five analytical modes—Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence—which make the underlying probabilistic mechanics of text generation visually legible. Researchers can explore the 'counterfactual history' of each generated word through visualizations like continuous heatmaps, entropy sparklines, and three-dimensional probability terrains. By surfacing this granular log-probability data, which Berry argues is currently underused in humanistic AI studies, LLMbench provides a critical resource for scholars aiming to understand not just what an LLM says, but how and why it says it, bridging a gap between technical AI evaluation and qualitative, critical inquiry.
- Provides four analytical overlays (Probabilities, Differences, Tone, Structure) for side-by-side LLM output comparison.
- Features five modes visualizing token-level probabilities with heatmaps, sparklines, and 3D terrains.
- Designed for digital humanities' qualitative, close-reading practices, unlike quantitative benchmarking tools.
Why It Matters
Enables critical, qualitative study of AI text generation for researchers, moving beyond simple performance metrics.