AI Safety

Getting Claude to rank the inkhaven bloggers

LessWrong AI April 11, 2026

⚡Anthropic's Claude spent $40 in API credits to rank 15+ posts from the rationalist writing residency.

Deep Dive

In a follow-up to Alexander Wales' viral experiment, Sean Herrington deployed Anthropic's Claude Opus 4.6 to systematically rank blog posts from the Inkhaven writing residency. Instead of simple pairwise comparisons, Herrington implemented a Bradley-Terry model—a statistical method Claude itself suggested, analogous to the Elo rating system in chess. The AI judged 8 posts at a time across five iterative rounds, burning through $40 in API credits to refine its rankings based on a detailed prompt asking it to predict the taste of a rationalist audience.

Claude evaluated posts on criteria including insight, craft, honest thinking, distinctive voice, and humor, with explicit instructions not to be 'generous or encouraging.' The final leaderboard of over 15 posts was topped by Natalie Cargill's 'How to invent a disease' with a score of +2.77 standard deviations. The list showcases the eclectic mix of Inkhaven, featuring runner-ups on topics from 'More Legal Systems Very Different From Ours' to 'The phenomenology of being hungry while pregnant.'

The experiment demonstrates a move beyond simple chat interactions, using Claude Opus 4.6 as an analytical engine for complex, multi-step evaluation tasks. It also inadvertently tested the model's consistency, as Herrington's attempt to 'push' his own post from 2nd to 1st place instead saw it drop to 10th. The project highlights how advanced LLMs can be prompted to perform sophisticated comparative judgment at scale, offering a novel lens on content quality within niche communities.

Key Points

Used Claude Opus 4.6 with a Bradley-Terry model to rank 15+ Inkhaven posts across 5 iterative rounds
Top post was 'How to invent a disease' by Natalie Cargill scoring +2.77σ, beating entries on legal systems and AI myths
Experiment cost $40 in Anthropic API credits and tested the model's consistency against prompt engineering attempts

Why It Matters

Shows how professionals can use LLMs like Claude for complex, multi-step analytical judgment beyond simple Q&A, at a predictable cost.

Read Original Article

Getting Claude to rank the inkhaven bloggers

Why It Matters

Stay Ahead in AI