AI Safety

My unsupervised elicitation challenge

Despite massive training data, the $20B AI model makes errors a human beginner can spot in minutes.

Deep Dive

AI alignment researcher Daniel Filan has exposed a surprising weakness in Anthropic's flagship Claude Opus 4.6 model through a practical test. While using the $20B model to study Ancient Greek, Filan discovered it consistently makes errors on elementary grammar exercises from a beginner's textbook. The task involves simple fill-in-the-blank questions about Greek letters, vowels, and grammatical gender—material a human can master in about a week. Despite Claude's training on vast text corpora that presumably include Greek materials, the model produces incorrect answers that Filan, as a novice, can immediately identify as wrong.

Filan has turned this discovery into a public "unsupervised elicitation challenge" on the LessWrong forum. The goal is to craft prompts that get Claude Opus 4.6 to produce correct answers without the prompter knowing Ancient Greek themselves. This tests whether users can extract latent knowledge the AI possesses but doesn't reliably surface. Early attempts—like adding cautionary notes or attaching textbook PDFs—have failed. The challenge highlights a core problem in AI alignment: how to ensure models truthfully express what they know, especially when users lack domain expertise to verify outputs.

The experiment reveals that even state-of-the-art models like Claude Opus 4.6 can struggle with straightforward tasks despite their capabilities. Filan notes the particular irony that the model can correct its mistakes when given specific hints about the errors, but only if the user already knows what's wrong. This creates a catch-22 for practical applications where users rely on AI for expertise they don't possess. The challenge has drawn attention from the AI safety community as a concrete example of the "eliciting latent knowledge" problem that researchers have theorized about for years.

Key Points
  • Claude Opus 4.6 failed basic Ancient Greek exercises that a human beginner could complete after one week of study
  • The $20B model makes consistent grammatical errors despite presumably being trained on Greek texts and textbooks
  • Researcher launched public challenge to craft prompts that elicit correct answers without knowing Greek—testing AI alignment fundamentals

Why It Matters

Reveals fundamental gaps in getting AI to reliably express what it knows, crucial for medical, legal, and technical applications where users lack expertise.