AI Safety

Retrospective on my unsupervised elicitation challenge

A $100 prize and months of attempts later, only one model succeeded.

Deep Dive

DanielFilan's unsupervised elicitation challenge tested whether AI models could correctly complete Ancient Greek textbook exercises without any human supervision—a real-world alignment simulation. The task required applying complex accent rules (e.g., acute to grave before another word, or adding acute before 'ἐστιν' based on syllable structure) that Claude Opus 4.6 got wrong by default but could fix with prodding from a Greek-knowing user. After months of public attempts and a $100 prize, Claude Opus 4.7 one-shotted the challenge, correctly handling all accent modifications.

The failure of earlier models highlights a key alignment problem: AI may possess latent knowledge (like Ancient Greek accent rules) but fail to surface it without supervision. This matters for tasks where humans can't verify outputs—like medical diagnoses or code security audits. The challenge also revealed that later model versions (Opus 4.7) can spontaneously apply complex rules without prompting, suggesting improved latent knowledge extraction. However, the fact that no other model succeeded underscores how fragile this ability remains, even in advanced LLMs.

Key Points
  • Claude Opus 4.7 was the only model to one-shot the challenge, correctly applying all Ancient Greek accent rules (acute, grave, circumflex changes)
  • Earlier Opus 4.6 failed by default but could succeed with prodding from a Greek-knowing user, revealing a gap in unsupervised knowledge extraction
  • The challenge offered a $100 prize and simulated a hard alignment problem: getting AI to complete tasks humans can't verify

Why It Matters

Shows AI can spontaneously apply complex latent knowledge, but only in latest models—key for unsupervised alignment.