AI Safety

OpenAI's AI self-improvement red line criticized as too permissive and unmeasurable

Critics warn the threshold may allow years of progress before triggering any halt.

Deep Dive

OpenAI's Preparedness Framework v2 defines the Critical threshold for AI self-improvement as either a superhuman research-scientist agent (leading indicator) or causing a generational model improvement in 1/5th the wall-clock time of equivalent 2024 progress (e.g., from o1 to o3 in 4 weeks) sustained for several months. Critics on LessWrong argue this threshold fires far too late: the lagging indicator could correspond to roughly three years of accumulated progress before triggering, especially if acceleration continues past 5x. Anthropic, by contrast, uses a 2x threshold.

The framework also contains an escape hatch (Section 4.3) that lets OpenAI lower safeguards if a competitor releases a comparable model without comparable safeguards, subject to conditions like public acknowledgment and an internal risk assessment. Additionally, both indicators lack measurable definitions: 'generational improvement' has no operational metric, and 'superhuman research-scientist agent' is undefined without benchmarks. This makes the threshold hardly falsifiable. Proposed fixes include independent evaluation bodies—unlike current practice where Self-Improvement lacks any external evaluator—and pre-committed concrete thresholds, such as halting when METR's p50 time horizon doubling rate reaches a specific acceleration.

Key Points
  • Threshold requires 5x generational acceleration sustained for months, equivalent to ~3 years of progress before trigger fires.
  • Escape hatch permits lowering safeguards if competitor releases comparable model without comparable safeguards.
  • Key terms like 'generational improvement' and 'superhuman research-scientist agent' lack operational definitions, making measurement impossible.

Why It Matters

Without clear, measurable, and independent safeguards, rapid AI self-improvement could outpace safety measures.

📬 Get the top 10 AI stories daily