AI Safety

OpenAI's red line for AI self-improvement is fundamentally flawed

LessWrong AI May 02, 2026

⚡Critics warn the threshold may allow years of progress before triggering any halt.

Deep Dive

OpenAI's Preparedness Framework v2 defines the Critical threshold for AI self-improvement as either a superhuman research-scientist agent (leading indicator) or causing a generational model improvement in 1/5th the wall-clock time of equivalent 2024 progress (e.g., from o1 to o3 in 4 weeks) sustained for several months. Critics on LessWrong argue this threshold fires far too late: the lagging indicator could correspond to roughly three years of accumulated progress before triggering, especially if acceleration continues past 5x. Anthropic, by contrast, uses a 2x threshold.

The framework also contains an escape hatch (Section 4.3) that lets OpenAI lower safeguards if a competitor releases a comparable model without comparable safeguards, subject to conditions like public acknowledgment and an internal risk assessment. Additionally, both indicators lack measurable definitions: 'generational improvement' has no operational metric, and 'superhuman research-scientist agent' is undefined without benchmarks. This makes the threshold hardly falsifiable. Proposed fixes include independent evaluation bodies—unlike current practice where Self-Improvement lacks any external evaluator—and pre-committed concrete thresholds, such as halting when METR's p50 time horizon doubling rate reaches a specific acceleration.

Key Points

Threshold requires 5x generational acceleration sustained for months, equivalent to ~3 years of progress before trigger fires.
Escape hatch permits lowering safeguards if competitor releases comparable model without comparable safeguards.
Key terms like 'generational improvement' and 'superhuman research-scientist agent' lack operational definitions, making measurement impossible.

Why It Matters

Without clear, measurable, and independent safeguards, rapid AI self-improvement could outpace safety measures.

Read Original Article

OpenAI's red line for AI self-improvement is fundamentally flawed

Why It Matters

Stay Ahead in AI