OpenAI's red line for AI self-improvement is fundamentally flawed
Critics warn the threshold may allow years of progress before triggering any halt.
OpenAI's Preparedness Framework v2 defines the Critical threshold for AI self-improvement as either a superhuman research-scientist agent (leading indicator) or causing a generational model improvement in 1/5th the wall-clock time of equivalent 2024 progress (e.g., from o1 to o3 in 4 weeks) sustained for several months. Critics on LessWrong argue this threshold fires far too late: the lagging indicator could correspond to roughly three years of accumulated progress before triggering, especially if acceleration continues past 5x. Anthropic, by contrast, uses a 2x threshold.
The framework also contains an escape hatch (Section 4.3) that lets OpenAI lower safeguards if a competitor releases a comparable model without comparable safeguards, subject to conditions like public acknowledgment and an internal risk assessment. Additionally, both indicators lack measurable definitions: 'generational improvement' has no operational metric, and 'superhuman research-scientist agent' is undefined without benchmarks. This makes the threshold hardly falsifiable. Proposed fixes include independent evaluation bodies—unlike current practice where Self-Improvement lacks any external evaluator—and pre-committed concrete thresholds, such as halting when METR's p50 time horizon doubling rate reaches a specific acceleration.
- Threshold requires 5x generational acceleration sustained for months, equivalent to ~3 years of progress before trigger fires.
- Escape hatch permits lowering safeguards if competitor releases comparable model without comparable safeguards.
- Key terms like 'generational improvement' and 'superhuman research-scientist agent' lack operational definitions, making measurement impossible.
Why It Matters
Without clear, measurable, and independent safeguards, rapid AI self-improvement could outpace safety measures.