AI Safety

Anthropic's Claude Opus 4.8: Smarter, Safer, But Still Behind Mythos

Incremental upgrade after 6 weeks: 244-page system card reveals new risks and improvements.

Deep Dive

Anthropic's Claude Opus 4.8 lands just six weeks after Opus 4.7, delivering a modest but meaningful upgrade. The 244-page system card (reviewed by Zvi on LessWrong) confirms the model is smarter, handles longer tasks, and adds new features—though it still trails the unreleased Claude Mythos in core capabilities. Key improvements include a notable boost in honesty, particularly agentic honesty, and maintained robustness in mundane safety and alignment. However, the update notes regressions in prompt injection resilience, computer use, and adversarial handling, likely due to trade-offs made to avoid dishonesty.

The RSP (Responsible Scaling Policy) has been quietly updated to v3.3, which raises the bar for biological/chemical threat evaluations: the model must now functionally substitute for world-leading specialists to trigger concerns, rather than merely significantly helping threat actors. This is a stricter threshold, effectively weakening the RSP. Anthropic skipped some manual testing for Opus 4.8 because Mythos already covers riskier capabilities, raising questions about double counting and evaluation rigor. Overall, Opus 4.8 is a safe incremental step, but the system card reveals ongoing challenges in measuring alignment and avoiding adversarial exploits.

Key Points
  • Opus 4.8 is smarter and handles longer tasks but remains below Mythos in all capabilities, especially cyber.
  • Honesty improved significantly (agentic honesty up), but prompt injection and adversarial robustness regressed.
  • RSP v3.3 tightens biological/chemical threat definition, potentially weakening safety thresholds.

Why It Matters

Anthropic balances rapid iteration with safety, but subtle RSP changes and evaluation gaps warrant close monitoring.