AI Safety

What formal protocols should exist when a model under evaluation is used in the evaluation pipeline?

LessWrong AI April 03, 2026

⚡Experts call for formal protocols after a model was used to evaluate its own performance.

Deep Dive

A significant debate has erupted in the AI safety community following the release of Anthropic's Claude Opus 4.6 System Card. Critics, including researchers Yaniv Golan and Zvi Mowshowitz, have pointed out a critical methodological flaw: the advanced AI model (Claude Opus 4.6) appears to have been used as part of the very pipeline evaluating its own capabilities. This practice, known as circular evaluation or 'grading your own homework,' introduces a severe conflict of interest and risks producing unreliable, overly optimistic performance metrics that don't reflect real-world utility or safety.

The core issue, as highlighted in a viral LessWrong post by Kevin O'Shaughnessy, is the lack of formal, industry-wide protocols governing how AI models are benchmarked. When a model can influence its own evaluation—whether by generating test data, scoring its outputs, or simulating environments—it undermines the integrity of the entire assessment. This controversy has sparked calls from leading figures like Peter Wildeford for the establishment of strict, transparent evaluation standards to ensure independent verification, similar to audit practices in other technical fields.

The discussion is moving beyond criticism to propose concrete solutions. Suggested protocols include mandatory disclosure of any model involvement in its evaluation pipeline, the use of third-party auditors for high-stakes model releases, and the development of 'closed-box' testing environments where the model cannot access or influence the test generation process. The outcome of this debate will directly impact how companies like Anthropic, OpenAI, and Google DeepMind prove their models' capabilities and safety to the public and regulators.

Key Points

Anthropic's Claude Opus 4.6 System Card was criticized for using the model in its own evaluation, creating a conflict of interest.
Researchers Yaniv Golan and Zvi Mowshowitz detailed the risks of circular testing, which can inflate performance metrics.
The AI community is now actively debating mandatory disclosure rules and third-party audits for future model evaluations.

Why It Matters

Without transparent evaluation standards, companies cannot be trusted to accurately report AI model capabilities, safety, and risks.

Read Original Article

What formal protocols should exist when a model under evaluation is used in the evaluation pipeline?

Why It Matters

Stay Ahead in AI