Research & Papers

In harmony with gpt-oss

A new study bypasses OpenAI's black box to match GPT-OSS-20B's published benchmarks within 0.3%.

Deep Dive

A new research paper titled 'In harmony with gpt-oss' by Borislav Mavrin details the first successful independent reproduction of OpenAI's published benchmark scores for the GPT-OSS-20B model with tools. The breakthrough came from two key insights: first, the researchers discovered that when prompted without explicit tool definitions, GPT-OSS-20B still calls tools from its training distribution with high statistical confidence—revealing a strong prior rather than hallucination. This allowed them to reverse-engineer the model's in-distribution tools that OpenAI's original paper had not disclosed.

Second, the team built a native 'harmony' agent harness that encodes messages directly in the model's native format, bypassing the lossy conversion through OpenAI's Chat Completions API. This technical workaround proved critical for accurate performance measurement. The combined approach yielded scores remarkably close to OpenAI's claims: 60.4% versus 60.7% on SWE Verified HIGH, 53.3% versus 53.2% on MEDIUM, and 91.7% versus 90.4% on AIME25 with tools—all within 0.3% margin of error.

The research addresses a significant transparency gap in AI benchmarking, where companies often publish impressive results without providing the necessary tooling or evaluation frameworks for independent verification. By successfully reproducing these scores, the study not only validates OpenAI's claims but also establishes a methodology for testing other closed AI systems. The harmony agent harness is publicly available, providing other researchers with tools to conduct similar verification studies on proprietary models.

Key Points
  • Reverse-engineered GPT-OSS-20B's tool prior, showing 60.4% score on SWE Verified HIGH vs. OpenAI's 60.7%
  • Built native 'harmony' agent harness bypassing lossy API conversions, achieving 91.7% on AIME25 with tools
  • Provides first independent verification method for closed AI systems, addressing transparency concerns in benchmark reporting

Why It Matters

Establishes crucial methodology for independently verifying AI benchmark claims, increasing accountability in closed-model development.