Sanity-checking “Incompressible Knowledge Probes”
Bojie Li's 'Incompressible Knowledge Probes' paper fails reproducibility checks.
A recent paper by Bojie Li, chief scientist at Pine AI, caused a stir by claiming to reverse-engineer the parameter counts of frontier closed-source models: GPT-5.5 at 9.7T, Claude Opus 4.7 at 4.0T, o1 at 3.5T, and GPT-4o at 720B. The method used a factual knowledge dataset of varying difficulty, regressing model performance against known parameter counts of open-source models, then extrapolating. However, independent researchers Benjamin and Lawrence quickly flagged the work as methodologically sloppy. Their audit found the codebase was largely AI-generated (Claude Code) with redundant variables, silent failures, and a hidden “minimum floor” for scores—directly contradicting the paper’s claims. The dataset itself had serious quality issues: at least 6.8% of hard Wikidata questions and 25.9% of researcher-sourced questions were ambiguous or had wrong gold answers.
The core idea behind the paper—that factual recall capacity correlates with parameter count—is not unreasonable. The open-source regression showed an R² between 0.78 and 0.92, depending on methodological choices. But Li’s specific parameter estimates for GPT-5.5 and Claude Opus 4.7 are unreliable due to these flaws. The hidden floor artificially inflated scores for small models, skewing the regression. Correcting the dataset and removing the floor would shift the estimates significantly. The episode serves as a cautionary tale about “vibe-coded” AI research, where flashy results outpace rigorous validation. For tech professionals, it underscores the importance of independent reproducibility checks before trusting viral claims about model size or capability.
- Paper claims GPT-5.5 has 9.7T parameters and Claude Opus 4.7 has 4.0T via factual knowledge regression.
- Audit found code had a hidden score floor, contradicting paper; 25.9% of hard questions were ambiguous.
- Open-source regression is valid (R² 0.78–0.92), but specific closed-model estimates are unreliable.
Why It Matters
Highlights need for rigorous peer review before accepting viral AI benchmark claims.