NIST's CAISI Evaluation Ranks DeepSeek V4 as Most Capable PRC Model, Notes 8-Month Lag Behind US Frontier
DeepSeek V4 Pro is impressive but still trails US frontier by 8 months
On April 2026, the Center for AI Standards and Innovation (CAISI) at NIST released an evaluation of DeepSeek V4 Pro, the latest open-weight model from China. Across 16 benchmarks covering five critical domains—cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics—DeepSeek V4 scored as the most capable PRC model ever tested by CAISI. However, its aggregate capability lags behind leading US frontier models by approximately eight months. For example, DeepSeek V4 achieved an IRT-estimated Elo of 800, compared to 1260 for OpenAI's GPT-5.5 and 999 for Anthropic's Opus 4.6. Notably, DeepSeek's own reported scores suggested parity with models like Opus 4.6 and GPT-5.4, but CAISI's held-out, non-public benchmarks (ARC-AGI-2 semi-private and PortBench) revealed a significant gap. In abstract reasoning (ARC-AGI-2), DeepSeek V4 scored only 46%, far behind Opus 4.6's 63% and GPT-5.5's 79%. In software engineering (PortBench), DeepSeek V4 managed 44%, while GPT-5.5 achieved 78%.
Despite the capability lag, DeepSeek V4 demonstrated notable cost efficiency. Compared to the most cost-competitive US reference model (GPT-5.4 mini), DeepSeek V4 was cheaper on five out of seven benchmarks, with savings ranging from 53% to 41% more expensive on one benchmark. This makes it an attractive option for price-sensitive deployments where cutting-edge performance isn't required. The evaluation also highlighted that DeepSeek V4 is the strongest PRC model to date in cybersecurity challenges (CTF-Archive-Diamond: 32%), though still below GPT-5.5 (71%). The report underscores the importance of independent, rigorous testing to verify self-reported claims, especially as geopolitical competition in AI intensifies. For US professionals, the findings suggest a sustained lead in high-end capability, but growing competition in cost-effective, open-weight models.
- DeepSeek V4 Pro lags behind US frontier models by about 8 months, with an IRT Elo of 800 vs GPT-5.5's 1260.
- CAISI's non-public benchmarks (ARC-AGI-2, PortBench) show DeepSeek V4 scores 46% and 44%, far below GPT-5.5's 79% and 78%.
- DeepSeek V4 is more cost-efficient than GPT-5.4 mini on 5 of 7 benchmarks, with cost savings up to 53%.
Why It Matters
Independent evaluations reveal the real gap between US and PRC AI, crucial for enterprise procurement and national security strategy.