Research & Papers

Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

New benchmark spans 9 engineering fields and 7 intent types, revealing a three-tier performance gap in AI models.

Deep Dive

A consortium of researchers led by MZ Naser has introduced the Engineering Reasoning and Instruction (ERI) Benchmark, a comprehensive new dataset designed to rigorously evaluate the engineering capabilities of large language models and AI agents. Published on arXiv, this taxonomy-driven benchmark spans nine core engineering disciplines—including civil, mechanical, electrical, and aerospace—and systematically crosses 55 subdomains with seven distinct reasoning intents (like explanation, calculation, and design) and three professional difficulty levels. This structured approach creates 57,750 unique records, providing a granular tool to measure how well models perform on tasks ranging from undergraduate fundamentals to professional-grade synthesis and troubleshooting.

The initial evaluation of seven LLMs revealed a statistically significant three-tier performance hierarchy. Frontier models, specifically GPT-5, Claude Sonnet 4, and DeepSeek V3.1, achieved mean scores above 4.30 on a five-point scale, demonstrating strong competency. However, mid-tier and smaller models showed steep performance degradation on graduate-level questions, highlighting a clear capability gap. To address common benchmark contamination issues, the team developed a novel convergent validation protocol using cross-provider independence and multi-judge analysis, empirically bounding the hallucination risk to just 1.7%. Released with full taxonomy specs and an evaluation harness, ERI enables reproducible testing for instruction tuning, RAG (retrieval-augmented generation) systems, and agentic tool-use workflows, setting a new standard for assessing AI in technical domains.

Key Points
  • Dataset contains 57,750 structured records across 9 engineering fields, 55 subdomains, and 7 reasoning intents.
  • Frontier models (GPT-5, Claude 4) scored >4.3/5, while smaller models failed on graduate-level questions.
  • Includes a validation protocol that reduces hallucination risk to 1.7% for reliable agent evaluation.

Why It Matters

Provides the first standardized test to measure if AI agents can reliably perform complex, professional engineering reasoning and design tasks.