Research & Papers

Collider-Bench tests if AI agents can reproduce LHC particle physics analyses

No AI agent reliably beats a human physicist at collider analysis reproduction yet.

Deep Dive

A new paper from researchers at (among others) David Shih's group introduces Collider-Bench, a benchmark designed to test whether AI agents can faithfully reproduce experimental analyses from the Large Hadron Collider (LHC). The task requires agents to take a published LHC analysis, turn it into an executable simulation-and-selection pipeline, and submit predicted collision event yields in specified signal regions. Because the public toolchain only approximates the experimental collaboration's internal software, and papers inevitably omit implementation details, agents must rely on physical reasoning, domain knowledge, and trial-and-error to fill in gaps. The benchmark uses standard histogram metrics for continuous fidelity scores and an LLM judge to catch hallucinations and duplications, plus tracks computational cost per task.

Evaluating a capability ladder of general-purpose coding agents, the authors find that on average no agent reliably beats the physicist-in-the-loop solution. This result underscores how far autonomous AI agents still are from replacing domain expertise in complex scientific workflows. The benchmark includes an initial set of tasks from LHC searches, a containerized sandbox, and event simulation tools, making it a rigorous new testbed for advancing AI's ability to handle long-horizon, knowledge-intensive tasks in science.

Key Points
  • Benchmark tasks require reproducing LHC analyses by building simulation pipelines from papers and open tools.
  • Agents are scored on histogram metrics for prediction fidelity and an LLM judge catches hallucinations and duplications.
  • No agent reliably outperforms a physicist-in-the-loop, highlighting the challenge of AI in real scientific research.

Why It Matters

Collider-Bench provides a rigorous, real-world benchmark for advancing AI's scientific reasoning and long-horizon tool use.