PhysDox benchmark shows LLMs max out at 53.0 F1 on physical feasibility
LLMs generate flawless protocols but can't spot fatal physical flaws.
A team of researchers from multiple institutions introduced PhysDox, a benchmark designed to evaluate LLMs on physical feasibility auditing of physiological sensing protocols. The benchmark comprises a 683-sample expert-curated Gold set and a 5,000-sample Silver set covering six sensing domains. The task is structured as a two-stage evaluation: severity detection (classifying protocols as valid, minor, or fatal) followed by constraint-level diagnosis of fatal violations. This approach aims to test whether LLMs can go beyond generating fluent text and actually reason about the physical realizability of experimental procedures.
Testing six LLMs across four inference strategies yielded a peak Stage-1 macro-F1 of only 53.0, indicating significant room for improvement. End-to-end evaluation of constraint diagnosis suffered from correlated cascade errors, causing strong oracle performance to collapse. Error analysis revealed scaffold bias—models equating procedural completeness with physical validity. Implicit constraints were missed at twice the rate of explicit hardware violations, supported by a strong statistical correlation (ρ=0.81, p<0.01). Further trace analysis of false negatives showed a 54%–46% split between attention failures and judgment failures, underscoring that protocol auditing requires calibrated feasibility reasoning rather than factual recall or longer rationales.
- PhysDox benchmark includes 683 expert-curated Gold samples and 5,000 Silver samples across 6 physiological sensing domains.
- Top LLM achieved only 53.0 macro-F1 in severity detection; constraint diagnosis collapsed due to cascade errors.
- Implicit constraints missed 2x more often than explicit hardware violations; 54% of errors from attention, 46% from judgment.
Why It Matters
As LLMs assist in research design, their inability to verify physical feasibility could lead to faulty experiments.