Highest defect-detection AUROC (0.804) produced only 13.3% task success on LIBERO pick-and-place?

Highest defect-detection AUROC (0.804) produced only 13.3% task success on LIBERO pick-and-place

Lower AUROC metric (0.638) achieved 90% success, within 3.3% of the oracle 93.3% ceiling?

Lower AUROC metric (0.638) achieved 90% success, within 3.3% of the oracle 93.3% ceiling

Five of seven metrics trivially exploit episode length as a confound, inflating reported AUROCs?

Five of seven metrics trivially exploit episode length as a confound, inflating reported AUROCs

Robotics

Best defect detector yields worst robot policy in new study

arXiv cs.RO June 10, 2026

⚡Top AUROC metric gave only 13.3% task success vs 90% from a weaker detector.

Deep Dive

A new study by Aarav Bedi, published on arXiv as arXiv:2606.10229, reveals a critical disconnect between demonstration-curation metrics and the actual performance of the resulting behavior-cloning policies. Using the contact-rich LIBERO pick-and-place benchmark, Bedi introduced a controlled structural defect—early gripper release during the carry phase—and tested seven different metrics for detecting defective training episodes. Surprisingly, the metric with the highest defect-detection AUROC (0.804) produced the worst curated policy, achieving only 13.3% task success. In contrast, a metric with a substantially lower AUROC (0.638) delivered a policy that nearly matched the oracle trained on ground-truth clean data, reaching 90.0% vs. 93.3%.

The paper also uncovers a major confound: five of the seven metrics exploit episode length as a trivial proxy for the defect label, inflating reported AUROCs to near-perfect values. Once episode length is controlled, these detection scores collapse. Across all conditions, the contaminated baseline succeeded on just 3.3% of rollouts, but the two best curation methods closed the gap to within 3 percentage points of the oracle ceiling. Bedi argues that curation methods should be evaluated by the policy they produce, not by the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. The full testbed, metric implementations, and evaluation pipeline are released for community use.

Key Points

Highest defect-detection AUROC (0.804) produced only 13.3% task success on LIBERO pick-and-place
Lower AUROC metric (0.638) achieved 90% success, within 3.3% of the oracle 93.3% ceiling
Five of seven metrics trivially exploit episode length as a confound, inflating reported AUROCs

Why It Matters

Roboticists must stop optimizing for defect detection and start measuring policy performance directly.

Read Original Article

Best defect detector yields worst robot policy in new study

Why It Matters

Related Articles

Stay Ahead in AI