Research & Papers

NeurIPS desks rejected papers using uncalibrated AI detector Pangram

False positives on chair papers reveal flawed AI detection process

Deep Dive

A recent desk rejection from the NeurIPS 2026 Position Paper Track has sparked controversy over the use of uncalibrated AI detection tools. The conference used Pangram, a proprietary AI-text detector, as part of their desk-rejection process for alleged AI-policy violations. The author of the rejected paper argues that the process suffers from circularity: the detector output and the author's AI-use attestation are both considered, but if a high detector score is used to judge the attestation inconsistent, the detector effectively becomes the sole arbiter. More critically, NeurIPS failed to validate Pangram on the actual target distribution of submissions. Their tests used synthetic and edited samples, not real papers submitted to the track. When the author ran Pangram on recent papers by NeurIPS 2026 Position Paper Track Chairs, the detector returned scores ranging from 24% to 69% AI content—scores that would have triggered scrutiny under the same policy. This suggests a high false-positive rate and distribution mismatch, undermining the fairness of the rejection process.

The incident highlights a systemic problem in academic AI conferences: using black-box commercial detectors without rigorous, domain-specific validation. The NeurIPS blog post acknowledged a "surprisingly high flagged rate" among submissions, which the author argues is exactly what one would expect from an uncalibrated detector measuring noise rather than true AI authorship. The consequences extend beyond one rejected paper. If venues like NeurIPS continue to rely on unstandardized detection tools, they risk rejecting legitimate human-written work while creating perverse incentives for authors to obfuscate their writing. The lack of transparency—Pangram is proprietary—makes it impossible for researchers to audit or appeal decisions. As AI writing tools become ubiquitous, academic bodies need to establish clear, validated protocols that separate genuine misuse from false positives.

Key Points
  • NeurIPS desk-rejected a paper using Pangram, a proprietary AI detector, without calibrating it on the actual submission distribution.
  • Pangram returned 24-69% AI scores on papers by track chairs themselves, indicating high false-positive risk.
  • The process is circular: detector output and author attestation are both used, turning the unvalidated detector into the decisive factor.

Why It Matters

Uncalibrated AI detectors in academic conferences risk false rejections and erode trust in peer review integrity.