AI Safety

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

arXiv cs.CY April 06, 2026

⚡LLM-written fact-checking notes received 1,614 ratings showing higher cross-partisan consensus than human notes.

Deep Dive

Stanford researchers Haiwen Li and Michiel A. Bakker have published the first field evaluation of AI-powered fact-checking deployed on a live social media platform. Over three months, they tested a multi-step LLM pipeline that handles text, images, and videos, conducts web searches, and writes contextual notes for X's Community Notes feature. The system generated 1,614 notes on 1,597 tweets, which were compared against 1,332 human-written notes on the same content using 108,169 ratings from 42,521 unique raters.

Direct comparison revealed that LLM-written notes received more positive ratings than human notes across raters with different political viewpoints, suggesting superior potential for achieving cross-partisan consensus. The note-level analysis confirmed this advantage: among raters who evaluated all notes on the same post, LLM notes achieved significantly higher helpfulness scores. This demonstrates that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

The study's methodology addressed platform-specific challenges by using two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalized rater exposure across note types. This rigorous approach provides the first empirical evidence that LLM-generated fact-checking can outperform human efforts in authentic social media environments, potentially transforming how misinformation is addressed at scale while maintaining broad acceptance across political divides.

Key Points

First real-world field test of AI fact-checking on X's Community Notes platform, analyzing 1,597 tweets over three months
LLM-generated notes (1,614) outperformed human notes (1,332) with higher helpfulness scores across 108,169 ratings from 42,521 users
AI notes achieved greater cross-partisan consensus, receiving more positive ratings from raters across political spectrums

Why It Matters

Proves AI can scale high-quality fact-checking that bridges political divides, potentially transforming how platforms combat misinformation.

Read Original Article

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Why It Matters

Stay Ahead in AI