AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X
LLM-written fact-checking notes received 1,614 ratings showing higher cross-partisan consensus than human notes.
Stanford researchers Haiwen Li and Michiel A. Bakker have published the first field evaluation of AI-powered fact-checking deployed on a live social media platform. Over three months, they tested a multi-step LLM pipeline that handles text, images, and videos, conducts web searches, and writes contextual notes for X's Community Notes feature. The system generated 1,614 notes on 1,597 tweets, which were compared against 1,332 human-written notes on the same content using 108,169 ratings from 42,521 unique raters.
Direct comparison revealed that LLM-written notes received more positive ratings than human notes across raters with different political viewpoints, suggesting superior potential for achieving cross-partisan consensus. The note-level analysis confirmed this advantage: among raters who evaluated all notes on the same post, LLM notes achieved significantly higher helpfulness scores. This demonstrates that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.
The study's methodology addressed platform-specific challenges by using two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalized rater exposure across note types. This rigorous approach provides the first empirical evidence that LLM-generated fact-checking can outperform human efforts in authentic social media environments, potentially transforming how misinformation is addressed at scale while maintaining broad acceptance across political divides.
- First real-world field test of AI fact-checking on X's Community Notes platform, analyzing 1,597 tweets over three months
- LLM-generated notes (1,614) outperformed human notes (1,332) with higher helpfulness scores across 108,169 ratings from 42,521 users
- AI notes achieved greater cross-partisan consensus, receiving more positive ratings from raters across political spectrums
Why It Matters
Proves AI can scale high-quality fact-checking that bridges political divides, potentially transforming how platforms combat misinformation.