Media & Culture

How many AIs does it take to read a PDF?

Despite AI's advanced capabilities, parsing PDFs remains a 'grand challenge' limiting real-world applications.

Deep Dive

A new investigation by The Verge's Josh Dzieza exposes a critical weakness in today's most advanced AI models: they can't reliably read PDFs. When researchers attempted to analyze 3 million Jeffrey Epstein documents released by the Department of Justice, they found that even state-of-the-art models like Google's Gemini failed to properly extract information from the PDF format. The models would summarize instead of extract, confuse footnotes with body text, or outright hallucinate content.

PDFs present a fundamental challenge because they were designed for visual preservation, not machine readability. Unlike HTML which represents text logically, PDFs consist of character codes and coordinates for painting page images. This causes problems with multi-column layouts common in academic papers, where optical character recognition (OCR) creates unintelligible jumbles by reading left-to-right across columns.

Companies like Reducto are now specializing in solving this problem. When applied to the Epstein documents, Reducto's technology successfully extracted data from garbled email threads, heavily redacted call logs, and low-quality scans of handwritten flight manifests. This enabled researchers to build searchable applications like Jmail (an Epstein inbox prototype) and Jflights (an interactive flight path visualizer), demonstrating how solving PDF parsing could transform document analysis workflows.

The researcher Pierre-Carl Langlais has jokingly placed 'PDF parsing is solved!' shortly before AGI in his AI development timeline, highlighting how this seemingly mundane problem represents a significant barrier to AI's real-world utility. As Edwin Chen of data company Surge notes, PDF parsing remains one of AI's 'unsexy failures' that limits practical applications despite rapid progress in other areas.

Key Points
  • Google's Gemini and other state-of-the-art models fail at PDF parsing, confusing footnotes and hallucinating content
  • PDF format wasn't designed for machines - uses character coordinates instead of logical text structure
  • Specialized companies like Reducto are solving this to analyze 3M+ Epstein documents for searchable applications

Why It Matters

This technical gap prevents AI from analyzing millions of real-world documents, limiting legal, research, and investigative applications.