Structural Hallucination in Large Language Models: A Network-Based Evaluation of Knowledge Organization and Citation Integrity
Network analysis shows LLMs fabricate 94% of knowledge connections while appearing locally fluent.
A new arXiv paper by Moses Boudourides introduces the critical concept of 'structural hallucination' in Large Language Models (LLMs), revealing systematic distortions in knowledge organization that remain invisible to traditional sentence-level evaluation metrics. While LLMs like GPT-4 and Claude increasingly mediate scholarly information, this research demonstrates they can produce locally fluent statements while completely misrepresenting the underlying conceptual architecture, relational networks, and bibliographic grounding of knowledge domains. The study moves beyond individual fact-checking to analyze how LLMs reconstruct entire knowledge structures, exposing a fundamental limitation in current AI evaluation paradigms.
The research developed a network-based stress test using knowledge graph extraction and similarity analysis across three structured domains: Roget's Thesaurus as a lexical ontology, Wikidata philosophers as a biographical graph, and bibliographic records from this http URL. The results are staggering: macro-averaged F1 scores below 0.05 in lexical reconstruction, hallucination rates exceeding 93% in biographical knowledge, citation omission reaching 91.9%, and node-set Jaccard similarity of just 0.028 with fabrication rates above 94%. These findings prove structural fidelity cannot be inferred from local fluency alone, necessitating new evaluation instruments for researchers and developers working on retrieval-augmented generation (RAG) systems, academic assistants, and any application requiring coherent knowledge representation rather than just plausible text generation.
- LLMs show 93%+ hallucination rates in reconstructing biographical knowledge graphs from Wikidata
- Citation omission reaches 91.9% when LLMs summarize scholarly bibliographic records
- Network similarity scores drop to 0.028 Jaccard with 94% fabrication in lexical ontology reconstruction
Why It Matters
Reveals why RAG systems and academic AI tools fail despite accurate individual statements—they distort knowledge architecture.