A large corpus of lucid and non-lucid dream reports
A new dataset of 55,000 labeled dream reports, including 10,000 lucid dreams, is now public for AI research.
Researcher Remington Mallett has published a landmark dataset titled 'A large corpus of lucid and non-lucid dream reports' on arXiv. The corpus contains 55,000 individual dream reports contributed by 5,000 users over ten years, scraped from an online forum where people anonymously share dream journals. Crucially, users self-categorized their entries, providing 10,000 reports labeled as lucid dreams (characterized by awareness within the dream), 25,000 as non-lucid, and 2,000 as nightmares. This user-provided labeling system creates a rare, structured resource for a phenomenon that is infrequent and resistant to laboratory induction.
After curation, the dataset was validated by analyzing the language patterns within the lucid-labeled reports. The analysis confirmed these reports contain linguistic markers consistent with established characteristics of lucid dreams, lending credibility to the labels. While the entire 55k-report corpus holds broad value for dream science, the labeled subset is particularly powerful. It opens the door for machine learning and natural language processing (NLP) models to systematically analyze, compare, and potentially identify the signatures of lucid dreaming at an unprecedented scale, moving the field beyond small-sample studies.
- Contains 55,000 total dream reports from 5,000 contributors, collected over ten years from public online journals.
- Includes 10,000 user-labeled lucid dream reports, a scarce and valuable category for research.
- Validation shows language patterns in lucid reports align with known phenomenology, confirming the dataset's research utility.
Why It Matters
Provides the large-scale, labeled data needed to train AI models for analyzing consciousness states and advancing dream science.