Research & Papers

DistilBERT+HRR model detects depression with 94% F1 score from Reddit posts

Combining cognitive linguistic features with transformers achieves 0.94 F1 vs baseline 0.80

Deep Dive

A new research paper from Brian Van Steen, published on arXiv, demonstrates that combining cognitively grounded linguistic features with transformer-based embeddings significantly improves automated depression detection in online text. Using a subset of the Kaggle Reddit Suicide and Depression Detection dataset, the study extracts cognitive distortions based on Beck's Cognitive Theory of Depression — including first-person pronoun density, absolutist words, and negative emotion markers. These features are encoded using Holographic Reduced Representation (HRR) vectors and concatenated with DistilBERT sentence embeddings, then classified via Logistic Regression.

Results show the hybrid DistilBERT+HRR model achieves a macro F1 score of 0.94 versus 0.80 for the TF-IDF baseline. Cross-validation with 5 folds improves F1 from 0.83 to 0.92, and AUC from 0.958 to 0.981. This approach bridges cognitive psychology and modern NLP, offering a more interpretable and accurate method for identifying mental health indicators in social media, with potential applications in early screening and clinical support tools.

Key Points
  • Hybrid DistilBERT+HRR model achieves macro F1 of 0.94, beating TF-IDF baseline's 0.80
  • Uses Beck's Cognitive Theory to extract linguistic features: first-person pronouns, absolutist words, negative emotion
  • 5-fold cross-validation F1 improved from 0.83 to 0.92; AUC rose from 0.958 to 0.981

Why It Matters

Enables more accurate, interpretable AI screening for depression from social media posts.