• Corpus ID: 49208337

SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions

  title={SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions},
  author={Arman Cohan and Bart Desmet and Andrew Yates and Luca Soldaini and Sean MacAvaney and Nazli Goharian},
Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled data without… 

Figures and Tables from this paper

Twitter-STMHD: An Extensive User-Level Database of Multiple Mental Health Disorders
The Twitter - Self-Reported Temporally-Contextual Mental Health Diagnosis Dataset (Twitter-STMHD), a large scale, user-level dataset grouped into 8 disorder categories and a companion class of control users, is built.
Explaining Models of Mental Health via Clinically Grounded Auxiliary Tasks
Models of mental health based on natural language processing can uncover latent signals of mental health from language. Models that indicate whether an individual is depressed, or has other mental
Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study
It is found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.
A Survey of Computational Methods for Online Mental State Assessment on Social Media
A comprehensive analysis of the proposed approaches for online mental state assessment on social media, a structured categorisation of the methods according to their design principles, lessons learnt over the years and a discussion on possible avenues for future research are presented.
Machine learning of language use on Twitter reveals weak and non-specific predictions
Depressed individuals use language differently than healthy controls and it has been proposed that social media posts can be used to identify depression. Much of the evidence behind this claim relies
On the State of Social Media Data for Mental Health Research
This paper introduces an open-source directory of mental health datasets, annotated using a standardized schema to facilitate meta-analysis, and offers an analysis specifically on the state of social media data that exists for conducting mental health research.
CAMS: An Annotated Corpus for Causal Analysis of Mental Health Issues in Social Media Posts
An annotation schema for causal analysis of mental health issues in Social media posts (CAMS) is introduced and a classic Logistic Regression model outperforms the next best (CNN-LSTM) model by 4.9% accuracy.
Adapting Deep Learning Methods for Mental Health Prediction on Social Media
In a binary classification task on predicting if a user suffers from one of nine different disorders, a hierarchical attention network outperforms previously set benchmarks for four of the disorders.
Then and Now: Quantifying the Longitudinal Validity of Self-Disclosed Depression Diagnoses
This work analyzes recent activity from individuals who disclosed a depression diagnosis on social media over five years ago and acquires a new understanding of how presentations of mental health status on social social media manifest longitudinally.
Automatic Detection and Classification of Mental Illnesses from General Social Media Texts
The accuracy obtained by the eating disorder classifier is the highest due to the pregnant presence of discussions related to calories, diets, recipes etc., whereas depression had the lowest F1 score, probably because depression is more difficult to identify in linguistic acts.


From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses
A broad range of mental health conditions in Twitter data is examined by identifying self-reported statements of diagnosis and language differences between ten conditions with respect to the general population, and to each other are systematically explored.
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
This work introduces RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis, which is valuable for various computational methods to examine mental health through social media.
Multitask Learning for Mental Health Conditions with Limited Social Media Data
The framework proposed significantly improves over all baselines and single-task models for predicting mental health conditions, with particularly significant gains for conditions with limited data, and establishes for the first time the potential of deep learning in the prediction of mental health from online user-generated text.
Quantifying Mental Health Signals in Twitter
A novel method for gathering data for a range of mental illnesses quickly and cheaply is presented, then analysis of four in particular: post-traumatic stress disorder, depression, bipolar disorder, and seasonal affective disorder are focused on.
Triaging content severity in online mental health forums
An approach for triaging user content into four severity categories that are defined based on an indication of self‐harm ideation is proposed and it is shown that overall, long‐term users of the forum demonstrate decreased severity of risk over time.
Scalable mental health analysis in the clinical whitespace via natural language processing
This community is introduced to some of the recent advancement in using natural language processing and machine learning to provide insight into mental health of both individuals and populations.
Depression and Self-Harm Risk Assessment in Online Forums
This work introduces a large-scale general forum dataset consisting of users with self-reported depression diagnoses matched with control users, and proposes methods for identifying posts in support communities that may indicate a risk of self-harm, and demonstrates that this approach outperforms strong previously proposed methods.
Feature Studies to Inform the Classification of Depressive Symptoms from Twitter Data for Population Health
It is concluded that simple lexical features and reduced feature sets can produce comparable results to larger feature sets, suggesting there is no consistent count of features for predicting depressive-related tweets.
Predicting Depression via Social Media
It is found that social media contains useful signals for characterizing the onset of depression in individuals, as measured through decrease in social activity, raised negative affect, highly clustered egonetworks, heightened relational and medicinal concerns, and greater expression of religious involvement.
The role of personality, age, and gender in tweeting about mental illness
Language-derived personality and demographic estimates show surprisingly strong performance in distinguishing users that tweet a diagnosis of depression or PTSD from random controls, reaching an area under the receiveroperating characteristic curve ‐ AUC ‐ of around .8 in all the authors' binary classification tasks.