Detecting Suicidal Ideation on Forums: Proof-of-Concept Study

Background In 2016, 44,965 people in the United States died by suicide. It is common to see people with suicidal ideation seek help or leave suicide notes on social media before attempting suicide. Many prefer to express their feelings with longer passages on forums such as Reddit and blogs. Because these expressive posts follow regular language patterns, potential suicide attempts can be prevented by detecting suicidal posts as they are written. Objective This study aims to build a classifier that differentiates suicidal and nonsuicidal forum posts via text mining methods applied on post titles and bodies. Methods A total of 508,398 Reddit posts longer than 100 characters and posted between 2008 and 2016 on SuicideWatch, Depression, Anxiety, and ShowerThoughts subreddits were downloaded from the publicly available Reddit dataset. Of these, 10,785 posts were randomly selected and 785 were manually annotated as suicidal or nonsuicidal. Features were extracted using term frequency-inverse document frequency, linguistic inquiry and word count, and sentiment analysis on post titles and bodies. Logistic regression, random forest, and support vector machine (SVM) classification algorithms were applied on resulting corpus and prediction performance is evaluated. Results The logistic regression and SVM classifiers correctly identified suicidality of posts with 80% to 92% accuracy and F1 score, respectively, depending on different data compositions closely followed by random forest, compared to baseline ZeroR algorithm achieving 50% accuracy and 66% F1 score. Conclusions This study demonstrated that it is possible to detect people with suicidal ideation on online forums with high accuracy. The logistic regression classifier in this study can potentially be embedded on blogs and forums to make the decision to offer real-time online counseling in case a suicidal post is being written.


Introduction Text Mining Methods for Suicidal Ideation Detection
Term Frequency (Bag of Words) Approach where rows correspond to documents, columns correspond to terms (usually words in stemmed format) and cells correspond to term frequency of the word in that document. Having several variations, term frequency ! in the simplest definition, is the number of occurrences of a t f (t, d ) indicates that the document is related to that term. If there is no other term in the document, then the ratio becomes 1, which is the maximum possible value. For instance, in the hypothetical text represented with Figure 1, term occurs 3 times in document d1 which contains a total of 130 terms, resulting in a term frequency of 3 / 130 = 0.023. By using bag of words approach, frequently used terms in suicidal or non-suicidal posts can be found and used to discriminate these two classes. However some terms like "the" are used very frequently in all kinds of documents, putting on noise to the statistics, reducing the importance of other terms. To avoid this, inverse document frequency (idf), which reduces the importance of commonly used terms, is commonly incorporated in addition to term frequency.

Tf-Idf Approach
In words that are commonly used in several posts and hence that don't have much importance (like commonly used stop words) when multiplied with term frequency.

LIWC Approach
One of the promising tools, LIWC is a commonly used text analysis program that counts words in psychologically meaningful categories and yielding a score for each category. "Word count, 4 summary language variables (analytical thinking, clout, authenticity, and emotional tone), 3 general descriptor categories (words per sentence, percent of target words captured by the dictionary, and percent of words in the text that are longer than six letters), 21 standard linguistic dimensions (e.g., percentage of words in the text that are pronouns, articles, auxiliary verbs, etc.), 41 word categories tapping psychological constructs (e.g., affect, cognition, biological processes, drives), 6 personal concern categories (e.g., work, home, leisure activities), 5 informal language markers (assents, fillers, swear words, netspeak), and 12 punctuation categories (periods, commas, etc)." With these statistics, it is possible to extract psychological mood or the main concerns of a post author which helps predicting suicidality. One pitfall is that LIWC might not be ideal for real time predictions due to non-automated nature of LIWC. It requires a human to run a desktop application to calculate scores. This requirement makes LIWC tool hard to run it for different inputs in an automated fashion.

Sentiment Analysis Approach
Similarly, Sentiment Analysis is the process of determining the emotional tone behind a series of words using Natural Language Processing (NLP) techniques. As a result of this process, polarity and subjectivity scores are produced for each document, resulting in a 2-column matrix.
Polarity score is a real number between -1 (negative) and 1 (positive), representing the degree the document is positive. Subjectivity score, on the other hand, is a real number between 0 (factual) and 1 (totally subjective). Selected feature columns and values resulting from these processes are combined and used for prediction modeling. Sentiment features allow detecting negative mood which is significantly common among people with suicidal ideation.

Data Collection
Reddit.com, founded in 2005, is one of the most popular forum sites as of 2018.

Results
In addition to accuracy, recall and precision, below are the false positive and false negative prediction performances. False positive indicates the ratio of posts that were actually non-suicidal but the algorithm classified as suicidal. False negative indicates the ratio of posts that were actually suicidal but the algorithms missed (classified as non-suicidal). In suicide prevention scope, keeping false negatives low is more important than keeping false positives low. However a 100% false positive would mean allocating resources for all the posts. classifications. It can be seen that LR and SVM are performing the best (except for Experiment 2 where SVM keeps behind of LR and RF in terms of FNR).