Supervised Learning for Suicidal Ideation Detection in Online User Content

,


Introduction
Suicide might be considered as one of the most serious social health problems in the modern society. Many factors can lead to suicide, for example, personal issues, such as hopelessness, severe anxiety, schizophrenia, alcoholism, or impulsivity; social factors, like social isolation and overexposure to deaths; or negative life events, including traumatic events, physical illness, affective disorders, and previous suicide attempts. Thousands of people around the world fall victims to suicide every year, making suicide prevention become a critical global public health mission.
Suicidal ideation or suicidal thoughts are people's thoughts of committing suicide. It can be regarded as a risk indicator of suicide. Suicidal thoughts include fleeting thoughts, extensive thoughts, detailed planning, role playing, and incomplete attempts. According to a WHO report [1], 788,000 people estimated worldwide committed suicide in 2015. And a large number of people, especially teenagers, were reported having suicidal ideation. Thus, one possible approach to preventing suicide effectively is early detection of suicidal ideation.
With the widespread emergence of mobile Internet technologies and online social networks, there is a growing tendency for people to talk about their suicide intentions in online communities. This online content could be helpful for detecting individuals' intentions and their suicidal ideation. Some people, especially adolescents, choose to post their suicidal thoughts in social networks, ask about how to commit suicide in online communities, and enter into online suicide pacts. The anonymity of online communication also allows people to freely express the pressures and anxiety they suffer in the real world. This online user-generated content provides another possible angle for early suicide detection and prevention.
Previous research on suicide understanding and prevention mainly concentrates on its psychological and clinical aspects [2]. Recently, many studies have turned to natural language processing methods and classifying questionnaire results via supervised learning, which learns a mapping function from labelled training data [3]. Some of these researches have used the "International Personal Examination Screening Questionnaire," and analysed suicide blogs and posts from social networking websites. However, these studies have their limitations. (1) From both a psychological and a clinical perspective, collecting data and/or patients is typically expensive, and some online data may help in understanding thoughts and behaviours. (2) Simple feature sets and classification models are not predictive enough to detect suicidal tendencies.
In this paper, we investigate the problem of suicidal ideation detection in online social websites, with a focus on understanding and detecting the suicidal thoughts in online user content. We perform a thorough analysis of the content, the language preferences, and the topic descriptions to understand the suicidal thoughts from a data mining perspective. Six different sets of informative features were extracted and six supervised learning algorithms were compared to detect suicidal ideation within the data. It is a novel application of automatic suicidal intention detection on social content with the combination of our proposed effective feature engineering and classification models. This paper makes notable contributions and novelties to the literature in the following respects: (1) Knowledge discovery: this is a novel application of knowledge discovery and data mining to detect suicidal ideation in online user content. Previous work in this field has been conducted by psychological experts with statistical analysis; this approach reveals knowledge on suicidal ideation from a data analytic perspective. Insights from our analysis reveal that suicidal individuals often use personal pronouns to show their ego. They are more likely to use words expressing negativity, anxiety, and sadness in their dialogue. They are also more likely to choose the present tense to describe their suffering and the future tense to describe their hopelessness and plans for suicide.
(2) Dataset and platform: this paper introduces the Reddit platform and collects a new dataset for suicidal ideation detection. Reddit's SuicideWatch BBS is a new online channel for people with suicidal ideation to express their anxiety and pressures. Social volunteers respond in positive, supportive ways to relieve the depression and hopefully prevent potential suicides. This data source is not only useful for suicide detection but also for studying how to effectively prevent suicide through effective online communication.
(3) Features, models, and benchmarking: rather than using basic models with simple features for suicidal ideation detection, this approach (1) identifies informative features from a number of perspectives, including statistical, syntactic, linguistic, word embedding features, and topic features; (2) compares with different classifiers from both traditional and deep learning perspectives, such as support vector machine [4], Random Forest [5], gradient boost classification tree (GBDT) [6], XGBoost [7], multilayer feed forward neural net (MLFFNN) [8], and long short-term memory (LSTM) [9]; and (3) provides benchmarks for suicidal ideation detection on Suici-deWatch on Reddit, one active online forum for communication about suicide.
This paper is organised as follows. In Section 2, we review the related works on suicide analysis and detection. We introduce the datasets in Section 3 along with data exploration and knowledge discovery. Section 4 describes the classification and feature extraction methods. Section 5 is the experimental study. We conclude this paper in Section 6.

Related Works
Suicide detection has drawn the attention of many researchers due to an increasing suicide rate in recent years. The reasons of suicide are complicated and attributed to a complex interaction of many factors [10]. The research techniques used to examine suicide also span many fields and methods. For example, clinical methods may examine resting-state heart rate [11] and event-related instigators [12]. Classical methods also include using questionnaires to assess the potential risk of suicide and applying clinicianpatient interactions [13].
The goal of text-based suicide classification is to determine whether candidates, through their posts, have suicidal ideation. Such techniques include suicide-related keyword filtering [14,15] and phrase filtering [16].
Machine learning methods especially supervised learning and natural language processing methods have also been applied in this field. The main features consist of N-gram features, knowledge-based features, syntactic features, context features, and class-specific features [17]. Besides, word embedding [18] and sentence embedding [19] are well applied. Models for cybersuicide detection include regression analysis [20], ANN [21], and CRF [22]. Okhapkina et al. built a dictionary of terms pertaining to suicidal content and introduced term frequency-inverse document frequency (TF-IDF) matrices for messages and a singular vector decomposition for matrices [23]. Mulholland and Quinn extracted vocabulary and syntactic features to build a classifier for suicidal and nonsuicidal lyricists [24]. Huang et al. built a psychological lexicon dictionary and used an SVM classifier to detect cybersuicide [25]. Chattopadhyay [8] proposed a mathematical model using Beck's suicide intent scale and applying multilayer feed-forward neural network to classify suicide intent. Pestian et al. [26] and Delgado-Gomez et al. [27] compared the performance of different multivariate techniques.

Complexity
The relevant extant research can also be viewed according to the data source.

Suicide
Notes. Suicide notes provide material for natural language processing. Previous approaches have examined suicide notes using content analysis [26], sentiment analysis [17,29], and emotion detection [22]. In the age of cyberspace, suicide notes are now also written in the form of web blogs and can be identified as carrying the potential risk of suicide [14].

Online User Content.
Cash et al. [30], Shepherd et al. [31], and Jashinsky et al. [16] have conducted psychology-based data analysis for content that suggests suicidal tendencies in the MySpace and Twitter social networks. Ren et al. explored accumulated emotional information from online suicide blogs [32]. O'Dea et al. developed automatic suicide detection on Twitter by applying logistic regression and SVM on TF-IDF features [33]. Reddit has also attracted much research interest. Huang and Bashir applied linguistic cues to analyse the reply bias [34]. De Choudhury et al. did many works on suicide-related topics in Reddit including the effect of celebrity suicides on suicide-related content [35] and the transition from mental health illness to suicidal ideation [36].
A questionnaire is a useful tool for collecting data, but it costs highly. Suicide notes are useful materials for training a classifier. The current dataset of suicide notes is quite small. Automatic detection on online user content will be a promising way for suicide detection and prevention. Our proposed method investigated a better solution with effective feature engineering on a bigger social dataset than the previous work. And it can adapt to real-world application with the ability of automatic detection compared with questionnaires.

Data and Knowledge
We collect the suicidal ideation texts from Reddit and Twitter and manually check all the posts to ensure they were correctly labelled. Our annotation rules and examples of posts appear in Table 1.

Reddit Dataset.
Reddit is a registered online community that aggregates social news and online discussions. It consists of many topic categories, and each area of interest within a topic is called a subreddit.
In this dataset, online user content includes a title and a body of text. To preserve privacy, we replace personal information with a unique ID to identify each user. We collected posts with potential suicide intentions from a subreddit called "SuicideWatch"(SW) (https://www.reddit.com/r/ SuicideWatch/). Posts without suicidal content were sourced from other popular subreddits (https://www.reddit.com/r/ all/, https://www.reddit.com/r/popular/). The collection of nonsuicidal data is totally a user-generated content, and the posts of news aggregation and administrator are excluded. To facilitate the study and demonstration, we will study the balanced dataset in Reddit and study imbalanced dataset in Twitter in the following subsection.
The Reddit dataset includes 3549 suicidal ideation samples and a number of nonsuicide texts. In particular, we construct two datasets for Reddit as shown in Table 2. The first dataset includes two subreddits in which one is from SuicideWatch and another is from popular posts in Reddit. The second dataset is composed of six subreddits that include SuicideWatch and another five hot topics: gaming (https://www.reddit.com/r/gaming/), jokes (https://www. reddit.com/r/Jokes/), books (https://www.reddit.com/r/ books/), movies (https://www.reddit.com/r/movies/), and AskReddit (https://www.reddit.com/r/AskReddit/). In the second dataset, the combination of SuicideWatch with any other subreddit will be a new balanced subdataset, for example, suicide versus gaming and suicide versus jokes. These two datasets will be studied on Subsections 5.1 and 5.2 separately.

Twitter Dataset.
Many online users also want to talk about the suicidal ideation in social networks. However, Twitter is quite different with Reddit as (1) each tweet's length is limited in 140 characters (this limit is now 280 characters), (2) tweet users may have some social network friends from the real world while Reddit users are fully anonymous, and (3) the communication and interaction type are totally different between social networking websites and online forums.
The Twitter dataset is collected using a keyword filtering technique. Suicidal words and phrases include "suicide," "die," and "end my life." Many of the collected tweets have the suicidal-related words, but they possibly talk about a suicide movie or advertisement which does not contain suicidal ideation. Therefore, we manually checked and labeled collected tweets according to the annotation rules in Table 1. Finally, the Twitter dataset has totally 10,288 tweets with 594 tweets (around 6%) with suicidal ideation. This dataset is an imbalanced dataset and will be studied in Section 5.3.

Data Exploration and Knowledge Discovering.
To understand suicidal individuals, we analysed the words, languages, and topics in online user content.
3.3.1. Word Cloud. Word clouds were used to provide a visual understanding of the data. The users' posts in Reddit and tweets in Twitter with potential suicide risk are showed separately in Figures 1(a) and 1(b). As we can see, suicidal posts frequently use words such as "life," "suicide," and "kill," providing a direct indication of the users' suicidal thoughts. Words expressing feelings or intentions are also frequently used, such as "feel," "want," and "know." For example, some suicidal posts wrote, "I feel like I have no one left and I want to end it," "I want to end my life," and "I don't know how much of it was psychological trauma." In addition, the dominant words in these two social platforms have different styles due to the posting rules of the platforms. The Reddit users are willing to compose their posts in a specific way. For instance, they describe their life events and their stories about their friends. While the content in Twitter is much more straightforward with expressions like "want kill," "going kill," and "wanna kill." The details are usually not included in their tweets.

Language Preferences.
Language preferences provide an overview of the statistical linguistic information of the data. The listed variables shown in Table 3 were extracted using LIWC 2015 [37]. All these categories are features based on word counts. We calculated the average value of each variable in both suicide-related texts and suicide-free posts. As shown in the table, content with or without suicidality quite differs in many items.
(i) Users with suicidal ideation use many personal pronouns to show their ego. For example, "I want to end my life." (ii) They express more negative emotions, like anxiety and sadness. For example, "I was drowning in guilt and depression for several years after." (iii) As for the tense, texts with suicidal ideation tend to use the present and future tense. They tend to use the present tense to describe their suffering, pain, and depression. For example, "I'm feeling so bad." The future tense is used to describe their hopeless feelings about the future and their suicide intentions. For example, "I'm eventually going to kill myself." (iv) Both types of posts discuss family and friends and make female or male references.  (v) Unsurprisingly, more words related to death appear in texts about suicide. For example, "kill," "die," "end life," and "suicide." (vi) Both types of posts contain a similar number of swear words.
One of the findings from Table 3 and Figure 1 is that people with suicidal thoughts tend to directly show their intentions in anonymous online communities when faced with some kinds of problem in the real world. Their posts often show negative feelings with strong ego and intention.

Topic Description.
We extracted 10 topics from posts containing suicidal ideation using the latent Dirichlet allocation (LDA) [38] topic modelling method, as shown in Table 4. There are some Internet slangs such as "tx" (thanks) and abbreviations like "im" (I am) and "n't" ("negatory"). In the field of standard natural language processing, personal words like "I," "me," and "you" are stop words and should be removed, but we kept them in this exploration because they contain important information. Thus, there are many personal pronouns included in these topic words, which are identical to the results in Table 3.

Methods and Technical Solutions
4.1. Feature Processing. By preprocessing and cleaning the data in advance, we extracted several features including statistics, word-based features (e.g., suicidal words and pronouns), TF-IDF, semantics, and syntactics. Additionally, we used distributed features by training neural networks to embed word into vector representations, along with topic features extracted by LDA [38] as unsupervised features.

Statistical Features.
User-generated posts are varied in length, and some statistical features can be extracted from texts. Some posts use short and simple sentences, while others use complex sentences and long paragraphs.
After segmentation and tokenisation, we captured statistical features as follows: Common POS tags include nouns, verbs, participles, articles, pronouns, adverbs, and conjunctions. POS subgroups were also identified to provide more detail about the grammatical properties of the posts. Each post was parsed and tagged, and the number of each category in the title and text body was simply counted.

Linguistic Features: LIWC.
Online users' posts usually contain emotions, relativity, and harassment words. Lexicons are widely applied for extracting these features. To analyse the linguistic and emotional features in the data, we used Linguistic Inquiry and Word Count [37] (LIWC 2015 (http://liwc.wpengine.com/)) which was proposed and developed by the University of Texas at Austin. This approach was used in a previous study [34]. The tool contains a powerful internally built dictionary for matching the target words in posts when parsing data. About 90 variables were output. In addition to word countbased features, it could extract features based on emotional tone, cognitive processes, perceptual processes, and many types of abusive words. Specific categories include word count, summary language, general descriptors, linguistic Imagine, cellophane, abandoned, anyone, medical, cheated, mr, surgery, yelling, letter 7 Im, want, life, like, get, feel, ive, know, year, even 8 Fucking, very, tomorrow, bottom, accept, sharp, n't, went, wife, attacked 9 Condition, suicide, also, hope, tx, california, chronic, jumping, crisis, age 10 Please, find, mother, car, social, live, need, accident, debt, month 5 Complexity dimensions, psychological constructs, personal concern, informal language markers, and punctuation.

Word Frequency
Features: TF-IDF. Many kinds of expression are related to suicide. We used TF-IDF to extract these features and measure the importance of various words from both suicidal posts and nonsuicidal posts. TF-IDF measures the number of times that each word occurs in the documents and adds a penalty depending on the frequency of the word in the entire corpus.

Word Embedding Features.
The distributed representation, which is able to preserve the semantic information in texts, is popular and useful for many natural language processing tasks. It embeds words into a vector space. There are several techniques for word embedding. We employed the word2vec ( [18], https://code.google.com/archive/p/ word2vec/) to derive a distributed semantic representation of the words.
There are two architectures for word2vec word embedding, that is, CBOW and Skip-gram. CBOW predicts the present word based on the context, Skip-gram predicts the closest words to the current word provided. 4.1.6. Topic Features. Suicidal posts and nonsuicidal posts talk about different topics which can provide good understanding for two categories. We applied the latent Dirichlet allocation (LDA) [38] to reveal latent topics in user posts. Each topic is a mixture probability of word occurrence in the topic, and each post is a mixture probability of topics.
Given the set of documents and the number of topics, we used LDA to extract the topics from each posts, then calculate the probability that each post belonged to every generated topics. Hence, the posts are represented by their thematic properties as probability vectors at the length of the number of topics.
(1) Feature Visualisation. To understand the informativeness of these feature sets, we visualise the features on the Reddit dataset in a 2-dimensional space by using principal component analysis (PCA) [40] in Figure 2. The results demonstrate that we indeed extract features that can largely separate the points in different classes. We will further validate the effectiveness of our feature sets in Section 5.

Classification Models.
Suicidality detection in social content is a typical classification problem of supervised learning. Given a dataset x i , y i n i consisting a set of texts x i n i with labels y i n i , we trained a supervised classification model to learn the function from the training data pairs of input objects and supervisory signals: where y i = 1 means that the expression x i is "suicide text" (ST), otherwise y i = 0 means "not suicide text (non-ST)." The training or learning of the classification model is to minimise the prediction error in the given training data. The prediction error is to be presented as a loss function L y, F x where y is the real label and F x is the predicted label by using classification model. In summary, the goal of training algorithm is to obtain an optimal prediction model F x by solving below optimisation task: Different classification methods may have different definition of loss function and predefined structure of model. We employed both classical supervised learning classification  6 Complexity methods and deep learning methods to solve the suicidal ideation classification task. The structure of our feature extraction method is shown in Figure 3. As mentioned in Section 4.1, features comprised statistics, POS counts, LIWC features, TF-IDF vectors, and topic probability features. Among these features, we applied POS features and LIWC features to both the title and text body of user posts. We combined the title and the body into one piece of text to extract topic probability vectors and TF-IDF vectors. All extracted features were input to the classifiers.

Comparison and Analysis on Suicide versus Nonsuicide.
This section compares various classification methods using different combinations of features with 10-fold cross validation (Our codes are available in https://github.com/ shaoxiongji/sw-detection). The specific classification models include support vector machine [4], Random Forest [5], gradient boost classification tree (GBDT) [6], XGBoost [7], and multilayer feed-forward neural net (MLFFNN) [8]. SVM is able to solve problems that are not linearly separable in lower space by constructing a hyperplane in highdimensional space. It can be adapted to many kinds of classification tasks [41,42]. Random Forest, GBDT, and XGBoost are tree ensemble methods that use decision trees as base classifiers and produce a form of committee to gain better performance than any single base classifier. MLFFNN takes the different features as input and learns the combination of them with nonlinearity.
For comparison and to solve the problem of understanding the semantic meaning and syntactic structure of sentences, deep learning provides powerful performance [43]. We used long short-term memory (LSTM) [9] network, one state-of-the-art deep neural network. LSTM takes the title and text body of user posts with word embedding as its inputs and uses memory cell to preserve the state over long periods, capturing the long-term dependencies in long conversation detection.
As shown in Table 5, all methods' performance increases by combining more features on the whole. This observation validates the effectiveness and informativeness of our extracted features. However, the contribution each feature makes varies, which leads to fluctuations in the results of individual methods. The XGBoost had the best performance of the six methods when taking all groups of features as inputs. Although LSTM does not require feature processing and is renowned for its state-of-the-art performance in many other natural language processing tasks, it did not perform as well as some of the other ensemble learning methods with sufficient features in this case. Random Forest, GBDT, XGBoost, and MLFFNN with proper features produced better accuracy and F1 scores than LSTM on our Reddit dataset. Admittedly, deep learning with word embedding is rather convenient and typically achieves adequate results, even without complicated feature engineering.
The AUC performance measurement in each classification is the area under the receiver operating characteristic curve with all extracted features. In the last column of Table 5, the AUC has an increasing tendency with more combined features. The XGBoost method gains the highest AUC of 0.9569 while other methods have very similar AUC value above 0.9.

Suicide versus Single Subreddit Topics.
To evaluate the classification on suicide with any other specific online communities, we extended our datasets and experiments to other specific subreddits, including "gaming," "jokes," "books," "movies," and "AskReddit." The results are shown in Figure 4. Using the features extracted with our approach was a very effective way of classifying the suicidal ideation posts from another subreddit domain. In fact, the classification results on suicidal dataset versus the subreddit dataset were better than suicidal versus nonsuicidal dataset where the nonsuicidal samples are composed of multiple popular subreddit domains. In these experiments, XGBoost produced the best results on "movies" and "AskReddit" in terms of accuracy and F1 scores. LSTM and Random Forest outperformed the other models in "gaming" and "books," respectively.

Experiments on Twitter Dataset.
To evaluate the performance of our proceeded features and the classification models, we do another experiment on our Twitter dataset. Tweet text without long text body is different with Reddit text. Thus, for the experimental setting, there is a slight difference between them. We exclude the number of paragraphs in statistical features, POS, and LIWC features of text bodies. The rest of the settings are similar to our previous experiment. Considering the class imbalance in Twitter data, we adopt undersampling techniques. The results are the average metrics of each undersampled data shown in Table 6. The receiver operating characteristic curves of 7 Complexity these methods are showed in Figure 5. In these dataset, Random Forest gains better performance than most models except for the metric of precision in which the MLFFNN gains a slightly better result.

Conclusion
The amount of text keeps growing with the popularisation of social networking services. And suicide prevention  remains an important task in our modern society. It is therefore essential to develop new methods to detect online texts containing suicidal ideation in the hope that suicide can be prevented. In this paper, we investigated the problem of suicidality detection in online user-generated content. We argue that most work in this field was conducted by psychological experts with statistical analysis, which is limited by the cost and privacy issue in obtaining data. By collecting and analysing the anonymous online data from an active Reddit platform and Twitter, we provide rich knowledge that can complement the understanding of suicidal ideation and behaviour. Though applying feature processing and classification methods to our carefully built datasets, Reddit and Twitter, we evaluated, analysed, and demonstrated that our framework can achieve high performance (accuracy) in distinguishing suicidal thoughts out of normal posts in online user content.
While exploiting more effective feature sets, complex models or other factors such as temporal information may improve the detection of suicidal ideation-these will be our future directions; the contribution and impact of this paper are threefold: (1) delivering rich knowledge in understanding suicidal ideation, (2) introducing datasets for the research community to study this significant problem, and (3) proposing informative features and effective models for suicidal ideation detection.

Data Availability
The data used to support the findings of this study are available from the first author upon request (email: shaoxiong.ji@uq.edu.au).   Complexity