Depression and Suicide Analysis Using Machine Learning and NLP

Depression is a common type of mental illness that can impair performance and lead to suicide ideation or attempts. Traditional techniques used by mental health experts can assist in determining an individual’s type of depression. Machine learning and NLP were used to understand how to predict posts that indicate depression in people and their accuracy. For this work, we have used a dataset from reddit. Reddit is an ideal destination to use as a supplement to the traditional public health system because of its punctuality in exchanging ideas, versatility in presenting emotions, as well as compatibility to use medical terms. We examined the comments and posts about suicidal ideation. We used NLP to gain a better understanding of interdisciplinary fields which are related to suicide. We discovered two help groups for depression and suicidal thoughts: r/depression and r/SuicideWatch. The famous “SuicideWatch” subreddit is commonly used by people who have thoughts of suicide and gives significant signals for suicidal behavior. A brief scan through the articles discloses that the subreddits are legitimate online spots to seek assistance and provide honest text data about people’s mental state. We have used multiple ML algorithms such as Naïve Bayes, SVM. To address the research problem, we have considered two subreddits that provided us with appropriate information to track people at risk. We achieved results of 77.29 % accuracy and 0.77 f1-score of Logistic Regression, 74.35 % accuracy and 0.74 f1-score of Naïve Bayes, 77.120% accuracy and 0.77 f1-score of Support Vector Machine, 77.298% accuracy, and 0.77 f1-score of Random Forest.


Introduction
Depression Disorder is a severe and widespread mental illness characterized by an excessive sense of pessimism and despair. Depression, in its most severe manifestations, can have a significant impact on human performance as well as human life. The severity of depression varies from person to person; what most cases have had familiar is a lack of motivation to attempt almost anything, such as tasks they once loved hugely. Suicide ideation and a loss of desire to quantify further are two of the deadliest manifestations of depression. According to psychologists, taking into account depression occurs after two weeks of the existence of its symptoms. Suicide is defined as a fatal self-harming act committed with the intent to perish. Suicide is the world's 13 th major cause of death, accounting for five percent of all demise, according to WHO, with nearly a million people dying as a result worldwide due to suicide each year. Suicide affects all age groups, but worldwide, rates clearly increase with rising age.
There are many reasons people attempt suicide like: Victims of sexual, physical assault, or mental anguish are more likely to commit suicide several years after the mental anguish has passed because the mental anguish causes feelings of guilt and despair that could result in suicide. 3. Use of Substances: Alcohol, drugs also impact a suicidal individual, making them more irrational to act on their impulses than they'd be if they were sober. Individuals with depression and other psychiatric disorders have a risk for substance use and alcohol use dysfunction. When you combine these factors, the dangers multiply. 4. Fear of Loss: When faced with a failure or the anxiety of a loss, a person may decide to attempt suicide. These situations can include. These situations can include a) Academic failure b) Bullying, shaming, or humiliation, including cyberbullying c) Financial problems d) Loss of social status In this paper, we commence with difficulties in processing social media text, which is illustrated with examples of suicidal ideation and similar mental illnesses. As mentioned in the related research, we have discussed prior studies conducted that are relevant to our work. We then provide a thorough analysis of the study which was used in our approach. Ultimately, we outline the method of analysis for meeting the problems outlined in the situation. We conclude by highlighting the broader scope of our research. 3 multilayer perceptron and support vector machine, among others, get the most precise forecasts of suicidal ideation nearly 100 percent of the time. 7. S. Fodeh et al. proposed a suicide clinical trials may benefit from ML models. They downloaded 12,066 series of posts from 3,873 users using keywords from Jashinsky et al. ideation's tracking framework via Twitter's Android app. Participants were allocated "HighRisk" or "at risk" keywords relying on their use of suicidal ideation aspects, and optimization classes have been used to recognize implied suicidal behavior potential risks between many users, that were then used to identify and classify users as "HighRisk" or "at risk." Amongst these algorithms used were Semantic Analysis, Latent Dirichlet allocation, and Non-linear Programming.  Text Feature extraction , Self-Organizing Map (SOM) , Neural Networks

Literature Review
In comparison to all these endeavors, we concentrate on the combined task of recognizing useful comments to SW subreddit posts. The objective is to predict helpful comments as well as analyze and interpret results in order to gain insights into communication techniques for online responses to suicidal social media posts from users.

Methodology
We have followed the methodology of collecting data, then pre-processed it, then trained the model and finally validated the model.

Data collection:
The dataset consists of over 60000 individual data points. Classified as depression vs. suicide, data was collected from two subreddits. Some of the characteristics and properties which make the dataset ideal for use in our model are the active postings in them, low toll rate and low spam rates and more textual tweets with very few images posted. Less memes also help in cleaning the data. With our research, we are trying to identify the distinction between language used by a person experiencing clinical depression thoughts and language used by an individual at risk of suicide, which will be beneficial for counsellors and mental health professionals.

Data pre-processing
We converted the text in lower casing, removed the punctuation, removed numbers, tokenized the data, removed the stop words, stemmed the data and then finally passed it into machine learning model.

Lower casing:
Converting a word to lower case, words like 'Depressed' and 'depressed' mean the same, but when not converted to the lower case, those two are represented as two different words in the vectors.

Tokenization:
Tokenization is a process of converting text to a smaller unit called tokens; tokens can be of 3 types: words, characters, and ngrams. We have broken down our text data into words. Stop words are words in the language that do not add significant meaning to a sentence. Common stop words are he, his, your, does, do, if etc. they can be easily ignored without losing the meaning of the sentence. By removing such words, we reduce the overall data size without reducing any valuable information.

Stemming:
Stemming is the process of removing a part of the word, or reducing it to its stem or root.

ML Algorithms Applied
Logistic regression: It predicts a binary outcome based on a set of independent variables. In contrast to linear regression, which yields constant number values, logistic regression yields a probability value that can be linked to two distinct classes using the logistic function.

Naïve bayes:
It uses bayes theorem to classify objects. Whenever we are dealing with a large amount of data, naïve bayes is the preferred solution. This provides very good results when it comes to NLP activities, such as sentimental analysis as it's a fast and uncomplicated classification algorithm.

Support vector machine:
SVM generates a linear algorithm that creates a function to maximize the distance between classes. Class function is created using instance data at the edge of the class. The data points lying closest to the decision boundary are the most difficult to identify, and they have a direct impact on the optimum position of the decision surface. SVM algorithm can achieve this by creating a function that maximizes the distance.

Random forest:
It is made up of many decision-making trees; each provides a class forecasting. The model prediction is chosen from the class with the most votes. The concept is simple but robust, and it represents the wisdom of the masses.
Random forest also de-correlates the independently constructed trees by selecting a subset of the available features (predictors) at random to build the tree. This procedure eliminates the possibility of the independently constructed trees being highly correlated due to one or two extremely powerful predictors.

Results and findings
Logistic Regression: Table 2. The results show precision 0.80, recall of 0.79, f1-score 0.79 and accuracy of 0. 79. Logistic regression has had the highest accuracy along with random forest, followed by Support Vector Machine, Naïve Bayes.
Naïve Bayes: Table 3. The results show precision 0.77, recall of 0.77, f1-score 0.77, support of 3938 and accuracy of 0.77120. Support Vector Machine has had the second-highest accuracy higher than Naïve Bayes but less than Logistic Regression and Random Forest.

Conclusion
Depression and suicide analysis is considered as a challenging and complex task. In this paper, we attempted to detect the existence of depression on the Reddit platform and looked for ways to improve affective efficiency in order to detect depression. We discovered a stronger link between depression and language using text classification techniques and NLP. We looked at how single feature and cumulative feature sets performed in measuring depression symptoms using various text classifying methods.
The "SuicideWatch" subreddit is used by people who have thoughts of suicide and gives great signals for suicidal behavior. Reddit is an ideal destination to use as a supplement to the traditional public health system because of its punctuality in exchanging ideas, versatility in presenting emotions, as well as compatibility to use medical terms. The model's effectiveness can be seen with 77.29 % accuracy and 0.77 f1-score of Logistic Regression, 74.35 % accuracy and 0.74 f1-score of Naïve Bayes, 77.12% accuracy and 0.77 f1-score of Support Vector Machine, 77.298% accuracy, and 0.77 f1-score of Random Forest.
Even though techniques used in this work perform sufficiently well, but even more research require in this area. We presume that this study will help to lay the groundwork for new mechanisms that will be used to approximate depression and related factors in various fields of health. People suffering from mental illnesses may benefit from being more proactive in their recovery. Identifying the instant suicide risk is a challenging but possibly lifesaving task. The existing accepted indicators do not satisfactorily tackle the topic of Suicide risk. Although the presence of suicidal ideation indicates a high risk of suicide, it is clear that many patients deny having one. Suicidal thoughts, intentions, and information from significant others should be sought and considered legitimate even after the patient denies it.

Future Work
Our structure can be widened to solve more significant healthcare problems encompassing multimodal data. It can even be used in conjunction with smart virtual assistants to lower the risk of self-harm in patients. Some points to improve in our present system 1. Collaborating with local groups to integrate digital footprints with offline data (e.g., suicide hotlines, counseling) 2. Extending our research beyond Reddit and into other common public forums.
Further DL algorithms such as RNN (Recurrent neural networks), LSTM (Long Short-Term Memory), ANN (Artificial Neural Network) can be implemented to improve the research. A further direction for future research is to take into account other data characteristics such as comments, memes, reposts, and likes. It would also be interesting to investigate whether such conversations affect the direction of a user's suicide sequence of events.