An Explainable Approach Based on Emotion and Sentiment Features for Detecting People with Mental Disorders on Social Networks

Mental disorders are a global problem that widely affects different segments of the population. Diagnosis and treatment are difficult to obtain, as there are not enough specialists on the matter, and mental health is not yet a common topic among the population. The computer science field has proposed some solutions to detect the risk of depression, based on language use and data obtained through social media. These solutions are mainly focused on objective features, such as n-grams and lexicons, which are complicated to be understood by experts in the application area. Hence, in this paper, we propose a contrast pattern-based classifier to detect depression by using a new data representation based only on emotion and sentiment analysis extracted from posts on social media. Our proposed feature representation contains 28 different features, which are more understandable by specialists than other proposed representations. Our feature representation jointly with a contrast pattern-based classifier has obtained better classification results than five other combinations of features and classifiers reported in the literature. Our proposal statistically outperformed the Random Forest, Naive Bayes, and AdaBoost classifiers using the parser-tree, VAD (Valence, Arousal, and Dominance) and Topics, and Bag of Words (BOW) representations. It obtained similar statistical results to the logistic regression models using the Ensemble of BOWs and Handcrafted features representations. In all cases, our proposal was able to provide an explanation close to the language of experts, due to the mined contrast patterns.


Introduction
According to the World Health Organization (WHO), an estimated 264 million people around the world suffer from depression [1]. Depression is one of the most troublesome and common mental disorders; it is the principal cause of disability worldwide and has a significant impact on the index of morbidity. This is because depression can lead to suicide and is the leading cause of suicide death. The number of suicides by depression reaches up to 800,000 per year worldwide [2].
The number of people suffering from mental disorders, including depression, is continually growing, and this has an impact on human rights, the economy, and society [3]. This is especially true in low-income countries. The exponential growth of the population and the fact that a large group of people is arriving to the age when depression is more likely to appear are factors that contribute to that growth [2]. Other factors, such as political, social, cultural, and economic factors, play a significant role in the development of such disorders. That is why minority groups, such as people suffering from discrimination for their ethnicity, sexual orientation, gender identification, people suffering from domestic abuse, and people exposed to conflict or high stress, are at higher risk of developing major depressive disorder (MDD) [4]. An explainable model based on contrast patterns achieving better area under the curve (AUC) and F1 regarding other state-of-the-art classifiers designed for detecting depression. • A set of extracted contrast patterns describing depressive posts in a language close to experts in the application area.
The paper is organized as follows. Section 2 presents the previous works in the topic. Section 3 presents the databases, representations, and classifiers used to construct the models used for evaluation. Section 4 presents the results obtained in AUC and F1 measures and the statistical comparison between our model and other state-of-the-art models. Section 5 presents a demonstration on how the contrast patterns extracted can be interpreted. Finally, Section 6 presents our conclusions and future work.

Previous Works
This section provides a general overview of previously conducted research related to the detection of mental health diseases with the use of information extracted from social media. As the use of AI for detecting depressive posts is still a little-studied topic, most of the papers were proposed in workshops. We also reviewed those papers using objective information and sentiment analysis as feature representations for detecting people with depressive behavior on social media.

Workshops on the Detection of Depression on Social Media
There have been multiple workshops on the detection of mental health problems using social media information. One of the most known workshops is from the CLEF's (Conference and Labs of the Evaluation Forum) Early Risk Prediction on the Internet (eRisk) lab [23] wherein the task for early detection of depression consisted of processing Reddit posts sequentially and detecting depression signs as early as possible.
Multiple solutions were given to this task; the best solutions are summarized in the following subsections.

LIDIC Participation in the CLEF's eRisk Pilot Task
This research group proposed a semantic representation of posts and a method named temporal variation of terms (TVT) [24]. The representation includes an unweighted BOW, 3 g character representation, features extracted using the linguistic inquiry and word count (LIWC) tool, and concise semantic analysis. The LIWC tool is "a transparent text analysis program that counts words in psychologically meaningful categories" [25]. This research group used it to extract linguistic markers of depression. Concise semantic analysis is a technique used to represent terms as vectors in a space of concepts close to their category labels. This technique represents posts as the central vector of the vectors representing the individual terms it contains [26]. TVT uses concise semantic analysis to create concept spaces for both the depressed and non-depressed class. It then classifies the new entries based on those spaces. This technique reported a F1 measure of 0.59 [26].

Dortmund University Participation in the CLEF's eRisk Pilot Task
The Computer Science Group of Dortmund University proposed four different representations of social media posts. The representations included linguistic metadata extracted manually. These data included the following: • Flesch Reading Ease score. • Dale-Chall Readability score. • Gunning Fog Index score. • Boolean value that represents whether or not the post includes the name of a medication linked to depression. • Boolean value that represents whether or not the post includes a phrase that implies the diagnosis of depression ("I was diagnosed with depression", "My diagnosis of depression", etc.). • Boolean value that represents whether or not the term "My therapist" is included in the post. • Boolean value that represents whether or not the term "My anxiety" is included in the post. • Boolean value that represents whether or not the term "My depression" is included in the post.
The other three representations were three different BOWs. The first BOW used, as terms, the 200,000 unigrams, bigrams, trigrams, and four-grams with the highest information gain (IG), and its weights are calculated with the following formula: where l d,t is the local weight of term t in post d given by the raw frequency of the term in the document, g t is the global weight of term t given by IG, and n d is a normalization factor.
The second BOW used, as terms, all the unigrams in the training set and the augmented term frequency (atf) multiplied by the inverse document frequency as shown in the following formula: .
The third BOW used all unigrams in the training set as terms, and it used the logarithmic term frequency as the local weight and the relevance frequency as the global weight, as shown in the following formula: All the BOWs used l 2 − norm or cosine normalization as the normalization factor. The handcrafted features and all BOWs were used separately as input of different logistic regression classifiers. These classifiers had a class weighting given by 1 1+w for the not depressed class and w 1+w for the depressed class, with w being equal to 2, 6, 2, 4 for each of the classifiers respectively. The highest F1 reported measure is 0.64, while the lowest is 0.55 [ The UAM (Universidad Autónoma Metropolitana by its acronym in Spanish) research team proposed a graph representation of the posts as in [29], where each term is represented by a node. The edges represent the number of co-occurrences of the pair of nodes they join within a contextual window. The terms the research team used for the graph representation were 1 g and 3 g with a contextual widow of two terms to the right and two to the left.
They made a graph for each class (depressive and non-depressive) and each node. Afterward, they compared the individual post graphs with the class graphs measuring its similarity using containment similarity, value similarity, normalized value similarity, and dice similarity. Containment similarity refers to sequences of shared nodes and edges. Value similarity refers to how many of the edges on the graph of the post are contained in the prototype graph. Finally, Dice similarity refers to the number of shared nodes between graphs.
The four features were then fed to a k-nearest neighbors classification algorithm in the Weka platform with k = 1 and Euclidean distance. The highest reported F1 measure is 0.16, while the lowest is 0.08 [30].

Machine Learning and Mental Disorders
Researchers have used machine learning and AI to detect multiple mental health issues. In this subsection, we present some of the works conducted using not only text, but also physical markers. This subsection shows some of these works.

Tackling Mental Health by Integrating Unobtrusive Multimodal Sensing
An investigation team from the University of Rochester proposed a model that could link mental states and a set of signals extracted from different sources [17]. These signals included the sentiment analysis of the tweets and tweet replies posted by the users. This feature was extracted with an NLP tool called Sentiment 140 [31], which returns the polarity of the text (positive, negative, neutral). For the sentiment in images, they used an algorithm described in [32]. In addition to the sentiment features of the tweets, they used the following signals using different devices: The team then used these features to feed a logistic classifier to infer the mood of the user based on them. Using the leave-one-subject-out approach, the F1 and AUC measures are shown in Table 1. These categories were used to create a sub-emotion lexicon by clustering words by the emotion level. This new lexicon was used to vectorize social media posts and create a BOW, using sub-emotions as features. They used SVM as classifier for their model and obtained an F1 score of 0.61 when using unigrams and 0.63 when using n-grams [20].

Sentiment Analysis and Mental Health
The use of sentiment analysis on social media posts for detecting mental health problems is present in current research. This subsection shows some of the previous works on that matter.

Detecting Depression Using K-Nearest Neighbors (KNN) Classification Technique
Ref. [18] proposed a model for classifying Facebook comments as depressive indicative or not, using features that were divided into three different categories: emotional variables, temporal categories, and standard linguistic dimensions. Table 2 shows the features in more detail.
These features were then fed to different KNN algorithms, both individually and combined between each other. The best F1 measure was obtained with the Coarse KNN algorithm, using the emotional variables with a value of 0.71.

Depressive Moods of Users Portrayed in Twitter
Ref. [34] extracted 37 sentiment categories using the tool LIWC and examined how the variables were correlated with a user Center for Epidemiological Studies-Depression (CES-D) score. CES-D is a survey that asks the frequency of depression-related symptoms that the patient has suffered over the past week. The final score ranges from 0 to 60, with higher scores indicating more severe depression symptoms [34]. A total of 18 sentiment predictors were found out to be reasonably correlated with the topic. The predictor, example words for each factor, and the coefficients of a multiple regression model for the CES-D score are detailed in Table 3.

Mental Health Computing via Harvesting Social Media Data
An investigation team from the Tsinghua University used the features described in Table 3 for their representation. Additionally they used features such as topic-level features, features related to antidepressants and depressions symptoms, and features extracted with the tool bBridge [35], which include the following: These features were then fed to a binary classifier and tested in a dataset of 2804 users. The model obtained an F1 measure of 0.85 [19].

Materials and Methods
In this section, we present the databases we used for the comparison, the tools we used for extracting emotion and sentiment features, and the structure of the feature space created. We also describe the state-of-the-art models we reproduced for the comparison. Finally, we summarize the models used for the statistical analysis.

Databases
In this paper, we used five different publicly available databases; Table 4 summarizes their characteristics. The C-SSRS database was extracted by Gaur et al. [36] to distinguish the severity risk of suicide in a user. They used four psychiatrists to classify Reddit posts into five categories: • Indicative of suicidal ideation. • Indicative of suicidal behavior. • Indicative of an actual attempt of suicide. • Suicide indicator (it contains reference to suicide but in an informative manner). • Supportive of suicidal people.
We relabeled every post labeled as supportive or a suicide indicator as non-depressive and every post labeled in any other category as depressive.
DDVHSM (depression detection via harvesting social media) is a database extracted by Guangyao Shen et al. [37]. The research team labeled Twitter posts considering the appearance of specific text related to depression diagnosis, such as "I have been diagnosed with depression.", "I am diagnosed with depression.", and similar texts.
The Kaggle database is an open-source database of Twitter posts, annotated individually, as depressive or non-depressive. Since the posts were annotated individually, the number of users is unknown. The labeling was made considering the appearance of the stem "depress" on the text.
Both LOSADA databases are databases that were available in the eRisk lab of the CLEF in the years of 2017 and 2018 [23,40]. As described in [39], to label a user as depressive or not depressive, they searched for mentions of diagnosis, such as "I have been diagnosed with depression.", "I was diagnosed with depression." and similar texts. Text such as "I am depressed." or "I have depression." were not considered, as they did not mention an explicit diagnosis. This labeling did not include individual post labels; we relabeled every post of a depressed user as depressive and every post of a not-depressed user as non-depressive.

Extraction of Sentiment Features
After having the databases with every post labeled individually as depressive or non-depressive, we processed the posts as shown in Figure 1. We removed HTML tags, stopwords, and repeating punctuation. After that, some posts were left blank, so we removed them, as they would give no useful information for the comparison.
We then extracted emotion and sentiment features using the tools MeaningCloud [41] and Paralleldots [42]. The description of the features extracted with each tool is found in Tables 5 and 6. We used five different APIs from Paralleldots: sentiment analysis, emotion analysis, sarcasm detection, intent analysis, and abuse analysis. The emotion analysis API uses a model based on Paul Ekman's basic emotions theory [43], replacing disgust and surprise by boredom and excitement [42]. From MeaningCloud, we used the sentiment analysis API.

Category
Feature Description

Sentiment
Negative Each feature has a numeric value representing the probability that the text is negative neutral, or positive. Neutral Positive Emotion Bored Each feature has a numeric value representing the probability that the text represents an emotion of boredom, anger, sadness fear, happiness, or excitement. Sarcastic Each feature has a numeric value representing the probability that the text is sarcastic or not. Not-sarcastic Intent News Each feature has a numeric value representing the probability that the text is news, query, spam, marketing, or feedback.
If the text is feedback, each feature represents the probability of the text being a complaint, a suggestion, or an appreciation text.

Suggestion
Appreciation Abuse Abusive Each feature has a numeric value representing the probability that the text is abusive or not. Hate-speech Neither

Extraction of Sentiment Features
After having the databases with every post labeled individually as depressive or non-depressive, we processed the posts as shown in Figure 1. We removed HTML tags, stopwords, and repeating punctuation. After that, some posts were left blank, so we removed them, as they would give no useful information for the comparison.  [43] replacing disgust and surprise by boredom and excitement [42].   We then extracted emotion and sentiment features using the tools MeaningCloud [41] and Paralleldots [42]. The description of the features extracted with each tool is found in Tables 5 and 6. We used five different APIs from Paralleldots: sentiment analysis, emotion analysis, sarcasm detection, intent analysis, and abuse analysis. The emotion analysis API uses a model based on Paul Ekman's basic emotions theory [43], replacing disgust and surprise by boredom and excitement [42]. From MeaningCloud, we used the sentiment analysis API.

Classifier
For the three previously explained representations, we used PBC4cip, a contrast pattern-based classifier that addresses class imbalance problems [44].
A contrast pattern is a pattern that describes a proportion of objects inside a class that differs significantly from other classes. Contrast patterns are a way of making a classifier explainable. This is because they can be interpreted in natural language and provide a model that is easy to understand for human experts in the field of the problem being solved [45]. Contrast patterns are used for the resolution of tasks, such as bot detection [45], masquerader detection [46], image processing [47], medical diagnosis [48,49], and fraud detection [50]. Moreover, they are proven to be a more accurate model than other state-ofthe-art models in certain problem resolutions [44,45,51,52].
PBC4cip (pattern-based classifier for class imbalance problems) is a classifier based on contrast patterns designed to deal with class imbalance problems. Its main goal is to avoid the model's bias toward the most supported class by extracting a weight during the training phase. It then uses the weighting obtained in the training phase to balance the classes by rewarding the minority class by its low support and punishing the majority class by its higher support [44].
We used a Random Forest miner by using Twoing as a splitting measure for the decision trees. Due to the time-consuming nature of hyperparamenter optimization and the fact that we used multiple representations and databases, we based our decision on an extensive experimentation, where Twoing was shown to be the recommended measure to build C4.5 decision trees [53].

State-of-the-Art Models
To compare the results with state-of-the art models fairly and precisely, we reproduced some of the models found in the literature. For every database, we extracted every representation, giving a total of 25 feature spaces. The representations extracted are described in detail in the following subsections.

Ensemble of Handcrafted Features and BOWs
The first representation we obtained is the one described in [16], which is the best representation of the database LOSADA2016. For all databases and for each post, we extracted the features described in Section 2.1.2. That is, the linguistic metadata and the three different BOWs with their correspondent weighting scheme. We then fed every representation individually to a different logistic regression classifier. Each classifier had its own class weighting defined. For the classifiers fed with the handcrafted features and the second BOW, the weights were non_depressed = 1 3 and depressed = 2 3 . For the classifier fed with the first BOW the weights werenon_depressed = 1 7 and depressed = 6 7 . For the classifier fed with the third BOW, the weights were non_depressed = 1 5 and depressed = 4 5 . These weights were used, due to imbalanced class distribution to increase the cost of false negatives, as stated in the paper. The result of this classification was the unweighted mean of the four probabilities calculated by the models.

Ensemble of BOWs
The second representation uses the same operations defined in Section 2.1.2 to extract three weighted BOWS. For the first BOW, we used the raw term frequency and information gain as local and global weights, respectively. For the second BOW, we used augmented term frequency and inverse document frequency. For the third BOW, we used logarithmic term frequency and relevance frequency [22]. Each BOW was fed to a different logistic regression classifier with class weight as non_depressed = 1 3 and depressed = 2 3 for the three classifiers. The result was the unweighted mean of the three probabilities calculated by the models.

Parser Tree
The third representation [36] consists of the following features:  We extracted the features of dependency parse tree, number of pronouns, sentences and definite articles with the help of the Stanford CoreNLP Natural Language Processing Toolkit [55]. For this representation, we used a Random Forest classifier with no weighting or any extra configuration since it was not specified in the original paper [36].

VAD and Topics
The fourth representation [37] includes the following features: • VAD features: valence, arousal, and dominance features using the Affective Norms for English Words database [56]. The lexicon we used for the extraction of these features is described in [57].
For this representation, we used a Naive Bayes model. We did not add any extra configuration to the model because it was not specified by the original paper [37].

BOW
The fifth and last representation is described in [58]; it consists of a classic tf-idf BOW, that is, a BOW that uses raw term frequency as the local weight and inverse document frequency as the global weight. For this representation, we used Ada boost since it was the one that performed the best in [58]. We did not give the model extra configuration since it was not specified in the original paper.

Data Partitioning
For every representation discussed above, we performed a distribution optimally balanced stratified cross validation (DOB-SCV) partitioning [59], using the tool KEEL.
KEEL is an open-source tool for developing experiments. It contains a specific module for imbalanced databases. This module is important in these problems since most of the databases found are imbalanced, due to the relative prevalence of depression and lack of diagnosis of this disease.

Metrics
The metrics we used to assess the results are F1 score and AUC. There are multiple reasons for choosing these metrics for our model evaluation. The F1 score is the harmonic mean of precision and recall, which means that it assesses both measures [60], given the following formulas: where TP is the true positives, FP is the false positives, and FN is the false negatives. We have that F1 is calculated by the following: By doing so, the F1 score is an indicator of both quality and robustness. On the other hand, "the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance" [61]. This can be calculated using the following formula: where TN is the true negatives, FN is the false negatives, TP is the true positives, and FP is the false positives. This is important in this specific problem because of the importance of classifying correctly positive instances of depression.

Proposed Representations
The first comparison we performed was between the three different representations proposed in this paper. As seen in Figures 2 and 3, the worst performing representation is Meaningcloud in both metrics. This representation allows obtaining an average F1 score of 0.5076, having the best performance with the database of Kaggle with an average performance of 0.8459. The worst performance of this metric was with the LOSADA2018 database with an average performance of 0.1390. For the AUC, the worst performance is given by the Meaningcloud representation, having its best average performance of 0.8562 with Kaggle and its worst average performance of 0.5 with LOSADA2018; the predictions were not better than random predictions.
On the other hand, the best F1 performance was achieved by the combined representation with an average performance of 0.6457 taking into consideration all the databases. The best performance is with the database Kaggle with an average performance of 0.9586. The representation that gives the best AUC performance is again the combined representation, with its best average performance of 0.9621 with Kaggle and its worst average performance of 0.7170 with LOSADA2018.
It is important to denote that the difference between the results of the combined and Paralleldots representations has no significant difference in the F1 score, according to a Wilcoxon signed-ranks test [62], which gives a p-value of 0.1416. Nevertheless, it does present a significant difference in the AUC metric with a p-value of 0.02444.
We chose combined as the best representation, given the results, and used it as the point of comparison with other representations and classifiers.
After determining the best representation using PBC4cip, we used that representation to compare it with the results of other proposals. Since we only have results for three of the five databases using the LOSADA2016 and LOSADA2018 representation and classification technique, the results are divided into two sets.      As seen in Figure 5, the same pattern is repeated when using AUC as the metric for performance. Nevertheless, using AUC, the Ensemble of BOWs and Handcrafted features with logistic regression performs better on both CSSRS and DDVHSM with a mean performance of 0.7357 and 0.9470, respectively. This contrasts with PBC4cip, which has a mean performance of 0.72191061 and 0.9375. As seen in Figure 5, the same pattern is repeated when using AUC as the metric for

481
As seen in Figure 6, for both LOSADA2016 and LOSADA2018 databases, the VAD 482 representation with Naive Bayes classification presents the worst performance in F1 483 score. On the other hand, for the DDVHSM database it presents the best performance.

484
For the CSSRS and Kaggle database the tf-idf BOW using Ada Boost classification 485 presents the worst performance but has a better performance in the LOSADA2016 and 486 Figure 4. A comparison plot of the F1 score for the best proposed representation using PBC4cip, and the available representations from Losada's databases.
As seen in Figure 5, the same pattern is repeated when using AUC as the metric for

481
As seen in Figure 6, for both LOSADA2016 and LOSADA2018 databases, the VAD 482 representation with Naive Bayes classification presents the worst performance in F1 483 score. On the other hand, for the DDVHSM database it presents the best performance.

484
For the CSSRS and Kaggle database the tf-idf BOW using Ada Boost classification 485 presents the worst performance but has a better performance in the LOSADA2016 and 486 Figure 5. A comparison plot of the AUC for the best proposed representation using PBC4cip, and the available representations from Losada's databases.

Other Models
As seen in Figure 6, for both LOSADA2016 and LOSADA2018 databases, the VAD representation with Naive Bayes classification presents the worst performance in F1 score. On the other hand, for the DDVHSM database, it presents the best performance. For the CSSRS and Kaggle database, the tf-idf BOW using Ada Boost classification presents the worst performance but has a better performance in the LOSADA2016 and LOSADA2018 databases. For all databases except DDVHS, the best performance is given by the sentimentbased representation with PBC4cip. Figure 7 shows that for AUC, VAD features using Naive Bayes classification present the worst performance for LOSADA2016 and LOSADA2018. It also shows that it has a higher performance in CSSRS, DDVHSM, and Kaggle databases, having the best performance in DDVHSM. The parser tree representation with Random Forest classification presents a similar performance in all databases in comparison with its counterparts.
As we have stated before, for all databases except DDVHSM, the best performance is given by sentiment-based representation with PBC4cip. by the sentiment-based representation with PBC4cip. 488 Figure 7 shows that, for AUC, VAD features using Naive Bayes classification present 489 the worst performance for LOSADA2016 and LOSADA2018. It also shows that it 490 has a higher performance in CSSRS, DDVHSM, and Kaggle databases, having the 491 best performance in DDVHSM. The parser tree representation with Random Forest 492 classification presents a similar performance in all databases in comparison with its 493 counterparts.

494
As we have stated before, for all databases except DDVHSM, the best performance 495 is given by sentiment-based representation with PBC4cip.

Wilcoxon Test
After obtaining and visually comparing the results, we performed a Wilcoxon signedranks test [62] to assess whether there is a significant difference between the results of the different classifiers or not. We used this test because it is recommended in cases where the same models or subjects are assessed under more than one different condition. It is also recommended when what is being assessed is the definite numeric scores instead of nominal values [63].
The Wilcoxon signed rank test has a null hypothesis H 0 : M 1 = M 2 and an alternative hypothesis H 1 : M 1 = M 2 , where M 1 and M 2 are the datasets that are being compared. We first have to subtract the values of one dataset from the other dataset, that is, D = M 1 − M2.
After that, we have to rank the absolute values of D in ascending order, that is, the smallest value of |D| is number 1, and the highest value of |D| is number n, where n is the number of values in the datasets.
We then add all the positive values of D obtaining T + and all the negative values of D obtaining T − . We obtain the Wilcoxon statistic using the following formula: We obtain the Wilcoxon critical value W crit using the Wilcoxon signed-ranked test quantiles table. After that, we compare W crit and W stat . If W stat < W crit , then we reject the null hypothesis, meaning that there is a statistical difference between the datasets. On the other hand, if W stat > W crit , we accept the null hypothesis, meaning that there is no statistical difference between the datasets. Table 7 shows the mean of the scores obtained by each model per database. The values used for the Wilcoxon test were the results obtained per partition, for each database. Table 8 shows the results obtained from the test. When comparing the models, we could see that the logistic regression classifier together with the Ensemble of BOWs and Handcrafted features performed better on some databases. Nevertheless, according to the Wilcoxon test performed, there is no significant difference between the F1 score and AUC of the sentiment-based representation together with PBC4cip. The results also show that there are classifiers and representations with which there is a significant difference. However, the difference proves an advantage of using sentiment-features with PBC4cip over the other classifiers. Moreover, PBC4cip provided patterns that explain the decisions taken by the classifiers. The features that are taken more into consideration by the model to classify a social media post as depressive or not depressive are as follows:  The polarity of the text, especially whether the text is neutral or not. • The subjectivity of the text.    Table 9 shows examples of the contrast patterns obtained by the model, as well as their support for the depressive and non-depressive class. These patterns can be interpreted in natural language, this interpretation can be found in Table 10.

Interpretation of Patterns
Patterns show that depressive tweets tend to have a higher probability of containing text representing sadness, anger and boredom and a lower probability of containing text representing excitement, happiness, or a positive polarity. Moreover, depressive posts can be identified by the lack of excitement and happiness more than by the presence of sadness or anger. The contrast patterns obtained show that for 40% of depressive posts, and over 0% of the non-depressive posts, the feelings of excitement and happiness do not go higher than 0.05. On the other hand, the levels of sadness, anger and boredom can vary from 0.02 upwards. Polarity is also important, as positive polarity is linked to 65% of the non-depressive posts and only 20% of the depressive posts. Nevertheless, neutral polarity is linked to both depressive and non-depressive posts. The patterns also showed that non-depressive posts contain more sarcasm (a probability higher than 0.52) than depressive posts. In addition, the posts are objective for 26% of the depressive posts in contrast with 0% of the non-depressive posts.  Table 10. Explanation in natural language of example patterns in Table 9.

ID Interpretation in Natural Language of Extracted Patterns
Depressive posts have at least a minimum amount of sadness and anger, almost no excitement or happiness, a polarity that is either negative or neutral, and up to a medium high level of boredom. P 2 Depressive posts have at least a medium high level of negativity and a medium low level of abusive content, at most a low level of positivity, and almost no excitement or happiness.
Depressive posts have almost no happiness or hate speech, at most a low level of positivity, a polarity that is either positive or negative, at least a minimum amount of sadness and at least a medium-low level of complaints.
Depressive posts have almost no intent of marketing, positivity, or happiness, a polarity that is either negative or neutral and at least a high level of abusive content.

P 5
Depressive posts have at least a minimum amount of sadness and complaints, at most a medium low level of intent of spam and neutrality, at most a medium low level of query intent, and almost no happiness.   P 9 Not depressive posts have at least a medium low level of neutrality and a low level of sarcasm, and at most a medium-high level of not-sarcastic content and a low level of negativity. P 10 Not depressive posts have at most a low level of sadness, negativity, and intent of news or query, a polarity either positive or negative, and at most a medium-low level of excitement.

Conclusions and Future Work
Depression detection has become an essential task, as it has multiple risks to individuals, society, and economics. Due to the lack of specialists per patients, this task has become difficult and has escalated, becoming a global problem. Since there is a link between language and signs of depression, social media is used to detect depression, using users' posts. The literature shows that most of these solutions provide a representation of text using objective features, such as the number of pronouns, the count of a certain word, or the use of certain phrases inside the text.
However, as far as we know, state-of-the-art proposals do not provide their result in a language close to the human expert, which is essential for helping decision makers in the application area. Hence, in this paper, we proposed a new representation of the text based on sentiment analysis and emotions to provide an understandable representation, allowing to discriminate depressive and non-depressive posts in a language close to that of human experts. The aim was to provide experts the information of social media to aid in the diagnosis of users, who may not know they have depression. We also proposed an understandable model based on our representation and pattern-based classification, obtaining both an understandable and accurate model for human experts.
Our proposed model outperforms most of the other five state-of-the-art models for depression detection. Additionally, it provides insights on the sentiment-features that define a depressive or a non-depressive post, providing more information to an expert or even the posts' author. Moreover, based on the statistical tests and F1 and AUC metrics, our model statistically outperforms the Random Forest, Naive Bayes, and AdaBoost models using the parser-tree, VAD and Topics, and BOW representations. However, it obtains similar statistical results to the logistic regression models, using the ensemble of BOWs and handcrafted features representations. Consequently, we can conclude that our proposed model, as well as the representation based on sentiments and emotions, allows for providing the best classification results for predicting depressive posts.
In future work, we plan on exploring new representations, including features directly related to the medical symptoms of depression and fuzzy pattern classification. The assessment of psychology experts on the patterns and the subsequent interpretation is also part of our future work on this topic.

Conflicts of Interest:
The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.