ANALYZING IMPORTANT STATISTICAL FEATURES FROM FACIAL BEHAVIOR IN HUMAN DEPRESSION USING XGBOOST

. Major Depressive Disorder (MDD) has been known as one of the most prevalent mental disorders whose symptoms can be observed from changes in facial behaviors. Previous studies had attempted to build Machine Learning (ML) models to assess depression severity using such features but few have utilized these models to determine key facial behaviors for MDD. In this study, we used video data to assess the severity of MDD and determine important features based on three approaches (XGBoost, Spearman’s correlation, and t-test). In addition, there is the Facial Action Coding System (FACS) framework that allows visual data such as changes in facial behavior to be modeled as time series data. The results show that the XGBoost model obtained the best results when trained using features selected through the t-test statistical method with 5.387 MAE, 6.266 RMSE, and 0.042 R 2 . The majority of the important features consist of Action Unit (AU) and Features 3D around the regions of the left eye, right cheek, and lip area. However, the majority of the important features discovered from the three approaches, are the ﬁrst derivatives of the 3D facial landmark coordinates of the cheeks, eyes, and ∗ Corresponding


INTRODUCTION
Major Depressive Disorder (MDD) is known as one of the most prevalent mental disorders affecting physiological and psychological aspects of humans, including sadness or prolonged sorrows among its symptoms. Untreated MDD cases may induce anxiety, feelings of isolation, and even suicidal thoughts, which may later lead to illegal drug usage or suicide [1]. MDD can be seen based on the symptoms experienced by the patient. One such symptom is the changes in facial behaviors. Research by Bodenschatz et al. concluded that depressed subjects show significantly more sad facial expressions compared to a healthy person [2].
Usually, the evaluation for MDD can be done by self-reporting of the patient's symptoms and filling in the eight-item Patient Health Questionnaire (PHQ-8). However, the PHQ-8 data is subjective and the psychologist's diagnosis results are influenced by their level of expertise [3]. The advent of Artificial Intelligence (AI) technology made it possible for AI models to provide more objective results if it is trained with the appropriate data. AI possesses huge potential in medical fields [4], especially for Computer-Aided Diagnosis (CAD) technology.
This technology is intended to assist doctors in making more accurate and precise diagnoses and provide appropriate follow-up actions for patients [3].
In this study, we performed a depression severity assessment using Facial Action Coding System (FACS) which allows visual data such as changes in facial behavior to be modeled as time series data. Song et al. had done feature selection using Correlation-based Feature Selection (CFS) named "Voted-CFS" which details are presented in Related Works [5]. However, the discovered important statistical features from the raw FACS data, as well as their first and second derivatives, were not listed. It means that questions related to feature importance had yet to be answered. In other words, the explainability of their proposed model was not elaborated. Therefore, our study focused on the first derivatives or changes to each feature per frame.
We used XGBoost, a widely used model for Explainable Artificial Intelligence (XAI), to determine important statistical features from these changes. Then, the results were compared with variables that have high Spearman's correlation value to the PHQ-8 scores, as well as variables that have significant differences between depressed and normal subjects obtained from a pooled t-test. Features found to be important from Spearman's correlation and the t-test were also fit into XGBoost to test whether the selected first derivative features allow XGBoost to make better predictions in assessing depression severity. All in all, the main contribution of this study is determining statistical features from the first derivative variables of FACS variables that can be considered important for Machine Learning (ML) models in assessing MDD severity through statistical and XAI methods. To the best of our knowledge, no such research has been conducted. In other words, the findings of this research can serve as the foundation for developing more advanced and accurate ML or deep learning models for video-based MDD severity assessment, which results from this study require enhancements before the implementations of mental disorder CAD systems.  Ray et al. conducted a study on depression with text, audio, and video analysis [10]. They used Long Short-Term Memory (LSTM) to identify low-level descriptors that focus on pose, gaze, and FAU. The proposed method increased the accuracy in the video aspect, but it was still inferior when compared to the approach using audio and text data. Similarly, Yoon et al.

RELATED WORKS
used Gaussian Mixture Model (GMM) clustering and fisher vector for visual data while the classification task was performed by a Support Vector Machine (SVM) and neural networks [11]. Their research used a dataset called D-Vlog which is a collection of vlog videos from YouTube that consists of 961 videos and provided higher depression detection performance than those trained with DAIC-WOZ. In addition, Zhang also conducted research on MDD using feature selection with the purpose of automatically detecting someone's depression based on text, audio, and video. The study used the XGBoost model in video-based depression detection to perform feature selection and detection and concluded that XGBoost feature selection can achieve the best performance for the video modality [12], similar to the research by Eteng that focuses on audio and video using a random forest classifier, SVM, and also XGBoost. In the research, the best accuracy was obtained by the XGBoost algorithm with an accuracy of 0.82 for 2-bin classification and 0.639 for 3-bin classification [13]. All of this related works conclude that deep learning is indeed quite popular in this task and is capable of producing extraordinary accuracy. However, research with deep learning is still classified as a 'black box' and extra methods are needed to extract feature importance. In addition, the results obtained by researchers for video-based depression classification generally still have relatively low accuracy or high RMSE values which may be due to the presence of noisy variables. Therefore, the research conducted by Song et al. can be a baseline by proving that statistical features performed only with ML are able to provide performance that is not inferior to deep learning.

RESEARCH METHODOLOGY
3.1. XGBoost. XGBoost is currently the most popular algorithm in many applications among other Gradient Boost Methods [15]. XGBoost is an advanced Gradient Boosting Tree-based (GBDT) method that can efficiently deal with large-scale problems with very limited computing resources. Since this method was introduced, XGBoost won various machine learning competitions such as Kaggle and Knowledge Discovery and Data Mining (KDD) Cup [16,17] and became a powerful and efficient solution for solving various problems classification [17].
XGBoost was developed with 10 times faster optimization than other gradient boosting methods.
XGBoost is a GBDT ensemble approach. The leaf nodes with the scores represent the outcomes in a regression tree, whereas the interior nodes carry the values for the test variables. The prediction result is the number of scores predicted by the K tree, as shown in the equation: whereŷ l represents the model prediction value, f k (x i )=ω q (x) is the space of Classification and Regression Trees (CART), ω q (x) is the score of samples x, each tree's structure is represented by q, the number of trees is represented by K, and each f k equates to an independent tree structure. q as well as leaf weight. XGBoost is a model that has many parameters, meaning that it requires more time to determine the value of each parameter. Unlike GBDT, XGBoost adds a regularization term to the goal function to avoid overfitting. The objective function is illustrated as follows: where L(y i , F(x i )) is the loss function, R(f k ) indicates the regularization term at iteration time k and C is a constant term that can be eliminated providently. The regularization term R(f k ) is demonstrated as: where ∝ indicates the complexity of the leaves, H represents the number of leaves, η denotes the penalty parameter, and w 2 j denotes the output result of each leaf node. In particular, a leaf denotes a predicted category following the classification rules and a leaf node denotes an indivisible tree node. In addition, rather than employing first-order derivatives like in GBDT, XGBoost utilizes a second-order Taylor set of objectives. If the mean squared error is utilized as the loss function, the objective function is as follows: q(x i ) denotes the function that assigns data points to the corresponding leaf, p i and q i denotes the first and second derivatives of the loss function respectively. The final loss value can be calculated by summing the loss values of the leaf nodes because the sample corresponds to the nodes in the decision tree.
As a result, the objective function is often written as follows: where P j = ∑ p i , Q j = ∑i∈I q i j i ∈I j , and I denotes all samples in a leaf node j. In other words, the objective function's optimization is changed in the case of selecting the minimum of the quadratic function. Besides, XGBoost also has a better ability to overcome overfitting problems.

Spearman's correlation coefficient.
Spearman's correlation coefficient, also known as Spearman's rank correlation coefficient, is a coefficient implying the degree of association between two variables obtained by ordering or sorting of correlations. Spearman's correlation is a powerful way to measure the monotonic association between variables [18]. In this study, we compared every feature from the first derivative with the PHQ-8 score that we encoded into several categories which are referred to as "encoded PHQ-8 Score".
Spearman's correlation has a relationship with Pearson's correlation coefficient [18]. However, the usage of Spearman's correlation (ρ) is convenient and has many rank-bound ability values. Spearman's correlation can also determine linear or non-linear monotonic data relationships, while Pearson's correlation is only suitable for linear correlation evaluation [19].
Mathematically, Spearman's correlation measures the individual coefficients between two variable columns [18]. Spearman's correlation has a relay range of +1 or -1 which shows the correlation value of the monotonic relationship [20]. The formula for calculating the Spearman's correlation Coefficient can be expressed as: where d is the difference between the set variables x and y. The variables x and y are two random variables whose number of elements is both n.

T-test (Pooled t-test).
In this study, we used the voted version of the pooled t-test, which is referred to as the "voted t-test" for the rest of this paper. The voted t-test is a test method of parametric statistical tests. The statistical t-test is a test that measures how significant the mean is between 2 different groups. In this method, we test for significant differences between normal subjects and depressed subjects. The t-test calculates p-values and compares them to the threshold value of 0.05 (α=5%) with certain criteria as follows : • If the p-value is >= 0.05, then it means that normal subjects do not have a significant influence on depressed subjects.
• If the p-value is <0.05 then it means that the normal subjects variable has a significant influence on the depressed subjects [21,22]. P-values are calculated from t-values, which are obtained by using the following formula: where mean1 and mean2 are the sample sets' average values, var1 and var2 are the number of records in each sample set, and n1 and n2 are the numbers of records in each sample set. In this study, we used three sample sets (random state = 0,1,2) and because the data are imbalanced between the normal and depressed categories (more records on the normal), we undersampled the data, resulting in 30 samples each for the normal and depressed subjects, respectively. Hence, the method is referred to as the "voted t-test" as the test was conducted on multiple groups sampled from the original population. This database provides recordings, in the form of audio, video, and psychiatric responses in text form [24]. To evaluate the severity of the patient's depression, interviews were conducted using the PHQ-8. As a standard to differentiate them, depressed patients have a PHQ-8 score • Facial Action Units (FAU) The FAU is an important component in analyzing a person's facial expressions [26].
Each Action Unit (AU) has dots that mark parts of the face. The following are the 30 AU points which are divided into two groups, the upper face and lower face AU listed in Table 1 and with the axes aligned to the camera and the camera being at (0,0,0) in the (X, Y, Z) axes.
The output are 4 vectors that were divided into two groups, the first group consists of two vectors that describe the gaze direction of both eyes and the second group also consists of two vectors that describe the gaze in head coordinate space (this means that the direction of the vectors will be indicated based on the gaze, not the head position).

• Pose
This file consists of 6 items, which are Tx, Ty, and Tz which represent the position coordinates in millimeters, and Rx, Ry, and Rz which represent the head rotation coordinate in radians and in the Euler angle convention.

Data Preprocessing.
As the output of OpenFace, which is the software utilized in extracting the FACS features, was stored in separate txt files in the dataset, these files were first combined into a data frame for each patient. Next, we processed the timestamp cropping (start time, stop time) based on the transcript as a cut reference. Then, we exported with the same name so that it overwrites the merged CSV file and we then performed the first derivative step for each patient file, which means we look for the difference for every 2 rows of the patient data (difference from row 0 and 1, 1 and 2, 2 and 3, and so on).
On the first derivative step, an anomaly was found in the combined file for subject number 432 where rows 0 to 139 in the file contained data that could not be processed due to a technical error during the recording of the interview so that rows 0 to 139 in the file were removed.
Next, we performed an aggregation based on the first derivative result file for each patient.
At this step, we calculated the mean, standard deviation, min, and max values for each feature column for each patient. This aggregation stage produced 971 columns which we then export into CSV form. (1) XGBoost predictions without feature selection We used aggregated data in the train and dev folders where the train folders were used as fitting or training data for the XGBoost model and the dev folder were used to validate the model's performance throughout the training process. After that, we tested the aggregated results file in the test folder with the XGBoost model that has been trained.
(2) XGBoost predictions based on the features from the t-test We divided the data into two groups based on the binary categorization of the PHQ-8, which are normal subjects and depressed subjects. Afterwards, we sampled data on normal subjects with a random state = 0,1,2 and use the voted t-test model to select features with p-values < 0.05.
In Spearman's correlation, we first encoded the PHQ-8 Score into categorical or ordinal.
The range of the PHQ-8 score is between 0 to 24 points with the score between 0 to 4 points set into 0 and representing the normal category, the score between 5 to 9 set into 1 and representing the mild depression category, the score between 10 to 14 are set into 2 and represent the moderate depression category, the score between 15 to 19 are set into 24 are set into 4 and represent the severe depression category [28]. Then, we compared the correlation between all features with the encoded PHQ-8 score and used Spearman's correlation to select features with p-values < 0.05.

RESULTS AND DISCUSSION
4.1. Feature Selection. In feature selection, we used the first derivative which meant that we investigated whether the speed at which these landmarks or features change positions will have a significant difference between the depressed and normal subjects. Usually, depressed subjects make more sad facial expressions than normal subjects so the speed of changing landmark positions in depressed subjects tends to be lower than in normal subjects [2]. The top 20 features selected based on the p-values were listed in the following explanation, which was obtained from the t-test and Spearman's correlation.
According to   4.2. Training Results. In Table 5 Table 3 while the XGBoost + Spearman's correlation column represents the test results from training the XGBoost model using Spearman's correlation based on the feature selection that shown in Table 4.
According to Table 6, it can be concluded that the statistical t-test method is the best method when compared to Spearman's correlation and XGBoost algorithms in selecting the important features. It should be noted that due to data imbalance between depressed and normal subjects, the deployed voted t-test used a poll with three random states or three population pairs between depressed and normal subjects to be used as a feature selection to minimize the effects of sampling bias. Additionally, training the XGBoost model without feature selection is the least recommended to be used among the three. Therefore, it can be inferred that some of the input features may have brought the noise to the data, which was removed through the feature selections.

Feature Importance Results
Based on Each Approach. In this section, we have visualized the top 20 important features based on each approach using the results of plotting data in Figure 3 and also displayed the top 20 important features in Table 7. In Table 7    has high dimensionality to the number of samples in the DAIC-WOZ dataset. If the dimensionality is high, the complexity of the models built for depression detection is also high [9].
Additionally, the more features used, the greater the possibility of noises appearing in the data.

CONCLUSION
In this study, selecting FACS features using t-test and fitting them to XGBoost produced the highest R 2 value compared to other methods. It can be concluded that the R 2 value is relatively small due to limited data exploration. For example, we did not use the FaceHOG feature set like Rathi et al.'s research [9]. Nonetheless, statistical methods can still influence the assessment of a person's level of depression. Besides, we used the first derivative to validate whether changes in the value of the FACS variable are proven to have an association with MDD levels and resulted that the most impactful feature in detecting MDD severity is Features3D with an average percentage of about 80% of the list of top 20 features for each approach. This research also contributed to producing important features based on the three approaches that were carried out in which these important features would be useful for researchers. By only using features that are relevant to the model, researchers can shorten research time and get better performance in their research. However, in this study, the available dataset was still relatively small and limited to only using the first derivatives. In the future, researchers can conduct MDD detection research using raw data combined with the second derivative method for further research or even using important features that have been obtained using deep learning because although conventional ML is still reliable, deep learning still provides better results [29]. Besides, deep learning has proven that the performance produced in recent years has been outstanding due to the development of technology that is constantly evolving towards better, larger datasets, and deeper network architecture [30].