ARPNet: Antidepressant Response Prediction Network for Major Depressive Disorder

Treating patients with major depressive disorder is challenging because it takes several months for antidepressants prescribed for the patients to take effect. This limitation may result in increased risks and treatment costs. To address this limitation, an accurate antidepressant response prediction model is needed. Recently, several studies have proposed models that extract useful features such as neuroimaging biomarkers and genetic variants from patient data, and use them as predictors for predicting the antidepressant responses of patients. However, it is impossible to utilize all the different types of predictors when making a clinical decision on what drugs to prescribe for a patient. Although a machine learning-based antidepressant response prediction model has been proposed to overcome this problem, the model cannot find the most effective antidepressant for a patient. Based on a neural network, we propose an Antidepressant Response Prediction Network (ARPNet) model capturing high-dimensional patterns from useful features. Based on a literature survey and data-driven feature selection, we extract useful features from patient data, and use the features as predictors. In ARPNet, the patient representation layer captures patient features and the antidepressant prescription representation layer captures antidepressant features. Utilizing the patient and antidepressant prescription representation vectors, ARPNet predicts the degree of antidepressant response. The experimental evaluation results demonstrate that our proposed ARPNet model outperforms machine learning-based models in predicting antidepressant response. Moreover, we demonstrate the applicability of ARPNet in downstream applications in use case scenarios.


Introduction
Treatment of major depressive disorder (MDD), which is one of the most common mental illnesses, is challenging. The main reason is that it takes a long time (8-12 months) for the effects of an antidepressant to be seen in a patient. It also takes several months for psychiatrists to find that some drugs are ineffective. As a result, only 11-30% of patients reach clinical remission from initial treatment [1][2][3][4]. This limitation of MDD treatment may result in increased risks and treatment costs.
To address this limitation, an accurate prediction model that can predict antidepressant responses is needed.
Recently, several studies that extract useful predictors for predicting the antidepressant responses from patient data such as genetic and brain imaging information [5,6] have been proposed. For example, neuroimaging biomarkers [7], genetic variants [8][9][10][11][12], and DNA methylation [13] have been proven to be effective in improving prediction performance. However, it is impossible for even an experienced psychiatrist to utilize all predictors to make a clinical decision on what drugs to prescribe for a patient.
To overcome this limitation, a machine learning-based antidepressant response prediction model [14] has been proposed. Machine-learning based approach can be used to detect high-dimensional patterns of all potential predictors extracted from various patient data. The model of Chekroud et al. [14] extracts useful predictors from the data of the multicenter clinical trial on MDD (STAR*D) [15][16][17][18] and predicts the antidepressant "citalopram" response in MDD patients. Using the Elastic network model [19], the model of Chekroud et al. [14] first selects useful predictors such as sociodemographic features and depression severity checklists from patient data for predicting antidepressant responses. The model then predicts whether a patient will reach clinical remission based on the 25 selected predictors.
However, the above antidepressant response prediction model has some limitations. The prediction model is designed for predicting the response of only one antidepressant, "citalopram" (Limitation 1). In addition, the model only predicts whether the patient is responding to the antidepressant. The model cannot measure the degree of antidepressant response (Limitation 2). This model is limited as it cannot be used for finding the most effective drugs for MDD patients among many antidepressant candidates (Limitation 3).
To address these limitations, we propose the Antidepressant Response Prediction Network (ARPNet) model based on a neural network architecture. Our proposed ARPNet model is designed for real-world antidepressant prescription scenarios where the prescription consists of a combination of one or more antidepressants (related to Limitation 1). When an antidepressant combination and patient data including the demographic, genetic, and MRI information of a patient are given, ARPNet can predict the patient's Hamilton Depression Rating Scale (HAM-D) score at the next visit (related to Limitation 2). Our ARPNet model is capable of prescribing the most effective antidepressant combination by predicting not only whether the patient will respond to the prescribed antidepressant combination but also the degree of response (HAM-D score) (related to Limitation 3). Moreover, ARPNet trains patient representation vectors that capture patient characteristics, and antidepressant representation vectors that capture the features of antidepressants. The trained patient and antidepressant representation vectors could be used in downstream applications such as drug discovery for new antidepressant development and patient clustering in clinical decision support systems. The vectors can also be used to find similar patients who are likely to respond to the same drugs.
The contributions of our study are summarized as follows.
• We propose an antidepressant response prediction network model called ARPNet based on a neural network architecture. Unlike the previous models, our proposed model predicts whether the patient will reach clinical remission from depression, and the degree of antidepressant response. By predicting the degree of antidepressant response, our proposed model can prescribe the most effective antidepressants. • ARPNet can be used in a real-world antidepressant prescription scenario where psychiatrists prescribe one or more antidepressants to patients with MDD. ARPNet predicts the antidepressant response to a given combination of antidepressants, which could not be done by previously proposed models.

•
Through the literature-based and data-driven feature selection process, ARPNet extracts useful features from patient data including the demographic, genetic, and MRI information of patients.

•
The trained patient and antidepressant representation vectors of ARPNet could be utilized in various downstream applications such as drug discovery and clinical decision making.
The remainder of this paper is organized as follows. In Section 2, we introduce the data, including the prescription records of patients with MDD, which is used for antidepressant response prediction. We describe the feature selection process which involves extracting useful predictors from patient data. We then describe our proposed ARPNet model which uses selected predictors and the prescription records of patients. In Section 3, we conduct experiments to assess ARPNet, and analyze the experimental results. In Section 4, we conclude the paper.

Materials and Method
In this section, we first describe the MDD patient data used in our study. Next, we introduce our feature selection process, which involves selecting useful features as predictors for improving performance in antidepressant response prediction. Then, we state the antidepressant response prediction task. Last, we introduce our proposed model ARPNet.

Data Description
In this study, we used the data of 121 patients with MDD collected from Korea University Anam Hospital, Seoul, Korea. All patients who are ethnically Korean were examined by trained psychiatrists using a structured clinical questionnaire. The depression severity of the patients was measured using the 17-item Hamilton Depression Rating (HAM-D 17) scale and 21-item Hamilton Depression Rating (HAM-D 21) scale at every visit. In this study, we employed the HAM-D 17 scale for the predictions on the degree of the antidepressant response. The collected MDD patient data include the demographic, genetic, and MRI information of only patients who consented to the use of their data for this study. Some patients did not consent to the use of certain data, thus the data available for each patient may vary. If one patient's ith feature is missing, we randomly chose one of the ith features of all the remaining patients, and used it as the ith feature of the patient. Specifically, 67, 71, 96, and 91 patients consented to the use of their demographic, MRI, and genetic information, respectively.
We confirm that the Institutional Review Board of Korea University Anam Hospital approved this study (ED16162). The statistics are summarized in Table 1.

Feature Selection
We used a two-step feature selection process to extract the most useful features from patient data. The first step involved selecting features identified by a literature survey. The second step involved further refining the features identified in the first step using data-driven methods. We did not perform feature selection on demographic information because it already contains features considered useful by psychiatric experts.

Literature-Based Feature Selection
From the MRI and genetic information, we selected the following three literature-based features. Neuroimaging biomakers. Using MRI information, we conducted a literature survey to identify neuroimaging biomarkers of treatment response in patients with MDD [7]. Among the reported biomarkers, we found 94 features in our MRI information. We selected 62 MRI features excluding 32 features which are not found in more than 30% of patients.
Genetic variants. Several studies [8][9][10][11][12] have been conducted to identify the effects of genetic variants on antidepressant response. The studies report that variation of 27 significant genes such as 5-HTTLPR, BDNF, and SLC6A4 is associated with antidepressant response.
We found additional genes using biomedical literature-based tools such as BEST [20] and VarDrugPub [21]. BEST, which is a biomedical entity search tool, returns biomedical entities frequently mentioned with query terms in abstracts of published articles. We searched for all the antidepressant prescriptions on BEST, and obtained the Top 23 ranked genes. We also found 12 genes related to antidepressants using VarDrugPub, which is a database containing relationships among genomic variants, drugs, and diseases extracted from the literature.
DNA methylation. Genetic variants and epigenetic modification known as DNA methylation [13] are known to affect the treatment for depression. The authors reported that SLC6A4, BDNF, IL11, and MAOA genes and their CpG sites are associated with various antidepressants. We selected 136 CpG sites as DNA methylation features from the four genes.

Data-Driven Feature Selection
We identified and extracted useful features from the literature and filtered unnecessary features using the available APIs. Using Elastic net [19], which is one of the popular regression analysis methods, we selected the more informative features among the extracted features for antidepressant response prediction. The estimates from the elastic net method are defined as follows.
where X is an input vector, Y is a true output value and β is the weight of the input vector. We used Elastic net to select the more informative features for treatment of patients with MDD from each of the three feature groups. We concatenated each feature with a one-hot encoded 14-dimensional antidepressant vector as input X where the number of antidepressants in our patient data is 14.
The difference between HAM-D scores at initial and last visits of patients were used as output Y.
For each feature group, we split the data into five subsets and used one subset as the test set and the remaining subsets as the training set to reduce the risk of overfitting. Then, we conducted 10-fold cross validation on the training set to find the hyper parameters of each Elastic net-based feature selection model. After selecting hyper parameters for each feature selection model, we validated the models on the test set and coefficients of the feature. By averaging the coefficients of the features from the five subsets, we selected 20 features with the 10 highest coefficients and the 10 lowest coefficients for each of the feature groups. The coefficients of the selected features are summarized in Table 2.

Problem Statement
When a patient sees a psychiatrist, the psychiatrist may prescribe an antidepressant to the patient based on the data of the patient. To prescribe the most effective antidepressant, the psychiatrist predicts the antidepressant response of the patient. Antidepressant response prediction is defined with this real-world antidepressant prescription scenario in mind.
Based on data P including the demographic, MRI, and genetic information of a patient, the current HAM-D score H of the patient, and the time interval ∆T until the next visit, antidepressant response prediction models predict the HAM-D score of the patient at the next visit for the given antidepressants A. The demographic information includes height, weight, and sex information of patients. Real-valued information such as height and weight is represented with log-scale. Discriminated information such as sex and occupation is encoded with a one-hot vector. The predicted HAM-D scoreŶ is computed as follows: where θ is a set of parameters of the prediction model. The parameters were trained in the training phase to minimize the loss that is defined in the next subsection. Figure 1 shows the architecture of ARPNet. Our source code that implements ARPNet is available at http://github.com/dmis-lab/arpnet. Our ARPNet model consists of the following three components: a patient representation layer, an antidepressant prescription representation layer, and a prediction layer. The notations used in this paper are summarized in Table 3.  The patient representation layer is designed to capture an abstractive patient feature from patient data. The input X P ∈ R 127+20+20+20+1+1=189 of this layer is constructed as the concatenation of the current HAM-D score H ∈ R and time interval ∆T ∈ R, which are the useful features selected in Section 2.2, as follows.

ARPNet
where V demo ∈ R 127 , V bio ∈ R 20 , V gene ∈ R 20 , and V methyl ∈ R 20 are the representation of demographic information, neuroimaging biomarkers, genetic variants, and DNA methylation, respectively. To extract the abstractive patient feature from interactions between selected features, the concatenated input X p is forwarded to a non-linear layer with a non-linear function such as ReLU, tanh, or sigmoid. In this paper, we use ReLU as the non-linear function.
where V P ∈ R d is the d-dimensional patient representation vector, and W P ∈ R 189×d and b P ∈ R d are the weight matrix and bias term of the patient representation layer, respectively. σ(·) is the ReLU function.
The two major contributions of the computed patient representation vector are summarized as follows. The patient feature representation is automatically trained from patient data to improve the performance in antidepressant response prediction. Moreover, the similarities between patients could be computed by representing patients as real-valued vectors. The similarities between patients are useful in downstream applications such as a clinical decision support system. For example, physicians can find patients who are similar to a given patient using patient representation vectors. The patient representation vectors are trained for antidepressant response prediction. Therefore, patients who have similar vector representations are predicted to respond to the same drugs.

Antidepressant Prescription Representation Layer
A prescription for a patient consists of one or more antidepressants. The prescription representation layer is designed to capture antidepressant features from one antidepressant or the combination of antidepressants. We first associate each antidepressant A ∈ A with a d -dimensional real-valued vector. A is the set of antidepressants in our patient dataset. We denote an antidepressant representation matrix that consists of the d -dimensional antidepressant vectors as E A ∈ R |A|×d . The prescription is represented as a one-hot vector as follows.
x A i = 1, if the ith antidepressant is in the prescription, 0, otherwise.
Finally, the prescription representation is computed as follows.
where E A is the antidepressant representation look-up function, andX A denotes the representations of the drugs in the prescription X A . V A ∈ R d is the sum of the antidepressant representations inX A and is the final antidepressant prescription representation that is utilized in the prediction layer described in the next subsection.

Prediction Layer
The prediction layer is a linear regression approach based on the patient representation and the prescription representation. An input of this layer is constructed by concatenating the patient representation and the prescription representation as follows.
where X ∈ R d+d is the input of the prediction layer. Then, the input X is forwarded to the regression layer as follows.
where W ∈ R (d+d )×1 and b ∈ R 1 are the weight matrix and the bias term of the regression layer, respectively. To optimize the parameters θ of our proposed model, we define the loss function of our model using Mean Square Error (MSE) as follows.
where Y i is a true HAM-D score of the ith prescription record data. The model parameters are randomly initialized with a Gaussian distribution and trained by minimizing the MSE using the Adam optimizer [22] with a learning rate 0.001 and a batch size 16. To address the overfitting problem, we apply dropout [23] and L2 regularization. Based on cross-validation, which is described in the next section, we find the optimal hyperparameter setting of our proposed ARPNet with grid search. We implemented our proposed model using TensorFlow v1.6.0 (https://www.tensorflow.org) and Python v2.7.12 (https://www.python.org). Our experiments were conducted on Ubuntu 16.04.5 with two TITAN X (Pascal) GPUs.

Experiments and Results
In this section, we evaluate our proposed ARPNet model by comparing it with several machine learning-based baseline models. We conducted experiments to answer the following research questions: RQ1. Does ARPNet outperform the baseline models in predicting the degree of antidepressant response?
RQ2. Can ARPNet predict whether a patient reaches clinical remission, which is a task proposed in previous work?
RQ3. Is ARPNet useful for prescribing the most effective antidepressant combination to patients? RQ4. How can we use the trained patient and antidepressant representation vectors in downstream applications?
We describe the experimental setting and report the evaluation results on the prediction of the degree of antidepressant response (Section 3.1) and the prediction of whether a patient reaches clinical remission with the given antidepressant (Section 3.2). We introduce the use case scenarios of our ARPNet model. (Sections 3.3 and 3.4).

Dataset Preparation
The prescription records of patients contain the HAM-D scores of the patients measured at the initial visit and one, four and eight weeks (or six months) after the initial visit. Some prescription records are missing the prescription record for the week in which patients did not visit. The data d = (P, A, H, ∆T) used in this experiment were generated from all pairs of two consecutive visit records (r i and r i+1 ) of a patient when the prescription records R P = {r 1 , r 2 , ..., r |R P | } of the patient are given. ∆T is the interval between the ith and i + 1th visits. Our dataset D consists of 273 data generated from 395 records of 121 patients. The randomly sampled 10% of data were used as the test set, and the remaining data were used as the training set. The statistics of the constructed dataset D are summarized in Table 4.

Baselines
For the qualitative evaluation, we compares the performance of ARPNet with that of the following six machine learning baseline models: Support Vector Machine Regressor with a linear kernel (Linear SVR), Ridge Regressor, Gradient Boosting, Multi-Layer Perceptron (MLP) Regressor, K-Nearest Neighbors Regressor, and Random Forest Regressor. We concatenates the predictor features selected in Section 2.2, one-hot encoded prescribed antidepressants, current HAM-D score of a patient, and the interval between visits as an input of the machine learning baseline models.
To find the optimal hyper-parameters for our proposed ARPNet model and the baseline models, we performed five-fold cross validation on the training set.

Metric
Antidepressant response prediction is a regression task. To assess the performance of the models, we employed Root Mean Square Error (RMSE) and R-squared, which are the most common evaluation metrics used for regression models. RMSE is computed as follows.
R-squared is computed as follows.
where y denotes the average value of actual HAM-D scores in our dataset D test . A lower RMSE indicates that the model achieves better prediction performance. A higher R-squared also indicates better performance.

Results
To reduce the randomness of the machine learning-based models that randomly initialize their parameters, we performed the training and evaluation five times using the optimal hyper-parameters obtained from cross-validation. We then report the mean values of the performance scores.
The left side of Table 5 shows the evaluation results on the prediction of the degree of antidepressant response (Task 1). Our model ARPNet outperforms all the baseline models in terms of RMSE and R-squared. Specifically, ARPNet obtains 27.9%, 12.2%, 25.2%, 14.9%, and 12.8% performance improvements in terms of RMSE over LinearSVR, Ridge, Gradient Boosting, K-Nearst Neighbors, and Random Forest, respectively. ARPNet also obtains 372.9%, 31.5%, 174.0%, 44.4%, and 34.1% performance improvements in terms of R-squared over Support Vector Machine Regressor with a linear kernel (Linear SVR), Ridge Regressor, Gradient Boosting, Multi-Layer Perceptron (MLP) Regressor, K-Nearest Neighbors Regressor, and Random Forest Regressor, respectively. Table 5. Experimental evaluation results. The performance improvement of ARPNet over the baseline with the best performance is reported in the Improv. row. These observations demonstrate that ARPNet is more accurate than the baseline models in predicting the degree of antidepressant response (Answer to RQ1).

Task 2: Prediction of the Clinical Remission of Patients
Prediction of the clinical remission of patients is a binary classification task. We assumed that a patient responded to the prescribed antidepressant if the HAM-D score of the patient measured at their last visit is less than half the HAM-D score measured at the initial visit. A patient's response to an antidepressant is calculated as follows.
where Y P is the label of the patient P and H i is the HAM-D score of the patient at the ith visit. Prediction models predict the clinical remission of patients using Algorithm 1.

Algorithm 1 Prediction of the clinical remissions of patients.
Input: P: Patient, R P : Prescription records {r 1 , r 2 , · · · , r |R P | } of P Output:Ŷ P : Predicted clinical remission of patient P.

Dataset Preparation
Using Equation (14), we constructed the dataset D which contains 121 data from the prescription records of 121 patients. As in Task 1, 10% of the data were randomly sampled and used as the test set, and the remaining data were used as the training set. To find the optimal hyper-parameters for the prediction models, we performed five-fold cross validation on the training set. The statistics of the constructed dataset D are summarized in Table 4.

Metric
To evaluate performance on the prediction of clinical remission, we employed the following five evaluation metrics that are widely used in binary classification tasks: Sensitivity, Specificity, Precision, F1-score, and Accuracy. Sensitivity and Specificity are computed as follows, where TP, TN, FP, and FN refer to true positive, true negative, false positive, and false negative, respectively. Precision and F1-score are computed as follows, Accuracy is computed as follows, For all the evaluation metrics, the higher is the score, the better is the performance of a model on prediction.

Results
We performed the training and evaluation five times using the optimal hyperparameters to reduce the randomness of the machine learning-based prediction models. We report the average of the five evaluation results for each model. The right side of Table 5 shows the evaluation results on the prediction of the clinical remission of patients (Task 2). Our ARPNet model outperforms all the baseline models on all the evaluation metrics (Answer to RQ2). ARPNet achieves performance improvements of at least 40.0%, 74.4%, 86.0%, and 57.1% in Specificity, Precision, F1-score, and Accuracy, respectively, over the best baseline model. In terms of Sensitivity, ARPNet achieves performance improvements of 100% over all the baselines except for K-Nearst Neighbors Regressor. As in the evaluation result of Ridge Regressor, sensitivity generally shows a negative correlation between specificity and precision. Nevertheless, ARPNet obtains performance improvements in all the evaluation metrics.
We observe that ARPNet achieves a greater performance improvement in Task 2 than in Task 1. Figure 2 shows two representative cases: (1) the patient improves over time; and (2) the condition of the patient worsens after four weeks of treatment. In both cases, ARPNet accurately predicts the outcome of the patient as the predicted HAM-D score of the patient is almost the same as their actual HAM-D score. On the other hand, K-Nearest Neighbors and Gradient Boosting were unable to accurately predict the HAM-D scores of patients. Their prediction performance decreased as the interval between visits increased. As it takes a long time (more than eight weeks) to treat MDD, inaccurate antidepressant response predictions can result in ineffective prescriptions and increased risks.

Use Case Scenario: Prescribing the Most Effective Antidepressant
As ARPNet can predict the degree of the antidepressant response, ARPNet can be used to prescribe the most effective antidepressants. We conducted an additional qualitative evaluation to show the applicability of ARPNet. We computed the predicted HAM-D scores of all the possible antidepressant candidates. The possible antidepressant candidates include 14 single antidepressants and 91 combinations of any two antidepressants. Figure 3 shows the result of a sample patient as an example. In the figure, the tables list the predicted HAM-D score of each antidepressant candidate for Patient A at their visit. The green line indicates the change in the actual HAM-D score of Patient A. The red line indicates the change in the HAM-D score predicted by ARPNet for the prescribed antidepressant MIRTAZAPINE. The blue line denotes the change in the HAM-D score when Patient A at the initial visit is prescribed the antidepressant combination of MIRTAZAPINE and MILNACIPRAN which is predicted to be the most effective. In this phase, it is predicted that MIRTAZAPINE with MILNACIPRAN (Rank 1) is more effective than MIRTAZAPINE alone (Rank 3). At the visit in Week 8, the antidepressant combination of ESCTALOPRAM and AMITRIPTYLINE, which is predicted to be the most effective antidepressant, is prescribed. The cyan line denotes the change in the HAM-D score when the psychiatrist changes the prescription to the most effective antidepressants predicted at the visit in Week 8. As shown in this example, the HAM-D score of the patient decreases faster than the actual HAM-D score by prescribing the most effective antidepressants at every visit. These observations demonstrate that ARPNet can compute the predicted HAM-D scores of all the possible antidepressant candidates to prescribe the most effective antidepressant, which can lead to more effective treatment and reduced costs and risks (Answer to RQ3).

Use Case Scenario: Using Pre-Trained Representation Vectors
Through the patient representation layer, ARPNet trains the patient representation vectors to capture high-dimensional patterns for improving performance in antidepressant response prediction. By representing the patients as real-valued vectors, we can compute the antidepressant response-aware similarity between patients. This antidepressant response-aware similarity can be used for precision prescription in evidence-based medicine [24,25]. Figure 4 shows an example of using patient representation vectors in clinical decision support systems. Patients in the prescription record database are represented as patient representation vectors, using patient data and prescription records. Based on their representation vectors, the patients are clustered using a clustering method that computes the antidepressant response-aware similarity between the patients. When a new patient's data are given, the patient's representation vector is computed using the patient representation layer of ARPNet. Clinical decision support systems compute the similarities between the patient and patients in the cluster in which the patient is likely to be included, and provide psychiatrists with the prescription records of patients very similar to those of the patient for evidence-based medicine. Psychiatrists prescribe the most effective antidepressant to their patient by referring to the antidepressant response records of similar patients.
This use case demonstrates that pretrained patient representation vectors can be utilized as clinical evidence in downstream applications for precision medicine (Answer to RQ4).

Conclusions
In this paper, we propose ARPNet, which is a new antidepressant response prediction network model for major depressive disorder. ARPNet utilizes several useful features from patient data including the demographic, MRI, and genetic information of patients. Based on a literature survey, features such as neuroimaging biomarkers were extracted from MRI information. In addition, genetic variants and DNA methylation were extracted from genetic information using biomedical literature-based tools. To make the extracted features more predictive, we filtered them by a data-driven feature selection process. Based on a neural network architecture, ARPNet predicts not only whether patients will respond to an antidepressant, but also the degree of the antidepressant response. ARPNet considers a real-world antidepressant prescription scenario where psychiatrists prescribe one or more antidepressants to patients with MDD. In the experiments, ARPNet outperformed the machine learning-based baseline models in both the prediction of the degree of antidepressant response and the prediction of clinical remission of patients. Finally, we used real-world use cases to show the applicability of ARPNet in downstream applications. In future work, we plan to further evaluate ARPNet on real patients.