NRLMFβ: Beta-distribution-rescored neighborhood regularized logistic matrix factorization for improving the performance of drug–target interaction prediction

Techniques for predicting interactions between a drug and a target (protein) are useful for strategic drug repositioning. Neighborhood regularized logistic matrix factorization (NRLMF) is one of the state-of-the-art drug–target interaction prediction methods; it is based on a statistical model using the Bernoulli distribution. However, the prediction is not accurate when drug–target interaction pairs have less interaction information (e.g., the sum of the number of ligands for a target and the number of target proteins for a drug). This study aimed to address this issue by proposing NRLMF with beta distribution rescoring (NRLMFβ), which is an algorithm to improve the score of NRLMF. The score of NRLMFβ is equivalent to the value of the original NRLMF score when the concentration of the beta distribution becomes infinity. The beta distribution is known as a conjugative prior distribution of the Bernoulli distribution and can reflect the amount of interaction information to its shape based on Bayesian inference. Therefore, in NRLMFβ, the beta distribution was used for rescoring the NRLMF score. In the evaluation experiment, we measured the average values of area under the receiver operating characteristics and area under precision versus recall and the 95% confidence intervals. The performance of NRLMFβ was found to be better than that of NRLMF in the four types of benchmark datasets. Thus, we concluded that NRLMFβ improved the prediction accuracy of NRLMF. The source code is available at https://github.com/akiyamalab/NRLMFb.


Introduction
Improving the accuracy for predicting drug-target interactions is an important task in drug discovery. Recently, research and development expenses are increasing annually, although the number of approvals for new drugs has remained constant every year [1]. Thus, increasing the value of existing drugs is necessary. Drug repositioning is one of the strategies that reuse an existing drug as a remedy for another disease [2]. For example, digoxin has been applied to prostate cancer in addition to heart failure therapy [3]. Previously, drug repositioning was casually performed during basic research and clinical trials [2]. However, it became more strategic with the development of machine learning and statistical models [4,5]. In particular, drug-target interaction prediction is known as one of the effective approaches [6,7].
Various methods have been proposed for predicting drug-target interactions [7][8][9][10][11][12][13][14][15]. These methods predict unknown interactions based on known interactions and the similarity of molecular structures (e.g., Tanimoto coefficient of chemical structures and normalized Smith-Waterman score of amino acids). At present, similar drugs are assumed to interact with similar proteins and similar proteins with similar drugs. These prediction methods can be divided into two approaches. One approach involves the use of kernel method, such as WNNGIP [8], BLMNII [9], LIK [10], and ECkNN [11], and the other involves the use of a matrix factorization model, such as MSCMF [14] and NRLMF [15]. To our knowledge, NRLMF is one of the most accurate methods. However, matrix factorization models such as NRLMF cannot satisfactorily predict interactions when their number of related interactions is small. Two approaches are known to deal with this problem. One approach replaces the latent feature vectors of drugs or target proteins whose interaction number is 0 with weighted sums of latent vectors of most similar drugs or proteins [15,16]. The other adds a bias vector to each pair of drug and target protein [16]. However, with these methods, when a drug or target protein with high similarity does not exist for a drug or protein with a low number of interactions, an appropriate latent feature vector cannot be estimated. Therefore, we considered that directly correcting the score by using the number of interactions of each drug-target pair would be effective.
In this study, we proposed NRLMF with beta distribution rescoring (NRLMFβ), which is a new drug-target interaction prediction model based on rescoring of the NRLMF score. Since NRLMF is based on the Bernoulli distribution, the beta distribution, which is its conjugate prior distribution, is used (Fig. 1). Our findings might form the basis to improve the accuracy of drug-target interaction prediction models. Herein, we introduce a new feature quantity to define the amount of interaction information and describe how to rescore the algorithm. Further, in order to confirm the improvement of prediction accuracy, we conducted a comparison experiment between the proposed method NRLMFβ and the conventional method NRLMF. In the experiment, we evaluated the area under the receiver operating characteristics (AUC) and area under precision versus recall (AUPR) by using the general benchmark [7].

Materials
Evaluation experiments were performed using a benchmark dataset [7], which is generally used in drug-target interaction prediction [7][8][9]12,14,15]. The benchmark consists of four datasets targeting proteins of Nuclear Receptor, GPCR, Ion Channel, and Enzyme. Each dataset includes three matrices-an interaction matrix, a drug similarity matrix, and a protein similarity matrix. The interaction matrix is defined by an adjacency matrix consisting of 0 and 1; it takes a value of 1, if interaction is experimentally confirmed between the drug and target protein, otherwise it takes a value of 0. The interaction information was obtained from the KEGG BRITE [17], BRENDA [18], SuperTarget [19], and DrugBank [20] databases. The statistical information of the interaction matrix in each dataset is shown in Table 1. The similarity matrix of a drug is defined by a real matrix taking a value ranging from 0 to 1. The chemical structures are considered to be similar, if the value is close to 1. The structure information of drugs was obtained from KEGG LIGAND [17] database. Notably, drugs having molecular weight less than 100 were excluded in order to exclude factors such as ions and cofactors. Tanimoto coefficient (i.e., for drugs d and d , calculated by SIM-COMP [21]) was used for calculating the similarity between drugs. The similarity matrix of target proteins was also defined as a real matrix that takes values from 0 to 1. Amino acid sequences obtained from KEGG GENES [17] database were used for the target proteins. The similarity between target proteins was calculated using the normalized Smith-Waterman score (calculated using S target = t t SW t t SW t t SW t t ( , ) ( , )/ ( , ) ( , ) for targets t t , ). Here, SW ( , ) represents the Smith-Waterman score [22]. These datasets can be obtained at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/.

Problem formalization
Herein, we denote the set of drugs as . This problem intends to assign high scores for drug-target pairs that have a possibility of interaction in unknown pairs.

NRLMF
NRLMF [15] is one of the drug-target prediction methods based on a matrix factorization technique and is one of the state-of-the-art methods.

Interaction probability
The possibility that a drug d i and target t j interact is evaluated by the interaction probability p ij calculated from latent feature vectors of the drug and target. The latent feature vector of drug d i is represented by u i r , and that of the target protein t j is represented by v j r , where r is a hyperparameter representing the number of dimensions of the latent feature vector. × U n r d is a latent matrix that has the latent feature vector u i as a row vector, and × V n r t is a latent feature matrix that has the latent feature vector v j as a row vector. Thus, the interaction probability between drug d i and target t j is defined as follows:

Prediction model
By using the interaction probability p ij , we defined the likelihood of interaction matrix Y in the latent feature matrix U V , as follows: where > c 0 is a hyperparameter. Here, we assumed that the latent feature vector u i follows a multivariate normal distribution with mean 0 r and variance-covariance matrix , and then the latent feature matrix U follows the joint distribution defined as equation (3). Similarly, we assumed that the latent feature vector v j follows a multivariate normal distribution with mean 0 r and variance-covariance matrix × I t r r 1 , and then the latent feature matrix V follows the joint distribution defined as equation (4). Where > , 0 d t are hyperparameters, and × I r r is an identity matrix.
From equations (2)-(4), we can obtain a solution of the maximum a posteriori (MAP) by solving the following minimization problem: where F is the Frobenius norm.

Regularization
We introduced a regularization term to bring the latent feature vectors of high similarity drugs or targets close to each other by using the similarity matrices S d and S t . The regularization term in the latent feature vector of the drug is defined using the following equation. where are two diagonal matrices, in which the diagonal elements are neighbor decided by the i-th row vector in the matrix S d , where K 1 is a hyperparameter. In addition, 2 is the Euclidean norm, and tr ( ) is the trace function. Similarly, the regularization term in the latent feature vector of the target is defined using the following equation.
t t is a real matrix. D t and D t are two diagonal matrices, in which the diagonal elements are

Estimation and scoring
By introducing the regularization terms (6) and (8) into (5), we redefined the minimization problem as follows: 0 are hyperparameters. Alternating gradient descent method [23] was used for optimizing equation (10). In addition, AdaGrad algorithm [24] that has a hyperparameter > 0 corresponding to the gradient coefficient was used to accelerate the speed of optimization. The row vectors corresponding to negative drugs and targets for the MAP solution obtained using the optimization were modified. For a negative drug d i D , K 2 nearest neighbors of the positive drugs are denoted as except that it targets positive drugs. Here, the latent feature vector û i r of the drug d i corresponding to the i-th row vector of the latent feature matrix Û is modified as a vector ũ i r by using the following equation.
Likewise, negative target t j T is also modified. Thus, by using the modified latent feature vectors u ṽ ,ĩ j r , we calculated the score of drug d i D and target t j T by using the following equation.

NRLMFβ
NRLMFβ is an algorithm that rescores the score of NRLMF as the expected value of the beta distribution, which is determined based on interaction information and NRLMF score. The beta distribution is known as a conjugative prior distribution of the Bernoulli distribution used in NRLMF and can reflect the amount of interaction information to NRLMFβ score.

New feature quantity
We defined a new feature quantity The feature quantity ij represents the sum of the observed values of the i-th row and in the j-th column in the interaction matrix Y. The ij has a greater value if the number of interaction pairs in the i-th row and j-th column become larger.

Re-scoring using the beta distribution
The NRLMF score s ij based on the value of the feature quantity ij becomes too small when there is little information on the interaction.
To address this, we performed rescoring for the score of drug d i and target t j by using the beta distribution defined using the following equation.
where > a b , 0 ij ij are parameters for determining the shape of the beta distribution, and the condition of > a b , 1 ij ij was assumed in order to limit the distribution to bell shape. B ( , ) is a beta function. We defined the relationship between parameters a b , ij ij , and score s ij and the feature ij by using the following equation.
where > , 0 1 2 are hyperparameters, and > 2 2 from the abovementioned conditions refer to the shape of the distribution. In the above equations, equation (15) refers to the mode of the beta distribution, and equation (16) refers to the concentration. Notably, the score of NRLMF corresponds to the mode of the beta distribution. From these relational expressions, the score s ij of NRLMFβ is defined as the expected value of the beta distribution by using the following equation.
The score s ij of NRLMFβ is equivalent to the NRLMF score s ij when the concentration of the beta distribution becomes infinity. In order to prove the argument, we assume that 1 is at infinity and 2 is any finite value. Then, the argument is proved using equation (18).

Algorithm
The calculation procedure of NRLMFβ is shown in Algorithm 1. Initially, the observed value 2 are set, and the NRLMF score s ij is calculated for arbitrary i, j based on equation (12). For each i and j, ij is calculated from the observed value Y by using equation (13). From the calculated ij and NRLMF score s ij , the NRLMFβ score s ij is calculated using equation (17). This algorithm rescores the score of NRLMF for all drug and protein pairs in the interaction matrix.

Experimental settings
In order to compare the generalization performance of NRLMF and NRLMFβ, we used the AUC and AUPR as the evaluation index; according to Liu et al. [15], three kinds of 10-fold cross-validation scenarios (CVSs) called CVS1, CVS2, and CVS3 were performed five times each. The division of test data and training data is exactly the same as the comparison method NRLMF, because this evaluation result uses the script used in the original paper of the comparative method and uses the same random seed. In CVS1, all drug-target pairs D T contained in the interaction matrix Y were randomly divided into ten sets so that the number of elements remained equal. A combination of one set P P . Next, each of the 10 divided sets was treated repetitively as test data, and the values of the evaluation index AUC and AUPR were measured. The above operation was repeated 5 times, and the average value of each evaluation index and the 95% confidence interval of the t-distribution were calculated. Conversely, in CVS2, the set D of drugs was randomly divided into ten sets so that the number of elements was even. For a set of pairs T consisting of a combination of one divided D D and all of the targets T , a set of combinations of a pair included in the set and its observed value was determined as test data d t y d t P {(( , ), )|( , ) } i j ij i j 2 . Next, like the CVS1, training data were prepared, and the value of the evaluation index was measured. Similarly, in CVS3, a set T of targets was randomly divided into ten sets so that the number of elements was even. A combination of a pair included in the set and the observed value was determined as test data D T consisting of a combination of one divided T and all drug D , and then the values of the evaluation indexes were measured.

Hyperparameter optimization
In terms of the computational experiment, we cited the result of NRLMF from the original paper [15] and measured the performance of NRLMFβ. Regarding the search range of hyperparameter of NRLMFβ, based on [15], we defined the one common to NRLMF as 5 . Regarding optimization of hyperparameters, grid search and Bayesian optimization were combined for all datasets. This is because previous experimental results [25] showed that Bayesian optimization can perform the same evaluation as a grid search. Furthermore, by dividing the search range of the hyperparameter, we can simultaneously conduct Bayesian optimization, and the calculation time can be shortened. Therefore, the value of r, , , , 1 2 was sequentially fixed based on the search range defined above, and Bayesian optimization was performed on the search range of the remaining d and α ( t is equal to d ). Next, among solutions obtained using Bayesian optimization, those with the highest AUC average value were selected as optimal hyperparameters (Table S1). Here, we used the Gaussian process mutual information algorithm [26] as the Bayesian optimization method, and the parameter δ was 10 100 according to Ref. [25].
These calculations were performed using supercomputer TSUBAME 3.0 at the Tokyo Institute of Technology. Two Intel Xeon E5-2680 v4 (2.4 GHz, 14 cores) CPUs and 256 GB main memory were installed in the computing node, and calculation was performed simultaneously by using Shared Memory Parallel with seven cores (the maximum number of the q_node option on TSUBAME 3.0).

Results
By comparing the prediction accuracy (i.e., AUC and AUPR) of NRLMF and NRLMFβ, we showed that performance is improved by rescoring by using the beta distribution.

Performance based on AUC
The average value of AUC calculated for each CVS for each dataset and the 95% confidence interval obtained using the t-distribution in order to compare the prediction accuracy of NRLMF and NRLMFβ are shown in Table 2. For CVS1 and CVS3, the average value of AUC of NRLMFβ was higher than that of NRLMF in all datasets. As for the dataset trend, both CVS1 and CVS3 were found to be remarkably improved since the dataset was smaller, indicating that NRLMFβ exerts a large effect when the amount of data is small. In particular, for the nuclear receptor dataset of CVS3, the average value of AUC of NLRMF was 0.851, whereas that of NRLMFβ was 0.941; an increase of 0.090 was observed. We will discuss this improvement in section 4.3. Conversely, for CVS2, the average value of AUC of NRLMFβ was higher than that of NRLMF for datasets other than Enzyme. However, for Enzyme, the average value of AUC of NRLMF was 0.871, whereas that of NRLMFβ was 0.858, i.e., the prediction accuracy decreased.

Performance based on AUPR
The average value of AUPR calculated for each CVS for each dataset and the 95% confidence interval by the t-distribution in order to compare the prediction accuracy of NRLMF and NRLMFβ are shown in Table 3. For CVS1 and CVS3, the average value of AUPR of NRLMFβ was higher than that of NRLMF in all datasets. As for the trend of each dataset, since the dataset was smaller, the trend improved remarkably as in the case of AUC. In particular, for the nuclear receptor dataset of CVS3, the average value of AUPR of NRLMF was 0.449, whereas that of NRLMFβ was 0.661; an increase of 0.212 was observed. Conversely, for CVS2, the average value of AUPR of NRLMFβ was higher than that of NRLMF for datasets other than GPCR. For GPCR, the average value of AUPR of NRLMF was 0.364, whereas that of NRLMFβ was 0.358, i.e., the prediction accuracy declined.

Discussion
We investigated the search range of the newly introduced hyperparameters , 1 2 and the reason why the performance of NRLMFβ was better than that of NRLMF.

Performance based on external validation
In order to evaluate the generalization performance of the proposed method NRLMFβ, we compared it with the conventional method NRLMF by external validation (nested cross-validation). In the external validation, in order to determine the optimal hyperparameter, 5-fold cross-validation by CVS1 was performed using the training data. Next, the hyperparameter that maximizes the average AUC was determined using Bayesian optimization [25]. Subsequently, AUC and AUPR were measured using the test data by using the determined hyperparameter. In this external validation, five times of 10-fold cross-validation (CVS1) were performed using the GPCR dataset used for the cross-validation. As a result, regarding AUC, NRLMFβ was 0.974±0.003, which was higher than that for NRLMF 0.968±0.004. Further, for AUPR, NRLMFβ was 0.759±0.016, exceeding that of NRLMF 0.751±0.015. These results indicate that generalization performance can be obtained even if only amino acid sequences not including a three-dimensional structure are used. If information of the three-dimensional structures is used, improving the generalization performance would be possible. 1 2 The heatmap for the average value of AUPR when hyperparameters , 1 2 of NRLMFβ were changed from 2 to 512 for each pair of CVS and dataset, respectively, is shown in Fig. 2. However, for hyperparameters other than , 1 2 , CVS and dataset with the optimal solution were used to perform grid search with fixed = 7 1 and = 3 2 (see also Supplemental Information Fig. S1) by using different values for each pair of the group. The frame in each figure is the search range of the hyperparameters , 1 2 defined in section 2.6. This frame was created based on a heatmap that was an average of 12 heatmaps shown in Fig. 2. Based on the balance between the calculation time and prediction accuracy, we found that the total value inside the frame of size × 4 5 (where the section width of 1 is 5 and that of 2 is 4) was the largest. Thus, the frame area is = … {2 , 2 , , 2 } . This is because the size of the dataset was small; the change in the score value for the interaction pair was thought to remarkably influence the change in the value of AUPR. Therefore, since the size of the dataset increases with GPCR, Ion channel, and Enzyme, the change in the value of AUPR with respect to the change of , 1 2 seemed to be small.

Table 2
The AUC for the 5×10-fold cross-validation scenarios. A 10-fold cross-validation was performed five times, and the average value of AUC and the 95% confidence interval of the t-distribution are shown. The one with higher average value is shown in bold.

Characteristics of rescoring
A plot depicting the probability density function of the beta distribution and the improved value of the score in order to explain how the score of NRLMF changes by rescoring of NRLMFβ is shown in Fig. 1. The score after improvement by rescoring of NRLMFβ differs between when the score s before improvement is low (Low score: < s 0.5) and when it is high (High score: > s 0.5). The change when the score before improvement is low (Low score) and the change when the concentration is different was compared ( Fig. 1(a)). When the original score s was 0.200, the improved score s 1 of concentration 22 ( = = a b 5, 17) was 0.227, and the improved score s 2 of concentration 7 ( = = a b 2, 5) was 0.286. As described above, when the score s before improvement was less than 0.5, the score after improvement tended to increase; as the concentration decreases, the amount increases. Conversely, Fig. 1(b) shows the change when the score before improvement is high (High score), and the change with different concentrations is compared. When the original score s was 0.800, the improved score s 1 of concentration 5) was 0.773, and the improved score s 2 of con- 2) was 0.714. Thus, when the score s before improvement was larger than 0.5, the score after improvement decreased unlike that when the score before improvement was less than 0.5; when the concentration was further decreased, the amount further decreases.

Discussion about improvement
A scatter plot of scores and the feature quantity γ in NRLMF and NRLMFβ in order to assess the evaluation results of CVS3 for nuclear receptor dataset are shown in Fig. 3. Here, the scores s s , and feature quantity γ of each method were calculated by substituting the leaveone-out method for dividing the 10-fold division of CVS3 into each target. The prediction accuracy (i.e., AUC and AUPR) of NRLMFβ improved significantly compared with that of NRLMF because the score of the interaction pairs in which the value of feature quantities γ is 0 was improved. In other words, the AUC and AUPR values increased as the Table 3 The AUPR for the 5×10-fold cross-validation scenarios. A 10-fold cross-validation was performed five times, and the average value of AUPR and the 95% confidence interval of the t-distribution are shown. The one with higher average value is shown in bold. score of interaction pairs with less than 0.5 in NRLMF was selectively increased by rescoring. This indicates that the prediction accuracy was improved by not excessively lowering the score of the pair with less information on interaction (i.e., the feature amount γ is small). Fig. 4 represents a histogram for each score in Fig. 3, indicating that the distribution of the score of NRLMFβ changes for NRLMF. In particular, in NRLMF, the frequency of interaction pairs with a score of 0.1 or less was found to decrease in NRLMFβ.

Conclusion
Since the score of NRLMF becomes low when a pair of drugs and proteins with less interaction information is used, we proposed a resort algorithm of NRLMFβ. In NRLMFβ, the shape of the beta distribution was determined based on the value of the feature quantity γ representing the degree of interaction information defined by the expression (13) and the value of the score of NRLMF. The score of NRLMF was recalculated by defining the expected value of the defined beta distribution as the score of NRLMFβ. In the evaluation experiment, in order to compare the generalization performance of NRLMF and NRLMFβ, we performed three cross-validations CVS1, CVS2, and CVS3 on four datasets of Nuclear receptor, GPCR, Ion channel, and Enzyme and calculated the average values of AUC and AUPR. Hence, we confirmed that the prediction accuracy of NRLMFβ was significantly improved compared with that of NRLMF. Thus, we concluded that NRLMFβ improves the prediction accuracy for drug-target pairs with less interaction information. Future studies need to focus on the number of interactions handled within the prior distribution on the latent feature matrix U V , . We expect that estimating more appropriate parameters and improving prediction accuracy are possible.