M2PP: a novel computational model for predicting drug-targeted pathogenic proteins

Background Detecting pathogenic proteins is the origin way to understand the mechanism and resist the invasion of diseases, making pathogenic protein prediction develop into an urgent problem to be solved. Prediction for genome-wide proteins may be not necessarily conducive to rapidly cure diseases as developing new drugs specifically for the predicted pathogenic protein always need major expenditures on time and cost. In order to facilitate disease treatment, computational method to predict pathogenic proteins which are targeted by existing drugs should be exploited. Results In this study, we proposed a novel computational model to predict drug-targeted pathogenic proteins, named as M2PP. Three types of features were presented on our constructed heterogeneous network (including target proteins, diseases and drugs), which were based on the neighborhood similarity information, drug-inferred information and path information. Then, a random forest regression model was trained to score unconfirmed target-disease pairs. Five-fold cross-validation experiment was implemented to evaluate model’s prediction performance, where M2PP achieved advantageous results compared with other state-of-the-art methods. In addition, M2PP accurately predicted high ranked pathogenic proteins for common diseases with public biomedical literature as supporting evidence, indicating its excellent ability. Conclusions M2PP is an effective and accurate model to predict drug-targeted pathogenic proteins, which could provide convenience for the future biological researches.

. During the past decades, various prediction methods have been presented with different performances.
Earlier researches mainly focused on the protein-protein interaction (PPI) network, whose topological structure was directly used to predict disease-gene associations [4,5]. However, the large number of false positives in the PPI network from public databases made these methods difficult to acquire higher prediction accuracy. Hence, the diseaserelated clinical data was added into later studies, which were based on GWAS [6][7][8] and gene expression [9][10][11][12][13], respectively. Although these methods obtained more accurate prediction than methods which applied PPI network alone, limitations still existed. For example, even the comprehensive platform TCGA [14] could only provide limited available data about uncommon cancers, let alone other non-cancer diseases, which greatly restricted the performance of these methods. Difficult to break limitations on the data source, researchers have begun to conduct in-depth research on algorithms, where the most widely used were about machine learning. Model GCN-MF combined the graph convolutional network with matrix factorization for disease-gene association identification [15]. Natarajan et al. derived features of diseases and genes for the inductive matrix completion [16]. Method CATAPULT was proposed by training a biased support vector machine model with features derived from a heterogeneous network [17]. Zeng et al. considered this problem as the recommender system, presenting a probability-based collaborative filtering model to predict pathogenic human genes [18]. Luo et al. developed a method to predict disease-gene associations with multimodal deep learning [19]. Although these efforts on algorithm development made prediction results improved, most methods still extracted valid information only from gene data and disease data. Actually, utilizing other information besides gene and disease to solve the prediction problem is essential and urgent in such intricate biological networks.
The ultimate objective of predicting pathogenic genes or proteins is to find a breakthrough for disease treatment. If predicting on the whole gene (protein) set, even though a novel gene-disease (protein-disease) association is successfully predicted, it will still be a long process to treat the disease specifically for this gene (protein). The reason comes from many aspects, for example, the research and development for new drugs usually take a long time. Actually, reducing the scope of the whole protein set to drug-targeted protein set will be more conducive for the disease treatment in clinical research, because for a novel predicted protein-disease association, the drugs which target this protein can be regarded as a candidate collection for the disease treatment instead of developing new drugs. Hence, we proposed a method to predict drug-targeted pathogenic proteins, named as M2PP. First, the target, disease and drug set were collected to construct association networks and similarity networks. Then, features were constructed for each target-disease pair based on the neighborhood similarity information, drug-inferred information and path information, respectively. Finally, a random forest regression model was trained to score unconfirmed target-disease pairs.

Data collection
We collected the drug-targeted single human target proteins from DrugBank [20], where the drugs were approved by the Food and Drug Administration (FDA) [21]. For these targets, we extracted diseases which had curated associations with them from the Comparative Toxicogenomics Database (CTD) [22]. Then, three sets (a target set, a disease set and a drug set) were constructed. Next, we reduced these sets to make sure that any element in one set had association with both the other two sets (all associations were from DrugBank and CTD). Finally, we obtained 1002 targets, 1035 diseases and 1095 drugs (Fig. 1a)). The target set, disease set and drug set were represented as T = {t 1 , t 2 , . . . , t nT }.D = d 1 , d 2 , . . . , d nD and M = {m 1 , m 2 , . . . , m nM } , respectively.

Network construction
First, we constructed three association networks among the target, disease and drug set: (1) the target-disease association network, including 7342 curated associations from CTD, whose adjacency matrix was represented as TDA nT×nD ; (2) the target-drug interaction network, including 38,871 curated interactions from DrugBank and CTD, representing its adjacency matrix as TDI nT×nM ; (3) the disease-drug association network, including 35,319 curated associations from CTD, with adjacency matrix of DDA nD×nM . For target t i (1 ≤ i ≤ nT) and disease d j 1 ≤ j ≤ nD , if the known association between them was existed,TDA i,j = 1 ; otherwise,TDA i,j = 0 . Analogously did TDI and DDA.
Then, we constructed the similarity networks: (1) The disease-disease similarity network. We calculated the disease semantic similarities based on the Medical Subject Headings (MESH) descriptors [23] by the IDS-SIM algorithm [24] and based on Disease Ontology (DO) [25] by Wang et al. 's method Fig. 1 The framework of M2PP. a Construct the target set, disease set and drug set; b Construct heterogeneous networks: the target-disease association network, target-drug interaction network, disease-drug association network, disease-disease similarity network, target-target similarity network and drug-drug topological structure similarity network; c Construct features for target-disease pairs; d Train the random forest model and predict association scores for unconfirmed target-disease pairs [26], respectively. For a disease-disease pair, the mean value of the two similarities was computed to construct the semantic similarity matrix DDS_S nD×nD . Then, we calculated diseases' topological structure similarity [27], whose matrix was represented as DDS_T nD×nD : where 1 ≤ i, j ≤ nD ; TDA ,i was the ith column of TDA ; α ′ was set to 1 according to previous study [28]. For the two similarity matrices DDS_S and DDS_T , we proposed an integration way based on the entropy to get the final disease similarity matrix DDS nD×nD . The entropy of row i in matrix W x×y was represented as E W i : According to the formula above, the entropy of disease d i in matrix DDS_S and DDS_T was calculated and represented as E DDS_S i and E DDS_T i , respectively. All diseases could be divided into two subsets, D_A and D_B: The similarity matrix DDS could be divided into four parts by D_A and D_B: A low entropy value meant little random information from the similarities. Hence, the upper left and lower right part of DDS were defined as below: The similarities between D_A and D_B were still integrated based on the entropy. D_A was divided into two subsets, D_A_a and D_A_b: similarity matrix between D_A and D_A similarity matrix between D_A and D_B similarity matrix between D_B and D_A similarity matrix between D_B and D_B  The similarity matrix between D_A and D_B could be represented as below: To ensure the symmetry of DDS , the similarity matrix between D_B and D_A was set as the transpose of similarity matrix between D_A and D_B . Finally, DDS could be obtained as below: (2) The target-target similarity network. We calculated the target proteins' amino acid sequences similarity from the KEGG database [29] by the Smith-Waterman algorithm [30] and the protein functional similarity by Chen et al. 's method [31], respectively. For a target-target pair, the mean value of the two similarities was calculated to construct the similarity matrix TTS_S nT×nT . Then, targets' topological structure similarity matrix TTS_T nT×nT was computed as below: The target subset T_A , T_B , T_A_a and T_A_b were defined as below: Finally, TTS_S and TTS_T were integrated into the final target similarity matrix TTS nT ×nT : Similarity matrix between D_A and D_B = DDS_S D_A_a,D_B DDS_T D_A_b,D_B The drug-drug topological structure similarity networks. We calculated drugs' topological structure similarities in the target-drug interaction network and the diseasedrug association network, respectively. They were represented as MMS_T nM×nM and MMS_D nM×nM , respectively: where 1 ≤ i, j ≤ nM ; TDI ,i and DDA ,i was the ith column of TDI and DDA , respectively; γ ′ = 1;δ ′ = 1.
Finally, the heterogeneous network was constructed as shown in (Fig. 1b)). The characteristics of data in these networks were summarized in Table 1, where the sparsity was the ratio of edges to the network size. Obviously, our objective network (the target-disease association network) was the most imbalanced.

Feature construction for model training to score unconfirmed target-disease pairs
For target-disease pair t i -d j ( 1 ≤ i ≤ nT, 1 ≤ j ≤ nD ), we constructed a 9-dimension feature based on its neighborhood similarity information, drug-inferred information and path information (Fig. 1c)), shown in the following formulas: The analysis of these features were summarized in Table 2, including each feature's type, description, content and information source. Considering each target-disease pair in the training set as a sample, the pair with known associations was regarded as a positive sample which was labelled as 1, while the pair which did not have known associations was regarded as a negative sample labelled as 0. After constructing features for each sample, the training set was used to train the random forest regression model [32], then the prediction model was used to score the unconfirmed target-disease pairs (Fig. 1d)). A higher score represented a larger possibility that the unconfirmed pair was associated. Parameters of mtry The neighborhood similarity information Information based on the similarities between the specific disease (target) and its neighborhoods

Fea1
The average similarity between the specific disease and its neighborhoods which did not have known associations with the specific target DDS TDA 0.58

Fea2
The average similarity between the specific target and its neighborhoods which did not have known associations with the specific disease TTS TDA 0.576

Fea3
The sum of weights for paths which connected by the nearest neighborhood of the specific target and the nearest neighborhood of the specific disease The drug-inferred information Information inferred by drugs based on the drug-targetdisease mechanism

Fea4
The maximum quotient of the average weight for the specific disease-drug paths divided by the average weight for the specific targetdrug paths The path information Information from paths (length = 2 and length = 3) between the specific target and the specific disease

Fea5
The average weight of paths from the specific target to the specific disease based on target-target-disease pattern TTS TDA 0.745

Fea6
The same as above but based on targetdisease-disease pattern DDS TDA 0.722

Fea7
The same as above but based on targettarget-target-disease pattern TTS TDA 0.671

Fea8
The same as above but based on targettarget-diseasedisease pattern TTS DDS TDA 0.654

Fea9
The same as above but based on targetdisease-diseasedisease pattern DDS TDA 0.593 and ntree in the random forest model were set to 3 (the number of features/3) and 500 according to the default settings in R package, respectively.

Evaluation metric
The fivefold cross-validation (CV) experiment was implemented to evaluate the performance of diverse prediction models. In the target-disease association network, there were 7342 known associations and 1,029,728 unconfirmed pairs. First, the 7342 target-disease associations and 7342 randomly selected unconfirmed pairs were considered as positive samples and negative samples, respectively. The remaining 1,022,386 unconfirmed pairs was unlabeled samples. Then, the positive samples and negative samples were evenly divided into 5 parts, where each part contained the same amount of positive and negative samples. In each CV, four parts were taken as training set in turn to train the model, while the remaining part and all unlabeled samples were taken as test set. For each test sample, the model could give a score representing the possibility that the pair was associated. We calculated the true positive rate (TPR) and false positive rate (FPR) for these scores under different thresholds to acquire the areas under the receiver operating characteristic curve (AUROC) and the areas under the precision-recall curve (AUPR). In fivefold CV, we obtained five AUROC/AUPR values and adopted the average AUROC/AUPR value to evaluate the performance of the model in this CV. To make the results more reliable, we repeated fivefold CV for 5 times to compute the mean and standard deviation (SD) values of the five average AUROC/AUPR values as the final evaluation metrics for prediction models.

Feature analysis
M2PP acquired mean AUROC of 0.986 and mean AUPR of 0.417 under fivefold CV for 5 times. To detect the influence of features on model's prediction performance, we removed each feature in turn to run M2PP with the remaining features under the same fold settings. After removing the investigated feature, the more reduced the prediction performance, the more effective the feature was. The AUROC and AUPR values via removing different feature were exhibited by boxplots in Fig. 2, where the mean values were represented by point in the box. It could be observed that the mean AUROC/AUPR values of using all features was better than removing any feature. The paired t-test [33] was performed between AUROC (AUPR) values of using all features and values of removing any feature to check whether the average difference in their performance is significantly different from zero. All p-values were less than 0.05 as shown in Fig. 2, indicating that the performance of using all features is significantly better than removing any feature. This result demonstrated that each feature was indispensable. To further explore the influence of different feature on prediction performance, we defined an indicator named influence coefficient as below: (29) Influence coefficient of Feai = mean(DifferenceAUROC i , DifferenceAUPR i ) ; AUROC allfeatures and AUPR allfeatures represented the AUROC and AUPR values of five times fivefold CV by using all features, respectively;AUROC allfeatures\Feai and AUPR allfeatures\Feai represented the AUROC and AUPR values of five times fivefold CV by removing feature Feai , respectively. The larger the influence coefficient, the more effective the feature was. The influence coefficient of each feature were shown in Table 2. In the neighborhood similarity information type, Fea3 got the largest influence coefficient, because Fea3 mainly utilized the nearest neighborhoods' similarity, which was the most valid information in similarity networks. In the path information type, Fea5 and Fea6 obtained advantageous influence coefficients, because paths of length = 2 provided more basic, direct and non-redundant information than length = 3. The drug-inferred information type, Fea4, also acquired decent influence coefficient, indicating that drug indeed play an effective role in predicting target-disease associations because of the drug-target-disease mechanism. Hence, our constructed features were effective, reasonable and indispensable to achieve excellent prediction performance.

Comparison with existing prediction models
M2PP was compared with six state-of-the-art models, which were RFLDA [34], DDR [35], NEDD [36], IRFMDA [37], GCRFLDA [38] and MFLDA [39]. The first four methods were based on random forest algorithm, and the last two methods were based on the graph convolutional matrix completion and the matrix factorization, respectively. We performed fivefold CV for five times on each model, exhibiting the mean and SD of AUROC/AUPR values in Fig. 3a). Each disease belonged to at least one category provided by MESH, for example, disease "Lymphoma" belonged to three categories, which were "C04: Neoplasms", "C15: Hemic and Lymphatic Diseases" and "C20: Immune System Diseases". In our network, diseases involved 24 categories, where the number and proportion of diseases in each category were shown in the left graph in Fig. 3b). Proportion of the top 5 category "C23: Pathological Conditions, Signs and Symptoms", "C10: Nervous System Diseases", "C04: Neoplasms", "C14: Cardiovascular Diseases" and "C16: Congenital, Hereditary, and Neonatal Diseases and Abnormalities" exceeded 10%, whose UpSet chart was shown in the right side in Fig. 3b) to exhibit the details of diseases in them. For these five categories, we detected models' prediction performance for their diseases. First, we trained the model with a training sample set which included known target-disease (excluded diseases in the investigated category) associations as the positive samples and the randomly selected unconfirmed target-disease (excluded diseases in the investigated category) pairs as the negative samples, noting that the number of positive and negative samples were the same. Second, the pairs between all targets and each disease in the investigated category were considered as the test set in turn to acquire scores by the model. Then, we could compute the AUROC and AUPR values for each disease in the investigated category, and the average AUROC/AUPR value was considered as the prediction performance of the investigated category. The process was repeated for 5 times to get reliable results. Each model's mean and SD of AUROC/AUPR values for the five categories were exhibited in Fig. 3c), where M2PP always achieved the best performance. These results indicated the excellent ability of our model.

Case studies
We predicted new pathogenic proteins for five common diseases: lung cancer, breast cancer, colon cancer, leukemia and lymphoma. For one investigated disease, M2PP was trained with a training sample set, where the known target-disease (excluded the investigated disease) associations was the positive samples and the randomly selected unconfirmed target-disease (excluded the investigated disease) pairs of the same size was the negative samples. Then, M2PP could predict for the pairs between all targets and the investigated disease to acquire prediction scores. We repeated the process for 5 times, so the pair between one target and the investigated disease had five scores, and finally the average score was considered as the prediction score of the pair. We sorted the prediction score of all unconfirmed pairs between targets and the investigated disease, and manually searched the top 10 pairs in public biomedical literature to find the supporting evidence. All top 10 targets were successfully predicted for lung cancer, breast cancer and colon cancer, nine targets for leukemia and seven targets for lymphoma, shown in Table 3. Here, we mainly introduced the top 1 predicted target for each disease. Researchers found that TNF played a key role in inducing resistance to epidermal growth factor receptor inhibition in lung cancer, and suggested that a concomitant inhibition of epidermal growth factor receptor and TNF maybe a potentially new treatment strategy for lung cancer patients [40]. IL2 inhibited the growth of breast cancer cells through improving the proliferation of natural killer cells [41]. Inhibiting or knocking MET down made colon cancer cells sensitive on cetuximab-mediated growth inhibition, implicating that targeting MET was a rational strategy for reversing cetuximab resistance in colon cancer [42]. VEGFA was observed to have additive effect in inflating the risk of leukemia [43]. CHKA possessed oncogenic activity and could be a potential therapeutic target in lymphoma [44]. We also predicted target-disease association scores on the whole network and sorted all unconfirmed pairs' scores. Seven associations in top 10 has been successfully predicted with public literature as evidences, shown in Table 4.  For example, researchers investigated the expression and functions of ALOX5 in breast cancer cells, and demonstrated that inhibiting ALOX5 had therapeutic potential in breast cancer [45]. In addition to these literature evidences, we also found that no matter in Tables 3 or 4, targets and diseases in all successful predictions had co-associated drugs (CDs), which were drugs simultaneously associated with the target and disease.
The phenomenon further demonstrated that these high-rank predicted pairs were reasonable from the aspect of both computational data and biomedicine verification. Other drugs which interacted with the predicted target might be potential candidate therapeutic strategies for the investigated disease, needing to be explored in future clinical trials. These results indicated the ability of M2PP to provide conveniences for the future biological researches.

Conclusion
Predicting drug-targeted pathogenic proteins is crucial for understanding disease mechanism and implementing disease treatment. In this study, we presented a novel model M2PP to predict drug-targeted pathogenic proteins. First, we constructed a heterogeneous network, including the target-disease association network, target-drug interaction network, disease-drug association network, disease-disease similarity network, targettarget similarity network and drug-drug topological structure similarity network. Then, we developed three types of features on the network, which were based on neighborhood similarity information, drug-inferred information and path information. Finally, we trained a random forest model with these features to score unconfirmed target-disease pairs. In the result section, we first analyzed our constructed features in detail. By removing each feature in turn to check the change of prediction performance, we found that each feature was indispensable. Three types of feature obtained the average influence coefficient of 0.598 (the neighborhood similarity information), 0.671 (the druginferred information type) and 0.677 (the path information type), respectively. The path information type acquired the highest value mainly benefited from paths of length = 2, which provided more basic, direct and non-redundant information than paths of length = 3. In addition, the drug-inferred information type also got decent value, indicating that drugs were effective in predicting target-disease associations because of the drug-target-disease mechanism. Then, we compared M2PP with several state-of-theart models, where M2PP obtained advantageous performance among them. According to the disease category, we extracted sub-networks from the whole target-disease association network for the top 5 category to perform the prediction. Results showed that category of "C23", "C04" and "C14" achieved better performance. This was because that diseases in "C23", "C04" and "C14" have more associations with targets than in the other two categories "C10" and "C16". The average degree of diseases in "C23", "C04" and "C14" were 6.84 (1670 associations /244 diseases), 12.03 (1985/165) and 7.16 (960/134); while in "C10" and "C16", the average degree of diseases were 5.06 (1057/209) and 2.95 (348/118). Finally, we predicted new target-disease associations using M2PP, where several high rank associations were successfully confirmed with public literature as evidence. These results demonstrated that M2PP was effective and accurate, which might be convenient for biological researches in the future.