Ranking Cancer Proteins by Integrating PPI Network and Protein Expression Profiles

Proteomics, the large-scale analysis of proteins, is contributing greatly to understanding gene function in the postgenomic era. However, disease protein ranking using shotgun proteomics data has not been fully evaluated. In this study, we prioritized disease-related proteins by integrating the protein-protein interaction (PPI) network and protein differential expression profiles from colon and rectal cancer (CRC) or breast cancer (BC) proteomics. We applied Local Ranking (LR) and Global Ranking (GR) methods in network with three kinds of protein sets as a priori knowledge, which were known disease proteins (KDPs) that were collected from the Online Mendelian Inheritance in Man (OMIM) database, differentially expressed proteins (DEPs), and the collection of KDPs and their direct neighborhood with differential expression (eKDPs). The cross-validations showed that GR method outperformed LR method while using eKDPs as the initial training showed significantly higher accuracy compared to using the other two a priori sets. And then we validated the top ranked proteins using RNAi-based loss-of-function screens in the DepMap database. The results showed that 75% of top 20 proteins in CRC are necessary for tumor survival. In summary, the network-based Global Ranking with protein differential expression can efficiently prioritize cancer-related proteins and discover new candidate cancer genes or proteins.


Introduction
Discovering disease genes is important for understanding the mechanisms in physiological and pathological processes of disease. Genetic studies like linkage analysis [1] and association studies [2] can uncover diseases associated with chromosomal regions, which contain hundreds of candidate genes that may be associated with a disease [3]. Conducting experiments to confirm certain disease-related genes is timeconsuming. To maximize efficiency, many computational methods and tools have been proposed to prioritize disease genes.
Since proteins are the basic unit of biological function to link genotypes to phenotypes, CPTAC and other cancer proteomics projects have characterized the proteomic features of human cancers [4][5][6][7][8]. However, how to prioritize cancer protein and discover new candidates based on PPI network and protein expression has not been fully addressed.
In the previous reports, the gene prioritization approach mainly depends on the similarity of characteristics, including sequence similarity [9,10], function annotation [10][11][12][13], gene product [14], and protein domains, between known genes associated with the phenotype of interest and the candidate genes. In biological network, the neighbor genes may be functionally or physically similar and affect the pathway or the phenotype of interest [15]. Oti et al. [16] proposed a direct neighbor-based method to predict disease candidates with known disease loci. This method is based on the principle of "guilt-by-association" [17] and is called Local Ranking (LR) in this work. In addition to the LR method, Köhler et al. [18] proposed random walk method to access the global distance between the candidate genes and the known disease genes, which we called Global Ranking (GR) here. However, lack of knowledge about disease proteins and few known diseaserelated genes limits the development of the GR. To solve this problem, one approach combines the PPI network and the gene expression data to prioritize genes [19] like PINTA 2 BioMed Research International  [20]. Le and Kwon [21] developed a neighbor-favoring weight reinforcement to improve the performance in disease gene prioritization. Recently, protein network-based integration of multiomics data for prioritizing cancer genes has also been proposed [22,23]. However, these strategies have not been tested in proteomics data analysis yet. In this study, by integrating PPI network and differentially expressed protein profiles given by shotgun proteomics, we evaluated LR and GR methods with three different initial protein sets as a priori knowledge to prioritize disease proteins. The three initial protein sets contain known disease proteins (KDPs), differentially expressed proteins (DEPs), and the KDPs with their direct neighbor DEPs (eKDPs). The cross-validations and RNAi perturbation screens were used for performance evaluation of different strategies.

Proteomic Datasets.
To test and compare the two ranking methods with three initial protein sets, two cancer proteome datasets were used. The proteomic data of CRC used in our work include 90 TCGA colon tumor samples and 30 normal colon epithelium samples, and the normalized proteomic expression profile is provided in Zhang's work [4]. The other proteomic dataset contains 77 tumor samples and 3 normal breast tissue samples, and its normalization is presented in Mertins et al. [5] 2.2. Protein-Protein Interaction Network. We constructed protein interaction networks based on the HINT database (http://hint.yulab.org/). HINT [24] (High-quality INTeractomes) is a database of high-quality protein-protein interactions in different organisms and contains two types of interactions: binary physical interactions and cocomplex associations. After removing duplicates and self-linked interactions, we obtained 132032 interactions between 13048 proteins with the average degree of 10.12.

Cancer Protein
List. The 44 known CRC genes were collected from the OMIM [25] knowledgebase (http://www.ncbi .nlm.nih.gov/omim), 39 of which exist in the HINT, as shown in Table 1. Similarly, 28 of 33 known BC genes collected from OMIM exist in HINT, as shown in Table 2.

Differential Expression
Measures. The statistical analysis of protein differential expression was Welch's t-test, and a Bonferroni correction (p < 0.05) was applied, leading to 2455 DEPs in the CRC and 1549 DEPs in BC. We selected the DEPs directly linked with the KDPs in the PPI network to enlarge the initial protein set and made up the third initial protein set eKDPs.

Performance Measurement.
Similarly to the previous disease gene ranking work [19], we employed a leave-one-out cross-validation test to validate the different models. In each iteration, one KDP and the other 99 non-KDP proteins were randomly selected to set up the testing set. All the remaining KDPs were used for training and building the model. A generalization error estimate was obtained by repeating this procedure for each of the KDPs available. Here only the proteins in the HINT PPI network were considered. The performance measurement used was the receiver-operating characteristic (ROC) analysis [26]. "roc" function in R package "pROC" was used to realize the measurement. The ranking scores of these proteins in the testing set given by the GR or LR method were used as predictor. Genome-scale perturbation screens can now be efficiently performed in many cell lines using RNAi to knock down the target genes. Tsherniak et al. systematically analyzed genome-scale loss-function screens performed in 501 cancer cell lines and the results were publicly available on the Cancer Dependency web [27]. In order to validate the top ranked genes/proteins that we newly identified in CRC samples, we investigated their gene-level differential dependency scores (DS) at the web (https://depmap.org/rnai/index) to measure the relationship to tumor survival.

Ranking Strategies.
In the LR method, the protein rankings are solely based on the directly connected protein neighbors, which can be one of the three kinds of initial protein sets (KDPs, DEPs, and eKDPs). The GR method can also use the information of the three kinds of initial protein sets by processing a random network walk on the PPI network.
The LR method sets the number of proteins in the initial protein set as its only standard for evaluating the probability of a candidate protein being associated with a disease. The more proteins in the initial protein set are associated with the candidate proteins, the higher score the candidate protein will get. In the GR method, the random walk restarts at one of the three kinds of initial protein sets. We adopted the heat kernel rank [28] method, which has been applied in gene prioritization previously [29]. ? Figure 1: Workflow of the ranking process of LR and GR with three kinds of initial protein sets. Firstly, we mapped the three kinds of protein sets to the PPI network as initial protein datasets, which consisted of KDPs (red), DEPs (green), and eKDPs. Then, we applied the LR or GR to the network based on the initial protein set (blue).
Given a graph G, processing a random walk on G, the transition probability matrix W is defined as W=D −1 A. A is the adjacency matrix and D is the diagonal matrix. The Laplacian matrix of G is L=I-W. By establishing the diffusion rate and setting a preference vector 0 , the ranking score vector can be got as the following Equation: The similarity matrix of proteins is shown in Then, we can obtain the discrete approximation (see (3)) described in Yang's work [30]: where parameter is the diffusion rate and N is the number of iterations. As the two parameters are set, a random walk through the network is initiated. We set =0.5 and N=3 for the interactions that can make reach a steady state. We initialized the preference vector 0 with binary values and filled the initial protein set with 1 and all the other candidate proteins with 0. Finally, we can obtain the vector , which contained scores for all of the candidate proteins. It should be noted that the KDP's neighbor DEPs were not initialized when this KDP was selected for the testing set in leave-one-out cross-validation.

Comparing LR, GR Methods with Three Different Kinds of
Initial Protein Set. We compared the two ranking methods using local [16] and global interaction information [29] or both. We plotted the ROC curves and computed the AUC values to evaluate LR and GR with three different initial protein sets, respectively [26,31]. Figure 1 shows a workflow of the comparison of the two ranking strategies. The LR method scores the candidate protein by the number of directly linked initial proteins, and the GR method scores the candidate protein by all the initial proteins. As a result, all the candidate proteins on the PPI network can be scored.
In CRC, the AUC values of GR with different initial protein sets were 0.834, 0.782, and 0.735, better than the LR method, whose AUC values were 0.814, 0.777, and 0.683, respectively, as shown in Figure 2. As a result, using the same kind of prior knowledge, the performance of GR was better compared to LR. While using the eKDPs as the initial protein set, the result tended to be significantly better compared to using the other two datasets. We also calculated the AUC value of GR with the initial protein set involving the KDPs and 629 randomly selected DEPs (AUC=0.687). This result suggested that the GR method is suitable for proteome data as well as transcriptome data, and the neighbor DEPs can enlarge the initial protein set, which is helpful for ranking proteins.
The results of BC were the same as the results from CRC. These results showed that both the LR and GR methods can assess the similarity between two proteins in the PPI network for protein ranking. However, the GR method was better than the LR method. After prioritizing the proteins with the TCGA CRC shotgun proteomics, we obtained 6 ranked candidate protein lists for the LR and GR methods with three initial protein sets. We counted the number of KDPs ranked in the top 10, top 20, top 50, top 100, and top 200 proteins in the lists. Figure 3 shows the comparison of the number of top ranked KDPs with different methods. We noted that the results of GR were the same as LR with the same initial protein set. However, while comparing the initial protein sets, eKDPs performed better than the other two. The top 100 ranked proteins resulted from GR method with eKDPs in CRC and BC are provided in Supplementary Table 1.

Annotation of the Top Ranked Proteins in CRC.
As the results shown above, the GR method with eKDPs performed better than the other two. We looked at the top 20 and top 50 ranked proteins/genes identified by the GR method with eKDPs in CRC samples. We found 7 KDPs in the top 20 ranked proteins (37%) and 10 KDPs in the top 50 ranked proteins (20%). Except for the KDPs, using the Cancer Dependency Map [27] we also queried the cancer dependency scores of 20 putative cancer genes/proteins in the top ranked list that we newly identified from the CRC dataset. As shown in Figure 4

Discussion
In this manuscript, we compared the performance of network-based methods using protein expression profiles for cancer protein ranking. We found that the GR method outperformed when the eKDPs were used as the initial protein set. We randomly selected 39 proteins in the PPI network and compared with 39 KDPs for their degrees linking with the differentially expressed nodes. The average link of colon cancer proteins to DEPs was 16.46, which is significantly higher than the average value 5.65 from random controls. This result suggests that the expressions of the close neighbors of cancer proteins are more likely to be changed. We further validated the top ranked proteins/genes in CRC samples based on the dependency score given by the Cancer Dependency Map. 75% of top 20 genes showed relationship with tumor survival. For example, UBC has different functions depending on the Lys residue linked by the ubiquitin. UBC is involved in DNA repair, kinase modification, protein degradation, DNA-damage responses, and other pathways [32]. One study showed Ubiquitin C-terminal hydrolase-L1 (UCHL1) can activate the -catenin/TCF pathway through its deubiquitinating activity to contribute to  CRC progression [33]. Minichromosome maintenance complex component 2 (MCM2) is involved in the initiation of eukaryotic genome replication. It has been reported that MCM2 is a therapeutic target of Trichostatin A in CRC cells [34] and also has been suggested to be used in the early diagnosis of CRC [35]. The Jun Protooncogene (JUN) is involved in activated KRAS-mediated transcriptional activation of USP28 in CRC cells, where it binds to the USP28 promoter [36]. Cullin 3 (CUL3) is the core component of multiple cullin-RING-based BCR (BTB-CUL3-RBX1) E3 ubiquitinprotein ligase complexes, which mediate the ubiquitination and subsequent proteasomal degradation of target proteins. As shown by Wang, CUL3 downregulation rescues folate deprivation-induced MAT II exhaustion and growth arrest in CRC cells [37].

Conclusions
Global network-based ranking is more efficient for proteomics data in cancer protein identification. The networkbased proteome analysis is helpful in ranking disease proteins and discovering new candidate proteins.

Data Availability
The processed proteomic expression profiles supporting this study are from previously reported studies and datasets, which have been cited. The code used to support the findings of this study is available at https://github.com/Mathlida/ protein-ranking.

Supplementary Materials
Supplementary Table 1: the top 100 ranked proteins that resulted from GR method with eKDPs in CRC and BC. The table lists the top 100 ranked proteins that resulted from GR method with eKDPs in colon and rectal cancer and breast cancer. (Supplementary Materials)