Classifying Ten Types of Major Cancers Based on Reverse Phase Protein Array Profiles

Gathering vast data sets of cancer genomes requires more efficient and autonomous procedures to classify cancer types and to discover a few essential genes to distinguish different cancers. Because protein expression is more stable than gene expression, we chose reverse phase protein array (RPPA) data, a powerful and robust antibody-based high-throughput approach for targeted proteomics, to perform our research. In this study, we proposed a computational framework to classify the patient samples into ten major cancer types based on the RPPA data using the SMO (Sequential minimal optimization) method. A careful feature selection procedure was employed to select 23 important proteins from the total of 187 proteins by mRMR (minimum Redundancy Maximum Relevance Feature Selection) and IFS (Incremental Feature Selection) on the training set. By using the 23 proteins, we successfully classified the ten cancer types with an MCC (Matthews Correlation Coefficient) of 0.904 on the training set, evaluated by 10-fold cross-validation, and an MCC of 0.936 on an independent test set. Further analysis of these 23 proteins was performed. Most of these proteins can present the hallmarks of cancer; Chk2, for example, plays an important role in the proliferation of cancer cells. Our analysis of these 23 proteins lends credence to the importance of these genes as indicators of cancer classification. We also believe our methods and findings may shed light on the discoveries of specific biomarkers of different types of cancers.


Introduction
Identifying cancer-specific genes involved in tumorigenesis and cancer progression is one of the major ways to understand the pathophysiologic mechanisms of cancers and to find therapeutic drug targets. Many efforts have been made to identify cancer biomarkers by using gene expression profiles [1]. However, the robustness of microarray-derived biomarkers is very poor [2]; this is in part because the robustness can be easily influenced in gene expression levels by small environmental changes. Without the evaluation of protein expression levels, there would be no way to illustrate causes of tumor proliferation and differentiation. Therefore, better understanding of the translational states of these genomes will bring us a step closer to finding potential drug targets and to illustrating off-target effects in cancer medicine.
Reverse phase protein array (RPPA) is a powerful and robust antibody-based high-throughput approach for targeted proteomics that allows us to quantitatively assess target protein expression in large sample sets [3]. In this process, sample analytes are immobilized in the solid phase, and analyte-specific antibodies are used in the solution phase. Through using secondary tagging and signal amplification to detect bound antibodies, proteins may be measured. Compared with conventional protein quantify methods, such as western blotting or ELISA, the advantages of RPPA include: large-scale quantification of the protein, high sensitivity, and small sample volume requirements [4]. While mass spectrometry, usually used to quantify the numbers of phosphorylation sites or phosphopeptides, requires further protein digestion, peptide fractionation and phosphopeptide enrichment after protein extraction, RPPA can directly quantify the extracted protein [5]. The application of RPPA has been extensively validated for both cell lines and patient samples [6], and it illustrates mechanistic insights behind diseases.
Currently, cancer types are classified by anatomical positions where they are found, such as lung cancer, breast cancer, etc. Whether these names could present their proteomic feature has not been determined until now. Although there have been some methods to find biomarker signatures for specific cancer types, there is still little research being done that considers different types of cancer as a whole in order to identify their similar or distinct proteomic expression patterns and classification features.
In this study, we proposed a computational workflow to successfully use 23 proteins to classify patient samples into ten main cancer types. First, we randomly divided the 3467 samples from ten types of cancers into a training set with 2775 samples and an independent test set with 692 samples. The proportions of each cancer type were similar in the training set and the independent test set. Then, with the training set, all features for distinguishing groups were ranked by the mRMR (minimum Redundancy Maximum Relevance Feature Selection) criteria. With 10-fold cross-validation on the training set, the SMO (Sequential minimal optimization) and the IFS (Incremental Feature Selection) [7] methods were used to choose an optimal feature set. A total of 23 proteins were selected from the training set. Their MCC (Matthews Correlation Coefficient) for the training set was 0.904 evaluated by 10-fold cross validation and their MCC on the independent test set was 0.936. Our methods could provide clinicians with knowledge of key distinct biochemical features of cancer types and could shed some new light on the discoveries of specific biomarkers of different types of cancers.

Materials and Methods Datasets
The RPPA data were downloaded from TCPA (The Cancer Proteome Atlas) database [8] (http://app1.bioinformatics.mdanderson.org/tcpa/_design/basic/download.html under Pan-Cancer 11 RBN), which contained proteomic expression of 3467 cancer patients in 11 cancer types ( Table 1). Because COAD (Colon adenocarcinoma) and READ (Rectum adenocarcinoma) share similar pathologies and were analyzed together in the TCGA (The Cancer Genome Atlas) colon and rectal cancer study [9], we combined the COAD and READ samples together as 'Colon adenocarcinoma and Rectum adenocarcinoma' samples. Therefore, ten cancer types were analyzed in following steps.
Because we did not have a different cohort to do multi-center validation, we randomly divided the 3467 samples into a training set with 2775 samples and an independent test set with 692 samples. The ratio of training samples over test samples was approximately 4:1 and we kept the proportion of each cancer type roughly the same in the training set and the independent test set. The description of the ten cancer types and their sample sizes in are given in Table 1. The training and test data sets are provided in S1 File.
Each sample contained 187 proteins whose expression levels were measured with reverse phase protein array (RPPA). RPPA is a protein array that allows measurement of protein expression levels in a large number of samples simultaneously in a quantitative manner when high-quality antibodies are available [4]. The 187 protein expression levels were considered as 187 features to be used for the cancer type classifications in this study.

Feature selection
The expression levels of 187 proteins may not all contribute equally to the classification. The maximum relevance minimum redundancy (mRMR) method [10][11][12][13] was employed to rank the importance of the 187 features in the training set. The 187 features can be ordered by using this method according to each feature's relevance to the target and according to the redundancy among the features themselves.
Let O denotes the whole set of 187 features, while O s denotes the already-selected feature set which includes m features and O t denotes the to-be-selected feature set which includes n features. The relevance D of the feature f in O t with the cancer classes c can be calculated by: And the redundancy R of the feature f in O t with the already-selected features in O s can be calculated by: To obtain the feature f j in O t with maximum relevance with cancer classes c and minimum redundancy with the already-selected features O s , Equation (1) and Equation (2)  The feature evaluation will continue 187 rounds. After these evaluations, a ranked feature list S by mRMR method can be obtained: :::; f h 0 ; :::; f N 0 g ð 4Þ The feature index h indicates the importance of feature. A feature with a smaller index h indicated that it had a better trade-off between the maximum relevance and the minimum redundancy, and it may contribute more in the classification.
Based on the ranked feature list in the mRMR table, we adopted the Incremental Feature Selection (IFS) method [14,15] to determine the optimal feature set, or one that achieves the best classification performance. To perform this method, features in the mRMR table were added one by one from higher to lower rank.
When another feature had been added, a new feature set was generated. And we get 187 feature sets, and the i-th feature set is: Based on each of the 187 feature sets, the classifiers were built and tested on the training set with 10-fold cross validation. With Matthews Correlation Coefficient (MCC) of 10-fold cross validation calculated on training set, we obtain an IFS table with the number of features and the performance of them. S optimal is the optimal feature set that achieves the highest MCC on training set. At last, the model was build with features from S optimal on training set and elevated on the test set.

Prediction methods
We randomly divided the whole data set into a training set and an independent test set. The training set was further partitioned into 10 equally sized partitions. The 10-fold cross-validation on the training set was applied to select the features and build the prediction model. The constructed prediction model was tested on the independent test set. The framework of model construction and evaluation was shown in Fig 1. We tried the following four machine learning algorithms: SMO (Sequential minimal optimization), IB1 (Nearest Neighbor Algorithm), Dagging, RandomForest (Random Forest), and selected the optimal one to construct the classifier. The brief description of these algorithms was as below.
The SMO method is one of the popular algorithms for training support vector machines (SVM) [16]. It breaks the optimization problem of a SVM into a series of the smallest possible sub-problems, which are then solved analytically [16]. To tackle multi-class problems, pairwise coupling [17] is applied to build the multi-class classifier.
IB1 is a nearest neighbor classifier, in which the normalized Euclidean distance is used to measure the distance of two samples. For a query test sample, the class of a training sample with minimum distance is assigned to the test sample as the predicted result. For more information, please refer to Aha and Kibler's study [18].
Dagging is a meta classifier that combines multiple models derived from a single learning algorithm using disjoint samples from the training dataset and integrates the results of these models by majority voting [19]. Suppose there is a training dataset I containing n samples. k subsets are constructed by randomly taking samples in I without replacement such that each of them contain n 0 samples, where kn 0 n. A selected basic learning algorithm is trained on these k subsets, thereby inducing k classification models M 1 ,M 2 ,. . .,M k . For a query sample, M i (1 i k) provides a predict result and the final predicted result of Dagging is the class with most votes.
Random Forest algorithm was first proposed by Loe Breiman [20]. It is an ensemble predictor consisting of multiply decision trees. Suppose there are n samples in the training set and each sample was represented by M features. Each tree is constructed by randomly selecting N, with replacement, from the training set. At each node, randomly select m features and select the optimized split to grow the tree. After constructing multiply decision trees, the predicted result of a given sample is the class that receives the most votes from these trees.

Matthews Correlation Coefficient (MCC)
MCC [21], a balanced measure even if the classes are of very different sizes, is often used to evaluate the performance of prediction methods on a two-class classification problem. To calculate the MCC, one must count four values: true positives (TP), false positive (FP), true negative (TN) and false negative (FN) [22,23]. Then, the MCC can be computed by However, many problems involve more than two classes, say N classes encoded by 1,2,. . .,N (N > 2). In this case, we can calculate the MCC for class i to partly measure the performance of prediction methods by counting TP, FP, TN and FN as following manners: TP i : the number of samples such that class i is their predicted class and true class; FP i : the number of samples such that class i is their predicted class and class i is not their true class; TN i : the number of samples such that class i is neither their predicted class nor their true class; FN i : the number of samples such that class i is not their predicted class and class i is their true class.
Accordingly, MCC for class i, denoted by MCC i , can be computed by However, these values can't completely measure the performance of prediction methods, the overall MCC in multiclass case is still necessary. Fortunately, Gorodkin [24] has reported the MCC in multiclass case, which was used to evaluate the performance of the prediction methods mentioned in Section "Prediction methods". In parallel, The MCC for each class will also be given as references. Here, we gave the brief description of the overall MCC in multiclass case as below.
Suppose there is a classification problem on n samples, say s 1 ,s 2 ,. . .,s n , and N classes encoded by 1,2,. . .,N. Define a matrix Y with n rows and N columns, where Y ij = 1 if the i-th sample belongs to class j and Y ij = 0 otherwise. For a classification model, its predicted results on the problem can be represented by two matrices X and C, where X has n rows and N columns, ( and C has N rows and N columns, C ij is the number of samples in class i that have been predicted to be class j.
For Matrices X and Y, their covariance function can be calculated by where X k and Y k are the k-th column of matrices X and Y, respectively, X k and Y k are mean value of numbers in X k and Y k , respectively. Then, the MCC in multiclass case can be computed by the following formulation [25]: Like the MCC in two-class case, the MCC in multiclass case ranges between -1 and 1, where 1 indicates the perfect classification, -1 the extreme misclassification.

Results and Discussion
The mRMR and IFS results By using the maximum relevance minimum redundancy (mRMR) method, the 187 features were ranked by importance in the training set. The result of the mRMR table can be found in S2 File.
During the IFS approach, each protein feature was added one by one. The classification MCCs which were obtained by four prediction methods, on the training set evaluated by 10-fold cross validation are presented in S3 File. We depicted the classification MCCs as   RandomForest were 0.985, 0.937, 0.969 and 0.925, indicating SMO can be used to construct an optimal classifier. By carefully checking the predicted results of SMO, it can be seen that by using the top 23 proteins, the MCC reached 0.904 which was the first reach above 0.900. With more proteins, the MCC did not increase by much. Therefore, in this study, we considered the 23 proteins as the optimal feature set and these 23 proteins were regarded as the most important proteins in classifying these ten types of cancers. We evaluated their prediction performance on the independent test set and the MCC was 0.936. The MCC for each cancer type can be found in S3 File.

The selected top 23 proteins for distinguishing cancer types
The selected top 23 proteins are summarized in Table 2. These proteins may play important roles in classifying the ten different cancer types. Most of these proteins have been reported to be related to certain tumors. For example, Claudin-7 has been reported to be over-expressed in breast tumors [26] and down-regulated in head and neck carcinomas [27]. TIGAR is up-regulated in colon tumors [28]. Gene amplification of ESR1 occurs frequently with breast cancer [29]. PREX1 is highly expressed in prostate cancer [30]. Thus, our findings are further corroborated by these previous results. Below, we will discuss the biological significance of the 23 proteins in detail based on gene function, cell pathways and biological functions, which may shed some light on the differences of different cancers in protein expression levels. We mainly discuss these genes in sections according to Robert A. Weinburg's [31]. For some genes that do not apply to cancer's hallmarks, we try to put these genes with similar functions together for discussion (see Fig 3).
Preventing cell death is crucial for cancer development because cancer cells are often resistant to apoptotic signaling caused by DNA damage and other factors. In our results, we found one gene that is related to apoptotic machinery and could be used to distinguish different cancers. Here, we discuss NDRG1, as well as previous findings showing its relationship to cancer. NDRG1 (N-myc downstream regulated gene 1) is a phosphorylated protein [32] that could be activated by the tumor suppressor gene p53 and required for the induction of p53-mediated apoptosis in the colon cancer cell line [33]. Because the NDRG1 protein has a crucial role in inhibiting primary tumor growth, it is well-known as a metastasis suppressor in a number of cancers including colon, prostate and breast cancers [34].
Replicative immortality is an important hallmark of cancer, which is commonly recognized as deregulated cell proliferation. Our findings on several important cell cycle-related genes in selected proteins not only illustrate their importance to the development of cancer, but are also first used as indicators of cancer classification. These cell cycle-related genes are discussed below: Cyclin B1 has a role in the regulation of cell cycle: before entering mitosis, cells flip between G2 and mitosis until there is sufficient accumulation of cyclin B to support CDK1 activity [35]. Misexpressed cyclin B1 in the nucleus has been found in a huge proportion of cells of some neoplasms, and cyclin B1 has been regarded as a potent prognostic factor in human breast carcinoma and squamous cell carcinoma [36]. Cyclin E1, encoded by CCNE1, is one of the members of the cyclin family, which controls cell cycle processes by dramatic periodicity of abundance. Recently, a genome-wide association study found that rs8102137 within the CCNE1 gene is associated with bladder cancer [37]. Meta-analysis also indicates that there is over-expression of this protein with breast cancer [38]. Chk2 (checkpoint kinase 2), as a serine/ threonine protein kinase, could respond to DNA damage in order to maintain genomic integrity [39]. It has been shown that Chk2 plays an important role in the proliferation of cancer cells [40], attracting much attention to make it a possible anti-cancer drug design target [41].
It is clear that invasion is a hallmark of cancer, even if its underlying mechanisms are still an enigma. Until now, the gain and loss of cell-cell attachment proteins are the main reasons of invasion, especially the loss of E-cadherin [31]. In our results, E-cadherin and some polarity-related proteins are found that could be used to distinguish different cancer types. These proteins are discussed below: E-Cadherin, as the type-1 classical cadherin, mediates cell interactions. Tumor progression is often linked with the loss of E-cadherin function, leading to a more motile and invasive phenotype [42]. PREX1 (phosphatidylinositol-3,4,5-trisphosphate-dependent Rac exchange factor) is highly expressed in prostate cancer, indicating a relationship between the cell invasion and its expression [30]. In melanomas, PREX1 over-expression was connected to the activation of ERK-MAPK signaling and required for efficient melanoblast metastasis as well as for migration [43]. Claudin-7, a common transmembrane protein, plays a vital role in the formation and maintenance of the permeability in polarized epithelial cells [44]. The aberrant Claudin-7 expression profile has been found in various tumors, such as highly induced Claudin-7 expression in both primary and metastatic breast tumors, [26] yet it is down-regulated in head and neck carcinomas [27]. These previous studies further supported our findings that Claudin-7 could be used as a biomarker for the differentiation and classification of various tumors. Rab-25, as a member of the Rab family of GTPases, Rab-25 is a constitutively active Rab GTPase that plays a crucial role in apical recycling and transcytosis pathways in polarized epithelial cells. Because loss of cell polarity is an essential hallmark of cancer, Rab-25 related trafficking has an important impact on epithelial cell polarity program in cancer progression [45].
Anomalous cancer cell energy metabolism was first observed by Otto Warbugy in 1930 and has been accepted as a hallmark of cancer. Abnormal fatty-acid synthesis as one type of energy metabolism is found in many cancer cells [46]. Here, several important fatty acid and glycolytic metabolism-related genes are found in the selected 23 proteins: FASN is a key enzyme which is required for de novo synthesis of fatty acid. It has been found that the FASN expression and activity are abnormally elevated in many types of human cancers, which may contribute to cellular resistance to drug-and radiation-induced apoptosis [46]. ACC1 is a rate-limiting enzyme in de novo fatty acids synthesis. It seems to be the limiting enzyme in proliferating cancer cells. ACC1 has been found to be up-regulated in proliferating cancer cell lines such as prostate, breast and liver. Indeed, it has been shown that knock-down of ACC1 by siRNA promotes apoptosis in prostate cancer and breast tumor cells but not in control noncancerous cells, underlining cancer cells' higher reliance on this enzyme than normal tissue [47]. AMPK (AMPactivated protein kinase, encoded by the gene PRKAA1/2) plays a crucial role in sensing available energy and coordinating external growth signals with cellular metabolism [48]. A decrease of AMPK signaling, mostly caused by the loss of function gene STK11, could lead to increased activation of mTOR and a shift toward glycolytic metabolism, which is found in a variety of cancers, including NSCLC [49] and cervical cancer [50].
Abnormal expression of hormone receptors are often shown in sex-related cancers, such as breast cancer and prostate cancer. Three hormone receptors are also reported in the selected proteins: Progestin receptor (PR), as a nuclear steroid receptor, has a high specificity for binding progesterone [51]. It has been shown in literature that PR inhibits the transition from G1 to S in the cell cycle and promote apoptosis in endometrial cancer cells [52]. In the GOG119 phase II trial, an estrogen surrogate named tamoxifen could enhance progestin activity in order to induce PR and cure endometrial patients [53]. Estrogen receptor (ER, activated by the hormone estrogen) is one of the most important therapeutic targets in breast cancers, given that the correlation between ER expression and cellular response to estrogen [54]. It has been reported that gene amplification of ESR1 frequently occur with breast cancer [29]. Androgen receptor (AR; NR3C4) is believed to solely mediate all the biological actions of endogenous, functioning mainly in regulating male development. Due to the strong connection between ARs and prostate cancer, androgen antagonists or androgen deprivation therapy has been applied to impede cancer cell proliferation of patients with androgen-dependent prostate cancer in clinical treatment [55].
Surprisingly, among these 23 selected proteins that are used to distinguish different cancers, α-tubulin and GAPDH are often used as controls in western blot analysis. In the following part, we will discuss known findings about α-tubulin and GAPDH that lend credence to the validity of our findings for their importance to distinguish cancers. For example, both αand βtubulin proteins are responsible for assembling microtubules (MTs, cytoskeletal polymeric structures), and certain posttranslational modifications. The acetylation of α-tubulin (Lys-40) [56] could alter dynamic behavior of MTs, which may lead to changes in biological functions that MTs perform during cell division, migration, and intracellular trafficking. Taking the dynamic parameters into account, MTs provide an attractive target for chemotherapy against rapidly growing tumor cells such as in lymphoma and leukemia, metastatic cancers, and slow growing tumors of the breast, ovary, and lung [57,58]. Over the last decade, GAPDH (glyceraldehyde-3-phosphate dehydrogenase) was considered a housekeeping gene and was as a control for equal loading during the experimental process. However, it has been shown that GAPDH expression varies different types of tissues. Moreover, GAPDH expression varies due to oxygen tension [59], and the expression levels of GAPDH vary in fallopian tube cancers and ovarian cancers [60]. On the basis of GAPDH's predilection for AU-rich elements, it has been shown that GAPDH can bind to the CSF-1 3'UTR that stabilize the mRNA [60]. To summarize, combining all the evidence, tubulin proteins and GAPDH may bring a new perspective on cancer studies, and it is suggested that they are not used as controls in western blot analysis of different types of cancer.
Other selected proteins include phosphatases, transcriptional activators, linker proteins and transferrin receptors: GATA3 is a transcriptional activator with high expression levels [61] and the third most frequently mutated gene in breast cancer [62]. Thus, GATA3 has proved to be a useful immunohistochemical marker to predict tumor recurrence early in the progression of breast cancer. PEA15, as a multifunctional linker protein predominantly expressed in the cells of the nervous system, such as astrocytes [63], controls a variety of cellular processes, such as cell survival, proliferation, migration and adhesion [64]. PEA15 functions in various cancers, concluding glioblastoma, astrocytoma, and mammary, as well as skin cancers. PEA15 can have both anti-(in ovarian carcinoma [65]) and pro-(glioblastoma [66]) tumorigenic functions, depending on its interactions. TFRC is a transferrin receptor. It is a major iron importer in most mammalian cells. It has been shown that TFRC proteins increase in breast, malignant pancreatic cancer, and other cancers [67,68]. PKCα is encoded by PRKCA gene and is a serine-and threonine-specific kinase. This gene is highly expressed in multiple cancers, and the high activation of PKCα has been identified to promote the genesis of breast cancer [69]. The high abundance in serum makes this protein to be a good diagnostic biomarker of lung cancer [70] and gastric carcinoma [71]. TIGAR is a fructose-2-6-bisphosphatase that promotes the production of antioxidant (NADPH) and nucleotide synthesis material (ribose-5-phosphate) and seems to be important for tissue renewal and intestinal tumorigenesis. Up-regulated expression of TIGAR in human colon tumors along with other evidence suggest its importance in the development of cancer and metabolism regulation and may be used as a therapeutic target in diseases such as intestinal cancer [28]. CD20 (Membrane-Spanning 4-Domains, Subfamily A, Member 1, MS4A1) encodes a surface molecule B-lymphocyte during the differentiation of Bcells into plasma cells. Currently, a CD20 monoclonal antibody has been utilized in the treatment of cancer, even though its dosage is still under discussion [72]. GAB2 (GRB2-associatedbinding protein 2) is a docking protein, which mainly interacts with signaling molecules. Research has shown that the oncogenesis of many cancers including gastric, colon, ovarian and breast cancer is related to GAB2 [73,74]. For example, GAB2 can amplify the signal of receptor tyrosine kinases (RTKs), which plays roles in breast cancer development and progression [75].
As shown above, all of the top 23 proteins are closely related to certain types of cancers. Researchers have focused on common features of different cancer types for decades [31]. Admittedly, in theory, the hallmarks of cancer would help us develop drugs to treat all types of cancers as a whole. However, this "one size fits all" cancer treatment has disappointed us due to its treatment-related toxicity and inefficiency. Despite the fact that personalized treatments have been proposed, the theory still stays at a conceptual phase. Thus, having a better understanding of the potential values and the applied ranges of cancer drugs based on different biomarkers may be a more realistic way to treat different types of cancers.

Potential values of our findings
Previous experimental studies in the literature could consolidate our results showing that the selected 23 proteins could be used as biomarkers for certain cancers. They also can explain partially why the combination of these proteins could be used to accurately classify different cancer types. However, to our knowledge, reasons behind the varying expression patterns in different types of cancers have not been found. At least, by using our computational method, one could gain a better understanding of the similarities and differences among different cancers. This could help us identify proteins that could promote the development of cancers and proteins that might not be indispensable for cancer development. Further studies should be performed to determine whether the differential expression patterns of proteins in various cancers are influenced by their original tissues. Those proteins specifically expressed in certain types of cancers could be considered as potential specific cancer targets, which could be used to improve the target efficiency. Therefore, our results may help drug designers obtain a better understanding of the potential targets of drugs by shedding some light on the cancer type-specific biomarker discoveries.
Supporting Information S1 File. The dataset used in this study. There were 3467 cancer patient samples in 10 cancer types, with 187 proteins for each sample. The 3467 samples were randomly divided into 2775 training samples and 692 independent test samples. The first column is the sample ID, the second column is the cancer types whose description can be found in Table 1. The third to the 189th columns were proteins. (XLSX) S2 File. The mRMR table. All the 187 protein features were ranked from the most important to the least by using the mRMR method on training set. The top 23 proteins were regarded as composing the optimal feature set because by using the 23 protein features, the MCC on the training set evaluated by 10-fold cross validation reached 0.904 which was the first reach above 0.900, and with more protein features, the MCC did not increase much.