Proteome Profiling Outperforms Transcriptome Profiling for Coexpression Based Gene Function Prediction*

Coexpression of mRNAs under multiple conditions is commonly used to infer cofunctionality of their gene products despite well-known limitations of this “guilt-by-association” (GBA) approach. Recent advancements in mass spectrometry-based proteomic technologies have enabled global expression profiling at the protein level; however, whether proteome profiling data can outperform transcriptome profiling data for coexpression based gene function prediction has not been systematically investigated. Here, we address this question by constructing and analyzing mRNA and protein coexpression networks for three cancer types with matched mRNA and protein profiling data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Our analyses revealed a marked difference in wiring between the mRNA and protein coexpression networks. Whereas protein coexpression was driven primarily by functional similarity between coexpressed genes, mRNA coexpression was driven by both cofunction and chromosomal colocalization of the genes. Functionally coherent mRNA modules were more likely to have their edges preserved in corresponding protein networks than functionally incoherent mRNA modules. Proteomic data strengthened the link between gene expression and function for at least 75% of Gene Ontology (GO) biological processes and 90% of KEGG pathways. A web application Gene2Net (http://cptac.gene2net.org) developed based on the three protein coexpression networks revealed novel gene-function relationships, such as linking ERBB2 (HER2) to lipid biosynthetic process in breast cancer, identifying PLG as a new gene involved in complement activation, and identifying AEBP1 as a new epithelial-mesenchymal transition (EMT) marker. Our results demonstrate that proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. Proteomics should be integrated if not preferred in gene function and human disease studies.

ristic is the basis for analyzing mRNA profiling data using gene clustering (2), coexpression network analysis (3)(4)(5), and pathway and gene set enrichment analysis (6 -8). However, genes with similar mRNA expression profiles are not necessarily functionally coupled due to reasons such as transcriptional leakage and nonspecific occurrence of cis-regulatory elements in the genome (9 -11). Distinguishing accidental transcriptional covariation from those that are functionally important is a well-known challenge, and strategies such as metaanalysis (12) and evolutionary constraint (10,13) have been developed to address this challenge.
Recent advancements in mass spectrometry-based proteomic technologies have enabled global expression profiling at the protein level, and the concordance between mRNA and protein profiling data has been extensively studied during the past decade (14,15). Although a few publications suggest that gene expression is mostly controlled at the mRNA level (16 -18), many studies have reported a considerable discrepancy between mRNA and protein profiles in human and other model organisms (15, 19 -22). It is not completely clear how much of the reported mRNA-protein discrepancy is due to technological issues and how much is due to underlying biology. Importantly, whether proteome profiling data can outperform transcriptome profiling data for coexpression based gene function prediction is largely unknown.
The deep proteome profiling data sets recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) on the breast (23), colorectal (22), and ovarian (24) tumors that had been transcriptomically profiled by The Cancer Genome Atlas (TCGA) (25)(26)(27) provided a new opportunity to address this question. We constructed gene coexpression networks based on mRNA and protein profiling data sets, respectively, for each of the three cancer types. Comprehensive comparisons between the mRNA and protein coexpression networks constructed for the same cancer type allowed us to systematically investigate the relative utility of mRNA and protein profiling data in predicting gene cofunctionality.

MATERIALS AND METHODS
Protein and mRNA Profiling Data-Breast Cancer-The gene-level proteomics data for breast cancer was downloaded from Mertins et al. (23). An isobaric peptide labeling approach (iTRAQ) was employed to quantify protein levels. Protein quantification was based on iTRAQ reporter ion ratios to the internal standard. Data normalization was performed using a 2-component Gaussian mixture model-based normalization algorithm. The data set contained 9988 genes and 77 samples. Only the 6281 genes without any missing values across all samples were included in this study. The gene-level RNA-Seq data was downloaded from the Firehose website (http://gdac.broadinstitute.org), which was from the Illumina HiSeq 2000 RNA Sequencing Version 2 analysis and was normalized by the RSEM algorithm (28). The RNA-Seq data set included 20501 genes and 1058 samples. The two data sets had 5988 overlapping genes and 77 overlapping samples. Only overlapping samples and genes were included in this study, and this was also true for the other two cancer types.
Colorectal Cancer-The gene-level proteomics data for colorectal cancer was downloaded from Zhang et al. (22). Label-free shotgun proteomics was used to quantify protein levels. Protein quantification was based on spectral counts, which were quantile normalized followed by log-transformation. The data set contained 3899 genes and 90 samples. The gene level RNASeq data normalized by the RSEM algorithm was downloaded from the Firehose website (http:// gdac.broadinstitute.org), which contained 20501 genes and 264 samples. There were 3764 overlapping genes and 87 overlapping samples between the two data sets.
Ovarian Cancer-The gene-level proteomics data for ovarian cancer was downloaded from Zhang et al. (24). Similar to the breast cancer data set, protein quantification was based on iTRAQ reporter ion ratios to the internal standard. Data normalization was performed using a global median centering algorithm. The data set contained 4186 genes across all 174 samples. Only the 3327 genes with low technical variance and without any missing values across all samples were included in this study. The gene-level microarray data was downloaded from the Firehose website (http://gdac.broadinstitute. org), which was from the Agilent 244K platform and was normalized by the lowess normalization method (29). The microarray data set contained 17814 genes and 541 samples. The two data sets had 2988 overlapping genes and 174 overlapping samples.
Identification of Functionally Similar and Dissimilar Gene Pairs-Gene Ontology (GO) based semantic similarity was computed for all gene pairs to identify functionally similar and dissimilar gene pairs.
GO-The GO vocabulary and annotation data were downloaded from the GO website (www.geneontology.org) in November 2014. To ensure high quality of the annotations, we excluded those denoted as IEA (Inferred from Electronic Annotation) or ND (No biological Data available) (30).
Calculation of Similarity Scores for Pairs of GO Terms-The Resnik similarity score (31) was computed to measure the similarity between each pair of GO terms. Specifically, the information content of a term c was defined as IC(c) ϭ -log(p(c)), where p(c) is the number of genes annotated to the term and its descendants divided by the number of all genes annotated to the corresponding root term (i.e. biological process, cellular component, or molecular function). Let P(m,n) represents the set of common ancestor terms of terms m and n, the Resnik similarity score between m and n was calculated as: Sim resnik ͑m, n͒ ϭ max c⑀P͑m, n͒ ͓IC͑c͔͒ Calculation of Similarity Scores for Pairs of Genes-The similarity scores for term pairs describing two genes were combined to calculate the semantic similarity score of the two genes based on the best-match average (BMA) approach (32). Let A 1 and A 2 be the sets of annotation terms for genes G 1 and G 2 , respectively, and #G 1 and #G 2 be the numbers of terms included in A 1 and A 2 , respectively. The BMA score for the two genes was defined as where S͑G 1 , G 2 ͒ ϭ ⌺ m⑀A1 max n⑀A2 ͑sim͑m, n͒͒ and S(G 2, G 1 ) ϭ ⌺ m⑀A2 max n⑀A1 ͑sim͑m, n͒͒. All gene pairs were ranked from the highest BMA score to the lowest BMA score. The top 1%, 5%, 10%, 15%, 20% and 25% gene pairs in the ranked list were selected as candidate gold standard sets of functionally similar gene pairs whereas the bottom 1%, 5%, 10%, 15%, 20% and 25% were selected as candidate gold standard sets of functionally dissimilar gene pairs.
Comparison of Different Methods for Coexpression Network Construction Using mRNA Profiling Data Sets-To select a superior methodology for coexpression network construction, we compared three methods that are widely used for mRNA coexpression network construction, including: the value-based method (33), the K-nearest neighbor method (34), and the ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) method (35). The mRNA profiling data sets for the three cancer types after overlapping with corresponding proteomics data sets as described above were used for method comparison. We only included positive correlations in our analysis because we found that a strong negative correlation between two genes did not necessarily correspond to high functional similarity between the genes (supplemental Fig. S1), which is consistent with previous reports (36,37).
Method Description-In the value-based method (33), spearman's correlation coefficients for all pairs of genes in a data set are calculated, then a correlation threshold T is selected, and gene pairs with correlation coefficients higher than T are connected to construct a coexpression network. In the K-nearest neighbor method (34), after pair-wise spearman's correlation coefficient calculation, for each gene, all other genes are ranked based on their correlation coefficients with the gene, and then a coexpression network is constructed by connecting the K mutual nearest neighbors. The ARACNE method (5) calculates the mutual information (MI) for all gene pairs and then estimates the significant levels for MIs. After filtering out gene pairs with insignificant MIs, the method examines each gene triplet among the significant gene pairs and removes one edge based on the following criterion: If the tolerance is 0, the gene pair with the smallest MI will be removed from the gene triplet. If the tolerance is 1, all gene pairs in the gene triplet will be kept. Following Margolin et al. (35), the tolerance value was set between 0 and 0.15 to provide a reasonable tradeoff between sensitivity and specificity.
Construction of a Consensus Coexpression Network-To increase robustness against errors in data, a bootstrapping procedure (5) was included in all network construction methods to generate consensus coexpression networks. Specifically, tumor samples were randomly sampled from the original data set with replacement and assembled into a new bootstrapped data set containing the same number of samples as the original data set. For each data set, we repeated this process 100 times and generated 100 bootstrapped data sets. Next, each of the three network construction methods was used to generate 100 bootstrapped coexpression networks based on 100 bootstrapped data sets. The edges from the 100 bootstrapped coexpression networks were then combined to calculate a support score for each gene pair based on the following formula: Then, the statistical significance of each gene pair was calculated based on the following formulas suggested in (5) where P(S͉N) and P(D͉N) denote the frequencies of functionally similar (S) and dissimilar (D) gene pairs, respectively, in the coexpression network (N), whereas P(S) and P(D) denote all functionally similar and dissimilar gene pairs, respectively, in our gold standard sets. Selection of Parameters for the Three Methods-To construct a coexpression network from a gene expression matrix, we need to set parameters T, K, and for the value-based method, K-nearest neighbor method and ARACNE method, respectively. We tested different values of the parameters T (from 0.40 to 0.80, step by 0.05), K (from 0.1%ϫD to 1%ϫD, step by 0.1%ϫD, D is the number of genes in the data set) and (0, 5%, 10%, and 15%). A more stringent parameter can usually lead to higher LR of the constructed network (supplemental Fig. S2 and supplemental Table S1). Although higher LR indicates higher functional relevance of the constructed network and is thus preferred, stringent parameters also lead to more isolated nodes and reduced network coverage (supplemental Fig. S3). To balance the tradeoff between functional relevance and coverage, a series of parameters were tested for each method and the most stringent parameters that produced no more than 10% isolated node were selected.
Selection of the Threshold for Identifying Functionally Similar and Dissimilar Gene Pairs-In the section "identification of functionally similar and dissimilar gene pairs", we selected the top and bottom 1%, 5%, 10%, 15%, 20%, and 25% of all ranked gene pairs as the candidate functionally similar and dissimilar gene pairs. Although a more stringent threshold (e.g. top and bottom 1%) could lead to higher discriminant power of the gold-standard data sets, the robustness of LR calculation could be compromised when the number of gene pairs in the gold standard data sets is small. To find a stringent threshold that can produce robust result for each of the cancer type, we calculated the LRs of the 100 bootstrapped networks for each of the cancer type and then computed the coefficient of variations (CVs) of the LRs.
Comparison of the Three Network Construction Methods-For each of the cancer type, consensus mRNA coexpression networks were generated using the three methods with the selected parameters. Based on the functionally similar and dissimilar gene pairs from GO, we calculated the LRs of the consensus networks and selected the method with the highest LR to construct coexpression networks.
Construction of Protein Coexpression Networks-For the proteomics data set from each of the three cancer types, we first generated 100 bootstrapped data sets using the same sets of samples as those in the 100 bootstrapped mRNA data sets (see the section "Construction of a consensus coexpression network"). Coexpression networks for each of the 100 bootstrapped data sets were constructed using the method and the parameter selected based on the mRNA data sets. The protein coexpression consensus network was then constructed based on the method described in the section Construction of a Consensus Coexpression Network.
Edge Level Comparison Between mRNA and Protein Coexpression Networks-Based on the functionally similar and dissimilar gene pairs defined above, we calculated the LRs of mRNA and protein coexpression networks. For each coexpression network, we generated 1000 random networks with the same number of nodes and edges and calculated LRs of these random networks as negative controls.
For each coexpression network, we also built a protein-protein interaction network for genes in the coexpression network based on curated protein-protein interactions from the iRef database (38) to serve as a positive control or benchmark.
To quantify the edge-level similarity between the mRNA and protein networks, we calculated the Dice coefficient scores (39).
To test the effect of sample size on LRs of the constructed networks, we performed down-sampling experiments. For each of the three cancer types, we randomly selected n samples from mRNA or protein data (n was from 10 to 70, with a step increment of 5). Then, we generated the consensus coexpression networks based on the selected samples and calculated LRs based on the functionally similar and dissimilar gene pairs. This process was repeated 100 times.
Module Level Comparison Between mRNA and Protein Coexpression Networks-We used the NetSAM package (40) (http:// bioconductor.org/packages/release/bioc/html/NetSAM.html) to identify hierarchical modules from the mRNA and protein coexpression networks. Then, we performed the following analyses to compare mRNA and protein modules.
Comparison of the Functional Coherence Between mRNA and Protein Modules-We first calculated the p values of enrichment for all GO biological process terms for each module using the hypergeometric test (41) and then adjusted the p values based on the Benjamini and Hochberg method (42). We selected the smallest adjusted p value as the measurement of the functional coherence of the module. Finally, we grouped the modules from each of mRNA and protein networks in each cancer type into three groups: a significant group (adjusted p value Յ 0.01), a marginally significant group (0.01 Ͻadjusted p value Յ 0.15), and an insignificant group (adjusted p value Ͼ 0.15).
Evaluation of the Conservation Level of the mRNA Modules in Corresponding Protein Networks-We first counted the overlapping edges between mRNA and protein networks in each cancer type. Then, we calculated the statistical enrichment of overlapping edges in each mRNA module based on the following hypergeometric test: where M, N, m and k represent the number of edges in the mRNA network, the number of all overlapping edges, the number of edges in a module, and the number of overlapping edges in the module, respectively. Third, the p values were adjusted based on the Benjamini and Hochberg method (42). Finally, the conservation level of a mRNA module in corresponding protein network was measured by -log 10 (adjusted p value).

Comparison of the Cytogenetic Band Coherence Between mRNA and Protein
Modules-The cytogenetic band information was downloaded from BioMart website (http://www.ensembl.org/biomart/, Ensemble Genes 82 and Homo sapiens genes GRCh28.p3). The analyses were performed the same as described for the functional coherent analysis, replacing GO biological process terms with cytogenetic bands.
Visualizing the Impact of Chromosome Colocalization on mRNA and Protein Coexpression-We first calculated the spearman's correlation between each pair of mRNAs or proteins. Then, we ordered the mRNAs or proteins based on their chromosome location and visualized the pair-wise correlation scores in a heat map.
Gene Function Prediction Based on the mRNA and Protein Coexpression Networks-We compared mRNA and protein coexpression networks for their gene function prediction potential for a wide variety of GO biological processes and KEGG pathways. The KEGG pathway data set was downloaded using REST-style KEGG API (http://rest. kegg.jp/link/hsa/pathway). Network-based gene function prediction was performed using the well-established random walk-based network propagation algorithm (43), and prediction performance was evaluated using 5-fold cross validation and quantified on the basis of the area under the receiver operating characteristic curve (AUROC). Because the random walk algorithm can only be applied to connected networks, our analyses were based on the maximum component for each of the six coexpression networks instead of the full networks. Thus, networks in this section refer to the maximum component of the full networks.
Gold-standard Positive and Negative Gene Sets-For a selected network and a selected GO biological process or KEGG pathway term, genes annotated to the term and also included in the network were defined as a positive gene set and other genes in the network constituted the negative gene set for the GO or KEGG term. Only positive gene sets with at least 20 genes and no more than 10% of the total number of genes in the network were included in our study. For each cancer type, only GO and KEGG terms selected by both the mRNA and protein networks were included in the comparative analysis. According, we had 1673, 1236, and 997 GO terms and 125, 84, and 60 KEGG terms as gold-standard for breast, colorectal, and ovarian cancer, respectively (see supplemental Table S2).
Random Walk Analysis-The random walk analysis exploits the global structure of a network by simulating the behavior of a random walker on a network. Given a network with n nodes and a set of k "seed" gene (k Ͼ 0), we used the random walk with restart (RWR) technique (43) to calculate a priority score for each gene in the network based on the steady state probability of the random walker staying at the gene, which is formally defined as the following equation: where the initial vector p 0 of size n was constructed such that an equal probability of 1/k was assigned to the k seed genes, while a probability of 0 was given to all other n-k genes in the network, r is the restart probability (we set r as 0.5 in this paper), W is the columnnormalized adjacency matrix of the network, and p t is a vector of size n where the i-th element holds the probability of being at gene i at time step t.
The final score of a gene in the network was defined by iterating the above equation until ⌺ iϭ1 n ͉p i tϩ1 Ϫ p i t ͉ fell below the predefined threshold of 1 ϫ 10 Ϫ6 , as previously described (43,44). A higher score of a gene represents a closer relationship between the gene and the seed genes.
AUROC Calculation-Prediction performance for each selected GO and KEGG term was evaluated using 5-fold cross validation. We first randomly assigned genes in the gold standard positive set into five equal sized subgroups. We kept one subgroup as the testing group, and then combined genes from the other four subgroups as seed genes and calculated the priority scores of all non-seed genes in the network based on the RWR analysis. We ranked all non-seed genes from the highest score to the lowest score and calculated the rank ratio of each gene by dividing its rank by the number of all non-seed genes. Based on the rank ratios of genes in the testing group, we used the R package pROC (https://cran.r-project.org/web/ packages/pROC/index.html) to calculate an AUROC score. The above analysis was repeated by using each of the other subgroups as the testing group, and then the mean AUROC was calculated across the 5 folds. A mean AUROC of 1 indicates perfect prediction performance of the network for the GO or KEGG term, whereas a mean AUROC of 0.5 suggest that genes annotated to the GO or KEGG term are randomly distributed in the network.
Gene2Net Analysis-We developed Gene2Net, a web-based application that allows users to expand one or multiple genes into a small network and to perform GBA analysis based on the three protein coexpression networks generated from this study. The tool can be accessed from the CPTAC portal in http://cptac.gene2net.org. Briefly, network expansion is based on the random walk analysis described above. GO biological process enrichment analysis for the resulted networks is performed based on hypergeometric test. Gene2Net visualizes expanded networks as interactive node-link diagrams and corresponding enriched GO biological processes in interactive directed acyclic graphs. Using clickable Venn diagrams and sortable heat maps, users can also compare results generated from the three cancer types. All results in Gene2Net can also be downloaded for further analyses. A detailed user manual and a video tutorial are available from the Gene2Net website. For the examples presented in this article, we set the network construction method as Network_Expansion and the number of top ranking neighbors as 10. Table I summarizes the mRNA and protein profiling data sets used in this study. Gene-level, normalized mRNA profiling data for all three cancer types were downloaded from the TCGA Firehose website (http://gdac.broadinstitute.org). Data for breast and colorectal tumors were generated by RNA-Seq whereas those for ovarian tumors were generated by microarray. Microarray data were used for ovarian cancer because RNA-Seq data were not available for most of the samples analyzed by proteomics. Gene-level, normalized proteomic data for the three cancer types were described in three recent publications (22)(23)(24). The three proteomic data sets were generated by different institutes using different proteomic platforms (iTRAQ for breast and ovarian cancers and labelfree for colorectal cancer) and normalized by different algorithms carefully selected and justified by individual publications. Including three methodologically diverse datasets in this study increases the generalizability of our results. For each cancer type, only overlapping samples and overlapping genes with both mRNA and protein abundance data were included in our analysis. Accordingly, breast, colorectal, and ovarian cancer analyses were based on 77 samples and 5988 genes, 87 samples and 3764 genes, and 174 samples and 2988 genes, respectively.

Summary of the mRNA and Protein Profiling Data Sets-
Comparison of Methods for Coexpression Network Construction-Using the mRNA profiling data sets from the three cancer types, we compared three methods that are widely used for mRNA coexpression network construction, including: the value-based method (33), the K-nearest neighbor method (34), and the ARACNE method (35). For each method, the most stringent parameter that produced no more than 10% isolated node in the constructed network (supplemental Fig. S3) was used for this study. The bootstrapping procedure (5) was included in all methods to build consensus networks that are more robust to sampling variability and errors in the data.
To evaluate the functional relevance of the networks constructed by different methods, we derived "gold-standard" by calculating pair-wise semantic similarity for all genes based on GO biological process, molecular function, and cellular component annotations, respectively. For each type of GO annotation, the top 5% of gene pairs with the highest similarity scores were designated as a positive set of functionally similar gene pairs whereas the bottom 5% were designated as a negative set of functionally dissimilar gene pairs. The 5% threshold was selected to maximize the discriminant power of the gold-standard sets while ensuring the robustness of the evaluation results (supplemental Fig. S4).
To compare the functional relevance of the networks constructed by the three methods, we computed the LRs of neighboring genes being functionally similar versus dissimilar using the gold standard sets. As shown in supplemental Fig.  S5, the K-nearest neighbor method resulted in the highest LRs for all cancer types and all GO ontologies, except for the combination of ovarian cancer and biological process, where the value-based method had the highest LR. Accordingly, we selected the K-nearest neighbor method for coexpression network construction in this study and the optimal K was 12, 11, and 9 for breast, colorectal, and ovarian cancer, respectively (supplemental Fig. S3).
mRNA and Protein Coexpression Networks are Wired Very Differently-Based on the K-nearest neighbor method and the optimal Ks, we construct both mRNA and protein coexpression networks for the three cancer types (supplemental Tables  S3-S9). For breast cancer (5988 genes), the mRNA coexpression network had 22,940 edges whereas the protein coexpression network had 19,925 edges. These numbers were 10,713 and 13,110 for colorectal cancer (3764 genes), and 7815 and 6091 for ovarian cancer (2988 genes).
As shown in Fig. 1A, there was very limited overlap in edges between mRNA and protein coexpression networks for all three cancer types (Dice coefficient Ͻ 0.15, pϷ1 for overlapping significance based on the Fisher's exact test). The overlap remained low (Dice coefficient Ͻ 0.25) even when K was increased to 599, 376, and 299 for breast, colorectal, and ovarian cancers, respectively (i.e. up to 10% of all genes in a network were considered as direct neighbors for each gene, supplemental Fig. S6). These results indicate that mRNA and protein coexpression networks are wired very differently. Protein Networks are More Closely Aligned with Function-Both mRNA (blue bars) and protein (red bars) coexpression networks were more likely to connect functionally similar gene pairs than dissimilar gene pairs, with LRs ranging from 3 to 30 (Fig. 1B). These values were significantly higher than those calculated for corresponding random networks with the same numbers of nodes and edges (light blue and light red bars, p Ͻ 2.2e-16). Deviation of the LRs of the random networks from 1 (log likelihood ratio ϭ 0) can be explained by the enrichment of genes quantified by both technologies in certain GO terms.
Protein networks showed higher (1.4-fold to 4.8-fold) LRs than corresponding mRNA networks (Fig. 1B). As depicted in Fig. 1C, for breast and colorectal cancers, common edges shared by both mRNA and protein coexpression networks showed the highest LRs (magenta bars), followed closely by edges specific to protein coexpression networks (red bars). For ovarian cancer, edges specific to protein coexpression networks outscored the common edges. For all three cancer types, edges specific to mRNA coexpression networks (blue bars) showed the lowest LRs. Notably, the LRs of the protein coexpression networks approached those of the benchmark networks constructed based on curated protein-protein interactions from the iRef database (38) (green bars, Fig. 1B). These data quantitatively demonstrate that protein coexpression networks are highly functionally relevant, and they are more closely aligned with function than mRNA coexpression networks.
To evaluate the sample size effect on the functional relevance of the constructed mRNA and protein coexpression FIG. 1. Edge level comparison between mRNA and protein coexpression networks of the three cancer types. A, Edge overlap between mRNA coexpression network (blue) and protein coexpression network (red). B, The likelihood ratios (LRs) calculated for individual networks with gold-standard reference data sets derived from GO biological process (BP), cellular component (CC) and molecular function (MF) annotations, respectively. Blue, light blue, red, light red and green bars represent mRNA coexpression network, mRNA random network, protein coexpression network, protein random network, and protein-protein interaction (PPI) network, respectively. C, The LRs of mRNA specific edges (blue), protein specific edges (red), and common edges (magenta). networks, we further performed down-sampling analysis. As shown in Fig. 2, protein networks had higher LRs than corresponding mRNA networks for all the sample sizes tested, ranging from 10 to 70. Both mRNA and protein coexpression networks had increased LRs with larger sample sizes, but a bigger sample size-dependent LR increase was observed among protein coexpression networks. For all three cancer types, the largest LR increase was found when the sample size went from 10 to 20. Therefore, we recommend a minimal sample size of 20 for protein coexpression network construction.
Protein Network Modules are More Functionally Homogeneous-We next studied the functional homogeneity of network modules in the coexpression networks. Biological networks usually have a hierarchical modular organization (45) and network modules with tightly connected components are considered as functional blocks of the cell (46). We used the NetSAM algorithm (40) to reveal hierarchical modular architectures of the coexpression networks and then performed enrichment analysis to evaluate the functional coherence of the identified modules (supplemental Tables S10 -S15). Among the 122, 61, and 55 mRNA modules identified for breast, colorectal, and ovarian cancer, 16%, 26%, and 15% were enriched in at least one GO biological process (multiple-test adjusted p Ͻ 0.01, hypergeometric test), respectively. In contrast, among the 88, 62 and 36 protein modules for the three cancer types, significantly larger proportions (73%, 63%, and 67%; p Ͻ 2.2e-16, p ϭ 5.8e-05, and p ϭ 4.7e-07, respectively, Fisher's exact test) showed significant GO enrichment (Fig. 3A). To examine the conservation level of the mRNA modules in corresponding protein networks, we performed the hypergeometric test for each mRNA module to quantify the statistical enrichment of its edges in the corresponding protein network. As depicted in Fig. 3B, functionally coherent mRNA modules (i.e. red modules with significant GO enrichment) were more likely to have their edges preserved in corresponding protein networks than functionally incoherent mRNA modules (i.e. blue modules with insignificant GO enrichment), with p values equal 2.8e-08, 1.6e-04, and 0.06 for breast, colorectal, and ovarian cancer, respectively (onesided Kolmogorov-Smirnov test).
To further explore other possible biological underpinnings of the coexpression modules, we performed cytogenetic band enrichment analysis for both mRNA and protein modules (Fig.  4A). Interestingly, 76%, 59 and 87% of the mRNA modules in breast, colorectal, and ovarian cancers were enriched in at least one cytogenetic band (multiple-test adjusted p Ͻ 0.01, hypergeometric test), and the ratios were significantly higher in functionally incoherent mRNA modules (89%, 72%, and 98%) than in functionally coherent mRNA modules (35%, 31%, and 25%, p Ͻ 0.05, Fisher's exact test). For all three cancer types, significantly smaller proportions (15%, 14%, and 50%) of the protein modules showed significant cytogenetic band enrichment compared with the mRNA modules (p Ͻ 2.2e-16, p ϭ 2.8e-07, and p ϭ 2.1e-04, respectively, Fisher's exact test). In breast cancer and colorectal cancer, the proportions of the cytogenetic band enriched modules were similar between functional coherent and incoherent protein modules (p ϭ 0.56 and 0.18, Fisher's exact test), whereas in ovarian cancer, this proportion was significantly lower in functionally coherent modules than in functionally incoherent modules (29% versus 89%, p ϭ 0.01, Fisher's exact test). The impact of chromosomal colocalization on mRNA coexpression is obvious for all cancer types when visualizing gene coexpression along human chromosomes, as indicated by the striking red diagonal lines, but such impact was much weaker at the protein level, and the difference was highly statistically significant for most of the chromosomes (supplemental Fig. S7-S9). Taken together, our results demonstrate that protein coexpression modules are primarily driven by functional homogeneity of the genes, whereas chromosomal colocalization plays a significant role in determining mRNA coexpression. Moreover, functionally coherent mRNA modules are preferably preserved in protein networks.

Protein Networks Better Predict Biological Functions-To
evaluate the relative utility of mRNA and protein profiling data in predicting gene cofunctionality, we compared mRNA and protein coexpression networks for their gene function prediction potential for a wide variety of GO biological processes and KEGG pathways. Network-based gene function prediction was performed using the random walk-based network propagation algorithm (43), and prediction performance was evaluated using 5-fold cross validation and quantified on the basis of the AUROC. GO biological process and KEGG pathway terms were selected for each cancer type separately (supplemental Table S2) based on the size filtering criteria described in Materials and Methods. Protein networks showed better prediction performance compared with corresponding mRNA networks for 85%, 78 and 92% of the GO terms for breast, colorectal, and ovarian cancer, respectively (Fig. 5A, supplemental Table S16). Protein networks achieved good performance (AUROC Ͼ 0.8) for 223, 144, and 206 GO terms for breast, colorectal, and ovarian cancer, respectively. In contrast, mRNA networks achieved good performance only for 46, 71, and 8 GO terms for the three cancer types, respectively, and the vast majority of these overlapped with the GO terms for which protein networks had good performance (Fig.  5A). In the analyses based on KEGG pathway terms, protein networks demonstrated an even more striking advantage compared with mRNA networks, and performance gain was FIG. 3. Functional homogeneity of mRNA and protein coexpression modules. A, Pie charts comparing the functional coherence between mRNA and protein modules from the three cancer types. Red, green and blue represent modules with significant (adjusted p value Յ 0.01), marginally significant (0.01Ͻadjusted p value Յ 0.15), and insignificant (adjusted p value Ͼ 0.15) GO biological process enrichment, respectively. The total number of modules for each network is provided under each pie chart. The proportion differences between the same colored sections in the mRNA and protein pie charts are indicated in the parentheses beside the proportional number of the protein pie chart. The "ϩ" and "-" signs correspond to higher and lower proportion in the protein pie chart compared with corresponding mRNA pie chart, respectively. The p values in the parentheses were calculated by two-sided Fisher's exact test. B, Empirical cumulative distribution plots of the conservation scores of mRNA modules in corresponding protein networks for individual module groups. Line colors represent the same module groups as in (A). The p values were calculated by one-sided Kolmogorov-Smirnov test.
obvious for KEGG pathways previously reported to have poor mRNA-protein correlations (22)(23)(24), such as oxidative phosphorylation and spliceosome (Fig. 5B, supplemental Table  S17). For 23 GO biological process terms and 11 KEGG pathway terms, good performance was achieved by all three protein networks, but by none of the mRNA networks (supplemental Tables S16 -S17). These processes and pathways covered a wide range of biological phenomena including cell cycle, tricarboxylic acid cycle, focal adhesion, mRNA surveillance pathway, spliceosome, antigen processing and presentation, gluconeogenesis, regulation of ligase activity, lipid oxidation, mitotic DNA integrity checkpoint, mitochondrial transport, among others. These results show that proteomics data strengthened the connection between gene expression and function for at least 75% of the GO biological processes and 90% of the KEGG pathways. Our analyses also identified specific biological processes and pathways whose assessment would benefit most from direct protein measurements.
Protein Networks Reveal Novel Gene-function Relationships-To make the highly functionally relevant breast, colorectal, and ovarian cancer protein coexpression networks available and useful to the broad scientific community, we developed a web application Gene2Net, which allows users to expand one or multiple genes into a small network based on the protein coexpression networks and then perform GBA analysis to generate hypotheses on gene-function relationships. Here we use some examples to illustrate the potential use of the tool.
First, we applied the tool to predict functions for all recently published driver genes (47) in the three cancer types (see supplemental Table S18). Among the 126, 39, and 27 drivers that were included in the breast, colorectal, and ovarian cancer networks, we were able to predict functions for 38, 10, and 14 genes, respectively (GO enrichment analysis under FDR 1%, See supplemental Table S19 -S22). Previously known GO annotations were identified for 27, 9, and 11 among these genes. For example, using KRAS, a frequently mutated gene in colorectal cancer, to query the colorectal cancer network, we associated the gene to its well-known role in Ras protein signaling transduction (Fig. 6A). Using CDH1 to query the breast cancer network, we associated the gene to its wellestablished role in cell adhesion (Fig. 6B). Using STAG1 to query the ovarian cancer network, we associated the gene to known function in cell cycle (Fig. 6C). Moreover, new GO annotations were predicted for 21, 6, and 11 of the driver genes in the three cancer types, respectively. Some of these predicted annotations were closely related to existing GO annotations for the genes, but others associated the genes to new biological processes. For examples, using ERBB2 (HER2), a frequently amplified gene in breast cancer, to query the breast cancer network, we associated the gene to lipid biosynthetic process (Fig. 6D). Although ERBB2 has not been annotated to this process by GO, ERBB2 positive breast cancers have been shown to produce significantly high amounts of fats and the fat synthetic process is required for survival of ERBB2-positive breast cancer cells (48). Accordingly, knock down of two lipid synthesis genes in our ERBB2 network (ACACA and FASN) has been shown to significantly decrease cell viability of the ERBB2-positive breast cancer cell line BT474 (48).
In addition to individual genes, gene lists can also be used as input to Gene2Net. To identify new players in complement activation, a central process of cancer immunity (49), we queried each of the three networks using a list of complement activation genes annotated by GO. The three complement activation networks shared 15 common genes (Fig. 6E), including a new gene PLG (plasminogen) that was not included in GO annotation. Consistent with this prediction, a recent study showed that PLG can serve as a complement inhibitor in addition to its well-established role in fibrinolysis (50). As another example, we queried each of the three networks using a list of genes curated in MsigDB (6) for the epithelialmesenchymal transition (EMT), which plays a critical role in promoting metastasis in epithelium-derived carcinomas (51). The three EMT networks shared 42 common genes, including a new gene, AEBP1 (AE binding protein 1, Fig. 6F). Interestingly, AEBP1 is one of ten genes in a recently published pan-cancer EMT signature (52).
The above examples demonstrate that our protein coexpression network-based tool allows retrieval of known functions for a gene, prediction of new functions, and identification of new genes for a biological process of interest. DISCUSSION With matched mRNA and protein profiling data from three cancer types, we have performed the first systematic study to investigate the relative utility of mRNA and protein profiling data in predicting gene cofunctionality. Although many studies have reported only a moderate correlation between mRNA and protein profiles (15, 19 -22), whether protein profiling data better reflects cellular functions has remained unanswered, because the reported mRNA-protein discrepancy may have both biological and technical explanations (16,17,21). Our study provided quantitative evidence to demonstrate that protein profiling data is more closely aligned with function than mRNA profiling data. Proteomic data strengthened the link between gene expression and function for the vast majority of biological processes and pathways. We identified a subset of biological processes and pathways for which protein measurements would be most critical. We also developed Gene2Net, which will allow biologists to generate hypotheses on new gene-function relationships based on the protein coexpression networks.
Although mRNA profiling has been the dominant approach to studying gene expression and its relationship to cellular functions, it has been suggested that genes with similar mRNA expression profiles are not necessarily functionally coupled (11). Our results showed that chromosomal colocalization plays a significant role in determining mRNA coexpression. Somatic copy number alteration may be an important driver of this phenomenon (22)(23)(24). In addition, genomic colocalization-driven coexpression has been previously reported in Caenorhabditis elegans (53) and Saccharomyces cerevisiae (54). Thus, this observation may also be explained by other mechanisms such as colocalization of coexpressed genes in regions of active chromatin or enhancers shared by neighboring genes on chromosomes. The impact of genomic colocalization on gene coexpression is significantly reduced FIG. 6. Protein coexpression network-based inference of gene-function relationship. A, KRAS network in colorectal cancer. The small nodes are the top ranking neighbors of KRAS, and red nodes represent genes participating in Ras protein signaling transduction. B, CDH1 network in breast cancer, and red nodes represent genes participating in cell adhesion. C, STAG1 network in ovarian cancer, and red nodes represent genes participating in mitotic cell cycle. D, ERBB2 network in breast cancer. The small nodes are the top ranking neighbors of ERBB2, and red nodes represent genes participating in lipid biosynthetic process. E, Tri-cancer complement activation network. The small nodes represent known complement activation genes annotated to the GO term (GO:0006956) and the large node represents the common top ranking neighbor across the three cancer types. Red, blue and green lines represent edges from breast cancer, colorectal cancer, and ovarian cancer network, respectively. F, Tri-cancer EMT network. The small nodes represent known EMT related genes and the large node represents the common top ranking neighbor across the three cancer types. at the protein level than mRNA level. (supplemental Fig. S7-S9). Although mRNA coexpression was driven by both cofunction and chromosomal colocalization of the genes, protein coexpression was driven primarily by functional similarity between coexpressed genes. Importantly, functionally coherent mRNA modules are preferably preserved in protein networks (Fig. 3B), suggesting a role of protein level regulation in coordinating gene functions.
Among the three cancer types, proteomic data provided the largest added value for ovarian cancer (Fig. 1B, Fig. 3). We note that mRNA profiling data for ovarian cancer were generated by microarray, whereas mRNA data for breast and colorectal cancers were generated by RNA-Seq. The observed differences may be partially attributable to the different platforms. However, this also may reflect the unique biology of ovarian cancer. Prevalent copy number alterations in ovarian cancer (27) may create widespread gene expression alterations at the transcriptome level, thereby requiring more extensive post-transcriptional regulation to buffer against non-functional alterations (11,55). Indeed, chromosomal colocalization had a much stronger impact on mRNA coexpression in ovarian cancer compared with the other two cancer types, and such impact was reduced, but still visible at the protein level in ovarian cancer (supplemental Fig. S9).
Although current proteomic platforms can identify more than ten thousand proteins, the number of quantifiable proteins remain much smaller than those can be quantified by mRNA profiling. In this study, the quantifiable proteins in the breast, colorectal, and ovarian data sets were 6281, 3899, and 3327, respectively, whereas the number of quantified genes in corresponding mRNA profiling data sets were 20501, 20501, and 17814. Our study was limited to genes with both mRNA and protein abundance measurements. but we believe our conclusion is not biased, because the same trend was observed with the number of studied genes increasing from 2988 in ovarian cancer to 3764 in colorectal cancer and 5988 in breast cancer. Moreover, the robustness of our conclusion was also confirmed by down-sampling experiments using the breast cancer data sets (see supplemental Text S2 and supplemental Fig. S10).
The network topology may affect the priority scores of genes in the network. Zhang et al. (44) tried to remove this effect by assessing the statistical significance of the scores. To evaluate whether considering network topology could improve network-based function prediction, we combined two statistic metrics, localP and edgeP, with the rank ratio metric for assessing the significance of the priority scores (see supplemental Text S2). On average, considering network topology only increased the AUROCs less than 2% (supplemental Fig. S11). Furthermore, results based on all three types of AUROCs consistently suggest that protein networks significantly outperformed mRNA networks in gene function prediction.
In conclusion, our results demonstrate that proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. The GBA strategies developed in transcriptomic studies would be more effective when applied to proteomic data. Gene function and disease studies would benefit immensely from broad adoption of global proteome profiling technologies.