Integrated Proteomic, Transcriptomic, and Biological Network Analysis of Breast Carcinoma Reveals Molecular Features of Tumorigenesis and Clinical Relapse*

Gene and protein expression changes observed with tumorigenesis are often interpreted independently of each other and out of context of biological networks. To address these limitations, this study examined several approaches to integrate transcriptomic and proteomic data with known protein-protein and signaling interactions in estrogen receptor positive (ER+) breast cancer tumors. An approach that built networks from differentially expressed proteins and identified among them networks enriched in differentially expressed genes yielded the greatest success. This method identified a set of genes and proteins linking pathways of cellular stress response, cancer metabolism, and tumor microenvironment. The proposed network underscores several biologically intriguing events not previously studied in the context of ER+ breast cancer, including the overexpression of p38 mitogen-activated protein kinase and the overexpression of poly(ADP-ribose) polymerase 1. A gene-based expression signature biomarker built from this network was significantly predictive of clinical relapse in multiple independent cohorts of ER+ breast cancer patients, even after correcting for standard clinicopathological variables. The results of this study demonstrate the utility and power of an integrated quantitative proteomic, transcriptomic, and network analysis approach to discover robust and clinically meaningful molecular changes in tumors.

alytical methods. We have overcome these barriers in previous studies, producing both proteomic and transcriptomic characterization of breast cancer using LCM-enriched tumor and normal samples (14,20).
Once the technical challenges to experimental data acquisition are overcome, -omic profiling data pose both opportunities and difficulties in their interpretation. Although these profiles can offer a comprehensive picture of a disease state relative to healthy tissue, their analysis often yields long lists of molecular changes that are difficult to interpret. This difficulty is compounded when the profiling data are obtained from multiple platforms (e.g. genomic, transcriptomic, and proteomic) that require additional integration of complex molecular signals. Novel bioinformatics techniques are required to overcome these challenges and realize the potential of -omic technologies to shed light on fundamental disease biology.
Having obtained in two previously published studies both proteomic and transcriptomic profiles of estrogen receptor positive (ERϩ) invasive breast carcinoma, a subtype that constitutes ϳ70% of all invasive human breast cancers diagnosed worldwide, we sought to integrate and interpret these data in a clinically and biologically meaningful manner (14,20). We first investigated whether proteomic and transcriptomic profiles obtained from independent tumor versus normal purified breast epithelium sample sets yielded a concordant picture of breast cancer disease biology. We then examined three approaches for integrating these profiles with annotated biological networks to discover sets of genes and proteins associated with breast tumorigenesis (network discovery phase; Fig. 1, a-c). Finally, we investigated the ability of these gene sets to predict relapse in independent breast cancer clinical cohorts (prognostic validation phase; Fig. 1d).
In the network discovery phase of our analysis, we explored three related approaches to identify sets of functionally linked molecules with significantly different gene and protein expression in highly enriched malignant breast epithelium (MBE) from invasive breast carcinoma (IBC) versus normal breast epithelium (NBE). The first approach built networks from proteins that were significantly differentially expressed at both the transcriptomic and proteomic level (Fig. 1a). The second, "gene-centric" approach built networks from differentially expressed genes and identified those networks that were significantly enriched in differentially expressed proteins (Fig. 1b). The third, "protein-centric" approach built networks from differentially expressed proteins and scored these networks for their enrichment in differentially expressed genes (Fig. 1c). FIG. 1. Panels a-c depict the three approaches we took to integrate our proteomic and transcriptomic differential expression data sets with annotated biological networks. a, Approach 1 applied network analysis on the list of significantly differentially expressed genes whose corresponding proteins were also differentially expressed in NBE and MBE. b and c, Approaches 2 (genecentric NSEA) and 3 (protein-centric NSEA) built networks in either the transcriptomic or proteomic data set, respectively, and chose among these the subset enriched in differentially expressed proteins or genes, respectively. Black and white arrows in the histological tissue sections denote NBE and IBC epithelium, respectively. d, shows the clinical validation approach we used to test gene expression clinical classifiers based on networks generated by Approaches 1-3 using independent training and validation cohorts of ERϩ breast cancer patients. The numbers alongside each test and training data set indicate the number of ERϩ patients used in the analysis from each in each cohort.
In the prognostic validation phase of our analysis (Fig. 1d), the gene sets underlying significantly altered networks were subsequently trained and tested as prognostic classifiers in large sets of publicly available and clinically annotated ERϩ breast cancer gene expression data. After permutation analysis, only the "protein-centric" integration approach produced a classifier that was significantly predictive of relapse in these independent cohorts. This classifier was predictive despite the fact that these clinical expression data were generated from unpurified, stromally contaminated tumor tissue collected from clinically heterogeneous breast cancer cases.
In summary, this study provides motivation and outlines a framework for the integrated analysis of protein and gene expression cancer profiling data. The results suggest that the analysis of global molecular changes in small sample sets of microdissected primary tumor cells can uncover alterations relevant to large independent cohorts of complex tumor tissue samples. Finally, this study uncovers a novel molecular signature that yields potential insight into the biology of breast cancer tumorigenesis and relapse.

MATERIALS AND METHODS
Proteomic Specimen Acquisition and Processing-Proteomic data used in the network discovery phase were previously published (14). Briefly, NBE and MBE samples were obtained by LCM from each of nine different mammoplasty reduction specimens from women without a history of breast cancer and from each of nine different ERϩ IBC samples, respectively (see Tables I and II for clinical characteristics of tumor and normal specimens used in the proteomic analysis) (14). Approximately 60,000 cells were isolated per tissue sample by LCM and individually subjected to MS-based shotgun proteomics analyses. For peptide/protein identification, acquired MS/MS scans were searched against the Swiss-Prot human database by using a SEQUEST algorithm. Identified proteins were grouped by using Pro-teinProphet (Institute for Systems Biology) (21). The spectral index (SpI) was calculated from normalized spectral counts to determine proteins with differential abundance between normal and malignant specimens. Sample label permutation analysis of SpI was performed to identify differentially abundant proteins, using a nominal two-tailed p value threshold of 0.01 (͉SpI͉ Ͼ 0.556) to determine significance. Further details of sample acquisition and data processing for proteomic analysis can be found in Ref. 14.
Gene Expression Data-For the network discovery phase of our analysis, we used gene expression data that was previously published (20). Briefly, these data represent seven pairs of patientmatched LCM-purified MBE samples from IBC and NBE samples that were assayed on the Affymetrix U133X3P platform (see Table III for clinical characteristics); these samples are a subset of the patients reported in Ref. 20, corresponding to MBE samples and patientmatched NBE samples from the estrogen receptor positive (ERϩ) IBC specimens in the cohort). 13 of these 14 samples were obtained from patients not assayed in the proteomic analysis (the tumor sample from patient 153 was assayed in both transcriptomic and proteomic analyses). Additional details of sample acquisition and expression measurement can be found in Ref. 20. We renormalized the data in Ref. 20 from raw CEL files as described below.
Only ERϩ patients in each cohort were used for the analysis, comprising 149, 211, 209, 203, and 65 patients for the data sets in Refs. 22,23,24,25, and 26, respectively. Survival measures (distant metastasis-free survival, relapse-free survival, and disease-free survival), censure times, and additional clinical covariates (age, tumor grade, tumor size, lymph node metastasis, and systemic therapy) were also obtained for each data set depending on their availability. All of these data sets, except that in Ref. 25, were profiled on an Affymetrix oligonucleotide array platform; the data set in Ref. 25 was assayed on an Agilent oligonucleotide array platform. All of the patient samples in these published clinical data sets were obtained from treatment-naïve post-surgical bulk tumor (i.e. nonmicrodissected) samples, although a subset of the patients in all cohorts with the exception of (24) underwent post-operative (i.e. adjuvant) systemic therapy. The study identifiers, cohort of origin, array platform, and clinical covariate values of all clinical samples analyzed in the prognostic validation stage of our study are provided as supplemental data.
For each gene expression data set, the data were acquired as CEL files and reprocessed using the affyrma package in MATLAB (Mathworks Inc., Natick, MA) and the Bioconductor affy package (http:// www.bioconductor.org/) in R using custom chip definition files obtained from the BrainArray website (http://brainarray.mbni.med. umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp) (27). Probe sets achieving fewer than 20% MAS5 present calls in any single data set (via the R function mas5calls) were removed from subsequent analyses. For the MBE versus NBE comparison of the data in Ref. 20, differential expression was determined using a t test with a q value threshold of 0.05 (nominal p Ͻ 0.0089) (Note that the q value for a gene is the false discovery rate that is implied by setting a significance threshold equal to the p value of that gene (28).) The choice of a t test, as opposed to the use of limma (which was used in the original analysis of Ref. 20 to quantify differential expression), was chosen to yield a simple test statistic to facilitate downstream setbased enrichment analyses.
Gene-Protein Set Concordance Enrichment Analysis-Gene set enrichment analysis (GSEA) was adapted to evaluate the concordance of differential gene and protein profiles of breast cancer data (29). Restricting the analysis to 1236 genes/proteins identified and quantitated on both proteomic and transcriptomic platforms, a gene/ protein "concordance score" was computed by multiplying the t statistic (from the gene expression analysis) with SpI (from the proteomic analysis). This score reflected the magnitude and direction of the concordance (i.e. agreement) of observed gene and protein expression changes, with a positive score indicating concordance (gene up, protein up or gene down, protein down) and a negative score indicating discordance (gene down, protein up or gene up, protein down) of a gene/protein pair. The above genes/proteins were mapped to the MSigDB collection v.2.5 (http://www.broadinstitute.org/gsea/ msigdb/index.jsp), and sets with fewer than 15 genes/proteins were removed. For each gene/protein set, an enrichment score (ES) was computed as a weighted running sum statistic over the concordance score. The ES was then recomputed for each gene set across 20,000 random shufflings of "tumor" (MBE) and "normal" (NBE) labels of samples in both proteomic and transcriptomic data sets and rescaled to yield a distribution of normalized enrichment scores (NESs). As in Ref. 29, the histogram of NES across all genes/protein sets and permutations was used to derive a background distribution for concordance enrichment, from which p and q values were computed. All of the computations were performed in MATLAB (Mathworks Inc.).
Network Set Enrichment Analysis (NSEA)-NSEA was developed to uncover biological pathways with both altered protein and gene expression in the tumor samples. Two versions of this approach were implemented (gene-centric and protein-centric NSEA), both employing Ingenuity Pathway Analysis (v8.8, content version 3204, http:// www.ingenuity.com/) and a GSEA-like statistic to quantify enrichment (29). Gene-centric NSEA uses IPA to uncover biological networks significantly enriched in differentially expressed genes and chooses among these a subset of networks that is significantly enriched in differentially expressed proteins. A protein-centric version of NSEA employs IPA to uncover biological networks significantly enriched in differentially expressed proteins and chooses among these a subset of networks that is significantly enriched in differentially expressed genes. NSEA is closely related to GSEA, using Ingenuity networkderived gene and protein lists in place of manually curated gene sets and unsigned phenotype correlation statistics (Pearson r 2 , ͉SpI͉) instead of the signal-to-noise or Pearson r statistics used in GSEA (29). The NSEA algorithm is described in more detail below and is also depicted in a "road map" shown in supplemental Fig. 1. To explain gene-centric and protein-centric NSEA compactly, parentheses are used to denote the "reverse" analysis, i.e. genes to proteins (proteins to genes).
Significantly altered genes (proteins) from the transcriptomic (proteomic) analysis of LCM-enriched MBE versus NBE were inputted into IPA with standard settings (using all molecules, all direct/indirect relationships, network size of 35) to obtain an "observed" collection of enriched biological networks. A background collection of 50 (500) additional gene (protein) lists was obtained through sample label permutation of the transcriptomic (proteomic) data set, choosing genes (proteins) with p Ͻ 0.0089 (͉SpI͉ Ͼ 0.556) from each permutation. These permuted gene (protein) lists were also inputted into the Ingenuity Pathway Analysis tool to obtain a background collection of 500 -1000 networks enriched in differentially expressed genes (proteins) from phenotype label-permuted data sets. The range of 500 -1000 was chosen to be on the order of the size of the standard pathway collections used for GSEA (e.g. MSigDB).
Network collections derived from observed and permuted transcriptomic (proteomic) data were tested for proteomic (transcriptomic) enrichment against the LCM-enriched malignant breast epithelium versus normal epithelium protein (gene) expression data set using a weighted Kolmogorov-Smirnov-like ES for each network (29). In this step, ͉SpI͉ (Pearson r 2 ) values were computed for all proteins (transcripts) in the breast protein (gene) expression data set of correlation with NBE versus MBE phenotypic status. For each observed or background network, an ES was computed as the maximum of a weighted running sum statistic of ranked ͉SpI͉ (r 2 ) values corresponding to proteins (genes) encoding nodes in the given network. An unsigned score, i.e. ͉SpI͉ (r 2 ), was chosen as correlation metric to capture networks enriched in both up-regulated and down-regulated proteins (genes). This score was then recomputed across 20,000 permutations of tumor (MBE) and normal (NBE) sample labels in the protein (gene) expression data set to yield a background ES distribution for each set. ESs were Z-transformed across all phenotype label permutations for each network set to yield NESs. The histogram of NES across all network sets and phenotype label permutations was used to derive a null distribution for network set enrichment. Each observed network set was superimposed on this null distribution to obtain a one-tailed permutation p value for proteomic (transcriptomic) enrichment. p values were Bonferonni-corrected by a factor corresponding to the number of observed networks. Networks with p Ͻ 0.05 following the Bonferonni correction were determined to be significantly enriched in the proteomic (transcriptomic) data set. NSEA was restricted to gene (protein) sets that had more than five genes (proteins) measured in the transcriptomic (proteomic) data set, because small sets may distort the background distribution of NESs, as discussed in Ref. 29. All of the computations were performed in MATLAB (v7.9.0; Mathworks Inc.; http://www.mathworks.com).
Prognostic Classifier Validation-Networks enriched in significantly altered genes and proteins were tested for their ability to prognostically stratify the patients in data sets of Refs. 22-26 on the basis of gene expression. Each network was mapped to a set of genes on each expression array via the Entrez Gene ID and Hugo Symbols provided with the Ingenuity output file and the gene mappings bundled with the Affymetrix chip definition files (22)(23)(24)26) or UniGene identifiers provided with the text output of (25). A support vector machine classifier with a linear kernel was trained on 5-year distant metastasis-free survival in ERϩ patients from Ref. 22 and tested in ERϩ patients from Refs. 23-26. Briefly, this classifier consisted of a threshold on a weighted sum of gene expression values; a patient whose score fell above or below the threshold was, respectively, assigned a "good" or "poor" prognostic category label by the classifier. The support vector machine algorithm determined the optimal weights and threshold for the classifier during the training step. Ztransformed gene expression data were used as input into the classifier in both training and test sets.
The efficacy of the classifier was then validated against diseasefree (DFS) and relapse-free survival (RFS) measures (DFS in Refs. 23, 25 and 26 and RFS in Ref. 24). DFS, RFS, and distant metastasis-free survival are closely related survival measures that refer to the time elapsed between diagnosis and the reappearance of breast cancer (including a second primary tumor), local or distant breast cancer metastasis, or distant metastasis, respectively, in a patient. The estimated hazard ratios of high risk versus low risk groups in each cohort and in meta-analysis across cohorts were determined using Cox regression; the hazard ratio is a standard metric used in clinical trials to quantify the effect of a grouping variable (e.g. risk stratification) on the rate of events (e.g. disease relapse) in a group of patients over the period of the study. For cohorts having additional clinical data (i.e. lymph node status, ER status, age, tumor size, grade, and treatment), the independent prognostic ability of each classifier subject to these additional covariates using multivariate Cox regression was also determined. We also computed receiver operator characteristic curves for each classifier and each test-cohort by varying the classifier threshold from a value in which all patients received a "Poor" to a value in which all patients received a "Good" prognosis label. Classifier labels were compared with the 5-year survival measure (DFS in (23), RFS in (24), DFS in (25), DFS in (26)) to compute the sensitivity and specificity of the test for each value of the classifier threshold. Receiver operator characteristic curves were computed by plotting sensitivity versus 1-specificity, and the areas under the curve (AUCs) were computed and reported. AUCs are measures of the efficacy of a binary classifier and vary between 0.5 and 1.
Permutation testing was applied to compare classifiers obtained from observed (i.e. network-derived) gene lists with random gene sets. For each observed gene list, random gene lists of the same size were chosen from the list of genes assayed on the training set microarray (22). Classifiers were trained on these random gene lists, and hazard ratios in the test cohorts were computed using Cox regression (Ϯ clinical covariates). A background distribution of hazard ratios from these 1000 random gene lists was computed, and the observed hazard ratios (either in meta-analysis or individual cohorts and either univariate or multivariate) were compared against the appropriate background distribution. All of the computations were performed in MATLAB (Mathworks Inc.).

Concordance of Differential Gene and Protein Expression in
ERϩ Invasive Breast Carcinoma-Global protein and gene expression were examined in two cohorts of laser capture microdissected ERϩ breast cancer samples (Tables I-III). As noted above, both data sets have been previously published (14,20): the proteomic data set was derived from LCMenriched MBE from ERϩ breast cancer patients and NBE from mammoplasty specimens of non-breast cancer patients (Tables I and II), whereas the transcriptomic data set consisted of patient-matched MBE and patient-matched NBE (Table III). Quality control filtering (see "Materials and Methods") of gene expression data from seven ERϩ IBC and patient-matched adjacent normal tissue yielded 9804 genes, of which 966 were differentially expressed between purified NBE and IBC cells (q Ͻ 0.05, Student's t test). (Note that these results differ from those presented in Ref. 20 because we are analyzing only ERϩ IBC tissues and have used a different array normalization/quality control pipeline; see "Materials and Methods.") Analysis of protein expression in nine ERϩ MBE tumors and (non-patient matched) NBE samples yielded 1623 proteins, of which 121 were differentially expressed (͉SpI͉ Ͼ 0.556).
A total of 1236 genes-protein pairs were detected on both transcriptomic and proteomic platforms in the two cohorts. The normal-tumor differential expression of these genes/proteins as measured by the t statistic and SpI, respectively, is plotted in Fig. 2a. Global concordance of differential gene and protein expression was modest but highly significant (r 2 ϭ 0.19, p Ͻ 1 ϫ 10 Ϫ12 ). Moreover, there was a significant overlap between genes/proteins determined to be significantly differentially expressed on each platform (p Ͻ 5 ϫ 10 Ϫ8 ; Fisher's exact test). Examining the direction of differential expression (i.e. up versus down in tumor) in 210 genes/ proteins called significant in either platform showed striking concordance (p Ͻ 1 ϫ 10 Ϫ12 ; Fisher's exact test; Table IV). Restricting the comparison with 28 genes/proteins called significant on both platforms showed 100% concordance in the direction of expression change (p ϭ 1.21 ϫ 10 Ϫ7 ; Fisher's exact test; Table IV). These 28 genes/proteins are shown in supplemental Table 1.
Finally, the pathway concordance of gene and protein differential expression was investigated using the MSigDB database and gene-protein set concordance enrichment analysis, analogous to GSEA. Briefly, the 1236 genes/proteins detected by both proteomic and transcriptomic platforms were mapped onto 1015 gene/protein sets of 15 members or more. Gene/protein sets were assigned an enrichment score relative to a "concordance signal" computed from the proteomic spectral index and the transcriptomic t statistic. Computing this enrichment score for original and phenotype label permuted data sets yielded observed and null distributions of the NESs and p values for enrichment across gene/protein sets (see "Materials and Methods" for additional details). The vast majority of gene/protein sets (927 of 1015) demonstrated positive enrichment (NES Ͼ 0) (tumor versus normal), with 310 showing significant positive enrichment (q Ͻ 0.25) (Fig. 2b). These results suggest a significant degree of pathway level concordance in the differential gene/protein expression in invasive ERϩ breast cancer. More broadly, both the single gene/protein and pathway level results strongly suggest that Network Integration of Transcriptomic and Proteomic Results-Three distinct strategies were examined to integrate transcriptomic and proteomic data with known networks to gain insight into the biology of invasive ERϩ breast tumorigenesis (Fig. 1a). The first and simplest approach built IPA networks directly from genes/proteins that were significantly altered in both proteomic and transcriptomic data sets. The second approach (gene-centric NSEA) built IPA networks from significantly altered genes and then chose the subset of those networks significantly enriched in differentially expressed proteins (Fig. 1b). The third approach (protein-centric NSEA) built IPA networks from significantly altered proteins and then chose the subset of those networks significantly enriched in differentially expressed genes (Fig. 1c). Networks emerging from each approach were evaluated for their ability to prognostically stratify patients across multiple clinical breast cancer gene expression data sets (Fig. 1d). Underlying this strategy was the hypothesis that a network of genes/ proteins contributing both to tumorigenesis and clinical relapse would reveal potentially important features of breast cancer biology.
Approach 1: Build Networks from Both Differentially Expressed Genes and Proteins-A list of 28 genes/proteins with significant expression changes in both transcriptomic (q Ͻ 0.05) and proteomic data sets (͉SpI͉ Ͼ 0.556) (supplemental Table 1) was subjected to IPA core analysis with standard settings (see "Materials and Methods"), yielding two significant networks containing 15 and 13 focus molecules. Each of the two networks was examined for their ability to predict relapse using ERϩ breast cancer patients from five independent cohorts (22)(23)(24)(25)(26) and their corresponding gene expression data sets. For each network, a support vector machine classifier was trained in one cohort (22) to predict 5-year survival (22). Significant survival time differences between the resulting FIG. 2. Concordance of differential gene and protein expression in NBE and MBE epithelial samples. a, scatter plot of differential protein expression (SpI) and differential gene expression (t statistic) for 1236 genes/proteins measured on both transcriptomic and proteomic platforms. Each point represents a gene/protein. Points are colored according to whether the corresponding gene/protein is significant in none, one, or both of the genomic and proteomic data sets. b, gene-protein set concordance analysis results for 1015 MsigDB gene/protein sets plotted as a histogram of observed normalized concordance enrichment score (black) against the background distribution obtained through shufflings of tumor and normal sample labels (red). NES scores for gene-protein sets showing significant concordance (q Ͻ 0.25, NES Ͼ 1.36) are highlighted in blue. See "Materials and Methods" for details of the analysis. "favorable" and "poor" classification categories were determined across the remaining four independent clinical test sets (23-26) using Cox regression. Each of the networks yielded nominally significant classifiers in meta-analysis across the four cohorts (Table V). However, neither of these two associations remained significant when compared against a null distribution of classifiers built from 1000 random gene sets (Table V). Approach 2: Gene-centric NSEA-966 of 9804 genes with significant differential gene expression between NBE and MBE (q Ͻ 0.05, Student's t test) were identified. Input of this gene list into IPA yielded 45 significant networks. Gene-centric NSEA (see "Materials and Methods") was applied to determine enrichment of these 45 networks in the proteomic data set. Interestingly, none of the transcriptomically derived networks yielded significant enrichment in differentially expressed proteins. Specifically, 17 of the 45 networks were not included in the enrichment analysis because of inadequate power, i.e. less than five proteins in the network were represented in the proteomic data set. The remaining 28 networks contained between 5 and 15 detected proteins, but none of these networks showed a significant enrichment when compared with the background NES distribution (p Ͻ 0.05, following Bonferonni correction for 28 hypotheses). We did not test any of the transcriptomically derived networks as clinical relapse predictors because of this lack of observed proteomic concordance.
Approach 3: Protein-centric NSEA-121 differentially expressed proteins were inputted into IPA, yielding seven significant networks. Protein-centric NSEA found that two of the seven protein-derived networks showed significant enrichment in the gene expression data set (p Ͻ 0.05, after Bonferonni correction for seven hypotheses) (Fig. 3 for network 1; see supplemental Fig. 2 for network 2). These two significant gene/protein enriched networks were tested for their ability to predict relapse across independent clinical gene-expression cohorts. As in Approach 1, an support vector machine classifier on network genes was trained on 5-year relapse-free survival in one ERϩ breast cancer cohort (22), and then the performance of this network was tested across four independent test cohorts of ERϩ patients applying Cox regression on DFS (23,25,26) or RFS (24). The classifier performance (quantified as hazard ratios) was compared between the network genes and classifiers created from 1000 random gene sets to yield a stringent permutation p value for predicting relapse versus relapse-free survival.
Although both of the above networks showed nominal significance in meta-analysis across the four test cohorts, only one network (Network 1; see Fig. 3) was significant following permutation (p ϭ 0.004, permutation test). Stated differently, Network 1 outperformed 99.6% of random gene sets as a prognostic clinical classifier, yielding a nominal p value of 6.5 ϫ 10 Ϫ11 and a hazard ratio of 2.35 in meta-analysis across the four cohorts. Network 1 was also nominally significant in all four individual test cohorts (p ϭ 0.002 in Ref. 23 Multivariate Cox regression analysis revealed that the Network 1-based predictor was independent of histological grade,   Table 2). In summary, protein-centric NSEA yielded a network that was robustly informative of breast cancer relapse. Further- show the network with protein and gene expression overlaid, respectively. In each figure, red and green coloring represents proteins/genes that are up-and down-regulated, respectively, in the laser capture microdissected cancer cells relative to normal epithelium. Nodes in a and b are annotated with the spectral index and log fold change for that protein and gene, respectively. more, this network-based expression biomarker was prognostically significant even after correcting for commonly used clinicopathological variables and comparison against 1000 random gene sets.

DISCUSSION
Methodological Considerations-We examined three approaches for integrating differential proteomic and transcriptomic results with annotated biological networks for ERϩ invasive breast cancer-derived data sets. Approach 1 built a network from intersecting differentially expressed genes and proteins identified in the proteomic and transcriptomic analyses. Although simple, the approach ignored much of the transcriptomic data set, the majority of which mapped to proteins that were undetected on the proteomic platform. Furthermore, this approach only took into account the most significant changes in both data sets, which is likely to be conservative and insensitive. Approach 2 (gene-centric NSEA) built networks from significantly differentially expressed genes and evaluated those for enrichment with significantly altered proteins. This approach failed to yield any altered networks. There are several likely explanations for this result. First, the comparatively small number of proteins detected on the proteomic platform (versus gene expression) resulted in a minimal overlap between transcriptomically derived networks and proteins detected in the proteomic data set. Second, transcriptomic analysis yielded a large number of networks, many of which are likely driven by "biological noise" (i.e. false positives). Kaplan-Meier curves depict the percentage of patients without disease (a, b, and d) or relapse (c) as a function of time from diagnosis, stratified by computed prognostic class. Disease and relapse events were treated equivalently in the meta-analysis results shown in e. Notches on each curve (marked with ϫ) represent censure times. The significance of outcome differences between the two prognostic classes in each analysis was computed using the log rank test and is displayed alongside the respective Kaplan-Meier curve. The success of Approach 3 may be attributed to a notable feature of the protein-centric approach: protein expression is more directly linked to the biological state of a tissue. Quantitative proteomics measures protein expression, albeit with bias against the detection of low abundance molecular species. In contrast, gene expression platforms yield comprehensive genome-wide snapshots of the transcriptomic state. However, many of these gene expression changes are not reflected in the proteome, because of factors such as posttranscriptional regulation, variable rates of protein turnover, and post-translational modification. Thus, many gene expression changes are likely not functionally important, representing "passenger" rather than "driver" events in tumorigenesis. However, a subset of these transcriptomic changes may emerge as important biological signals because of their closeness in network space to differentially expressed proteins. Therefore, anchoring a network analysis on significant protein changes and then evaluating the resulting networks for significant differential gene expression changes may leverage the comprehensiveness of transcriptomic data while buffering for biological noise.
There are several precedents in the literature of networkbased interpretation of differential proteomic cancer data sets. For example, in a previous study, IPA was used to analyze global protein profiles of nontransformed and transformed breast cancer cell lines (30). In a second study, differential proteomic changes in normal versus tumor cells of Apc Min/ϩ mice were interpreted in the context of previously constructed gene coexpression networks (31). Similar to both of these studies, NSEA employs the existing biological networks to identify differentially expressed modules. However, unlike in Refs. 30 and 31, NSEA integrates both transcriptomic and proteomic expression data from the same tissue to identify networks altered at the gene and protein level. The choice of protein-protein interaction instead of gene coexpression networks (used in Ref. 33) is not essential to NSEA; however, it is believed that protein-protein interactions more reliably reflect underlying biology by capturing signaling, binding, and other biochemical interactions between molecules. The use of IPA for biological network analysis could similarly be replaced with any other proprietary or public network database (e.g. STRING (32) or SPIKE (33)) and alternate methods of scoring subnetworks enriched for differentially expressed genes/proteins (34 -36).
Notably, Network 1 was discovered through the analysis of a small cohort of purified epithelial tissue, although it was prognostically validated in a large cohort of unpurified (i.e. stromally contaminated) tumor tissue. The small size of our network discovery data sets (seven MBE-NBE pairs for gene expression and nine MBE and nine NBE samples for proteomic analysis) is a testament to the power of an integrated approach that combines different levels of biological signal with functional network annotation. In addition, because Network 1 was initially discovered in the analysis of highly purified epithelial tissue, one would expect the strength of any biological signal originating in this tissue compartment to diminish with the presence of stromal contamination. All of the samples profiled in the expression data sets analyzed in the prognostic validation component of our study (22)(23)(24)(25)(26) were unpurified samples, many of which were likely highly contaminated by stroma. The choice of using these data sets was made out of necessity, because, to our knowledge, there are no large genomic case studies with clinically annotated profiles of highly enriched breast cancer epithelium. However, despite this potential confounder of stromal contamination, our analysis was able to detect a significant statistical effect on relapse using a clinical classifier built from our network. The cohort-to-cohort differences in the degree and significance of prognostic impact seen with this classifier (Tables V-VII) are likely related to issues of statistical power and sampling noise in the face of this significant confounder. We surmise that the classifier resulting from the discovered network would show more profound and consistent prognostic impact using LCMenriched MBE expression profiles from a clinically annotated cohort of ERϩ breast cancer patients.
Guidelines for Evaluating Clinical Relapse Signatures-Part of the above analysis involved evaluation of significantly altered gene/protein networks for their ability to predict breast cancer relapse. An important caveat of evaluating gene sets and gene signatures for relapse is that it is quite possible to train a prognostically informative classifier (i.e. a biomarker(s) that predicts clinical outcome) from a random gene set. As has been previously observed in the gene expression literature, such a "random classifier" will even perform well on multiple validation data sets (37,38). This baseline predictive ability may arise from the "footprint" of proliferative index or tumor grade on all genes in the genome (37)(38)(39). As a result, a clinical classifier built from a random gene set will encode this generic molecular feature and be somewhat predictive of prognosis, without necessarily reflecting novel tumor biology. Indeed, this caveat has been observed with many of the network-derived and random gene sets of the present study. In prognostic validation phase of our study, this challenge was overcome through permutation, i.e. by ranking the prognostic ability of an observed gene set relative to a collection of randomly chosen gene sets. A gene set that performs no better than random on the test set is likely not encoding an interesting biological principle. For example, such was the case with both networks obtained by Approach 1. However, with Approach 3, a network was found that robustly surpassed this stringent permutation-based significance criterion. Furthermore, this network-based biomarker signature was significant even after correcting for commonly used clinicopathological variables used in determining breast cancer prognosis. As a result, this network very likely encodes a biologically significant feature of breast cancer molecular biology.
Biological and Clinical Implications of Discovered Network-Protein and gene expression changes superimposed on the prognostically significant network obtained via proteincentric NSEA (Fig. 3) suggest an overlap of three interrelated biological processes: 1) activation of stress response signaling, 2) altered small molecule metabolism, and 3) modulation of cytoskeletal and extracellular matrix components. These three themes comprise a host of interesting molecules and fuel many intriguing conjectures, a subset of which is discussed below.
Central to this network are the two canonical stress-activated protein kinases, JNK and p38 MAPK. These kinases mediate cellular response to environmental and genotoxic stresses, in particular, reactive oxygen species formation, while also playing important roles in inflammation, differentiation, survival, and cell migration (40). The impact of these kinases in cancer is complex-and context-specific. Thus, although certain cells employ the signaling pathways involving these kinases to inhibit growth and promote apoptosis, other cells employ JNK and p38MAPK to facilitate proliferation, survival, and invasion through activation of Ap1. This dual function mirrors the dual role of oxidative stress in cancer, in which tumor progression is promoted by inducing genomic instability, whereas apoptosis and other forms of cell death are simultaneously enhanced. Although phosphorylation activity of these two stress-activated protein kinases has not been measured, mild overexpression of JNK at the gene level and of p38 MAPK on both the protein and gene levels was observed in this study, as shown in Fig. 3.
In addition, consistent with JNK activation, dramatic downregulation of ferritin heavy chain at the protein level was observed. Ferritin is the major iron storage protein in human cells, and its down-regulation causes reactive oxygen species accumulation and JNK activation through the increase of intracellular labile iron pools (41,42). Conversely, JNK activation has also been shown to cause ferritin degradation, release of labile iron pools, and reactive oxygen species formation (43). Down-regulation of ferritin commonly occurs between early and late stages of esophageal carcinoma and is an essential event in transforming growth factor-␤-induced epithelial to mesenchymal transition in cultured cancer cells (42). Interestingly, ferritin heavy chain is not found to be downregulated at the transcript level. However, ferritin heavy chain is classically known to be post-transcriptionally regulated via a pathway involving iron response proteins IRP1 and IRP2. The analysis did not determine post-translational or transcription modifications, but either of these factors could be a possible explanation for this discordance (44).
Striking proteomic and transcriptomic up-regulation of heat shock protein HSBP1 (also known as Hsp27) was also observed. This finding is consistent with previous observations in which HSBP1 was found to be overexpressed in more than 50% of breast cancers and other epithelial tumors; heat shock protein has also been associated with multidrug resistance to chemotherapy (45)(46)(47). Similar to JNK and p38 MAPK, heat shock proteins are activated by cellular stress. In addition, HSBP1 is associated with the activation of JNK and inhibition of p38 MAPK in hematopoietic cells (48,49) and is a substrate of the kinase MAPKAP-K2, which is directly phosphorylated by p38 MAPK (50).
HSBP1 prevents the cleavage of PARP1 (51), which was one of the most significantly altered proteins in the proteomic data set. PARP1 cleavage is a hallmark of programmed cell death in both normal and cancer cells (52). The enzyme mediates a key response to genotoxic stress by sensing doubleand single-stranded breaks in DNA and rapidly recruiting a plethora of DNA repair proteins to the lesion site. PARP1 is also linked intricately to cancer cell metabolism, because its enzymatic activity causes significant depletion of cellular NADϩ and ATP, leading to a novel form of metabolic cell death termed parthananos (53). Most importantly, PARP1 has emerged as a centerpiece of breast cancer therapeutics, because PARP1 inhibitors have shown efficacy in the treatment of "triple negative" (i.e. ER, PR, and Her-2-Neu negative) breast cancers (52). Furthermore, recent overexpression and RNA interference knockdown experiments have implicated p38 MAPK as a major modulator of sensitivity to PARP1 inhibitors in breast cancer cell lines, providing an intriguing link between these two nodes in Network 1 (54,55). PARP1 is significantly up-regulated across both the proteomic and gene expression data sets, which suggests an intriguing novel role for this molecule in the stress response of ERϩ tumors.
Small molecule metabolism enzymes acid ceramidase, ATP citrate lyase (ACLY), and nucleotide diphosphate kinase 1 (NME1) were concordantly up-regulated in both the proteomic and transcriptomic data sets. Acid ceramidase, an enzyme that mediates the conversion of ceramide to sphingosine, links small molecule metabolism and stress response pathways in the network in Fig. 3. Ceramide accumulation provides a key molecular signal to coordinate the cellular response to serum deprivation, ionizing radiation, inflammatory cytokine stimulation, and other stresses (56). Notably, ceramide activates JNK and also causes programmed cell death through PARP1 cleavage (57,58). Ceramide-mediated apo-ptosis can be prevented through the overexpression of acid ceramidase, which metabolizes intracellular ceramide to sphingosine (59).
ACLY is a central metabolic enzyme that is essential for acetyl-CoA synthesis from citrate (60). The enzyme shuttles carbon moieties derived from glycolysis into de novo lipid synthesis, providing an important link between the Warburg effect and cancer anabolism (61). Pharmacological and genetic inhibition of ACLY suppresses tumor growth in multiple cell lines and xenografts (61). In addition to its established role in lipogenesis, it was shown that acetyl-CoA proteins derived from ACLY serve as the primary substrate for chromatin modifying histone acetylases, suggesting that ACLY may be involved in the epigenetic regulation of the differentiation state of tumor cells (62). Intriguingly, ACLY has been shown to undergo phosphorylation at a catalytic histidine residue by another protein in Network 1, the enzyme NME1 (63).
NME1 is a small molecule metabolism enzyme that was up-regulated in both the transcriptomic and proteomic data sets. NME1 catalyzes synthesis of non-ATP nucleotide diphosphates. NME1 overexpression is also associated with the inhibition of JNK and p38 MAPK, as well as broad inhibition of the proliferative kinases ERK, Akt, and PI3K (64,65). NME1 is also known as a "metastasis suppressor," because loss or underexpression of the protein is associated with high metastatic potential in breast cancer and other tumor types (65). This property may underlie its prognostic value in the network-based clinical predictor reported in this study.
The periphery of the proposed network contains multiple down-regulated extracellular matrix and cytoskeletal proteins that include decorin and type XIV collagen (COL14A1). Decorin is a extracellular matrix proteoglycan whose overexpression causes inhibition of the oncogenic Erbb2 protein in breast carcinoma cell lines (66). Decorin is degraded by matrix metalloproteases, which are endopeptidases that cleave multiple components of the extracellular matrix and are ubiquitously up-regulated across human cancers, including breast cancers (67,68). Matrix metalloproteinases are transcriptionally activated by JNK and p38 MAPK but are generally regulated through post-translational cleavage from an inactive zymogen state (40,67). Furthermore, decorin binds to type XIV collagen, and this interaction is postulated to promote the organized connection of collagen fibrils in the extracellular matrix (69). In vitro-based data demonstrate that decorin suppresses breast metastases (70 -72), whereas direct clinical data show that human breast tumors with reduced decorin expression are associated with poor outcome (73).
Taken together, Network 1 highlights an intriguing interplay of stress response, cancer metabolism, and tumor microenvironment pathways. The dual property of this network as being 1) proteomically and transcriptomically altered in the breast cancer versus normal epithelium analysis and 2) prog-nostically informative across a large clinical collection of breast cancer patients underscores the potentially important role of these pathways in the biology of tumorigenesis and metastasis in ERϩ breast cancers.
The functional role of this network in breast cancer is not established by our study. The patterns discussed above represent hypotheses that would require significant additional experimentation to functionally validate. Because many of the central nodes in this network are involved in phosphosignaling (JNK and p38 MAPK), targeted and global kinase profiling of MBE and NBE would critically test these hypotheses. Additional functional insight could be gained through knockdown and overexpression experiments in human breast cancer cell lines or genetically engineered mouse models combined with histological and molecular phenotypic profiling following each perturbation. Such experiments are outside the scope of our present study but represent a potentially fruitful direction of future experimental efforts.
Conclusion-This study has demonstrated the power of an integrated quantitative proteomic, transcriptomic, and network analysis approach to discover robust and potentially meaningful clinical molecular changes in breast cancer. The work began by demonstrating a basic concordance between proteomic and transcriptomic expression changes measured in two studies of ERϩ invasive ductal carcinoma patients. Leveraging this concordance, three novel methods were implanted to integrate proteomic and transcriptomic expression changes with biologically annotated networks, leading to the discovery of novel molecular features in breast cancer biology. The most successful approach constructed networks from significantly altered proteins and then chose among these networks significantly enriched in differentially expressed genes. Strikingly, it was found that a network enriched in differentially expressed genes and proteins in cancerous versus normal breast epithelium was informative of breast cancer relapse. This network linked pathways in cellular stress response, cancer metabolism, and tumor microenvironment through experimentally annotated signaling and binding interactions. The present results suggest physiologically important intersections of diverse cellular pathways that mediate both tumorigenesis and clinical aggressiveness in breast cancer. The findings have the potential to open new avenues of research into the molecular mechanisms of breast tumorigenesis and the development of novel therapeutics strategies for human breast cancer. of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
□ S This article contains supplemental material. ¶ To whom correspondence may be addressed. E-mail: b.karger@ neu.edu.
To whom correspondence may be addressed. E-mail: dsgroi@ partners.org.