Extracting Pathway-level Signatures from Proteogenomic Data in Breast Cancer Using Independent Component Analysis*

Independent component analysis was applied to human breast cancer proteogenomic data, and pathway-level signatures were further integrated with clinical information. Our results demonstrated that ICA can be used to extract biological relevant signals from multi-omics data in an unsupervised manner. Graphical Abstract Highlights Unsupervised feature extraction from proteogenomics data. Pathway level integration of multi-omics data based on clinical features. Recent advances in the multi-omics characterization necessitate knowledge integration across different data types that go beyond individual biomarker discovery. In this study, we apply independent component analysis (ICA) to human breast cancer proteogenomics data to retrieve mechanistic information. We show that as an unsupervised feature extraction method, ICA was able to construct signatures with known biological relevance on both transcriptome and proteome levels. Moreover, proteome and transcriptome signatures can be associated by their respective correlation with patient clinical features, providing an integrated description of phenotype-related biological processes. Our results demonstrate that the application of ICA to proteogenomics data could lead to pathway-level knowledge discovery. Potential extension of this approach to other data and cancer types may contribute to pan-cancer integration of multi-omics information.


Extracting Pathway-level Signatures from Proteogenomic Data in Breast Cancer Using Independent Component Analysis* □ S
Wenke Liu ‡ §, Samuel H. Payne ¶, Sisi Maʈ ‡ ‡, and David Fenyö ‡ §** Recent advances in the multi-omics characterization necessitate knowledge integration across different data types that go beyond individual biomarker discovery. In this study, we apply independent component analysis (ICA) to human breast cancer proteogenomics data to retrieve mechanistic information. We show that as an unsupervised feature extraction method, ICA was able to construct signatures with known biological relevance on both transcriptome and proteome levels. Moreover, proteome and transcriptome signatures can be associated by their respective correlation with patient clinical features, providing an integrated description of phenotype-related biological processes. Our results demonstrate that the application of ICA to proteogenomics data could lead to pathway-level knowledge discovery. Potential extension of this approach to other data and cancer types may contribute to pan-cancer integration of multi-omics information. Molecular & Cellular Proteomics 18: S169-S182, 2019. DOI: 10.1074/mcp.TIR119.001442.
Breast cancer is the most common cancer among women, and although targeted therapies have helped to significantly reduced breast cancer mortality rate in the past decade, further improvement will require a comprehensive understanding of the molecular mechanisms of the disease (1,2). Recently, deep mass spectrometry based proteomic characterization of genomically annotated breast cancer samples by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) 1 has marked the initial step of a proteogenomic integrative approach, in which recurrent mutations and copy number variations on the genomic level, expression profiles on the transcriptomic level and protein abundance and functional manifestations on proteomic level were measured for the same group of patient samples and examined in the same framework (3)(4)(5). The collection of high-quality multi-omics data immediately led to the discovery of concordant gene amplification and protein phosphorylation in key pathways (3). At the same time, there is increasing demand for analytical approaches that could incorporate all data types and extract pathway level signatures. Because in all human patients "-omics" data sets the number of features far exceeds the number of samples, analysis of any single data type is already susceptible to "the curse of dimensionality," and integration by simple concatenation of multi-omics data would be an even less desirable option. Our previous work has benchmarked the predictive power of multi-omics data sets for classifying breast cancer patients into different survival groups and showed that combined multi-omics data sets produced with data-driven fusion techniques were not able to outperform proteomic data alone (6). This result highlighted the possible redundancy among information contained in different biological levels and motivates us to explore other data fusion techniques that extract both concordant and complementary features from high-dimensional multi-omics data.
In the current study, we applied independent component analysis to proteomic and transcriptomic data of 77 breast cancer samples to extract pathway-level molecular signatures. Independent component analysis (ICA) is an unsupervised learning method widely used in signal processing and has been applied to cancer genomics with notable success (7)(8)(9). This approach decomposes the molecular profiles into linear combinations of non-Gaussian independent sources or components, each of which is comprised of weighted contributions from individual genes. Therefore, ICA reduces the dimensionality of original data by representing the molecular profile of each sample as weighted sum of several "metagenes" or "meta-proteins," and the weight of specific metagene/protein (mixing scores) in one sample reflects the "activity" of that component in the sample. Different from the more conventional dimension reduction method of principal component analysis (PCA), which seeks to find uncorrelated factors that explain the variance among the data, and works the best when the underlying components are normally distributed, ICA are able to discover more informative representations of high-dimensional biological signals, which are usu-ally super-Gaussian and contain more close-to-zero values than a normally-distributed sample (10). As clinical features are also available for the CPTAC samples, molecular signatures can be constructed from clusters of meta-genes/proteins that show activity patterns correlated with these clinical features. Further, taking advantage of a specific clinical feature as an "anchor," this method may help extract patterns at different biological levels and across different cohorts, which may originate from the same cellular functionality (Fig. 1). The signatures extracted from different data sets were filtered based on their intrinsic stability and association with known clinical features (see Experimental Procedures below) and grouped into modules that showed similar correlation patterns to clinical features. Subsequent gene set enrichment analysis revealed the biological relevance of these modules to pathways such as HER2 signaling, mitosis, and histone modification. Our analysis has demonstrated that ICA was able to blindly extract information in the form of weighted gene combinations, which may be biologically meaningful at pathway level. With input from clinical features or other sample subgrouping indices, these signatures may be further integrated into multilevel models that provide insights into the molecular mechanisms of breast cancer.

EXPERIMENTAL PROCEDURES
Data Sets-Transcriptome and proteome data of 77 breast invasive carcinoma (BRCA) patients and 84 ovarian serous cystadenocarcinoma (OV) patients were acquired by the TCGA and CPTAC projects: the transcriptome was characterized using the Agilent mRNA expression array platform (Santa Clara, CA) MS and global protein abundance data were obtained using iTRAQ 4-plex LC-MS/MS (3,(11)(12)(13). For OV data, Preprocessing procedure collapsed peptides mapped to the same gene by taking the mean value. Expression levels of genes and abundance of proteins were included in the transcriptome and proteome matrices. Clinical and demographic features were associated with each tumor sample.
Independent Component Analysis-Data were presented in a p ϫ n matrix X with genes in rows and samples in columns. The goal of ICA is to decompose the p ϫ n data matrix into the product of a source matrix S (p ϫ k) and a mixing matrix A (k ϫ n). The ith column of the source matrix represents coefficients of each of the p genes for the ith independent component. The coefficient vector of each component could be considered as p random samples that revealed probability distribution of a specific random variable. The mutual information between any pair of such variables is minimized, and the components are therefore statistically independent. Genes with coefficients of positive or negative values in the same component indicated that their levels may be enhanced or suppressed in the same biological process. The ith row of the mixing matrix represents contributions of the ith component in the source matrix to profiles of each of the n samples. Rows of the mixing matrix therefore record the activity of each components (or meta-genes/ meta-proteins) across n samples.
The stochastic nature of most ICA algorithms entails that different randomly initiated runs would give rise to different results, and there is no guarantee that the true structure of the data could be correctly estimated from any single run (10,14). To assess the statistical reliability of ICA results, stable components were filtered based on an approach adapted from Engreitz et al., which cluster components FIG. 1. Data analysis workflow. Coefficients of independent components and their corresponding mixing scores are extracted from both proteome and transcriptome data sets for multiple randomly initiated runs. Each independent component is a pathway-level representation. Molecular signatures were identified as cluster centroids of these components, and clinical features were used to select biologically relevant signatures, which were further annotated with pathway analysis.
obtained from different runs with K-medoids method (8). In brief, ICA was run with k ϭ n for 50 times to extract as much information as possible. As the sign of the same component may be flipped during different randomly-initiated runs, we followed the algorithm of Engreitz et al. and took advantage of the fact that the extracted components usually had skewed distributions (different sizes of tails on positive and negative ends) to align flipped components (8). For each individual component, if the larger tail was on the negative side, the component would be flipped to ensure that all larger tails were positive. All 50ϫn components were then considered as data points in p dimensional space and subjected to K-medoids clustering with Spearman correlation as the dissimilarity measure. For each cluster, the number of different runs that its members were extracted from was documented as a measure for cluster consistency, alongside with the average silhouette width. In general, clusters that with members appeared in more than 50% of all runs (25 out of 50) were considered as likely to contain true biological signals (see Signature Annotation section below).
All computations were carried out on the R platform. Package "fastICA" which implements the iterative FastICA algorithm (15) was used to extract non-Gaussian independent components with logcosh contrast function. Components were subsequently assigned to clusters using the "cluster" package. Clusters were visualized with 2d t-SNE using the R package "tsne." The number of clusters was determined as equal to number of components extracted at each run of ICA. When the number of samples is small comparing to the number of features, which is usually the case for biological data, it is convenient to retrieve as many as independent signal sources as possible, and the number of components extracted is equal to sample size.
The package "pcaMethods" was used to calculate principal components for comparison with independent components. In permutation tests, for each gene, protein abundance measurements of all samples were randomly shuffled without replacement. One hundred permuted data sets were created to assess the specificity of the ICA method.
Signature Annotation-Component clusters were annotated with GO terms by running Gene Set Enrichment Analysis against centroid coefficients as the pre-ranked gene lists (16,17). Enrichment map of components were visualized with Cytoscape 3 (18). Each cluster was also associated with clinical features as following: First, 22 clinical features were recoded into ordinal variables (supplemental Table S1). Second, ordinary linear regression models were built with corresponding mixing scores for members in a component cluster to predict each of the ordinal responses. Counts of significant associations between components and clinical features (p value for the slope coefficient Ͻ 5.9 ϫ 10 Ϫ7 , Bonferroni correction at ␣ ϭ 0.01 level) were tabulated. Hierarchical clustering with complete linkage was applied to the clinical associations of independent components clusters extracted from both transcriptome and proteome data.

Stable Molecular Signatures Extracted from Proteome and
Transcriptome Data-For both the proteome and transcriptome data sets we identified 77 clusters of meta-proteins or meta-genes from independent components obtained from multiple randomly initiated runs. In this study, we chose to run the procedure for 50 times with considerations of both computation time and numeric stability, as further increase the number of runs give rise to similar results (supplemental Fig.  S3). Centers of the stable meta-gene and meta-protein clusters, which could be found by averaging gene coefficients within each cluster, could represent pathway-level signatures. The stability of these signatures could be inspected by visualizing all meta-genes and meta-proteins with t-distributed stochastic neighbor embedding (t-SNE), a widely used dimensionality reduction technique (19)  Colors represent cluster assignments using the partition around medoids algorithm. B, D, average silhouette widths for the 77 clusters were plotted against the number of different individual runs from which the members of each cluster were extracted. Marginal histograms revealed the distribution of the metrics. Large values of individual runs indicate that the cluster is recurrent and larger values of silhouette width indicate that the cluster is more consistent. Both metric could be used to evaluate whether the cluster of components as a signature and more likely to capture true structures in the data.

Independent Component Analysis of Breast Cancer Proteogenomic Data
Molecular & Cellular Proteomics 18.14 S171 clusters are compact, whereas the rest formed a visually indistinguishable mixture, consistent with the observation that the average silhouette widths of all clusters followed a bimodal distribution (Fig. 2B, 2D). Another metric that could help evaluate the stability of extracted signatures is the number of different runs that the members of each cluster originated from. It has been proposed that reliable independent components should be highly recurrent, therefore, a cluster the members of which were extracted from different runs would be considered as more stable. Indeed, this metric largely agrees with silhouette width, meaning that compact clusters were most often comprised of meta-proteins or meta-genes extracted from all 50 runs.
Mixing Scores Reveal Clinical Relevance of Identified Components-In addition to their numerical stability, the clusters of components representing groups of meta-proteins or meta-genes identified from different ICA runs can also be evaluated with their relevance to known characteristics of breast cancer. For each ICA decomposition, the rows of the mixing score matrices represented the "activity level" of the corresponding independent components in all of the samples. Therefore, it is possible to establish associations between the activity patterns of meta-genes and meta-proteins and clinical features available for TCGA samples, which would in turn reveal the functional relevance of the signatures. We recoded 22 clinical features into ordinal factors and used linear regression to assess their correlation with activity scores of metagenes or meta-proteins in each signature cluster. To select the most significant associations, Bonferroni correction was applied to control for multiple comparisons at ␣ ϭ 0.01 level, such that a nominal p value of 5.9 ϫ 10 Ϫ7 was set as the significance threshold (corrected ␣ ϭ 0.01/(77 ϫ 22)). Within each cluster, the percentage of meta-proteins or meta-genes with activity levels that showed significant linear relation with the 22 clinical features was documented. Large portion of significant association between a signature cluster and a clinical feature indicates that the signature may contain pathwaylevel information about molecular mechanisms underlying the clinical feature. Thirty-three out of the 77 proteome IC clusters showed significant association with 10 clinical features, whereas 60 transcriptome clusters were correlated with 12 clinical features (Fig. 3). Six proteomic signatures and 7 tran- scriptomic signatures showed strong clinical associations, such that the corresponding mixing scores were significantly correlated with a clinical variable in more than 50% of all runs (counts Ͼ25). In contrast, when the same procedure applied to 100 sets of randomly permuted proteome data, no component clusters showed strong clinical associations, as the count of significant linear correlations between mixing scores of a given signature cluster and any of the tested clinical variables was no larger than 2 in all 7700 clusters of the 100 permutation settings.
To further assess the adequacy and efficacy of the ICA method on the CPTAC breast cancer proteomics data, we compared the results obtained with ICA to that extracted by principal component analysis (PCA), a "standard" and more widely used method of dimension reduction (Fig. 4). With PCA, the BRCA data could be represented by 77 principal components, each of which is a 77-element vector representing scores for the 77 samples. In the context of ICA, because of the stochastic nature of the method, a similar representation was constructed by averaging the activity scores corresponding to the independent component in each of the 77 clusters obtained from the 50 repetitive runs. As shown in Fig. 4, PCA gave rise to signatures that contained mixed information, as the significant associations with clinical features concentrated on Principal Component 2. A total of 6 PCA-extracted signatures showed significant correlation (p Ͻ 0.001, linear regression) with one or more clinical variables, whereas 17 ICA-extracted signatures were significant associated with at least one clinical variables. This comparison, consistent with the fact that ICA was designed for "de-mixing" signals of different sources, further demonstrated that ICA could extract more specific FIG. 4. Comparison between PCA and ICA applications on the BRCA proteome data. Significance levels of association between clinical features and dimension-reduced representation of data obtained by PCA and ICA revealed the difference between the two methods. Association between each score and the clinical features are characterized by Ϫlog 10 p value of linear regression slope coefficient.

Independent Component Analysis of Breast Cancer Proteogenomic Data
Molecular & Cellular Proteomics 18.14 S173

Independent Component Analysis of Breast Cancer Proteogenomic Data
S174 information about biological processes underlying high dimensional data. The biological relevance of the clusters identified with ICA could be further validated by inspecting the average gene coefficients (the centroids of the clusters). For example, the BRCA proteome signature cluster that contains 29 members that were significantly correlated with the HER2 receptor immuno status and 22 members associated with the Her2 subtype index (pr_02), was heavily weighted on ERBB4 and ERBB2, two tyrosine receptor kinases that mediate the Her2 signaling, and the two proteins were assigned the two largest coefficients (11.13 and 10.68, Table I). Another signature (pr_29) correlated with estrogen receptor (ER) and progesterone receptor (PR) status also exhibited high average scores for PGR (7.73) and ESR1 (4.42) proteins. Interestingly, the signature pr_04, which displayed strong correlation with the Luminal A subtype, is enriched with a group of mitosis-related proteins, which showed low levels of abundance in the Luminal A subtype (Fig. 5). This finding is consistent with the low proliferation level of Luminal A subtype (20). Protein abundance of genes with large weights in the same signatures also showed strong intra-module correlation (supplemental Fig.  S2). The aforementioned associations between proteome signatures and known biological processes demonstrated that ICA was able to rediscover clinically relevant information from proteome data in an unsupervised manner.
Large value coefficients of known marker genes revealed the biological relevance of a subset of the molecular signatures, but more biological information could be exploited on the pathway level. The coefficient vector for each signature is a preranked gene list that can be subjected to gene set enrichment analysis (GSEA) on curated gene sets and GO terms (16), and the collective effect of genes with coefficients of smaller absolute values could contribute to the enrichment metric. GSEA results showed that the clinically related proteome IC clusters retrieved a lot of the breast cancer related gene sets determined by experimental manipulations, as well as gene sets that characterized basic biological processes (Table II). For example, the proteome signatures pr_02 and pr_29, which were strongly associated with Her2 and ER status, also exhibited enrichment of the corresponding gene sets (Table II). The transcriptome signature tx_07, which were correlated with Basal subtype, ER and PR status, showed negative enrichment of targets of LSD1, a histone demethylase of H3K4 sites, which had known links to be poor breast cancer prognostics and ER negative status (21,22). Interestingly, the same signature cluster also exhibited positive enrichment of genes with H3K27Me3 sites, indicating that epigenetic regulation is highly orchestrated in breast cancer.
Although the most clinically relevant clusters (total count of significant association Ͼ25) contained proteome and transcriptome signatures that are highly recurrent, these clusters were not very homogeneous (supplemental Table S2). On the other hand, out of the 20 most consistent clusters with average silhouette widths larger than 0.9 and are recurrent in every run for both proteome and transcriptome, only 3 signatures (pr_04, tx_46, tx_63) showed direct clinical relevance. The meta-proteins and meta-genes in the clusters of pr_04 and tx_46 were both significantly associated with Luminal A subtype index and showed enrichment of genes involved in estrogen response (Table II). The signature tx_63 was associated with the Basal subtype and ER status, and its coefficients exhibited significant negative enrichment of genes in 16q24 loci (p Ͻ 0.001, FDR Ͻ 0.001), a breast cancer

Independent Component Analysis of Breast Cancer Proteogenomic Data
Molecular & Cellular Proteomics 18.14 S175 risk factor, as well as positive enrichment of genes involved in endocrine therapy response (p Ͻ 0.001, FDR Ͻ 0.001). The large extent of disagreement between clinical relevance and cluster consistency suggest that ICA may serve as a promising approach to knowledge discovery, as stable clusters that did not show clinical relevance may provide new insights into the molecular mechanisms of breast cancer.

Integrative Pathway Analysis of Proteomic and Transcriptomic Components Guided by Clinical Features-
The correlation between mixing scores and clinical features provided a valuable opportunity to find the link between independently extracted components from different data types. For example, in both BRCA proteome and transcriptome analyses, there was a highly consistent signature (pr_04 and tx_46), such that meta-proteins and meta-genes in the two clusters were recurrent in all 50 randomly initiated runs and the corresponding mixing scores all showed significant correlation with index of Luminal A subtype (Figs. 2 and 3). To integrate the proteomic and transcriptomic information with guidance from clinical relevance, we applied the hierarchical clustering algorithm to the vectors of clinical association counts for the most clinically relevant meta-protein and meta-gene clusters. Proteome and transcriptome signatures could therefore be grouped based on their similarity in functional indications, as the metric of direct correlation between gene coefficients of different signatures was limited by noise introduced by high dimensionality (supplemental Fig. S4). Combined network visualization of GO terms enriched in proteome and transcriptome signatures allowed us to examine complementary functional modules on different biological levels. For example, the pr_04 and tx_46 clusters may characterize Luminal A subtype specific processes from protein and RNA perspectives (Fig.  6B). Although majority of the gene set networks consisted of nodes and edges common between proteome and transcriptome, specific networks could be identified for both data modes, with tx_46 showed specific enrichment in a DNA

Independent Component Analysis of Breast Cancer Proteogenomic Data
S178 replicationrelated network, and pr_04 showed specific enrichment in polysaccharide metabolism (Fig. 6C).
To further validate the ICA workflow, we apply the method to proteomic data of the ovarian cancer CPTAC cohort (4). Signatures found in the data showed association with previously reported transcriptome-based subtypes (supplemental Fig. S5C) (23). Moreover, there were strong correlations between ICA-extracted proteome signatures and protein modules derived with weighted correlation network analysis (supplemental Fig. S5D) (4,24). To further highlight the potential of ICA workflow as an integrative method, we explored pancancer proteomic features by examining the correlation between ICA-extracted signatures obtained from breast cancer and ovarian cancer data sets. Between the 77 breast cancer proteome signatures (Fig. 2-3) and the 84 ovarian cancer proteome signatures (supplemental Fig. S5), four pairs of highly correlated signatures were identified (Spearman correlation larger than 0.2) (supplemental Fig. S6). Notably, the shared signature pairs include BRCA_IC_004/OV_IC_051, which was previously annotated to correlate with Luminal A subtype and show enrichments of mitosis-related genes (Fig. 5).

Independent Component Analysis of Breast Cancer Proteogenomic Data
Molecular & Cellular Proteomics 18.14 S179 The share signature pair BRCA_IC_014/OV_IC_004, which was associated with the Mesenchymal subtype of ovarian cancer (supplemental Fig. S6), showed consistent enrichments in extracellular-matrix related genes. Taken together, our results demonstrated the ICA workflow could consistently extract biological relevant information from multi-omics data sets. DISCUSSION We have used independent component analysis to gain insights into the mechanisms of breast cancer and develop protein/gene modules as clinical signatures. Meta-proteins and meta-genes are extracted blindly from the data, and signatures are further selected based on consistency of metagene/protein clusters or the association between their activity scores and known clinical features. Gene set annotation revealed that several selected signatures contained biologically relevant information. A proteome signature (pr_02) characterizing strong activation of the Her2 pathway was recovered as a Her2-related meta-protein cluster. On the transcriptome level, a meta-gene enriched for genes in the 16q24 risk loci (tx_63) formed a stable signature that showed correlation with both ER status and Basal subtype. Stable signatures on both proteome and transcriptome levels (pr_04 and tx_46) were found to be heavily associated with the Luminal A index, suggesting that cell division and growth is specifically regulated in this subtype.
As an unsupervised blind source separation method, independent component analysis has been applied to multiple biological data types. Consistent with previous reports, our results demonstrated that as an unsupervised learning method, ICA can extract biological meaningful information solely based on the intrinsic structures of the transcriptome and proteome data. Because of the underlying assumptions, this method would only yield satisfactory results when the "true" signals in the data do not follow Gaussian distribution, and it would fail to separate any mixture of normally distributed sources. Successful application in the current use case suggested that ICA have captured some aspects of the mechanistic processes underlying proteome and transcriptome profiles. It is reasonable to assume that the observed RNA and protein levels are the sum of several up-regulation and downregulation modules, in which the distribution of individual gene levels deviate radically from the normal distribution that characterizes a noisy background.
In the current use case, we have designated the number of components to be equal to the number of samples, as is usually assumed in blind source separation applications. It remains an open question whether all extracted signatures are reflective of some "true" biological processes. Moreover, the stochastic nature of the algorithm entails that, although all components were found simultaneously in each run without any component "privileged" over others (25), there is no guarantee that all solutions found at local optima are practically useful features. Indeed, the fact that a lot of the extracted components were neither consistent nor recurrent in multiple runs suggested that they may arise from noise instead of stable signals. However, previous applications of ICA on biological data have showed that overestimating the number of independent sources gave rise to signatures driven by smaller gene groups without affecting the rediscovery of the most stable components, while underestimating the number of sources compromised signal detection (26). It is possible that assuming the number of independent sources to be equal to the number of samples may result in noisy signals that are "unnecessary." However, in an exploratory analysis setting where the number of samples is relatively small, this problem could be remedied by inspecting the stability and biological relevance of extracted signatures post hoc and extracting more components may lead to better separation of signals (supplemental Fig. S3). In the case of large sample size, the number of components will have to be determined by heuristics, and different choices may be compared in downstream analyses.
Using clinical association (total significant associations Ͼ25 across all tested clinical variables) or signature stability (cluster silhouette Ͼ0.9) as selection criterion gave rise to two lists of potential signatures (Table II) with only a few overlapping members (pr_04, tx_46, tx_63), suggesting that the most clinically relevant signatures are not very stable under the current method. Other clustering method such as density-based clustering may be used to improve the estimation of stable signatures (14,27). On the other hand, it is possible that the most stable signatures described the housekeeping processes common in all samples, but they may also help reveal novel molecular mechanisms of breast cancer that are not previously linked to any phenotype. As a feature construction procedure, ICA could also facilitate knowledge integration from multiple data types. In addition to the integration of proteome and transcriptome signatures as demonstrated in this work, future studies could further apply ICA on omics data sets of multiple cancer types. Extracted molecular signatures may be grouped in another round of clustering procedure to reveal pan-cancer and cancer type specific mechanisms.