Correlation Analysis of Histopathology and Proteogenomics Data for Breast Cancer

Tumors are heterogeneous tissues with different types of cells such as cancer cells, fibroblasts, and lymphocytes. Although the morphological features of tumors are critical for cancer diagnosis and prognosis, the underlying molecular events and genes for tumor morphology are far from being clear. With the advancement in computational pathology and accumulation of large amount of cancer samples with matched molecular and histopathology data, researchers can carry out integrative analysis to investigate this issue. In this study, we systematically examine the relationships between morphological features and various molecular data in breast cancers. Specifically, we identified 73 breast cancer patients from the TCGA and CPTAC projects matched whole slide images, RNA-seq, and proteomic data. By calculating 100 different morphological features and correlating them with the transcriptomic and proteomic data, we inferred four major biological processes associated with various interpretable morphological features. These processes include metabolism, cell cycle, immune response, and extracellular matrix development, which are all hallmarks of cancers and the associated morphological features are related to area, density, and shapes of epithelial cells, fibroblasts, and lymphocytes. In addition, protein specific biological processes were inferred solely from proteomic data, suggesting the importance of proteomic data in obtaining a holistic understanding of the molecular basis for tumor tissue morphology. Furthermore, survival analysis yielded specific morphological features related to patient prognosis, which have a strong association with important molecular events based on our analysis. Overall, our study demonstrated the power for integrating multiple types of biological data for cancer samples in generating new hypothesis as well as identifying potential biomarkers predicting patient outcome. Future work includes causal


In Brief
Histopathology images are important for cancer diagnosis and prognosis. We extracted quantitative morphological features from breast cancers images and systematically analyzed their relationships with proteins and mRNAs. We observed concordant correlation patterns between image-protein and image-RNA and identified four cancer-related biological processes associated with morphological features related to different tumor components. Further, we observed that proteomic data reveal unique protein-related biological processes associated with morphology. Finally, prognostic morphological features were identified, and their roles are consistent with the underlying biological processes.

Graphical Abstract
The aggregation of large amount of trans-omics data including high-throughput genetic, transcriptomic, proteomic and clinical information has revolutionized disease research in the past decade but also led to a series of new analytical challenges, calling for new approaches and solutions that aim at improving diagnosis, prognosis, and treatment of complex diseases (1)(2)(3)(4)(5). Cancer is a disease with complex underlying molecular mechanisms and factors, and researchers have contributed an overwhelmingly large body of data to characterize, diagnose, and ultimately treat patients with greater precision (6 -8). This data revolution enabled researchers to identify genetic mutations and gene expression signatures associated with the development of cancers (6,9,10). However, despite these considerable progresses in understanding cancer at multiple levels of biological events, a substantial challenge is to link different types of data with cancer cell and tissue morphology, with the latter being essential for diagnosis and prognosis in clinical practice.
Solid tumors are heterogeneous tissues that contain a mixture of malignant and non-malignant cells, such as stromal cells and lymphocytes (11,12). Distinct molecular differences exist for cells derived from different tissues, reflected in different patterns in multi-omics data including gene and protein expression profiles (9,13). These molecular differences, in turn, induce changes in biological functions and morphology of tumor tissue and cells, which are often associated with different prognosis of cancers (14). To date, many studies have addressed the close relationship between molecular events and morphological features of tumor tissues. For instance, Baba et al. (15) systematically discussed the association between mitoses and metabolism with nuclear changes and Wang et al. (16) identified genes whose expression levels are associated with multiple morphological features of tumor cells in triple negative breast cancer. Although remarkable achievements have been made, there are still many important questions to be answered. For example, what is the underly-ing molecular basis for the cellular and tissue heterogeneity in the tumor (17)? How are the transcriptional and proteomic aberrations reflected on cellular morphology? Therefore, studying the correlations between nuclear morphology and molecular data, especially functional data including both transcriptomic and proteomic data will shed light on the molecular basis of various morphological features of cells and tissues, addressing important questions in cancer development.
Pathological diagnosis is critical for clinical oncology where morphological features are extensively used for diagnostics and prognosis (18). Histopathology images derived from hematoxylin and eosin (H&E)-stained cancer tissue slide contain information regarding morphology (e.g. nuclear area, nuclear shape) and spatial context (e.g. cell density) of diverse types of cells coexisting in the tumor microenvironment (12,14,19). Although our previous work (16,20,21) along with many work by others (17,19,22,23) have successfully demonstrated a connection between cellular and tissue morphology and clinical outcome for cancers, the underlying molecular basis especially key biological processes associated with these morphological features have not been well understood. Therefore, investigating the biological processes underlying the prognostic morphological features is an important issue in cancer biology and outcome prediction.
To address these issues, matched histopathology images and multi-omics datasets for cancers are required. Fortunately, large consortium endeavors, such as The Cancer Genome Atlas (TCGA) 1 have accumulated many large datasets to enable such analyses. TCGA aggregates an extensive collection of omics and clinical datasets from large cohorts of patients for more than 30 types of cancers (24). It also archives histopathology images for solid tumor samples from which omics data were sampled. Currently, more than 24,000 histopathology images are available and can be visualized at the Cancer Digital Slide Archive (CDSA, http://cancer. digitalslidearchive.net/). In addition, The NCI Clinical Proteomic Tumor Analysis Consortium (CPTAC) (https://proteomics. cancer.gov/programs/cptac) program also provides highthroughput proteomic data for some of the TCGA tumor specimens such as breast cancer, ovarian cancer, and colorectal cancer based on mass-spectrometry technology. These large-scale experimental datasets make comprehensive integrative and correlative analyses feasible.
In this study we aim to systematic explore the relationship among molecular, morphological, and clinical data for differential cell types in breast cancer. Previously, we developed a quantitative image analysis pipeline that automatically extracts quantitative cellular morphological features such as nuclear size, nuclear shape, and cell density from H&Estained whole-slide images (25). Based on this pipeline we performed a series of analysis correlating and integrating molecular data, morphological features, and clinical outcome using data from TCGA and CPTAC Breast invasive carcinoma (BRCA) project. Breast cancer is one of the most common malignant cancers (25) and matched histopathology images and omics data including the genome-wide proteomic, transcriptomic, and Reverse Phase Protein Array (RPPA) proteomic data were acquired from CPTAC and TCGA for a subset of 73 patients in BRCA (25,26). First, we performed a correlative analysis between multi-omics data (including proteomic, transcriptomics data) and morphological features extracted from histopathology images. We observed that proteomic and transcriptomic data shared consistent correlation pattern with various morphological features at genome scale. However, comparing to transcriptomics data, proteomic data can identify specific protein-related biological processes associated with morphological features that otherwise cannot be inferred from transcriptomic data. More comprehensive analysis revealed that four major categories of biology processes related to the hallmarks of cancer (6) are associated with different morphological feature based on the correlated proteomic data. Furthermore, we examined the relationship between nuclear morphology and patient outcome (i.e. survival time). Both prognostically favorable and unfavorable morphological features have been identified. The biological processes associated with these prognostic morphological features were also identified based on proteomic data. The biological processes such as immune responses, cell cycle, and extracellular matrix development have been previously associated with cancer patient outcome. In summary, our work linked molecular data, morphology, and clinical outcome, which led to new insights and hypotheses into the relationships between cancer tissue development and molecular events, thus contributing to a more comprehensive understanding of breast cancer. The entire process and workflow can be applied to other cancers.

EXPERIMENTAL PROCEDURES
Experimental Design and Statistical Rationale-The main objective of this study is to explore the relationship between molecular data and tumor morphology. Firstly, we performed correlation analysis between molecular data (i.e. mRNA-seq data from TCGA and proteomic data from CPTAC respectively) and quantitative morphological features extracted from histopathology images of breast cancers. We examined the distribution patterns of correlation coefficients between image-mRNA and image-protein pairs at genome scale. Secondly, we validated the above distribution patterns of correlation coefficients using proteomic data generated from the RPPA technology and morphological features. Thirdly, we compared the biological processes associated with morphological and spatial features based on the strongly correlated mRNAs and proteins. Finally, we summarized the major biological implications underlying the various morphological features. In addition to the correlation analysis, we explored the relationships among morphology, patient outcome, and the underlying biological processes derived from protein data.
The overall workflow for the analysis is summarized in Fig. 1 with the correlation and other bioinformatics analyses outlined in Fig. 1B. For nuclear morphology-proteomic-transcriptomic correlation analysis, we identified a set of 73 patients whose tumor samples are shared by the TCGA BRAC project and CPTAC breast cancer project with matched H&E-stained whole-slide images, MS-derived proteomic profiles, and RNA-seq transcriptomic data. Specifically, the CPTAC consortium contained 105 breast cancer proteome profiles, of which 77 samples contained non-degraded data (26) that include the 73 selected samples. Proteomic data were accessed and downloaded using the R package "TCGA-Assembler 2" (27, 28) from the CPTAC. Histopathology images were downloaded directly through the GDC Minor_Axis: length of minor axis of a nuclear; Ratio: the ratio between lengths of major and minor axes; Mean_Distance: mean distance to neighboring cells; Max_Distance: maximum distance to neighboring cells; Min_Distance: minimum distance to neighboring cells.) TCGA Data Portal, whereas transcriptomic data were downloaded from the UCSC Xena data portal (https://xena.ucsc.edu/public-hubs/) (29). To validate correlation analysis between proteomic and transcriptomic profiles for genes, matched RPPA proteomic data were obtained for each of the 73 samples described above from the Broad GDAC Firehose (https://gdac.broadinstitute.org). The RPPA dataset contains protein expression profiles for 183 genes instead of the entire genome.
To understand the relationship between image analysis-derived cell morphological features and patient survival outcomes, 1,057 BRCA-type breast patients with matched 1057 H&E-stained tissue images and corresponding clinical survival information were used. The patient clinical data were obtained from UCSC Xena. Demographic and clinical characteristics of the patients were described in Table I.
Analysis of Nuclear Morphology from Archived Histopathology Images-As outlined in Fig. 1A, using our previously developed image analysis algorithms and pipeline (30,31), automated image analysis was carried out and ten types of cell-level features from tissue images were extracted following the three main steps: 1) nuclei segmentation, 2) cell-level feature measurement, and 3) aggregation of cell-level measurements into patient-level statistics. In Step 1, the nuclei of all cells in the image are automatically segmented based on our previous workflow (31). In Step 2, ten types of cell-level features were extracted, including seven types of morphological and spatial traits and three types of pixel traits in the RGB color space. The seven types of morphological and spatial features of cell nuclei were: major axis length (Major_Axis), minor axis length (Minor_Axis), the ratio of major to minor axis length (Ratio), nuclear area (Area), mean distance to neighboring cells (Mean_Distance), maximum distance to neighboring cells (Max_Distance), and minimum distance to neighboring cells (Min_Distance). The seven types of morphological and spatial features of cell nuclei can be summarized as nucleic area (Area), nucleic shape (Major_Axis, Minor_Axis, and Ratio), and cell density (Mean_ Distance, Max_Distance and Min_Distance). In Step 3, 5-bin histogram and five distribution statistics (i.e. mean, standard deviation or S.D., skewness, kurtosis, and entropy) were calculated for each of the ten types of morphological features to aggregate the measurements over the whole slide image. Thus for each type of feature, ten measurements (i.e. five histogram bins and five distribution statistics) were generated and 100 image features were generated in total for the ten types of morphological features. The centers of the five bins were determined by clustering each type of cell-level features from all patients instead of a single patient, which ensured that the histogram features are comparable and consistent across the entire patient cohort. The value of each feature based on the five bins of the histogram represented the relative percentage of corresponding image feature over the entire slide for a patient.
To identify distinctive features among the morphological and spatial features, we focused our analysis on the smallest and largest ends of the morphological features for seven types cell-type image features. We designated these features with intuitive names such as Small_Nucleus_Area and Large_Nucleus_Area. The Small_Nucleus_ Area is the first bin of the five-bin histogram, thus representing the proportion of very small nuclei. Other feature names are similarly defined. The visual explanation of these morphological features and putative biological interpretations can be found in Table II.
Analysis of MS-based Proteomic Data-We obtained log 2 -transformed iTRAQ values of protein abundance from CPTAC. The iTRAQ values were calculated as the log 2 -transformed ratio of spectral counts of target proteins versus a reference. For this analysis, we used iTRAQ data that included peptides that were mapped to multiple proteins. First, to obtain high quality proteomic data, proteins with missing values in more than 20% of the samples were excluded from analyses. Second, expression levels of proteins with missing values in less than 20% of samples were imputed using the Multivariate Imputation tool "mice" package in R (32).
Analysis of RNA-seq Data-Log 2 -transformed and normalized RSEM gene transcript values of RNA-seq data were obtained and genes with a value of zero in more than 20% of the samples were excluded from analyses.
Image-mRNA and Image-Protein Correlation Analysis-8125 common genes that had both the mRNA and protein expression values were identified. For this subset of genes, the correlations between image features and expression values (protein and mRNA) of a gene were calculated using Spearman's rank correlation ().
Here we chose a relatively loose cutoff of Spearman's rank correlation coefficient , which corresponds to the p value Ͻ0.01. Specifically, we designated Ͼ 0.3 as correlated whereas Յ 0.3 was considered uncorrelated.
Validation of the Image-protein Correlation-Besides the CPTAC data, we used the matched proteome data based on RPPA technology of the selected 73 samples for validation. The RPPA proteomic data after normalization were downloaded and the similar processing method with MS-based proteomic data was applied. The Spearman correlation coefficients between image features and protein levels measured using RPPA technology was calculated. The correlations for image-RPPA and image-CPTAC data were then compared.
Biological Process Enrichment Analysis-In order to identify enriched biological processes associated with images features, Gene ontology (GO) analyses were performed using ToppGene (https//topgene.cchmc.org) (33) based on genes whose mRNA or protein product was correlated with morphological features. Here only genes correlated with selected morphological features in Table  II were used to perform enrichment analysis. The Fisher's exact test  Survival Analysis-To assess the association between image features and patient survival information, for each selected image feature, the patient cohort was divided into two groups (high image feature value and low value groups) by applying a cutoff on the image feature values. Then Kaplan-Meier estimator was used for patient stratification and log-rank test was adopted to compare the survival difference between two groups. To choose the best image feature value cut-off with most significant survival difference, we adopted the same procedure as the Human Pathology Atlas, which is part of the Human Proteome Atlas (34). Specifically, all values of the selected feature were ranked and values from the 20th to 80th percentiles were used to identify the cutoff for grouping patients, significant differences in the survival outcomes of the groups were examined and the value yielding the lowest log-rank p value was selected as the cutoff. Features were designated as prognostic image features for those with log-rank p value less than 0.001 for the selected cutoff.
Further, we determined if a prognostic image feature is a "favorable" or "unfavorable" feature by applying the univariate Cox proportional hazards regression analysis with the hazard ratio (HR) larger than 1.2. An unfavorable prognostic image feature was defined as whose higher value is associated with the poor survival whereas favorable prognostic image feature has lower value associated with poor survival.
All analysis above were performed using the "survival" R package (35).
Morphological Features Associated with Clinical Subtype Classification-The associations between each morphological features and clinical subtype classification (i.e. Basal, Luminal A, Luminal B, Her2) were examined using Kruskal-Wallis Test.
Statistical Analysis Software-Except where noted above, all statistical analyses were performed in R version 3.5.1. The analysis scripts that we used for this manuscript are available at GitHub: https://github.com/xiaohuizhan/ cor_image_omics_BRCA.

Correlations Analysis Between Multi-omics Data and
Morphology-To investigate the relationships between molecular data and histopathology features, we performed correlation analysis between imaging features and mRNA or protein (MSbased global proteomic data) profiles by calculating Spearman's rank correlation coefficients (). A total of 8,125 genes with both mRNA and protein expression and 100 image features for 10 types of cell-level image features extracted from histopathology images were analyzed. As the processes from transcriptome to proteome to morphology were quite complex, in order to be comprehensive in identifying potential molecular basis for different morphological features, correlated image-mRNA or image-protein pairs were designated using a cutoff of Ͼ 0.3 to avoid excessive stringency.
The results are summarized in Table III. Among the 100 ϫ 8,125 ϭ 812,500 mage-gene pairs, 5.82% of all image-mRNA pairs and 3.95% of all image-protein pairs were observed to be correlated. In addition, 92.96% showed consistent correlative relationships (either correlated or uncorrelated) in both image-mRNA and image-protein relationships including 0.79% positive correlations, 0.57% negative correlations, and 91.60% no correlation. In contrast, 4.45% image-gene pairs showed correlations only in the image-mRNA relationship, whereas 2.58% were only correlated in image-protein pairs. Opposite image-mRNA and image-protein correlative relationships (i.e. positive correlation in one pair but negative in the other) were observed for only 0.003% of all image-gene pairs. Such globally consistent patterns can also be observed between every image feature and mRNAs and proteins, regardless of positive or negative correlations as demonstrated in supplemental Fig. S1. These results suggested that, at the genome scale, image-mRNA and imageprotein shared consistent correlation patterns.
Next, we compared the distribution of correlation coefficients for image-mRNA and image-protein pairs for individual morphological features (i.e. Area, Major_Axis, Minor_Axis, Ratio, Mean_ Distance, Max_Distance and Min_Distance). Specifically, we focused on the selected morphological features in Table II. The correlation for image feature with most of the proteins were consistent with matched mRNA. Supplemental Fig. S2 showed the distribution of correlations for image-mRNA and image-protein for these features. At the individual feature level, the distribution of correlations for individual image features also revealed consistent correlation patterns between image-mRNA and image-protein pairs.
Comparison of Image-protein Correlation Between CPTAC and TCGA Data-In order to test whether the correlation pattern between image-CPTAC measurement overfits the data, we used the matched proteome data based on RPPA technology of these 73 samples for validation. Here we compared the correlation patterns between image features and protein measurements from MS-based technology and RPPA technology. As shown in Fig. 2 for an example for the Large_ Nucleus_Area feature. We observed overall consistent results supporting the robustness of correlation between protein profiles and morphological features (correlation coefficients range from 0.472 to 0.597). Results for the rest selected morphological features were displayed in supplemental Fig. S3.
Image-protein Correlation Analysis Reveals Specific Biological Processes Associated with Morphological Features-To test whether proteomics data can reveal biological processes associated with morphological features that cannot be inferred from transcriptomic (mRNA) data alone, we compared enriched gene ontology (GO) terms obtained from mRNAs and proteins that showed strong positive correlation with individual morphological features. After examining these significantly enriched GO Biological Process categories associated with morphological features, we found most of these Biological Process categories identified based on mRNA data can be confirmed based on proteomic data except for Large_Nucleus_Area (implying large nuclei) related biological process. However, some unique Biological Process categories associated with morphological features were found solely based on proteomics data. For instance, for Small_ Nucleus_Area (implying small nuclei), protein related Biological processes such as posttranscriptional regulation of gene expression and translation were identified from only proteomics data (supplemental Table S1). Table IV showed the most significantly enriched biological processes terms for individual morphological features based on the positive correlated proteins and mRNAs. In addition, we noticed that for the feature Large_Nucleus_Area (implying large nuclei), mitochondria protein synthesis process involving largely the mitochondrial Ribosomal proteins (MRPs) proteins was significantly enriched based on proteomics data, whereas in contrast it was RNA synthesis process inferred from mRNA data (Fig. 3). Based on these observations, proteomic data can reveal biological processes associated with certain  morphological features, which cannot be otherwise identified from transcriptional data alone. Specific Biological Processes Associated with Image Features-As the genes and proteins correlated with the morphological features may shed light on the molecular basis for the cellular and tissue morphology in cancer, gene ontology (GO) enrichment analysis was performed for proteins correlated with each individual morphological feature (i.e. Nuclear Area, Major_Axis, Minor_Axis, Ratio, Mean_Distance, Max_Distance, and Min_Distance) based on if the proteins were positively or negatively correlated. In order to identify biological processes associated with specific types of morphological features, we focused on the selected features listed in Table II. Overall, the analysis revealed GO terms related to four major categories of biology processes including metabolism, immune process, cell cycle, and extracellular matrix (ECM) were significantly enriched (FDR Ͻ 0.05) for morphological features (as shown Fig. 4, supplemental Fig. S4 and Table V). Mitochondrial Ribosomal proteins (MRPs) and mRNA processing related biology processes stood out among metabolic related GO processes. Although for ECM, cell adhesion, cell migration, and vascular system development related GO terms were shared biology processes ( Fig. 4 and supplemental Fig. S4). For positive correlations, both Small_Nucleus_ Area (implying small nuclei) and Small_Major_Axis (implying small nuclei) were significantly correlated with cell cycle re-FIG. 3. Significantly enriched GO biological processes for large Nuclei Area based on proteomic and Transcriptomic data. x axis lists significant enriched biological processes associated with Large_Nucleus_Area. y axis is -log10(FDR). Orange stands for the mitochondria protein synthesis related biological processes that were identified based on proteomic data. Blue stands for RNA synthesis related biological processes that were identified based on transcriptomic data. Here only top10 enriched biological processes were listed for each category. lated processes and well-known cell cycle related proteins such as CDCA8, CDC20, NDC80, and BUB1B. In addition, Small_Nucleus_Area (implying small nuclei), Large_Nucleus_ Area (implying large nuclei), Small_Aspect_Ratio (implying round nuclei), Small_Major_Axis (implying small nuclei), Large_ Minor_Axis (implying large nuclei), Small_Mean_Distance (implying high cell density), Small_Max_Distance (implying high cell density), and Small_Min_Distance (implying high cell density) were all significantly correlated with metabolic processes. Among the proteins associated with these processes, mammalian mitochondrial ribosomal proteins (i.e. MRPL9, MRPL21, MRPL39) showed high correlation, which function in RNA synthesis and processing as well as protein synthesis and translation in cytosol and/or mitochondria and are necessary for the fast growth of tumor cells (36). Although the detailed relationships between cancer cell nuclear size and protein expression as well as metabolism have not been fully investigated, studies based on cancer cell lines suggested protein synthesis rates are positively correlated with cell size, which may be related to nuclear size as well (37).
Moreover, morphological features like Small_Aspect_Ratio (implying round nuclei) and Small_Mean_Distance (implying high cell density), were significantly correlated with immune

Molecular & Cellular Proteomics 18.14 S45
processes. This is consistent with our knowledge in pathology that many lymphocytes can be identified based on their typical small and round shape (12,14) as well as densely aggregated patterns. Immune response related proteins such as FCN1, LY75, and major histocompatibility complex (MHC) related proteins such as TAP1, TAP2, B2M were among the ones that are highly correlated with these morphological features.
Further, features such as Large_Aspect_Ratio (implying elongated shape), Large_Major_Axis (implying large or elongated nuclei), Small_Minor_Axis (implying small or elongated nuclei), and Large_Mean_Distance (implying low cell density) correlated with proteins that are significantly enriched with ECM development, which is consistent with the development of tumor stroma in the microenvironment. As stromal cells such as fibroblasts are typically spindle-shaped with elongated nuclei (12) and sparsely scattered in the stroma, they are characterized by long major axes and/or large ratio between major and minor axes and low density compared with epithelial cells. Collagen related proteins such as COL5A1, COL5A2 and COL5A3, which are structural constituent of EMC were identified. From the breast cancer biopsy samples with immunohistochemical staining for COL5A1 in the Human Protein Atlas (HPA) database, we indeed observed a high percentage of stromal cell existing in breast cancer for the ones with high COL5A1 staining (supplemental Fig. S5). Similar results were also observed for MRC2 and COL3A1, which were also highly correlated with morphological features linked to stroma cells (supplemental Fig. S5). Together, the associated biological processes and well-known protein markers support our understanding of the biological basis of different cell type morphological features.
Like the positive correlations between proteomic data and morphological features, such patterns of shared high-level biological processes were also observed in proteins that are negatively correlated with morphological features. Because the values of the image features are relative (i.e. percentages) based on distribution of the values, most of the enriched biology processes associated with the selected extremal features showed inverse enrichment (i.e. the proteins show positive correlations with the large feature values often show negative correlations with the corresponding small aspect). For negative correlations, metabolic process was shown to be significantly associated with features including Small_ Nucleus_Area (implying small nuclei), Large_Aspect_Ratio (implying elongated shape), Small_Major_Axis (implying small Note: ECM related biology processes includes: extracellular matrix (ECM), cell adhesion, cell migration, and vascular system GO functions; Metabolic related biology process mostly include: Mitochondrial Ribosomal proteins (MRPs) and mRNA processing related GO functions. nuclei), Large_Major_Axis (implying large or elongated nuclei), Small_Minor_Axis (implying small or elongated nuclei), Large_Mean_Distance (implying low cell density), Large_ Max_Distance (implying low cell density), Small_Min_Distance (implying high cell density), and Large_Min_Distance (implying low cell density). Immune processes were significantly enriched in proteins negatively correlated with features such as Large_Nucleus_Area (implying large nuclei), Large_Minor_Axis (implying large nuclei), Large_Mean_Distance (implying low cell density), Large_Max_Distance (implying low cell density), and Large_Min_Distance (implying low cell density). The ECM related features were significantly enriched in Small_Aspect_ Ratio (implying round nuclei), Large_Minor_Axis (implying large nuclei), Small_Mean_Distance (implying low cell density), Small_Max_Distance (implying low cell density), and Small_Min_Distance (implying low cell density) (Table V and supplemental Fig. S4).
In summary, four major types of biology process including metabolism, immune, cell cycle and ECM development were identified based on proteomic data because of strong associations with morphological features.
Survival Analysis Based on the Morphological Features-Because morphological parameters extracted from histopathology images are essential for breast cancer diagnosis and prognosis by pathologists, we also investigated how well these morphological features are associated with clinical outcome of the patients as described in the Methods section to assess the association between morphological feature and patient overall survival information for all the 1,057 patients in the TCGA BRCA project. Morphological features (p value Ͻ 0.001, HR Ͼ 1.2) associated with both favorable and unfavorable prognosis have been identified using the workflow described above. In Fig. 5, examples of favorable and unfavorable prognostic morphological features were shown, based on the optimal stratification p value calculated using a similar approach as in (34) (detailed information for other morphological features were provided in supplemental Fig. S6). Five prognostic morphological features that were strongly correlated with patients' overall survival were selected (Fig. 6). After examining these survival-associated variables, we found unfavorable prognostic morphological features including Large_Nucleus_Area (implying large nuclei), Large_Minor_Axis (implying large or elongated nuclei), and Large_Max_Distance (implying low cell density). These morphological features were linked to large cell nuclei or large distances to neighboring cells, which were highly associated with metabolic or ECM related biology processes (Table VI). As for favorable prognostic morphological feature, they tended to be small distances to neighboring cells (Small_Mean_Distance and Small_Min_ Distance), which were highly correlated with metabolic or immune related processes (Table VI).
Morphological Features Significantly Associated with Clinical Subtype Classification-As breast patients can be categorized into subtypes including Basal, Luminal A, Luminal B, Her2 type based on histology and molecular signatures, we performed Kruskal-Wallis Test for each selected image feature to test if they have significant associations with these clinical subtypes. All the image features exhibit significant differences between breast cancer subtypes except for Small_Minor_Axis (supplemental Table S2). These results are consistent with the fact that morphological features are critical to guide diagnosis and treatment. DISCUSSION Solid tumors such as breast cancer are highly heterogeneous, with multiple types of cells such as epithelial cells, immune cells and other stromal cells. Given the importance of tumor morphological features in diagnosis and prognosis, investigating the relationship between the molecular data and morphology can lead to potential new insight on the molecular basis underlying cancer development and prognosis. Taking advantage of the computational pathology workflow we established for processing whole slide images, we were able to extract quantitative morphological features from histopathology slides of breast cancer tissues, thus enabling investigating relationships between tumor tissue morphology and omics data. In addition, because mRNA and protein data contain related but different levels of molecular information, integrating both data with morphological features can lead to discovery of different biological events associated with cancer tissue morphology.
Based on the correlation analysis between morphological features derived from whole slide images of tissue samples and molecular data (mRNA or proteomic data), four major types of biology processes, namely metabolic, cell cycle, immune, and ECM development processes have been identified. These processes have all been strongly associated with cancer hallmarks (6). Morphological features enriched with metabolic and cell cycle processes were associated with cancer (epithelial) cells. Among these metabolic processes, we observed strong signals for mitochondria related biology processes, the protein translational process related to mitochondrial Ribosomal proteins (MRPs). Kim et al. (36) previously demonstrated the important function of MRPs in regulating apoptosis, cell cycle, and cell proliferation. As for cell cycle processes, Yuan et al. have highlighted its close relationship with cancer morphologic features (12). However, although it is often anticipated that active cell cycle progress may be associated with large nuclei because of chromosome duplication and mitosis, our results suggest that they may also lead to more smaller cells in breast cancer possibly because of active division even though the detailed mechanism calls for future in-depth investigation. In addition, ECM development and immune response processes for tumor microenvironment were associated with stromal cells and tumor infiltrating lymphocytes respectively (38 -41). We found that these stromal cells related features are most strongly associated with tumor microenvironment (TME) development (e.g. ECM, cell adhesion, cell migration). Previous studies have demonstrated that the interaction between stromal cells (such as cancer-associated fibroblasts, a typical stroma cell) and ECM has a crucial role in tumor initiation, progression, and metastasis (42)(43)(44), which is an important hallmark of cancers. Beck et al. previously demonstrated the importance of TME related morphological features in breast cancer prognosis (22) and our results linked related features to the potential underlying genes. In addition, cancer-associated fibroblast (CAF) is a typical stromal cell and can recruit and bind collagen fibers (key components of ECM) thus convert a loose stroma into a dense stromal network (43,44), this network acts as a barrier to drug flow, thereby increasing chemoresistance. Lastly, Yuan et al. also identified that immune related pathways were correlated with the lymphocyte morphologic features (12), which is consistent with our observation. Taken together, our approach can identify the specific biological process associated with individual morphological features. These results not only confirm our understanding of the molecular basis of morphology, but also offer new insights and hypotheses regarding the development of cancer tissues for future investigation.
When comparing the significantly enriched biological processes associated with morphological features based on mRNA and protein, we found that although most of the significantly enriched biological process categories were similar, some unique biological processes associated with morphological features were identified only based on proteomics data (e.g. posttranscriptional related biological processes). In addition, the mitochondria-related metabolism processes also stood out based on proteomic data. Latonen et al. recently showed that post-transcriptional events take important roles in the mitochondria during cancer progression (45). These results strongly suggest that proteomic data are important in fully characterizing the molecular events associated with morphological changes at cellular and tissue levels and are important for understand the development of cancers.
Because histopathology images are essential for cancer diagnosis and prognosis, we also identified favorable and unfavorable prognostic morphological features and the corresponding biological process associated with them. Among these unfavorable predictors, large values of long distance to adjacent nuclei imply a high percentage of stromal components in the in whole-slide images. Yuan et al. and Beck et al. both demonstrated that stromal morphologic structure is an important prognostic factor in breast cancer, patients with higher stromal proportions had worse prognosis than other patients (12,22). In addition, we also observed that large nuclear area is associated with poor survival. Previous studies have highlighted that cancer cells with enlarged nuclei almost always indicate more aggressive outcomes (46). Currently, anti-estrogen therapy to decreased nuclear size in tumors are used for preoperative treatment of breast cancer patients (46). As for favorable predictors, most of them were related to immune responses, suggesting that activation of immune system plays critical roles in fighting cancer, which are consistent with many recent studies on cancer immunology and immunotherapy (12,47,48).
Despite the extensive observations and results generated from our analysis, some limitations of this study should be noticed as well. First, the key molecular regulators for the cell type morphology features were still unknown, even though the associated biological processes were inferred because our current study focuses on correlation analysis instead of causal analysis. Deeper analysis for the regulatory and driver genes and proteins using more sophisticated statistical methods combined with experimental validation will be carried out soon. Second, we only included 73 breast cancer patients for the correlation analysis between molecular data and morphology phenotypes in this study because of the limitation of available data. The image-protein and image-mRNA relationships identified here may not represent all breast cancer subtypes. Despite that the correlative relationships between proteomic data and morphology were validated using matched RPPA data, further confirmation using independent datasets is still needed despite the lack of such data at the meantime. Last but not the least, even though we showed that the cell nucleic features suggested stromal or tumor cells, it is difficult to distinguish different cell types accurately just based on the nucleic morphology alone.
In summary, we carried out a unique systematic study on the relationship between tumor tissue morphology and transcriptomic as well as proteomic data in breast cancer. We observed concordant distribution patterns of correlation coefficients between image-mRNA and image-protein at the genome scale. Four major types of important biological processes related to cancers have been associated with various morphological features. Importantly, proteomic data are critical in identifying protein related biological processes associated with morphological features, which cannot be captured by transcriptomic data. In addition, morphological features associated with patient survival have been identified and their underlying molecular processes based on the associated proteins can link these morphological features to different hallmarks of cancers.
In conclusion, our analysis demonstrated the potential for integrating morphological information and molecular data for generating new biological hypothesis for cancer research. The algorithmic development for computational pathology unleashes the potential for similar large-scale studies for different cancers. More sophisticated modeling and integration methods will lead to deeper understanding of the regulation of the tissue morphology and importance of protein in this process, contributing to the generation of new insights for cancer biology and outcome prediction. Acknowledgment-We thank Mrs. Megan Metzger for editing the manuscript.

DATA AVAILABILITY
The data used for this study are downloaded from various public sources. Proteomic data were accessed from the NCI CPTAC Data Portal. Histopathology images were downloaded directly through the NCI GDC TCGA Data Portal, whereas transcriptomic data were downloaded from the UCSC Xena data portal (https://xena.ucsc.edu/public-hubs/). Matched RPPA proteomic data were obtained from the Broad GDAC