Prognostic stromal gene signatures in breast cancer

Global gene expression analysis of tumor samples has been a valuable tool to subgroup tumors and has the potential to be of prognostic and predictive value. However, tumors are heterogeneous, and homogenates will consist of several different cell types. This study was designed to obtain more refined expression data representing different compartments of the tumor. Formalin-fixed paraffin-embedded stroma-rich triple-negative breast cancer tumors were laser-microdissected, and RNA was extracted and processed to enable microarray hybridization. Genes enriched in stroma were identified and used to generate signatures by identifying correlating genes in publicly available data sets. The prognostic implications of the signature were analyzed. Comparison of the expression pattern from stromal and cancer cell compartments from three tumors revealed a number of genes that were essentially specifically expressed in the respective compartments. The stroma-specific genes indicated contribution from fibroblasts, endothelial cells, and immune/inflammatory cells. The gene set was expanded by identifying correlating mRNAs using breast cancer mRNA expression data from The Cancer Genome Atlas. By iterative analyses, 16 gene signatures of highly correlating genes were characterized. Based on the gene composition, they seem to represent different cell types. In multivariate Cox proportional hazard models, two immune/inflammatory signatures had opposing hazard ratios for breast cancer recurrence also after adjusting for clinicopathological variables and molecular subgroup. The signature associated with poor prognosis consisted mainly of C1Q genes and the one associated with good prognosis contained HLA genes. This association with prognosis was seen for other cancers as well as in other breast cancer data sets. Our data indicate that the molecular composition of the immune response in a tumor may be a powerful predictor of cancer prognosis.

Gene expression analyses of tumor samples are generally performed on whole tumor homogenates and thereby represent a pattern that reflects the expression from all cell types present in the tumor. The tumor microenvironment, comprising a large variety of cells, such as fibroblasts, immune cells, and endothelial cells, can constitute a significant part of the tumor and substantially contribute to observed expression patterns. Cells in the microenvironment can influence cancer progression [8][9][10][11] and have been shown to predict tumor outcome and therapy response in breast carcinomas [12][13][14]. One way to obtain cancer cell and stromal compartment-specific expression patterns is to isolate the compartments by laser capture microdissection (LCM).
For routine histological diagnosis of surgically removed tumors, formalin-fixed paraffin-embedded (FFPE) tissue is normally used and thus a wide range of FFPE tumors are available for gene expression analyses. However, fixation and embedding have a detrimental effect on RNA quality, resulting in fragmentation and chemical modifications and making it of less use for expression analyses [15]. In this study, we have established a procedure for global-gene expression analysis using LCM on FFPE triple-negative breast cancers. Isolation of stromal and cancer cell compartments of the tumor with subsequent analysis of the global mRNA expression revealed compartment-specific gene expression. Expanding the stroma-specific gene set by identifying genes with correlating expression levels using tumor data from The Cancer Genome Atlas (TCGA) database gave rise to 16 gene signatures of stromal genes with highly correlating gene expression. Two signatures, consisting of genes related to an immune response, are of particular interest since they are prognostic in multivariate Cox proportional hazard models in several breast cancer data sets and data sets of other cancers.

Tumor material
Tumor specimens, from subjects that had given informed consent, were obtained from Skåne University Hospital, Malmö. Ethical permission has been obtained from the local Research Ethics Committee in Lund (Dnr 2009/658). Information about the tumors was obtained from the pathology reports. All tumors were negative for estrogen and progesterone receptors and had no ERBB2 (HER2) amplification. Tumors with sufficient amount of cells in the stromal compartment to allow conclusive microarray analyses were selected. Two of the tumors used for microarray analysis were high-grade, grade III (3 + 3 + 3 for tubule formation, nuclear pleomorphism, and mitotic count, respectively), and one was low-grade, NHG (Nottingham histological grade) grade I (1 + 3 + 1 according to the NHG grading system). Two of the tumors were invasive ductal carcinomas, whereas one was reported as a medullary carcinoma.

Tissue preparation
Five consecutive sections (5 μm) of each tumor were prepared on a microtome and mounted onto polyethylene terephtalate (PET) membrane slides (Leica Microsystems, Wetzlar, Germany). When indicated, 30% ethanol was included. Mounted tissue sections were allowed to dry for 30 minutes in room temperature prior to incubation in −20°C for 24 hours for optimal adhesion. To preserve the RNA quality in the archived FFPE tissue specimens, the blocks used for biomarker analysis during the routine prognostic procedure were stored in 4°C.

Staining
During method development, staining of the tissue sections with either a cresyl violet LCM staining kit (Ambion, part of Thermo Fisher Scientific, Waltham, MA, USA) or standard hematoxylin-and-eosin staining was evaluated. For further analysis, cresyl violet staining was used. Tissue sections were initially rinsed 2 × 1 minute with xylene for deparaffinization and rinsed in 100%, 75%, and 50% ethanol for 30 seconds. Rehydration of the tissue with diethylpyrocarbonate (DEPC)-treated water was performed prior to 40-second incubation in staining solution, followed by additional rinsing with DEPC-treated water for 30 seconds and dehydration with 100% ethanol twice for 30 seconds. Sections were thereafter dried in room temperature and immediately used for LCM or stored at 4°C for up to 1 week.

Laser capture microdissection
LCM to isolate tumor compartments was performed on a Leica LMD6500. After microdissection, the tissue was collected in 0.5-mL polymerase chain reaction (PCR) tube caps containing Allprep RNA/DNA FFPE kit lysis buffer (Qiagen, Hilden, Germany) with Proteinase K.

RNA extraction and hybridization
RNA was extracted by using Allprep RNA/DNA FFPE kit (Qiagen) and evaluated with Nanodrop and Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) in accordance with standard procedures, including calculation of RNA integrity number (RIN) values as previously described [16]. Isolated RNA samples with a 260/280 ratio of at least 1.8 and RIN value of more than 2.0 were used for amplification with SensationPlus FFPE Amplification and WT labeling kit (Affymetrix, Santa Clara, CA, USA) in accordance with the protocol of the supplier. The resulting ds-cDNA was hybridized to a Human Gene 1.0 ST array (Affymetrix).

Data sets
TCGA expression data were downloaded November 2013 (breast cancer), January 2014 (kidney and lung cancer), and March 2014 (head and neck cancer) from the TCGA database [17]. The data were log2-transformed after the addition of 1 to each normalized value. Clinical and follow-up data were downloaded in May 2014. The basic patient characteristics of the TCGA breast cancer data used for follow-up analyses can be found in Additional file 1: Table S1. The NKI295 [18], Wang [19], and Trans-Big [20] datasets were downloaded in an assembled form, as described [21]. The original array data can be found at [22] (NKI295) or on the National Center for Biotechnology Information Gene Expression Omnibus website as GSE2034 and GSE7390.

Data analysis
All analyses were done with R by using the limma, survival, and cluster packages. Breast cancer subtypes were determined by using the nearest correlations with the PAM50 centroids [23]. For analyses of gene signatures from the Wang and TransBig data sets, probes with a mean log2 signal of more than 6 were included, and for the NKI295 dataset a threshold of −0.3 was used.

Isolation of RNA from laser capture-microdissected FFPE breast tumors
To obtain RNA of sufficient quality from tumor compartments isolated with LCM, sectioning, mounting, deparaffinization, and staining steps were performed with the highest possible purity to avoid RNase contamination. Adhesion of the tissue section to the membrane slide was obtained with 30% ice-cold ethanol (Additional file 2: Table S2) to avoid further degradation. As the standard histological staining with hematoxylin and eosin may influence the RNA integrity negatively [24], cresyl violet was used for tissue staining. We obtained similar RNA quality and amount from a cresyl violetstained section as from a non-stained section either with or without rehydration (Additional file 3: Table S3 and Additional file 4: Figure S1). Since cresyl violet staining with rehydration was the only procedure that resulted in both an acceptable RNA quality and adequate tissue morphology, this was henceforth used as staining procedure.
To collect epithelial and stromal compartments from breast tumor tissue, five consecutive tissue sections from triple-negative tumors were laser-microdissected ( Figure 1A-C). We found that collection of 25 to 28 mm 2 from the cancer cell compartment yielded a sufficient amount of RNA (Table 1A). However, isolation of stromal compartments with or without inflammation showed that stroma with few inflammatory cells did not yield enough material for downstream applications and limited us to analyze only tumors with inflammatory stroma. Analysis of three triple-negative tumors with inflammatory stroma showed a similar relationship between dissected tissue area and extracted RNA amount for both compartments (Table 1B-D and Figure 1D-F).

Identification of stroma-and epithelium-specific gene expression
Extracted and amplified samples were hybridized to an Affymetrix Human Gene ST 1.0 array. To identify genes that are selectively expressed in either the stromal or the cancer cell compartment, the probe list was shrunk by removing probes for which every sample had a log2 expression value below a threshold (less than 7) and by applying a variance filter, which removed probes with variance of less than 0.15. The remaining 5,971 probes were analyzed for difference in expression between stroma and cancer cells by using the limma package in R.
Even if only three tumors were analyzed, this approach identified 107 probes (Additional file 5: Table S4), with an adjusted P value of less than 0.05, that had higher levels in stroma and 48 probes (Additional file 6: Table S5) that were higher in cancer cells with the same adjusted P value (Additional file 7: Figure S2A-B). Several of the genes enriched in the stroma compartment are genes encoding matrix proteins, such as collagens (for example, COL1A1) and decorin (DCN). There was also enrichment of genes related to an immune response such as chemokines (CXCL12 and CCL19) along with their receptors (CXCR4 and CCR7), matrix metalloproteases (MMP2 and MMP9), the T cell-specific genes CD4 and granzyme K (GZMK), and the B cell-specific immunoglobulin genes such as both heavy (IGHA1, IGHA2, IGHV1-5, and IGHM) and light (IGLJ3, IGLV6-57, IGKC, IGKV1-5, and IGK@) chain-encoding genes. This indicates that the analysis has worked technically and that it identifies genes that are essentially stromaspecific in their expression. On the other hand, several of the genes in the cancer cell compartment are epithelial, such as the cadherin family member desmoglein (DSG2), intracellular junction protein desmoplakin (DSP), the epithelium-specific transcription factors (ELF3 and 5), keratin 7 (KRT7), claudin 4 and 7 (CLDN4 and 7), and integrin beta 8 (ITGB8).

Stroma-specific gene sets
The genes highly expressed in stroma most likely represent contributions from several different cell types that may be at different maturation stages that are located mainly in the stromal compartment. Therefore, the expression levels of these genes in a tumor homogenate may potentially reflect the combination of these cell types in the tumor. This raises the possibility to identify gene signatures that can be used as an estimate of the molecular and of the cellular composition of the stroma in a tumor. However, the method we have used, LCM of FFPE material followed by amplification and hybridization, is not optimal for an accurate estimate of RNA levels and will conceivably have low sensitivity. Therefore, to expand the set of genes that may constitute specific stromal signatures, we used TCGA mRNA data from 982 primary breast cancers and identified all genes whose expression level correlated with a correlation coefficient above 0.85 with at least one of the original genes identified as enriched in the stromal compartment of the tumors in our analysis. This set was further expanded by including genes that had a correlation coefficient above 0.89 with one of the genes in the expanded set. This led to an enlarged set comprising 361 genes. All of these genes were also found to be expressed at higher levels in the stroma in the laser capturemicrodissected samples (Additional file 7: Figure S2C). To define signatures of highly correlating genes, an iterative correlation analysis was performed yielding clusters of genes, in which all genes in a cluster had an average correlation coefficient above 0.90 with all other genes in the set. Owing to the large amount of genes typical of an immune or an inflammatory response, the threshold was set to 0.91 for these genes. This step yielded 16 gene signatures which contained at least three genes (Table 2 and Additional file 7: Figure S2C). For each signature, an aggregated value was calculated for each tumor by taking the arithmetic mean of the log2 expression of the genes in the set. The breast cancer tumors in the TCGA database were thereafter clustered by using the signature scores as variables (Figure 2A). There was no obvious relationship between this clustering and the breast cancer molecular subtype, determined by the PAM50 centroids. However, the gene signatures were separated into three major groups: one with primarily matrix/ fibroblast-related genes (signatures 1 and 2), one with endothelium-associated genes (signatures 4 and 5), and one with genes typical for immune/inflammatory cells. We thereafter compared the aggregated expression levels for the signatures in the molecular subtypes ( Figure 2B demonstrates the signature with the largest number of genes, all signatures are shown in Additional file 8: Figure S3). Luminal B tumors were low in expression of all stromal signatures, whereas basal-like tumors were low in matrix/fibroblast and endothelial genes but high in immune/inflammatory signatures. HER2-enriched tumors were low in endothelial genes and luminal A in immune/ inflammatory signatures. Thus, the tumor group associated with good prognosis (luminal A) was high in both matrix and endothelial genes.

Prognostic value of gene signatures
To assess whether the identified signatures of highly correlated genes may have prognostic implications, univariate Cox proportional hazard regression analyses were performed for the tumors from the TCGA database that had follow-up data ( Table 3). The standardized mean of log2 expression levels of all the genes in a set were computed and used for the analyses. None of the gene signatures, except gene set 5, which is a signature with endothelial-related genes, was associated with risk of recurrence. The hazard ratio (HR) of signature 5 was 0.714.
In a full multivariate model, using all gene sets, some signatures had P values of less than 0.05 (not shown), indicating that a multivariate model may be more appropriate. Therefore, we performed both forwards and backwards selection to identify an optimal model. The P value of the likelihood ratio test of the model was used as an indicator of the quality of the model. This resulted in an optimal model that contained four signatures. A multivariate analysis with these signatures indicated that two (1 and 13) were associated with an increased and two (5 and 12) with a decreased risk for new tumor event (Table 4A). To analyze whether the HRs may be due to confounding effects of other parameters, we included tumor size, node stage, and estrogen receptor status in the model and stratified for stage (Table 4B). We also analyzed a model in which stratification was done for PAM50 subtype (Table 4C). This showed that the gene sets, as well as node stage and estrogen receptor status, are independent prognostic markers. The only exception was signature 5, which had a P value of more than 0.05 upon stratification for PAM50 subtype. Gene signatures 12 and 13 both have an inflammatory/immune profile with set 12 containing HLA genes and set 13 C1Q genes. These were the most influential signatures in the multivariate Cox model and their opposing HRs indicate that the characteristic of the immune response in a tumor has prognostic value.
Multivariate analyses with only these sets demonstrated significant HRs for both sets in models both without or with adjustments for the clinicopathological parameters and PAM50 subtype (Table 4D-F).

Stromal gene signatures in other tumors
The analyses indicate that the profile of an immune/inflammatory response in the tumor has a prognostic value. To analyze whether this is general and also applies to other tumors, we used RNA HiSeq data from four other cancer forms available in the TCGA database ( Figure 3A). For both renal clear cell carcinoma, with new tumor event as the end point, and lung squamous cell carcinoma, with death as the end point, a similar pattern with no significant roles of the gene signatures as isolated variables, but with opposing HRs in a multivariate model, was detected. A similar but not significant tendency was seen for lung adenocarcinoma and head and neck squamous cell carcinoma. We also analyzed the signatures in three other breast cancer data sets, in which the expression had been analyzed with microarray technology ( Figure 3B). Here, a threshold of mean log2 expression level for the probes was applied, excluding a few genes from calculation of the gene signature values. For two of the three data sets (NKI295 [18] and TransBig [20]) analyzed, significant associations with opposing HRs was seen in analogy with the breast cancer TCGA data. In the third set (Wang [19]), a similar but, for the C1Q signature, not significant association was observed.

An immune/inflammatory score
The observed prognostic values of the C1Q and HLA signatures raises the possibility that a prognostic score, based on the ratio of the two signatures, could be calculated. Therefore, we defined a C1Q-HLA score as the difference in mean log2 expression of the two signatures. When Kaplan-Meier curves for the TCGA tumors were constructed on the basis of quantiles of this score, it is evident that the top 1/3 is clearly at higher risk than the bottom 1/3 for recurrence. In the latter group, there were hardly any new tumor events within five years after the initial treatment ( Figure 4A).
We also evaluated the molecular subgroups. For HER2enriched, luminal A, and normal-like tumors, the number of events were too few for relevant analysis, but for basallike and luminal B tumors, Kaplan-Meier curves based on the C1Q-HLA score above or below the median were constructed ( Figure 4B and C). We also evaluated the C1Q-HLA score together with lymph node status in a multivariate Cox proportional hazard model using the basal and luminal B tumors (HR and P values are shown in Figure 4). All of these analyses gave P values below 0.05 for the score as prognostic indicator, except the log-rank test in luminal B tumors, in which the P value was 0.070. Taken together, the analyses indicate that a C1Q-HLA score is of prognostic value also in these isolated subgroups.

Other immune/inflammatory genes
The opposing HRs obtained with the HLA and C1Q signatures in several cancer data sets indicate that the profile of the inflammatory components in a tumor has prognostic importance. To screen for other inflammatory genes that may provide prognostic information, we analyzed all immune/inflammation-related genes obtained in the expanded stromal gene set pairwise with both the C1Q and the HLA signatures in multivariate Cox proportional hazard models (Tables 5 and 6). Using a cutoff of 0.01 for the  P value of the gene in the Cox models resulted in 26 genes with an HR of more than 1 in a multivariate model with the HLA signature and 53 genes with an HR of less than 1 in a model with the C1Q set. To further analyze these genes, a correlation matrix was generated by using the expression data from TCGA tumors ( Figure 5). A pattern can be discerned: genes that are associated with higher risk in general are more correlated with each other than with genes that are associated with lower risk, and vice versa.

Discussion
The tumor stroma is composed of several non-malignant cell types that together build up the tumor microenvironment which may promote both tumor progression and metastasis [8,10,11,25]. In this study, we have used LCM for isolation of epithelial and stromal compartments of FFPE triple-negative breast tumors to enable analysis of compartment-specific gene expression. The usage of FFPE tumors is advantageous because of the large amount of routinely stored tumors, but gene expression analysis of this material has been challenging because of chemical modifications and degradation of the RNA [26]. Methodology development has improved the possibilities of accurate analysis of whole transcript expression from FFPE material. Studies thus far have mainly been technically oriented [27,28], but also tumor grade-, prognostic-, and subtype-specific expression for identification of novel gene expression profiles has been analyzed [29][30][31][32][33].
In this study, we have optimized the RNA preparatory steps and used an amplification and labeling system specifically developed for analysis of degraded RNA, along with a microarray containing probes detecting sites across the whole transcript. The genes we thereby could detect as enriched in the stroma compartment could to a large extent be identified as typical for immune cells, such as the Ig genes, and extracellular matrix, conceivably emanating from fibroblasts. This indicates that the method, despite the limited quality and quantity of the material, is able to capture compartment-specific expression patterns.
We obtained a set of 107 genes expressed at substantially higher levels in stroma. By expanding this set using correlation analysis of TCGA tumor data and successive iterations, we could define 16 gene signatures with high intra-set correlation. All of the genes identified in the expansion were also found in our data to be expressed at higher levels in the stromal compartments, supporting that they are all related to stroma.
The gene signatures conceivably represent a combination of different stromal cell types or different maturation stages of the same cell type. Sets 1 and 2 mainly contain genes typical of extracellular matrix such as DCN, LUM, VCAN, and collagens. These genes have been seen to be highly expressed in stroma in other studies [10,11,34,35]. We also find FAP, a typical myofibroblast marker, as well as POSTN, which is synthesized by fibroblasts in set 1. It is likely that sets 1 and 2 could be used as indicators of the amount of fibroblasts and stroma in a tumor. Sets 4 and 5 contain genes typical of endothelial tissue, both actual angiogenic regulators TIE1/TIE2 (TEK), ARHGEF15, ROBO4, and ELTD1 and other endothelial markers such as CD34, CLEC14A, and ESAM. Several (7/12) of the genes in gene sets 4 and 5 were identified in the angiogenesis signature obtained from 1,250 tumors from different cancer types [36].
The remaining gene sets contain genes typical of immune and inflammatory cells, such as T cell-associated CD3 (CD3G, CD3D and CD3E), CD4, ZAP70, GIMAP and granzymes (GZMA/GZMK and GZMM), major histocompatibility complex (MHC) II-encoding HLAs, and more general lymphocyte-associated genes such as LAIR1 and LST1. Cox proportional hazard analyses revealed only limited value of the gene signatures in univariate models, which suggest that neither the amount of stroma nor the extent of vascularization or the magnitude of inflammatory components has prognostic information as isolated variables. However, in a multivariate model, four of the gene signatures were significant. Sets 1 and 13 are associated with an increased risk for new tumor event, whereas gene sets 5 and 12 negatively influenced the HR.
A typical fibroblast marker (FAP) was identified in gene set 1 together with fibroblast-expressed extracellular matrix-associated genes, such as collagens, whereas gene set 13 contained complement factor C1Q. FAP has been identified as a prognostic marker in various cancer studies, including breast and lung cancer, and has been suggested to be a potential target in solid tumors [37][38][39][40]. Complement initiator C1q proteins can be derived from various stromal cells and has, in addition to its role in immune complex recognition, been found to have a proangiogenic effect in wound healing [41,42] and to drive carcinogenesis [43]. Complement activation has, on the other hand, been considered to have tumor suppressive properties, and C1q can induce apoptosis in prostate cancer cells [44,45], indicating a complex and probably context-dependent role of C1Q in tumor development.
The gene sets with negative HR contain several MHC-II-encoding HLA genes (gene set 12) and angiogenesisregulating genes, such as TEK and ELTD1 (gene set 5). ELTD1 expression has previously been shown to be associated with an improved outcome [36] which may explain why this signature negatively influences the hazard. Furthermore, higher levels of HLA-DR have been shown  to predict better prognosis of invasive ductal carcinomas [46]. The signatures were also significant when lymph node status was included and when stratifying for molecular subtype, indicating that they are independent prognostic markers.
A more limited Cox model based on only the immune gene signatures 12 and 13, with or without lymph node status and stratification for subtype in the model, could also predict recurrence. The opposing effects by these signatures on the risk indicate that the composition of the immune response in the tumor is of importance for the progression. This conclusion is further supported by the fact that the signatures are predictive in a multivariate Cox model in several other tumor forms and other breast cancer data sets. It suggests that the importance of specific components of the immune response for progression of the disease may be a general phenomenon applicable to several cancer forms.
The prognostic signatures contain mainly C1Q and HLA genes, respectively. C1Q genes have been shown to be produced by a number of cells, including monocytes, macrophages, and dendritic cells [42]. However, the expression of C1Q has previously been reported to be highest in immature immune cells that are known to reflect a state of immune paralysis in cancer immunology [47,48]. Therefore, higher expression levels of these genes could potentially reflect a higher ratio of immature, non-functional antigen-presenting cells in the tumor. Likewise, higher expression levels of HLA genes may be indicative of a more active immune response. This assumption is further supported by the screen (Table 5) for immune genes that are associated with a higher risk for recurrence in a multivariate model with the HLA signature. Here, we found many genes typical for negative immune signaling. These include inhibitory co-receptors that are negative regulators of immune responses (LAIR1, LILRB1, LILRB2, and LILRB4) and macrophage genes (CD68, ITGB2, and CD86). On the other hand, genes that were associated with a lower risk in a multivariate model with the C1Q signature (Table 6) represented genes coding for proteins involved in a cytotoxic immune response (for example, GZMH, GZMA, GZMK, CD3D, CD3E, CD3G, CD247 (CD3ζ), CD8A, CD27, CD52, CD96, PYHIN1, SLAMF6, and IL12RB1). The number of genes with both high correlation of their expression and similar prognostic value indicates that it may be possible to identify fairly large gene sets that could be used as markers for the profile of an intratumor immune response. It thereby highlights potential factors determining this profile.
The results support the idea that the molecular profile of an immune response, rather than the amount of immune or inflammatory cells, is an important prognostic marker in breast cancer and in other cancer forms. Such a hypothesis is further supported by other results, including the finding that the amount of a specific set of CD4 + T cells, namely follicular helper T cells, predicts breast cancer survival [49]. Figure 5 Correlation matrix of expression levels of genes found to provide altered hazard together with either the C1Q or the HLA signatures. The correlation coefficients were calculated by using the log2 expression levels of the genes that were found to be associated with either a higher (red bar) or a lower (green bar) hazard in a multivariate Cox analysis together with the HLA or C1Q signature, respectively. The Cancer Genome Atlas breast cancer data were used for the analysis. The genes are listed in Tables 5 and 6. HR, hazard ratio.