Survival analysis across the entire transcriptome identifies biomarkers with the highest prognostic power in breast cancer

Graphical abstract


Introduction
Breast cancer is by far the most common cancer in women [1] with two main established molecular biomarkers for systemic therapy, estrogen receptor (ER) alpha and epidermal growth receptor 2 (ERBB2/HER2). These define three distinct molecular subtypes, the ER alpha positive/ERBB2 negative, the ERBB2 positive, and the basal tumors (these lack ER alpha, ERBB2, and also progesterone receptor). Progesterone receptor (PR) is a gene regulated by ER alpha and ER alpha positive/PR negative cancers are exceedingly rare [2].
ER positive tumors are treated with hormone therapy and occasionally with some chemotherapy, ERBB2 positive tumors are trea-ted with anti-ERBB2 therapy and chemotherapy and basal tumors receive chemotherapy only [3]. With the introduction of anti-ERBB2 therapies the previously inferior prognosis of ERBB2 positive patients improved dramatically [4]. Today, basal cases have the worst expected outcome with high risk of relapse within a few years following diagnosis [5]. The ER positive/ERBB2 negative patients represent 70% of all cases, the ERBB2 positive cases account for 15-20%, and the basal cases denote 15% of all breast cancer cases [6].
In recent years few new agents have been approved, including CDK4/6 inhibitors for the treatment of advanced ER positive patients [7] and PARP inhibitors for patients with germline BRCA mutation [8]. However, despite multiple large-scale tumor sequencing studies, germline mutations in BRCA1 and BRCA2 [9] remain the solitary mutations capable to serve as basis for clinically valuable targeted therapy. At the same time, monogenic gene expression based predictive biomarkers have been supplemented by new generations of multigenic prognostic test. Some of the multigenic tests claim to predict both early and late relapse [10].
The most important genes used today as predictive markers (capable to serve as biomarkers predicting response to a given agent) emerged first as prognostic biomarkers (genes capable to predict the expected survival of the patients). This was the case for both ER [11] and ERBB2 [12]. Considering the efficiency of current systemic therapies, the next level should lie in investigation of patient cohorts further stratified based on administered treatment -which itself is already based on the currently approved clinical markers. To achieve this goal, one has to identify and rank all prognostic biomarkers. Here, we aimed to perform such an analysis using publicly available gene expression datasets in basal and in the estrogen-positive/ERBB2 negative chemotherapy treated breast cancer.

Database setup
We performed a search in the GEO (https://www.ncbi.nlm.nih.gov/geo/) and EGA (https://ega-archive.org/) repositories to identify transcriptome-level gene expression datasets with available clinical information. In this, only datasets with at least 30 samples were considered and only those which were generated using the GEO platforms GPL96, GPL570, and GPL571. The reason for this filter is that these platforms have an overlapping set of 22,277 genes measured using the exact same probe sequences. It is only possible to have the same sensitivity, specificity, and dynamic range in case the same probe sets are used.

Quality control and pre-processing
First, each array was normalized using MAS5 -we selected MAS5 as it ranked among the best performing normalization techniques in our previous comparison of available methods using RT-PCR validated expression values [13]. In addition, MAS5 enables the normalization of a single sample separately, thus the insertion or removal of a sample or samples does not affect the other values within the dataset. Then, a second scaling normalization was performed to reduce batch effects by setting mean expression of the overlapping 22,277 probes to 1000 in each array [14].
In order to remove redundant samples, the normalized gene expression values across all samples were compared. In case of identical expression values, only the first publication of a given gene array was retained in the database, and all subsequent copies were removed. Five parameters were analyzed for quality control: the background, the raw Q, the percentage of present calls, the presence of bioBCD spikes, and the GAPDH/ACTB 3 to 5 ratio. Samples with positive values and -for continuous variables -those within the 95% range for all samples passed the quality control. Those where one parameter did not pass, were designated as outliers, and those where two or more parameters did not pass were marked as biased arrays. Biased arrays were excluded from the subsequent statistical analyses.

Molecular subtype determination
Molecular subtypes were determined using the StGallen criteria [15]. Because only the gene expression measured on the gene arrays were available for all samples, these were used to determine receptor status for each patient. In this, the cutoff of 500 for the probe set 205225_at was used to determine estrogen positivity and the probe set 216836_s_at with a cutoff of 4800 was used to assign patients into ERBB2 positive/negative groups [16]. Proges-terone receptor was not included, because there is no reliable probe set for this gene in the GPL96 gene arrays. Present analysis was restricted to two systemically treated cohorts, to those who are estrogen positive ERBB2 negative and to those who are negative for both estrogen and ERBB2 receptors.

Survival analysis
Cox proportional hazards regression analysis was made for each gene separately. In this, each possible cutoff value was examined between the lower and upper quartiles, and False-Discovery Rate using the Benjamini-Hochberg method was computed to correct for multiple hypothesis testing. The survival analysis was performed for relapse-free survival (RFS). Breast cancer specific survival was not used because almost all studies published OS and/ or RFS only. In case of identical p values the strongest hazard rate was identified. The results for the best performing cutoff were exported for each gene in a separate database, and these were used to generate Kaplan-Meier plots to visualize correlation between gene expression and survival.

Gene ontology analysis
We performed gene ontology analysis for the derived lists in each setting separately. In this, only the significant genes were included and The Database for Annotation, Visualization and Integrated Discovery (DAVID) tool was used to uncover overrepresented biological processes (BP) and molecular functions (MF) [17]. Only hits with a Benjamini-Hochberg False Discovery Rate below 0.05 were accepted as significant.

Updates of www.kmplot.com
Our database was initially established in 2010 with 1809 patients and at that time we also established an online survival analysis platform to enable the investigation of the assembled dataset by independent researchers [18]. In addition to updating the database in the online analysis platform new analysis options were also added to the site, including the cutoff determination algorithm and molecular subtypes utilized in present manuscript.

Database
The total number of breast cancer arrays was 9423 and these represent 7830 unique samples from 55 independent datasets. Of these, there were 1139 outliers and 77 biased arrays. All biased arrays were excluded from further analysis. Relapse-free survival was available for 5268 patients and overall survival time for 5165 patients. Clinical characteristics for the entire database are presented in Table 1. Of note, the total sample number in Table 1 for some studies is lower due to the exclusion of redundant samples, as described in the Methods section. Clinical characteristics of the entire database including receptor status, grade, lymph node status, molecular subtype distribution, applied treatment and length of follow-up for relapse-free survival are summarized in Fig. 1.
In order to select the most reliable probes, multiple filtering steps were executed. First, only probe sets mapped to a gene were retained. Then a second filter was added to remove all genes with a false discovery rate over 5%. Then, a third and a fourth filter were set in which the maximal expression had to be over 1000 and the cutoff values had to be over 100, respectively. The goal of these Table 1 An overview of the clinical characteristics of all datasets integrated into the complete database. NA: no data, RFS: relapse-free survival, OS: overall survival, ER: estrogen receptor, MTAB-365: E-MTAB-365 dataset, TABM-43: E-TABM-43 dataset.  (these include the datasets GSE1456, GSE16391,  GSE16446, GSE16716, GSE17907, GSE19615, GSE21653, GSE25066, GSE31519, GSE3494, GSE37946, GSE45255, GSE4611, GSE5327, and GSE69031). The cumulative number of patients included in these totaled at n = 712 (for some genes the n was 131 due to array platform differences). When running the Cox regression for relapse-free survival, there were 1496 genes below the 5% FDR threshold and 1257 of these had expression over 1000 in at least one sample. The threshold of 1000 was used as this was the mean expression for all genes after the normalization. The cutoff was over 100 for 1203 genes and 692 of these were JetSet best probe sets. The complete table of all significant genes ranked by absolute HR values is presented as Supplemental Table 1.
What is the maximal hazard rate a gene can achieve? We can estimate the potential effect of a gene when ranking all genes and selecting the most significant one. When investigating all genes in all patients in the estrogen receptor positive ERBB2 receptor negative cohort, Ribosomal Protein L22 (RPL22) reached the highest significance with a HR of 0.3 (higher expression of RPL22 was associated with better survival, and thus the value of 0.3 would equal to an absolute HR of 3.33) and a p of 5.4EÀ10 ( Fig. 2A). The first significant gene was Thyroid transcription factor I (TGT3) with a HR of 1.76 and a p of 5.8EÀ04 (Fig. 2B). Genes with inferior p value did not reached statistical significance after multiple hypothesis testing (FDR over 5%).

Estrogen-positive ERBB2 negative breast cancer with untreated excluded
In this setting we included all estrogen positive and ERBB2 negative patients (n = 2823) and then excluded all samples with no information about treatment and also excluded all systemically untreated patients. Of note, the restriction was for systemic therapies only (chemotherapy and endocrine therapy) as there was no information available about radiation therapy. Twenty-three datasets had eligible patients (these include GSE12093, GSE12276, GSE1456, GSE16391, GSE16446, GSE16716, GSE17705, GSE17907, GSE19615, GSE21653, GSE25066, GSE26971, GSE2990, GSE31519, GSE3494, GSE37946, GSE45255, GSE4611, GSE46184, GSE5327, GSE6532, GSE69031, and GSE9195), and the final number of patients included was 1679. Of note, some genes were only present in the HGU133plus2 arrays, and therefore only patients who were measured by this platform were included (n = 384). Of the 37,535 probe sets mapping to a gene, 17,088 genes reached statistical significance at FDR < 5%. Of these, 11,029 had expression over 1000 in at least one sample, and the cutoff was over 100 for 8607 genes. When mapping to JetSet best probe sets, 4709 genes remained as significant, the complete list of these ranked by absolute HR values is provided in Supplemental Table 2.

Genes associated with survival after chemotherapy in basal breast cancer
All together 13 datasets included basal breast cancers with documented chemotherapy, these include GSE1456, GSE16446, GSE16716, GSE19615, GSE21653, GSE25066, GSE31519, GSE3494, GSE37946, GSE45255, GSE4611, GSE5327, and GSE69031. The number of patient samples was 392. When running the survival analysis across all genes using relapse as the endpoint, only probe sets mapping to a gene were included (n = 37,535). Then filtering was made to include only those results where the False Discovery Rate was not higher than 5% (n = 652 genes remaining), and only those where the expression of the gene reached 1000 in at least one sample (n = 402 remaining). The cutoff designating highand low-expression cohorts had to be over 100 (n = 380 remaining) to exclude probes with expression levels close to the background noise. Finally, the significant probe sets were reduced to include only the JetSet best probe sets (n = 246 remaining). The complete list of all results ranked by the absolute HR values is presented in Supplemental Table 4. In the gene ontology analysis extracellular matrix organization (GO:0030198, p = 1.33EÀ08) reached the highest significance. When ranking all genes derived using all patients in this cohort, the most significant gene was Calmodulin-regulated spectrinassociated protein 1 (CAMSAP1) with a HR of 3.61 and a p of 1.5EÀ05 (Fig. 2C). On the other end of the spectra the first gene to reach significance was PDZ And LIM Domain 7 (PDLIM7) with a HR of 1.88 and a p of 7.2EÀ04 (Fig. 2D). Thus the variety of significant genes spanned a hazard rate ranging between 76% and 261% higher risk (when considering all HR values below one as inverted).

Genes associated with survival in basal breast cancer after adjuvant chemotherapy
Ten datasets had basal breast cancer patient samples with documented adjuvant chemotherapy, including GSE1456, GSE19615, GSE21653, GSE31519, GSE3494, GSE37946, GSE45255, GSE4611, GSE5327, and GSE69031. In the altogether 156 samples 1553 genes reached a FDR below 5%. When filtering for maximal expression over 1000 (n = 862) and cutoff over 100, 542 genes reached significance. The complete list of all genes related to relapse-free survival and ranked by the absolute HR values is provided in Supplemental Table 5. When examining the overrepresented biological processes among these genes, antigen processing and presentation (GO:0002504, p = 1.26EÀ06), T cell receptor signaling (GO:0050852, p = 7.09EÀ05), and immune response (GO:0006955, p = 1.62EÀ04) reached the highest significance. MHC class II receptor activity (GO:0032395, p = 2.5EÀ05) was the most significant molecular function. The seven biological process and three molecular function categories reaching significance in the GO analysis are listed in Supplemental Table 3.

Genes correlated to prognosis in untreated patients
The analysis was also executed by including only patients who did not receive a systemic treatment. Untreated estrogen-positive ERBB2 negative breast cancer patients (n = 686) were available from the GSE11121, GSE1456, GSE19615, GSE2034, GSE21653, GSE2990, GSE31519, GSE3494, GSE45255, GSE4922, GSE69031, and GSE7390 datasets, and the expression of all together 959 genes reached statistical significance in correlation to relapse-free survival (Supplemental Table 6). When comparing all genes related to survival in chemotherapy treated and untreated patients, 135 out of the combined 1515 genes were present in both lists (8.9%). Untreated basal breast cancer patients (n = 178) were accessible from the GSE11121, GSE19615, GSE2034, GSE21653, GSE2990, GSE31519, GSE3494, GSE45255, GSE69031, and GSE7390 datasets and 135 genes had a FDR below 0.05 in these (Supplemental Table 7). When compared to the genes related to chemotherapy response, 99.5% of genes were unique for one signature and only two genes (WARS and UBE2L6) were present in both lists.

Online survival analysis platform
The updated online analysis platform with transcriptomic and survival data of all 7830 breast cancer samples can be utilized at https://kmplot.com/analysis/index.php?p=service&cancer=breast. The correlation between survival and gene expression can also be evaluated for clinical cohorts not utilized in current project.

Discussion
Present study is based on multiple distinct steps. First, a sizeable database comprising thousands of breast cancer samples with clinical follow-up was assembled. The entire transcriptome was processed for each sample and redundant samples were removed. Then, survival analysis was made across all genes and the best performing genes were ranked for two cohorts with high clinical relevance: ER positive ERBB2 negative patients who received chemotherapy and basal breast tumors with chemotherapy.
Chemotherapy can improve the survival [20] but at the same time has significant risks due to the suppression of rapidly proliferating tissues including bone marrow (anemia, immunosuppression), hair follicles (alopecia), and the gastrointestinal tract (diarrhea) [21]. Chemotherapy can also have an effect on the central nervous system, lead to vomiting and early cognitive impairment [22]. Thus, it is crucial to select patients who get the most benefit from chemotherapy -different features are capable to assist in making this decision in different breast cancer subtypes.
ER positive ERBB2 negative tumors represent the largest cohort of breast cancer patients with over two third of all patients. The basic systemic therapy for these patients includes chemotherapy and endocrine therapy -we evaluated biomarkers of endocrine therapy previously [23]. The decision to administer chemotherapy can be based on clinical features including high stage or node positivity, or designation of high risk via gene expression profiles including Oncotype DX [24,25] or EndoPredict [26]. Here, we run survival analysis for all genes in all chemotherapy treated ER positive ERBB2 negative patients to uncover genes correlated to survival following chemotherapy. With 692 significant genes we exposed a surprisingly large proportion of genes related to survival. Of note, when running the analysis using a less restricted criteria by including all patients who were not untreated, the number of significant genes was even higher. The significant prognostic biomarker genes achieved a hazard rate between 1.76 and 3.33 with a p value below 5.8e-04. When investigating common features of these genes by analyzing biological processes and molecular functions, GO categories related to cell division and chromatid segregation including microtubule binding were identified. These observations are in line with the previously described paradox cor-relation between chemosensitivity and low proliferation rates [27]. While the in depth discussion of individual genes is not in the scope of present study, a ranking of all significant genes in conjunction with the established threshold values help to quickly identify and filter genes related to chemotherapy response in this cohort in future genetic and transcriptomic studies.
Basal breast cancers that lack the expression of both steroid hormone receptors and ERBB2 represent approximately one sixth of all cases. Despite sensitivity to chemotherapy, these tumors have generally a poor prognosis [28]. In these tumors, the identification of good-and worse-prognosis cohorts has little value as even good-prognosis patients have a 20% risk of relapse [29]. Basal tumors are heterogenous and can be further subdivided into four molecular subtypes based on their transcriptomic fingerprint including the basal-like 1, basal-like 2, the mesenchymal, and the luminal androgen receptor subtypes [30]. While these subtypes have differences in clinical characteristics, the number of samples with clinical follow-up available in our database was not sufficient to perform a robust analysis across all genes within each subtype separately. When using all chemotherapy treated basal breast cancers we identified 246 genes significantly correlated to survival. The association with survival ranged between 1.88 and 3.61 for these genes. Only few gene ontology categories related to extracellular matrix organization and collagen catabolism were significantly overrepresented. Notably, then restricting the analysis to include only basal tumors with adjuvant chemotherapy, multiple gene ontology categories related to immune response reached statistical significance hinting on the involvement of the immune system. These observations correspond with the trend of advancing immune-mediated therapies in these patients [31].
Although the discussion of each gene associated with survival following chemotherapy is beyond the scope of present study, some interesting observations can be made when examining the list of significant genes. Well-known genes previously linked to chemotherapy response including TOP2A [32], MKI67 [33], ABC efflux pumps [34], or APOBEC3B [35] were related to survival following chemotherapy in ER positive tumors only, and none of these genes reached significance in basal tumors. The best performing genes associated with survival after chemotherapy in basal breast cancer were not significant in ER positive tumors. The reasons for these differences are most probably the different molecular characteristics related to the molecular subtypes. A recent study investigating clinical prognostic factors including TP53 status, grade, size, node positivity, ER and HER2 status, and age found that only nodal status was significantly associated with chemotherapy outcomes [36]. Combined, these results suggest that ultimately molecular and not clinical features will enable the prediction of response for chemotherapy in breast cancer.
While the two selected chemotherapy treated cohorts discussed above cover the largest chunk of breast tumors, there are subcohorts and other combination of clinical features. Our established online platform www.kmplot.com was extended with the entire updated database enabling the future validation and ranking of gene-expression based biomarkers in any sub-cohort of breast cancer. The analysis can help to identify biomarker candidates for subsequent in vitro validation studies [37,38].
We have to mention a limitation of the presented approach. The collected and published clinical characteristics are incomplete for many of the available datasets. As a result, only a fraction of the total samples could be included in the statistical analyses. In addition, some detailed information, including the exact treatment protocol given was almost newer available.
In summary, by performing survival analysis across all genes we identified the best performing genes in chemotherapy treated estrogen-positive/ERBB2 receptor negative breast cancer and in basal breast cancer samples. A reference ranking for all significant genes is presented and the minimal hazard rates to reach clinically robust significance were established. The ranking and the established threshold values help to quickly identify and filter genes related to chemotherapy response in future genetic and transcriptomic studies.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.