Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.


Introduction
Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis.

Aim
The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets.

Methods
Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariateadjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA).

Introduction
Microarray analysis has revolutionized our understanding of breast cancer. A molecular classification of breast cancer based on the expression of certain genes has gained acceptance in the last two decades. Luminal A, Luminal B, HER2 positive and Basal subtypes are the major subtypes identified in multiple independent cohorts [1][2][3][4][5], and this classification is being refined further by identification of more subtypes [6][7][8]. Analyses done on microarrays have suggested that genes related to the cell cycle are of great prognostic importance [9][10][11][12][13][14].
However, to the best of our knowledge, a comprehensive analysis concentrated on finding prognostically important cellular pathways and processes in the different breast cancer subtypes has not been done. The few studies which have been done have either tried to extract a prognostic gene signature [15][16][17][18][19][20][21][22] or have focused on a small number of pathways [12,23,24]. Furthermore, pathways which are prognostically important independent of the cell cycle related genes have also not been looked at. A significant short coming of existing gene signature studies using microarray technology are methodological flaws related to non-correction of batch effects in microarray data analysis, which may potentially reduce their robustness [25,26].
In order to understand the natural history of breast cancer, it is important to exclude factors that may impact on survival, and which may have changed over the past few decades. We realized that node negative tumors afforded us the best opportunity to study long term survivors and their gene signatures. At present, breast cancer patients are given a variety of treatments, and the responsiveness of these treatments may have an effect on the clinical end point selected, as well as may themselves be associated with gene expression [27][28][29][30]. Including patients receiving systemic therapy and those deserving but not receiving any such therapy would increase the heterogeneity of the analysis, thus rendering any modeling non-robust, and the results difficult to interpret. This problem is compounded further by the non-availability of the exact treatment regimen in most genomics data sets. The possibility of different treatment regimens being associated in a variable, non-uniform manner with gene expression renders the task of appropriate modeling and interpretation of results even more difficult. In order to exclude such treatment related effects, we selected only systemically untreated patients for our study. This would have the advantage of studying the true tumor related natural history unconfounded by treatment related effects. We included only those microarray data sets from which the batch in which the analysis was done could be deduced, so that the batch effects could be controlled for while analyzing the data.
The present study focused on finding significant pathways and processes associated with distant metastasis in node negative breast cancer. We tried finding not only the main processes associated with distant metastasis in the entire dataset, but also those which were significant independent of cell cycle. Analysis was also carried out on the different molecular subtypes separately.

Data sources and study selection
The Gene Expression Omnibus (GEO) database was queried in June 2014 for node negative breast cancer using the search terms "breast cancer" and "breast neoplasms", combining them with Boolean operator "OR", filtered to include only Gene Expression Series ("gse"), for organism "Homo sapiens" containing "Expression profiling by array". An initial search resulted in 1142 series. From the 1142 series, only those having 50 or more samples were included. This resulted in 317 data series. Of these 317 data series, only 5 contained arrays exclusively from node-negative breast cancer patients without any systemic adjuvant therapy. The rest 312 were excluded because they did not meet our inclusion criteria of node negative patients not receiving any systemic adjuvant therapy and having follow up data of a minimum of 05 years. Of the five remaining series, one did not contain information about the batch or date in which the microarrays were analyzed, and was excluded. Finally we were left with four series, viz. GSE2034 [31], GSE5327 [32], GSE7390 [33], and GSE11121 [34], all having confirmed node negative patients who did not receive any form of systemic therapy. The flow chart summarizing the steps in study selection is given in Fig 1.

Data extraction and synthesis
The raw microarray data from each series was downloaded and pre-processed separately, as recommended by Hahne et al [35]using the frozen Robust Multi-array Average (fRMA) algorithm [36]. The different series were then combined together with respect to our primary end point of interest Distant Metastasis Free Survival (DMFS) or distant recurrence, depending on coding of particular data series. The occurrence of distant metastasis was taken as the event of interest for which time-to-event analyses were done. The series were censored at 10 years after diagnosis.
Prior to using a Cox regression model, the probes were collapsed to genes, by the WGCNA package [37]. Probes which gave the maximum average expression were selected to represent the gene.
Only the four major molecular sub-types were investigated using the PAM50 classifier [5], i.e. Basal, HER2 Positive, Luminal A and Luminal B subtypes.
Six different analyses were conducted combining data from the four studies as follows: (a) Univariate analysis on the entire combined dataset, (b) Multivariate analysis on the combined dataset adjusting for expression of cell cycle related genes, (c) Univariate analysis on the HER2 subtype extracted from the combined dataset, (d) Univariate analysis on the basal subtype extracted from the combined dataset, (e) Univariate analysis on the Luminal A subtype extracted from the combined dataset, and, (f) Univariate analysis on the Luminal B subtype extracted from the combined dataset.
To adjust for expression of cell cycle related genes for the multivariate analysis done in analysis (b) listed above, the AURKA module score [12] as implemented in the genefu package in Bioconductor and R was added to the Cox regression model. Each analysis (a-f above), was evaluated by Cox regression models which included the gene expression as a fixed effects variable and one of the following as a random effects variable: i) Batch (as estimated from the scan date of the original microarray analysis), ii) Study series, iii) Centre or institution, iv) Batch as a nested effect in the Series, and, v) Batch as a nested effect in the Centre.
A sixth model (vi) was also performed as a control which did not include the batch or centre (a simple marginal model with the gene expression as the sole variable).
Therefore, in each analysis (a-f above), for each gene, six different Cox regression models were evaluated. For each analysis, the model which gave the lowest median Akaike Information Criterion (AIC) [38] value was selected as the most appropriate model for that particular analysis.
The coefficients obtained from the most appropriate Cox regression models (for each analysis) were then ranked according to their value. Since the standard GSEA (Gene Set Enrichment Analysis) currently lacks the software implementation for adjusting for batch effects, as well as being incapable of performing time-to event analysis, hence the coefficients obtained were subjected to a pre-ranked GSEA [39,40].
The following downloaded gene set collections were used in the present study: i) The gene sets representing the Canonical pathways (presented as C2: CP gene set collections), which included gene sets from the Biocarta [41], KEGG, [42,43] Reactome [44] and Protein Interaction Database [45] repositories, and, ii) The Gene Ontology [46,47] gene sets (presented as C5 gene set collections).
A Family Wise Error Rate (FWER)-adjusted p-value of 0.05 was selected as our level of significance for the pre-ranked GSEA. Ten thousand permutations were used (default being one thousand); the default setting was kept for all the other parameters of the GSEA study. The gene sets which were found significant in the pre-ranked GSEA analysis were then extracted.
The entire workflow is described in Fig 2. Visualization of the relationship between the different significant gene sets from the preranked GSEA above was done by means of the Enrichment Map Cytoscape plugin [48], which groups the highly redundant gene sets in clusters. The Overlap coefficient [48], at a threshold of 0.5, was used for clustering the gene sets, as recommended in the Enrichment Map manual.
The "leading edges" of significant gene sets from each of the six analyses (a-f) were then studied after the pre-ranked GSEA. Genes which were the most commonly distributed among the leading edges of the significant gene sets were identified for each analysis.
A standard GSEA was performed again at the end to find the relationship between each gene set included in each analysis (a-f above) and the AURKA module score for each data series. For this analysis, the "Pearson" metric for ranking genes was used, as recommended by the GSEA user manual; no correction was made for the batch effects, and one thousand permutations were used in the analysis of each data series. A final adjusted p-value was calculated by combining the p-values obtained from the four studies by the Stouffers method [49]. The FWER was calculated by the method of Holm [50]. An FWER-adjusted p-value of 0.05 was selected as our level of significance.
All the gene sets used in the GSEA were downloaded from the website of the Broad institute [57]: The GSEA analyses were carried out by GSEA2-2.0.14 software downloaded from the Broad institute.

Results
The Clinical characteristics of the patients in the data series are given in Table 1, while the tumor-related characteristics are detailed in Table 2.
The comparability of the expression values from the different samples was visualized using boxplots. fRMA pre-processed expression data were seen to be comparable between samples, with the distribution of the expression values between different arrays being similar to one another (S1 Fig). This supported the contention that the fRMA pre-processed data from different series could be combined together for analysis.
Of the various mixed models considered, the model with just batch effects as random variable (Cox regression model (i) in Methods above) gave the lowest AIC values for the analyses from (a)-(d) and analysis (f) above. For analysis (e), i.e. Univariate analysis on the Luminal A subtype, the cox regression model with Batch nested in Centre as RE Variable (model (v) in methods above) gave the lowest AIC values. These models (which gave the lowest AIC value for an analysis) were selected as the most appropriate model for that particular analysis.
Details of the AIC values are shown in S1 Table. Note on interpretation of the GSEA results: GSEA is a competitive gene set analysis method. In this context, a gene set negatively associated with metastasis means that it is more negatively associated with metastasis than random gene sets formed from genes other than those in the gene set tested. A gene set positively associated with metastasis means that it is more positively associated with metastasis than random gene sets formed from genes other than those in the gene set tested.

I) Results of the analysis on the entire combined dataset
When the combined dataset was analyzed in a univariate setting (analysis (a) in Methods above), gene sets related to protein translation were found to be negatively associated with distant metastasis (good prognosis), as depicted in Table 3 and Fig 3; while gene sets related to the Cell cycle (i.e. proliferation-related gene sets) were the only gene sets to be significantly positively associated with metastasis (poor prognosis) ( Table 4, Fig 3).
On performing the multivariate analysis adjusted for the Cell cycle by adding the AURKA module to the Cox regression model (analysis (b)), gene sets representing the integrin1 pathway and the proteinaceous extracellular matrix were positively associated with metastasis ( Table 5).

II) Results of the subtype analysis
The results for the gene sets significantly associated with distant metastasis in the individual molecular subtypes are detailed below. Univariate analysis on HER2 tumors (Analysis (c) in Methods). Gene sets related to interferon gamma signaling was the only gene set found negatively associated with distant metastasis (Table 6). No gene sets were found to be positively correlated with distant metastasis.
Univariate analysis on Basal tumors (Analysis (d)). Gene sets representing protein translation were the only sets significantly positively associated with metastasis ( Table 7, Fig 4).
Univariate analysis on Luminal A tumors (Analysis (e)). Gene sets representing translation were negatively associated with distant metastasis ( Table 6, Fig 5). One gene set representing Kinesins (thus possibly related to mitosis)were significantly positively associated with metastasis (i.e. negatively associated with survival) ( Table 7, Fig 5) Univariate analysis on Luminal B tumors (Analysis (f)). Gene sets representing translation were negatively associated with distant metastasis ( Table 6, Fig 6). Gene sets associated with cell cycle (proliferation) dominated the sets significantly positively associated with metastasis (i.e. negatively associated with survival) ( Table 7, Fig 6).

III) Results of the leading edge analysis
Univariate analysis on the entire combined dataset. Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the gene sets significantly associated with good prognosis, while genes encoding for replication proteins A1 and A3, DNA directed polymerase and Cyclin dependent Kinase 2 were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Tables 8  and 9).  Multivariate analysis on the entire combined dataset. Genes encoding for various types of collagens, fibrillin 1 and laminins were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Table 9).
Univariate analysis on HER2 tumors. Since only one gene set was found significant in our analysis, a leading edge analysis could not be done on HER2 tumors.
Univariate analysis on the Basal tumors. Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Table 9).
Univariate analysis on the Luminal A tumors. Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the significant gene sets associated with good prognosis (Table 8). Only one gene set (Reactome_Kinesins) was found significantly associated with bad prognosis in Luminal A tumors; hence a leading edge analysis could not be done for bad prognosis gene sets in Luminal A tumors.
Univariate analysis on Luminal B tumors. Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the gene sets significantly associated with good prognosis, while genes encoding for proteins involved in cell division (including cyclins, cyclin dependent kinases, proteasomes and replication factors) were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Tables 8 and 9). IV) Results of the GSEA performed for the relationship between the AURKA module score and the other gene sets In each data series, as well as in the combined dataset, a strong negative association was found between the AURKA module score and the gene sets related to the extracellular matrix ( Table 10). The association of the gene sets representing translation did not reach statistical significance (FWER 0.05), though they were consistently expressed in a direction opposite to that of the AURKA score.

Discussion
Most studies done on microarray data in breast cancer have found Cell cycle-related genes to have prognostic importance in breast cancer [9][10][11][12][13][14]58]; some studies have found the importance of immune related networks [9,16,19] and RNA splicing [9]. However, the prognostic importance of many processes may have been missed because: i) the aim of many of the studies was to find a gene signature predictive of outcome, and not the important cellular pathways and processes [15][16][17][18][19][20][21][22], and, ii) other studies only studied a few processes [12,59]; or did not include gene sets from many databases [23]. In contrast to the above studies, the present study focused on biological pathways and processes. Different datasets were combined to make the results more robust, with adjustment for study-specific differences being made by random effects models. Furthermore, batch effects were also treated as random effects, in order to adjust for any potential confounding effects thereof. The aim of the present study was to find processes and pathways associated with distant metastasis in the different molecular subtypes of breast cancer, as well as those processes associated with DMFS which were important in breast cancer taken as a whole.
The important pathways and processes were estimated from the entire list of genes ranked in order of their prognostic strength in each molecular subtype as well as overall breast cancer cases analyzed. In order not to miss out any important processes from the analysis, multiple gene sets were tested. These gene sets represent biological processes curated by multiple independent reliable sources, as mentioned in Methods above. The appropriate design in the present case is a time-to-event analysis. Of the many gene set analysis implementations, we had to select one which could make sense of a time-to-event analysis and yet adjust for batch effects. Therefore it was felt suitable to perform mixedeffects Cox regression on each gene after adjustment for Centre and batch effects as a random variable, and then analyze the ranked list of the Cox regression coefficients.
A pre-ranked Gene Set Enrichment analysis (GSEA) [39,40] was preferred to analyze the ranked list, since the alternative(over-representation analysis [60] using the hypergeometric  test) is limited by an arbitrary cut-off [61]value for gene selection and the statistical significance is also related to the size of the gene-set in the analysis. It may be noted that the standard GSEA currently lacks the software implementation for adjusting for batch effects, as well as being incapable of performing time-to event analysis.
To counteract the false positives, a strict criterion for significance was adopted by keeping the Family-wise Error Rate (FWER) 0.05. Pre-ranked GSEA results use gene set permutation rather than the more stringent phenotype permutation used by standard GSEA. The GSEA User Guide [62] suggests a cut-off of False Detection Rate (FDR) of 0.05 for analyses done with gene set permutation. This study consists of multiple independent analyses, which necessitate a stringent cut off to exclude false positives. An FDR of 0.05 means that in any given single analysis, the total number of false positives would be 5%; however, an FWER 0.05 means that in any given single analysis, the probability of having even one single false positive is not more than 5%. Therefore, we have used this more stringent threshold than suggested by the GSEA user guide. An even stricter cut off may have resulted in missing out many significant gene sets. To make the results of the pre-ranked GSEA robust, ten thousand permutations were used for finding the prognostically important gene sets. However, for the standard GSEA used to find gene sets associated with the AURKA module score, we were more interested in the general direction of association in each of the four data series, hence the standard recommended one thousand permutations were thought to confer sufficient robustness to the results.
Probably the most striking aspect of the results is the association of genes involved in protein translation with distant metastasis in breast cancer. Although the importance of protein translation in the molecular biology of cancers is well known, the association of this process with the prognosis of breast cancers is under appreciated. To our knowledge, this is possibly the first study to report on the prognostic importance of gene sets related to protein translation in breast cancer subtypes. On univariate analysis of the entire dataset, gene sets representing protein translation and the immune system were the only gene sets positively associated with survival (negatively associated with metastasis, i.e. having a good prognosis). As expected, gene sets representing cell cycle (or proliferation) were found to be negatively associated with survival in this analysis The leading edges of the significant translation-related gene sets in the different analyses were predominantly populated by genes encoding for ribosomal proteins. On adjustment for proliferation in the multivariate analysis, protein translation ceased to retain its significance. On first impression, it may seem that the association of protein translation with DMFS is spurious. However, on univariate analysis of the individual molecular sub-types, it was found that these gene sets were associated with DMFS in the Luminal A and Basal sub-types, even though in such analyses proliferation-related gene sets were not found to be associated with  Table 9. Genes which are found in a large number of leading edges of significant gene sets positively associated with distant metastasis (bad prognosis) for each analysis.   (Tables 6 and 7). Thus, protein translation seems to have an association independent of cellular proliferation with survival in certain molecular sub-types of breast cancer. It was interesting to note that the direction of association of gene sets related to protein translation with metastasis was different in the different sub-types. In the Luminal A and B tumors, protein translation was found to be negatively associated with metastasis (good prognosis), but in the Basal subtype, it was found to be positively associated with the same (poor prognosis). This was corroborated by the high degree of overlap of the leading edge genes which were associated with favorable prognosis in Luminal A and B tumors, while being associated with a poor prognosis in Basal tumors.
Why translation should have differing prognostic effects in different sub-types is unclear. We offer a provisional hypothesis that the prognostic effect of translation might be related to the deranged proteins in the cells. Increased protein translation in Basal tumors might imply greater translation of mutated proteins which cause further dysregulation of growth (leading to a worse prognosis); while increased protein translation in the Luminal tumors might imply an increased expression of normal proteins which tends the cells towards homeostasis (leading to a better prognosis). Thus, different protein translation derangement mechanisms may confer different prognostic significance in the various breast cancer subtypes.
To our knowledge, ours is the first study to link protein translation as a process/pathway to clinical outcomes in breast cancer. However, since ours is an exploratory analysis, our findings need to be confirmed on larger datasets. However, such a novel finding is intriguing and may well give us insights about the oncobiology and natural history of breast cancer and its sub-types.
The different molecular subtypes have different expressions of proliferation-related genes, as a result of which the range of expression of these genes are reduced in the different subtypes. Therefore, it is not surprising that, in our study, cell cycle has a relatively lower prognostic importance in many of the individual molecular subtypes, as compared to analysis on the entire breast cancer cases. Even then, proliferation retained its prime position in Luminal B tumors. A gene set representing Kinesins also attained significance in Luminal A tumors. The genes in the leading edge of this gene set are all kinesins which play an important role in mitosis.
Only a few studies have found the extracellular matrix (ECM) to be of prognostic importance in breast cancer [9,63,64]. One study [9] found that a module made of ECM genes was related to a prognostic gene signature. However, the above mentioned study finally selected Proliferation, Immune Response and RNA splicing as the main cellular events predictive of outcome in breast cancer. Some proteins related to ECM were identified as having prognostic importance in breast cancer by network analysis [63]; even then, the importance of the ECM was understated. Only one study [64] found gene sets related to ECM to be of primary importance.
In our study, the ECM-related gene sets did not show prognostic association with breast cancer on any univariate analysis. However, when adjusted for proliferation (in the multivariate analysis), there was a positive association between the ECM-related gene sets and metastasis, underscoring a negative prognostic significance of the expression of ECM-related genes in breast cancer. This was bolstered by the finding of collagen-related genes being over-represented in the 'leading edge' analysis performed separately.
We investigated as to why the prognostic effect of ECM-related genes remained hidden in the univariate analysis, while being apparent in the multivariate setting. Considering that the ECMrelated gene sets showed association with metastasis after adjustment for proliferation (Table 5), we hypothesized that there could be suppression of the effect of ECM-related genes due to association of their expression with proliferation-related genes. To test this hypothesis, standard GSEA was performed to assess the relationship between the proliferation-related AURKA module score and all the gene sets in individual data series. This analysis confirmed a strong negative association between the AURKA score and the ECM gene sets, i.e. increased expression of ECM-related genes is associated with decreased expression of the proliferation-related genes, and vice versa (Table 10). Since proliferation is known to be one of the strongest markers of poor prognosis, hence the decreased expression of proliferation-related genes might mask the poor prognostic association of the increased expression of ECM-related genes, thus explaining our seemingly paradoxical results. This negative correlation between the expression of ECM-related genes and proliferation may also be a possible explanation for the prognostic non-significance of ECMrelated genes in other studies which were not adjusted for proliferation. The overexpression of ECM and integrins being associated with poor prognosis (Table 5) makes biological sense, as they represent a part of the molecular mechanism of metastasis as is presently known [65].
A few caveats while interpreting the results have to be made: i) The study consisted of systemically untreated node negative females pooled from various studies which differed in their time periods from the 1980s through the 2000s. This has the advantage of having a more homogeneous cohort compared to random datasets having a variety of systemic treatments which would have complicated the analysis greatly. However, this criterion may itself lead to a dataset consisting of a biased population.
After combining data from the four series, 499 patients had ER positive tumors, while 295 had PR positive tumors-the data series did not allow for an interpretation whether these were mutually exclusive or overlapping in some way. However, despite being hormone receptor positive, none received any adjuvant hormonal therapy, the reasons for the exclusion from systemic treatment being unknown. Similarly, there were 325 patients who had T2 or larger tumors, but who, for unknown reasons, did not receive chemotherapy. We do not know whether the characteristic of the patients which led to their exclusion from systemic therapy would be a confounding variable in itself, as well as the possibility that since they did not receive any systemic therapy that itself could have affected their metastasis free survival in any way. It is certainly probable that the characteristics of patients and tumors (for instance, preferential evolutionary clonal selection [66]) in the different datasets changed with time as systemic treatment became more prevalent and few patients were recommended no systemic treatment.
ii) Since the population consisted of patients who were systemically untreated, the present study gives a good picture of the natural history of breast cancer unmodified by any systemic iatrogenic factor. However, the very same characteristic causes it to be unsuitable for the formation of a prognostic gene signature; prognosis being a much more complex end process, having highly multi-factorial stochastic and causal elements to it. At present, breast cancer patients are given a variety of treatments, and the responsiveness of these treatments may themselves be associated with gene expression [27][28][29][30].
iii) A third limitation of the study concerns the relatively small sample size of this study. Despite combining four large studies in the database, our study had 742 patients, and relatively little number of events of interest (i.e. distant metastases) had occurred at the time of data censoring. At the time of data censoring, a total of 200 events of interest had occurred-52, 34, 47 and 59 for the Basal, HER2, Luminal A and Luminal B tumors respectively-which we feel might represent a complex interplay between differences in tumor biology and changing patterns of survival over the last few decades.

Conclusion
This study, in addition to confirming the known prognostic role of proliferation in breast cancer, to the best of our knowledge, reports for the first time, the hitherto unappreciated prognostic effect of translation in breast cancer and its various molecular sub-types. It also shows the opposing prognostic association of the translation associated genes in the various breast cancer subtypes. Finally, the study highlights the independent prognostic significance of the genes related to integrin1 pathway and the extracellular matrix. This is possibly the first study to demonstrate the confounding effect of proliferation. Importantly, this study showed how the significant prognostic effects of biologically meaningful pathways and processes may be hidden by their association with a prognostically strong confounder (proliferation).
This study was undertaken as we felt that previous studies had methodological flaws, and no one has yet reported any results by combining different series in a single large dataset using a meta-analytic approach for the different breast cancer subtypes. The primary aim of this study was to better understand the processes and pathways affecting prognosis in breast cancer, and possibly identify novel pathways of interest for further analysis and hypothesis generation, rather than making a predictive signature. Confirmatory studies, preferably on large datasets, are needed to validate the findings of this study.
Supporting Information S1 Fig. Boxplot of the entire microarray data before and after fRMA processing. The yaxis of the data before pre-processing is plotted after log 2 transformation. Data after fRMA preprocessing are log 2 transformed during pre-processing, hence no further log transformation is done.The different colors represent different data series (orange for GSE2034, blue for GSE5327, yellow for GSE11121 and green for GSE7390). (TIFF) S1 Folder. A zipped folder containing the R script for the analysis along with the follow up data, instructions for analysis, and the.rnk files as well as text files having the GSEA parameters with which to reproduce the results of the present study. (ZIP) S1 Spreadsheet. Excel spreadsheet giving the full Leading edge results. 1 represents membership of the gene in the geneset, 0 denotes non-membership. A total denoting total number of leading edges the gene is part of is also given. (XLSX) S1 Table. Supplementary Appendix 1 shows the AIC values for each Cox Regression model used in each analysis. In each analysis, Gene Expression was the fixed effects variable. In analysis (b) the AURKA module score was an additional fixed effects variable. For any given analysis, the particular model having the least median AIC values (indicated in bold red) was chosen as the best model, for the respective analysis, to be used in the pre-ranked GSEA.