Novel Immune-Related Gene Signature for Risk Stratication and Prognosis of Survival in Estrogen Receptor (ER) or Progesterone Receptor (PR) Positive and Human Epidermal Growth Factor Receptor 2 (HER2) Negative Breast Cancer

Background: Although intrinsic molecular subtype has been extensively used, the risk stratication have not been fully elucidated in estrogen receptor (ER) or progesterone receptor (PR) positive and human epidermal growth factor receptor 2 (HER2) negative breast cancer. Methods: RNA transcriptional data from The Cancer Genome Atlas (TCGA), METABRIC and GEO were used. Immune-related genes were obtained from the datasets and literature search. Univariate, lasso regression and multivariate cox regression were employed to identify prognostic immune-related genes and establish the risk signature. Relationships between the risk signature and clinical parameters, tumor-inltrating immune cell abundances and cancer phenotypes were further evaluated. Results: Noted, 102 immune-related prognostic genes were identied in METABRIC dataset by univariate cox analysis. Consecutively, 7 immune genes (SHMT2, AGA, COL17A1, FLT3, SLC7A2, ATP6AP1 and CCL19) were selected as risk signature by lasso regression and multivariate cox analysis. Its performance was further veried in TCGA,GSE20685 and GSE9195 datasets. Multivariate Cox regression indicated that the risk signature was an independent predictor. The prognostic signature showed signicant correlation with intrinsic molecular subtypes, 70-gene signature and tamoxifen resistance signature. The CIBERSORT algorithm revealed that CD4+ memory T cells were signicant higher in low-risk group. Conversely, M0-type macrophages were signicant higher in high-risk group in both TCGA and METABRIC cohorts, which may have effect on the prognosis. Furthermore, we found that low-risk group may be associated with immune-related pathway and high-risk group was with cell cycle-related pathway, which also showed impact on the prognosis. Conclusion: The present study constructed a robust seven immune-related gene signature and established an effective method in risk stratication and prediction of clinical outcome in ER or PR positive and HER2 negative breast cancer.


Introduction
Breast cancer ranks as the rst in incidence rate among female malignant tumors and signi cantly affects women's health. Breast cancer now is considered a heterogeneous disease with different clinical and prognostic features. Though breast cancer has been classi ed into four subtypes with distinct molecular features and clinical outcome [1,2], great heterogeneities were still found in each subgroup, especially in luminal breast cancer. [3,4] The most commonly used luminal A/B classi cation does not fully reveal the heterogeneity in luminal breast cancers and therefore failed to properly address risk strati cation. [5,6] Thus, it is necessary to explore new subtyping for prognostic prediction to guide individualized treatment beyond the existing molecular subtyping.
Gene expression analysis has emerged as a powerful tool to reveal molecular diversity of breast cancer.Recent studies showed that the immune-related gene signature were associated with prognosis in various types of cancer.
[6] A recent study identi ed substantial heterogeneity in immune pro les across and within cancer types in The Cancer Genome Atlas (TCGA) Pan-Cancer datasets. [7] In this study, we pro led immune gene expression of estrogen receptor (ER) or progesterone receptor (PR) positive and human epidermal growth factor receptor 2 (HER2) negative breast cancer in the TCGA, METABRIC and GEO datasets, for whom extensive clinical and transcriptional expression pro le data were collected. We constructed a novel prognostic model, which was then evaluated and validated in independent cohorts for the prognostic robustness. Furthermore, we explored the association between the model and clinical parameter as well as the potential mechanism.

Data processing
We collected publicly transcriptional expression data of breast cancer cohorts from Gene Expression Omnibus ,UCSC xena and TCGA data portals. For METABRIC and TCGA database,we selected those patients with ER or PR positive and HER2 negative breast cancer.
The GSE20685 cohort was derived from a study of gene expression pro ling conducted on fresh frozen breast cancer tissue collected from 327 patients in conjunction with thoroughly documented clinical data.
All clinical and microarray data of these patients can be publicly downloaded at the GEO website (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20685) . The GSE9195 cohort was derived from a study of gene expression pro ling including 77 ER or PR positive and HER2 negative breast cancer patients.All clinical and microarray data of these patients can be publicly downloaded at the GEO website (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9195) Bioinformatic analysis R package genefu was used to classify the PAM50 molecular subtypes and calculate the 70-gene signature and the tamoxifen resistance signature (TAMR13) score of each case based on the gene expression data.
[8] Gene set enrichment analysis (GSEA) was conducted using Molecular Signatures Database (MSigDB) collections by the R package clusterPro er [9] To determine the optimal number of groups, we used the Nbclust and ConsensusClusterPlus R package. [10,11] The CIBERSORT algorithm was performed on transcriptional expression data to estimate the proportions of twenty-two types of immune cells in each case. [12] The ESTIMATE algorithms were used to characterize immune cell composition in a given sample. [13] Gene set analysis was carried out using the GSVA Bioconductor package. [14] We selected gene sets for various immune and cell cycle-related pathways. For each sample, enrichment score of selected gene set was obtained. Limma R package was used to identify signi cantly differential expressed genes (DEGs). [15] The Heat map of the representative DEGs were generated using the package ComplexHeatmap in R version 3.6.1. [16]

Statistical analyses
Statistical analyses were performed using SPSS version 23.0 (IBM, USA), GraphPad Prism version 8.00 (GraphPad Software, USA) and R version 3.6.1 (R Core Team, Vienna, Austria) . Pearson's chi-square test and Fisher's exact test were used to compare the categorical variables and ordered categorical variables Pearson correlation analysis was used to evaluate the association between two continuous variable. Mann-Whitney U tests were performed to evaluate the statistical signi cance within boxplots.Survival analysis was implemented by log-rank test. The Lasso Cox regression model was analysed using the glmnet package. Univariate and multivariate regression analyses were performed with the Cox proportional hazards regression model to determine the parameters that were signi cantly correlated with prognosis.

Identi cation of prognostic immune-related genes
The work ow of this study is delineated in Fig. 1. A total of 2600 genes were identi ed from datasets and literature search. And then, the gene expression pro les from 1369 patients with ER or PR positive and HER2 negative breast cancer identi ed in the METABRIC dataset were used to perform the univariate cox analysis. As a result, 102 genes were discovered to be signi cantly associated with overall survival.
Then we performed the Lasso Cox regression analysis to eliminate the redundant collinearity and further validate the robustness. As a result, 7 genes were identi ed in Lasso regression from the 102 genes. Then we did stepwise multivariate Cox regression analysis of the 7 genes in the METABRIC dataset and ultimately, a prognostic signature comprising these 7 genes, including SHMT2, AGA, COL17A1, FLT3, SLC7A2, ATP6AP1, and CCL19, were selected to construct a prediction model. As shown in the forest plot, SHMT2 and ATP6AP1 were risk factors, whereas the other ve genes were protective factors.
The comprehensive risk score (RS) was imputed as follows: h0(t)*exp(β1X1 + β2X2 +.... + βnXn). The Nbclust and ConsensusClusterPlus analysis showed that the optimal number of clusters was two. Therefore, the median risk score was set as the cut-off value, patients were categorized as two groups, RS-low and RS-high groups. (Fig. 2) Performance of Risk Signature in Luminal breast cancer from METABRIC and TCGA In order to validate the prognostic predicting role of the RS-based group in different datasets, METABRIC, TCGA, GSE20685 and GSE 9195 cohorts were included. The RS performed consistently in predicting longterm outcomes, including overall survival, disease free survival or distant-metastasis free survival. (Fig. 3) In subgroup analysis, the RS also showed excellent predicting ability in luminal A, luminal B, stage II, with or without lymph node, and different age groups in both METABRIC and TCGA datasets. (Fig. 3) Multivariate Cox analysis showed that the risk signature was the independent prognostic factor in both METABRIC and TCGA cohorts, after adjusting for established prognostic variables including TNM stage, age and tumor grade. (Table.1 -2) The Association Between Risk Signature And Clinicopathological Parameters Subsequently, we analysed the relationship between the RS and clinicopathological parameters. It is shown that the RS-high group accounted for the highest percentage (> 80%) in the HER2-positive disease, followed by Luminal-B and Basal subtype in both METABRIC and TCGA cohort. (Fig. 4) In order to explore the relationship between other gene prognostic score and our risk signature. 70-gene prognostic score and tamoxifen resistance signature (TAMR13) were calculated by genefu package. The results demonstrated that both 70-gene prognostic score and TAMR13 were signi cantly higher in the RShigh compared with RS-low group. (Fig. 4)

Correlation Of The Risk Signature With Tumor Microenvironment
To investigate tumor immunity relevance of the risk signature, the association of the RS with tumor purity and the presence of in ltrating stromal/immune cells in tumor tissues were evaluated by applying the ESTIMATE and CIBERSORT algorithm. As showed in Fig. 5, the stromal-and immune-score were both signi cantly higher in the RS-low group. RS was also strongly correlated with several important immunerelated genes such as IL33 and TGFBR2. (Fig. 5) By applying the CIBERSORT algorithm, the relative proportions of 22 immune cell subsets of ER or PR negative and HER2 positive breast cancer in the TCGA and METABRIC datasets were estimated. Consecutively, compared with RS-high group, the in ltration levels of CD4 memory resting T cells was increased, while the level of macrophage M0 was decreased signi cantly in RS-low group. (Fig. 5) We further investigated the prognostic values of the abundance of in ltrative immune cells. The high abundance of CD4 memory resting T cell was signi cantly associated with favourable prognosis (p < 0.001), while the high abundance of macrophage M0 was associated with unfavourable survival (p = 0.021). (Fig. 5) Risk signature is associated with immune and cell cycle related phenotypes In order to further characterize the phenotype contributing to the worse prognosis in the high-risk group, we rstly performed differential expressed gene (DEGs) analysis of RS-high versus low group in the METABRIC and TCGA dataset. Then we performed GSEA using the collection of the MsigDB for these DEGs. As a result,we found that in the RS-low group, the immune-related pathway such as GSE22886_NAIVE_BCELL_VS_NEUTROPHIL_UP and KEGG_CYTOKINE-CYTOKINE RECEPTOR_INTERACTION were signi cantly enriched, while in the RS-high group, the cell cycle-related pathway such as KONG_E2F3_TARGETS and ISHIDA_E2F_TARGETS were enriched. (Fig. 6) The heatmap displayed the expression of the core genes that contribute to pathway enrichment between the two groups. Notably, those cell cycle-related genes were predominantly up-regulated in the high-risk group, while the immune-related genes were signi cantly up-regulated in the low risk group. (Fig. 6) In order to explore the relationship between the enrichment of pathway and survival, we selected gene sets involving immune and cell cycle pathway. A total of 10 gene sets from the MsigDB were selected and the gene set enrichment score of each sample was calculated by the GSVA method. The activity of each gene set in each sample was estimated by the enrichment score. Then the univariate cox analysis were performed. As showed in Fig. 7, the enrichment of cell cycle-related pathway was signi cantly associated with worse survival, and the enrichment of immune-related pathway was signi cantly related with favourable survival.

Discussion
In this study, we provided a novel immune-related 7 gene signature with relevance for the prognosis in the ER or PR-positive and HER2-negative breast cancer. The risk signature was independent of other prognostic factors, including PAM50 and TNM stage. Besides, the group were characterized by speci c composition of immune in ltrate and molecular phenotype, which may contribute to the clinical outcome. Our nding of this novel prognostic model demonstrated a association between tumor phenotype and immune gene expression.
Through dimensionality reduction, survival analysis and unsupervised consensus clustering, we identi ed two groups of patients with substantial prognostic difference in ER or PR-positive and HER2-negative breast cancer. This model (1) provided independent prognostic information (2) associated with speci c immune microenvironment and phenotype.
Our ndings added to the emerging body of evidence that the expression of immune genes could provide additional prognostic information. Xavier Tekpli et al [17] identi ed three groups of patients with distinct levels of immune in ltration. The intermediate cluster is correlated with poor clinical outcome and lower response to therapy. Additionally, proliferation and epithelial mesenchymal transition phenotypes were associated with the immune gene-based model. On the other hand, Bin Zhu et al [18] strati ed the luminal breast cancer into three subtypes based on immune gene expression in an Asian population. One subtype was characterized by higher level of in ltrating lymphocytes, activation of immune checkpoint genes and higher tumor mutation burden.
Our ndings are consistent with prior studies showing that SHMT2 was an important risk factors that contributed to poor clinical outcomes. Proteomics data veri ed that high SHMT2 protein expression was signi cantly correlated with poor OS. [19,20] Besides, ndings from our analysis added more evidence on the protective effect of COL17A1, SLC7A2 and CCL19 expression in breast cancer. [21][22][23] On the other hand, the association between ATP6AP1 ,AGA and FLT3 expression and prognosis in breast cancer were rstly reported.

Conclusion
In summary, we developed and validated a 7-gene prognostic signature based on immune gene expression in ER or PR positive and HER2 negative breast cancer, displaying distinct patterns of prognosis and genomic features. If con rmed, these ndings may have important clinical implications in risk strati cation for precision oncology treatment in this population.  Figure 1 The study ow chart The association of 7-gene risk signature and tumor microenvironment (A) The comparison of stromal score between high-and low-risk group in METABRIC and TCGA cohorts (****: p<0.0001;***:p<0.001) (B) The comparison of immune score between high-and low-risk group in METABRIC and TCGA cohorts (****: p<0.0001;***:p<0.001) (C) Risk signature score was signi cantly correlated with immune genes (D) The comparison of CD4 memory resting T cell score between high-and low-risk group in METABRIC and TCGA cohorts (****: p<0.0001;***:p<0.001) (E) Kaplan-Meier survival curve for patients with high or low CD4 memory resting T cell score in METABRIC cohort (F) The comparison of macrophage M0 cell score between high-and low-risk group in METABRIC and TCGA cohorts (****: p<0.0001;***:p<0.001) (G) Kaplan-Meier survival curve for patients with high or low macrophage M0 cell score in METABRIC cohort Figure 6 Risk signature is associated with immune and cell cycle related phenotypes in ER or PR positive and HER2 negative breast cancer (A) Genes with signi cantly higher expression in high or low risk groups were used in a gene set enrichment analysis using the collections of the MsigDB. The ve most enriched processes in each collection are denoted. The cell cycle-related or immune-related pathways were signi cantly up-regulated in high or low risk group,respectively. (B) The heatmap displayed the selected DEGs in representative pathways. Figure 7