Novel immune-related genes in the tumor microenvironment with prognostic value in breast cancer

Breast cancer is one of the most frequently diagnosed cancers among women worldwide. Alterations in the tumor microenvironment (TME) have been increasingly recognized as key in the development and progression of breast cancer in recent years. To deeply comprehend the gene expression profiling of the TME and identify immunological targets, as well as determine the relationship between gene expression and different prognoses is highly critical. The stromal/immune scores of breast cancer patients from The Cancer Genome Atlas (TCGA) were employed to comprehensively evaluate the TME. Then, TME characteristics were assessed, overlapping genes of the top 3 Gene Ontology (GO) terms and upregulated differentially expressed genes (DEGs) were analyzed. Finally, through combined analyses of overall survival, time-dependent receiver operating characteristic (ROC), and protein-protein interaction (PPI) network, novel immune related genes with good prognosis were screened and validated in both TCGA and GEO database. Although the TME did not correlate with the stages of breast cancer, it was closely associated with the subtypes of breast cancer and gene mutations (CDH1, TP53 and PTEN), and had immunological characteristics. Based on GO functional enrichment analysis, the upregulated genes from the high vs low immune score groups were mainly involved in T cell activation, the external side of the plasma membrane, and receptor ligand activity. The top GO terms of the upregulated DEGs from the high vs low immune score groups exhibited better prognosis in breast cancer; 15 of them were related to good prognosis in breast cancer, especially CD226 and KLRC4-KLRK1. High CD226 and KLRC4-KLRK1 expression levels were identified and validated to correlate with better overall survival in specific stages or subtypes of breast cancer. CD226, KLRC4-KLRK1 and other new targets seem to be promising avenues for promoting antitumor targeted immunotherapy in breast cancer.


Background
Breast cancer is one of the most frequently diagnosed cancers among women worldwide [1]. With improvements in early detection and treatment, the number of deaths from breast cancer has continuously declined since 1990 [2]. However, in the United States, an estimated 268,600 women will still be diagnosed with breast cancer, and 41,760 women will die of breast cancer in 2019 [3].
Multidisciplinary approaches, including chemotherapy, endocrine therapy, molecular targeted therapy, radiotherapy, and surgery, are commonly used to improve breast cancer patient survival. However, not all patients benefit from combined treatment, and as many as 40% of women with breast cancer will still be resistant to currently available targeted therapy approaches [4]. At present, numerous issues remain unresolved, and there is a lack of adequate evidence to fully clarify the mechanism of breast tumorigenesis. In recent years, obvious alterations in the tumor microenvironment (TME) have been increasingly recognized as key in the development and progression of breast cancer.
The TME consists of tumor cells, stromal cells (such as fibroblasts, endothelial cells, pericytes, myoepithelial cells, and adipocytes), immune cells (such as macrophages, dendritic cells, natural killer (NK) cells, T lymphocytes, and B lymphocytes), extracellular matrix (ECM), signaling molecules and cytokines [5,6]. Stromal cells and immune cells are two major types of nontumor components in the TME, so the stromal or immune components are proposed to be valuable for diagnostic and prognostic assessments of the TME. It was demonstrated that epigenetic alterations generating aberrant gene expression in the cells of the TME are predictive of clinical outcomes [7]. In addition, there is increasing interest in the breast cancer microenvironment as a prognostic factor as well as a potential therapeutic target; thus, a wider assessment of immune responses within the TME by gene expression profiling might effectively predict clinical outcomes [8]. A deeper comprehension of the genetic profile of the breast cancer TME is urgently needed.
In this study, comprehensive bioinformatics analyses were performed to better understand the immune-related genetic profile and determine the relationship between gene expression in the microenvironment of breast cancer and prognosis. Our study will provide applicable information on novel immunological targets from the TME, complement other treatment options for breast cancer and improve the overall survival (OS) of patients.

Data sources and preprocessing
RNA sequencing (RNA-seq) data in fragments per kilobase of transcript per million mapped reads (FPKM), single-nucleotide polymorphism data and clinical followup information such as age, molecular subtype, outcome and survival time of breast cancer were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 1097 cases were obtained from the TCGA dataset. Then, the RNA-seq data were converted into transcripts per kilobase million (TPM) expression profiles. The immune and stromal scores of the samples were calculated by the Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data (ESTIMATE) algorithm via the ESTIMATE R software package. Next, two series of gene expression profiles from the Gene Expression Omnibus (GEO) database (GSE20685 and GSE42568) with clinical data from 431 breast cancer patients were used as the validation datasets. The GEO data were downloaded and normalized by the LIMMA R software package.

Comprehensive assessment of the TME characteristics in breast cancer
Based on the mean stromal or immune scores as the cutoff value, the patients were divided into two groups: the high stromal/immune score group and the low stromal/immune score group. Then, we analyzed the relationships between different stages/subtypes of breast cancer and stromal/immune scores. Kaplan-Meier (K-M) analysis was performed to determine the differences in OS between the high-and low-score groups of the four subtypes of breast cancer. In addition, we detected the correlations between stromal/immune scores and conventional gene mutations.

Identification of differentially expressed genes (DEGs)
Following the high and low score grouping, we screened for DEGs to obtain those with a false discovery rate (FDR) < 0.05 and a log2 fold change (FC) > 1. Heatmaps and cluster maps of the DEGs were generated using the Pheatmap package, and volcano plots were designed by the ggplot2 package in R software. Then, the upregulated genes with high vs low scores were further analyzed with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. GO categories including biological process (BP), molecular function (MF), or cellular component (CC) were analyzed using the EnrichGO function in the clusterProfiler R package and KEGG analysis was performed by the EnrichKEGG function in the clusterProfiler R package, with the parameters pvalue-Cutoff = 0.05 and qvalue-Cutoff = 0.05. A Venn diagram was used to screen the overlapping genes between the upregulated DEGs from high vs low scores and the top 3 GO terms by the online tool VENNY 2.1 (https://bioinfogp.cnb.csic.es/tools/ venny/index.html).

Construction of the protein-protein interaction (PPI) network
The PPI network was established in the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database and reanalyzed by Cytoscape software. The medium confidence was set as 0.4. Only individual networks with 10 or more nodes were included for further analysis. The connectivity degree of each node of the network was calculated. The logFC value of the gene expression was used to reflect the color of each node, and the size of the node represented the number of proteins interacting with the designated protein. Molecular COmplex Detection (MCODE) was then used to locate the central gene cluster.

Screening immune-related genes with prognostic value
To identify genes with prognostic value in the TME, we performed univariate survival analysis and used K-M analysis to analyze the relationship between the expression of these genes and the OS of breast cancer patients. Then, a time-dependent receiver operating characteristic (ROC) curve was generated to assess the predictive accuracy.

Statistical analysis
All data in the present study were performed in R software (version 3.6.1; https://www.r-project.org/). The stromal or immune scores were calculated by ESTIMA TE package. Survival curve was constructed by K-M analysis and compared by log-rank test, which was done to explore the survival differences in subtypes or stages of breast cancer and to identify the prognosis of selected genes using the GraphPad Prism 6.0 software (GraphPad Software Inc., La Jolla, CA, USA). Heatmaps, volcano plots, GO enrichment, KEGG pathway and Timedependent ROC curve were plotted by R software. Time-dependent ROC curve as done with the TimeROC package. The confidence index of AUC was calculated by combined with Survival and TimeROC package. All statistical tests were two-sided and P-value< 0.05 was considered statistically significant.

Results
Immune scores and stromal scores are closely related to breast cancer subtypes but not associated with breast cancer stages In our study, the stromal scores ranged from − 2164.14 to 2050.55, and the immune scores were distributed between − 1724.88 and 3459.35. First, we assessed the relationships between stromal scores or immune scores and different stages of breast cancer. However, neither stromal scores nor immune scores showed significant differences in different stages of breast cancer (Fig. 1a, b). Then, we explored whether stromal scores or immune scores were correlated with breast cancer subtypes. The average stromal scores of the basal-like subtype were the lowest of all four subtypes (Fig. 1c); however, the average immune scores of the basal-like subtype ranked highest (Fig. 1d). In addition, the immune scores of the HER2enriched subtype were higher than those of the luminal B subtype (Fig. 1d). Finally, we divided the scores into high and low groups and evaluated the survival probabilities of distinct subtypes in both groups of stromal scores and immune scores. Unfortunately, the high-and low-score groups did not show any survival differences in the four breast cancer subtypes (Fig. S1).
Breast cancer gene mutations were correlated with stromal scores or immune scores divided into mutant and wild-type groups, and the distributions of stromal scores and immune scores were plotted based on the status of the mutant genes in breast cancer. When compared with the wild-type group, either the BRCA1 or BRCA2 mutant group exhibited no significantly difference in the stromal scores and immune scores ( Fig. 2a, b, f, g). However, CDH1 mutant cases had higher stromal scores and immune scores (Fig. 2c, h). Although stromal scores of the PTEN/TP53 mutant and wild-type groups were not statistically significant (Fig. 2d, e), PTEN/TP53 mutant cases exhibited higher immune scores than wild-type cases (Fig. 2i, j). Other mutant genes, such as CHEK2, PALB2, BRIP1, STK11, ATM, BARD1, MLH1, MRE11A, MSH2, MSH6, NBN, MUTYH, PMS1, PMS2, RAD50, and RAD51C, had no significant differences in stromal scores or immune scores between the mutant and wild-type groups (data not shown). Thus, CDH1, PTEN and TP53 mutations were closely associated with stromal/immune scores in breast cancer.

Differential gene expression profiles with stromal scores and immune scores in breast cancer
To determine the correlation between comprehensive gene expression profiles and stromal/immune scores, we analyzed the 1097 cases acquired from the TCGA database. The heatmap shows the top 100 distinct gene expression profiles of cases belonging to the high vs. low score groups (Fig. 3a, b). Based on the stromal score grouping, 2832 genes were upregulated and 253 genes were downregulated in the high vs. low stromal score groups. Similarly, 1930 genes were upregulated and 491 genes were downregulated in the high vs. low immune score groups (p < 0.05, log2 FC > 1). Figure 3c and d show the volcano plot of differentially expressed genes (DEGs) in high vs. low stromal/immune score groups. Next, in order to comprehend the function of DEGs, we selected the upregulated DEGs for GO enrichment and KEGG pathway analyses. The top GO terms identified for the upregulated DEGs from the high vs. low stromal score groups were extracellular structure organization, collagen-containing extracellular matrix, and receptor ligand activity (Fig. 4a). In addition, the top GO terms of the upregulated DEGs from the high vs. low immune score groups were T cell activation, external side of the plasma membrane, and receptor ligand activity (Fig. 4b). Moreover, the top KEGG pathways of the upregulated DEGs from the high vs. low stromal score and high vs. low immune score groups were neuroactive ligand-receptor interaction and cytokinecytokine receptor interaction, respectively (Fig. 4c, d).
For deeply exploring the gene profiles of the top GO terms, we integrated the top 1 term of the BP, MF, or CC categories as gene datasets. Then, we screened the overlapping genes between upregulated DEGs and the gene datasets (Table S1). Venn diagrams showed that 225 genes overlapped between the upregulated genes from the high vs. low stromal score groups and the gene datasets (for convenience of description, we named these genes overlapping 1 genes), and 169 genes overlapped between the upregulated genes from the high vs. low immune score groups and the gene datasets (for convenience of description, we named these genes overlapping 2 genes) (Fig. 4e, f).
The screened immune-related DEGs are associated with good prognosis in breast cancer We performed a univariate survival analysis to reveal the relationship between the selected overlapping genes and prognosis in breast cancer patients from TCGA. Although the overlapping 1 genes did not show statistical significance in the OS analysis, the overlapping 2 genes exhibited highly significant differences in the prognosis of breast cancer, so we chose the overlapping 2 genes for further analysis. Using p value < 0.05, 54 genes related to good prognosis were included. Then, a time-dependent ROC curve (area under the curve (AUC) > 0.6) was employed to assess the prediction accuracy, and 31 genes with prognostic value in breast cancer remained. To better understand the interplay among the identified 31 genes, we performed PPI analysis, and Fig. 5a shows the overall PPI network, which included 29 nodes and 110 edges (Fig. 5a). Then, we selected the extremely critical module containing 15 nodes for further analysis (Fig. 5b).
High expression of CD226 and KLRC4-KLRK1 results in a better prognosis in stage II, stage III or luminal B breast cancer To further explore the identified CD226, KLRD1, and KLRC4-KLRK1 genes with prognostic value, we analyzed the expression of these genes and their association with OS in different breast cancer stages or subtypes. Although the high expression of KLRD1 was not related to a better prognosis in any stage or subtype variation (data not shown), the expression of CD226 and KLRC4-KLRK1 displayed a strong correlation with OS for stages or subtypes. High CD226 expression was related to better prognosis in stage II and stage III breast cancer, as well as in luminal B breast cancer (Fig. 8a-c). Similarly, despite not being significant for stage II (Fig. 8d), high KLRC4-KLRK1 expression revealed preferable survival probabilities in stage III breast cancer and in the luminal B subtype (Fig. 8e, f). In other stages or subtypes of breast cancer, CD226 and KLRC4-KLRK1 expression was not significantly related to OS (Fig. S2).  Validating that the prognostic genes identified from the TCGA database are equally significant in the GEO database To validate whether the prognostic genes identified from the TCGA database are equally significant in the GEO database, two gene profiles from GEO (GSE20685 and GSE42568) were used as validation datasets. The gene expression data of 327 cases from GSE20685 and 104 cases from GSE42568 with clinical information were downloaded and evaluated for OS. Data from GSE42568 demonstrated that CD226 was related to good prognosis in breast cancer patients, while KLRC4-KLRK1 did not show any correlation with breast cancer OS (Fig. 9a, b). However, the data from GSE20685 showed the reversed results; instead of CD226, KLRC4-KLRK1 was related to good prognosis in breast cancer (Fig. 9c, d). Next, we employed the data from GSE20685 to better confirm the relationships between gene expression and OS in different cancer stages (Fig. 9e-j), and found CD226 was firmly associated with good prognosis in stage II breast cancer patients. Thus, our validation datasets partly proved that high CD226 and KLRC4-KLRK1 expression levels would predict satisfied overall survival of breast cancer, especially in specific stages.

Discussion
The TME is crucial in tumor initiation, progression and drug resistance in breast cancer, and the infiltration of nontumor cells in the TME was calculated by the ESTI MATE algorithm. The ESTIMATE algorithm developed based on gene expression is valid and effective in various cancers, such as prostate cancer, breast cancer, and colon cancer [9,10]. By utilizing the breast cancer cohorts in the TCGA database and ESTIMATE algorithmderived scores, we first analyzed the relationship between stromal/immune scores and different stages or subtypes of breast cancer, found that stromal/immune scores were highly correlated with subtypes of breast cancer, especially in the basal-like subtype. In addition, the luminal B and HER2-enriched subtypes also showed relevance with immune scores. These findings demonstrated that TME variation correlated with subtypes of breast cancer and possessed immunological characteristics, which was consistent with the findings of a previous study [11]. However, the high immune-related scores did not predict a better prognosis in breast cancer subtypes. This is because prognosis is more influenced by various factors, such as age, race, never being pregnant or having a first child after age 30, type of surgical procedure, initial tumor size, clinical lymph mode status, and neoadjuvant chemotherapy [12,13]. Moreover, when considering the tumor stroma, cancer-associated fibroblasts (CAFs) are the major constituent of tumor stroma, take part in induction of tumor progression in breast cancer. Mounting evidence indicates that CAFs are heterogeneous, like tumor cells, among different subtypes or in the same subtypes of breast cancer which patients have different length of survival [14,15]. Currently, the potential role of CAFs as a predictive biomarker of tumor prognosis is still debated.
Second, we analyzed the relationship between the TME and gene mutations. Although our data demonstrated that CDH1, TP53 and PTEN mutations were closely associated with the TME, BRCA1 and BRCA2 mutations did not show significant differences in TME variation. BRCA1 and BRCA2 mutant genes predispose individuals to an elevated risk of breast cancer, and those with a family history of cancer are recommended to undergo gene detection based on the National Comprehensive Cancer Network (NCCN) guidelines [16]. Thus, BRCA1 and BRCA2 mutations are not highly frequent in sporadic cases, the design of upcoming trials stratifying patients by BRCA status to avoid potential bias are needed. CDH1, TP53 and PTEN are tumor suppressor genes, their alteration presents poor survival and worse prognosis in breast cancer [17][18][19]. Our study findings coincide with those of a previous exploration that after the exclusion of BRCA1 or BRCA2, the TME correlates with TP53, CDH1, and PTEN mutations, their mutations even revealed a highly or moderately increased risk of breast cancer [20].
Third, when further exploring the DEGs in the TME, many previous studies have ignored genes with low expression but exhibit high significance in antitumor activity. Thus, to eliminate the factor of tumor cells downregulating the expression of antitumor genes during tumor progression, we focused on the genes of the top GO terms of the upregulated DEGs based on functional enrichment.
Fourth, through step-by-step screening via OS analysis, time-dependent ROC analysis, and PPI network analysis, we revealed that the top GO term genes of the upregulated DEGs from the high vs low immune score groups exhibited better prognosis in breast cancer, which could be explained by the fact that the immune system plays an important role in cancer development and therefore potentially offers novel targeted therapies in antitumor treatment [21]. Ultimately, we found 2 important TME genes with good prognosis (CD226 and KLRC4-KLRK1).
CD226, also known as DNAM-1, is an activating receptor expressed on various immune cells, such as CD4+ and CD8+ T lymphocytes, regulatory T cells (Tregs), monocytes, macrophages, and NK cells [22,23]. CD226 serves as a costimulator, enhances T cell or NK cell activation [22], and exhibits significance in innate/ adaptive immune regulatory networks. When combined with its ligand CD115 or CD112 upregulated in tumor cells [24], CD226 facilitates the cytotoxicity of NK cells [25]. Moreover, in Treg-mediated tumor immune escape, Tregs express relatively high levels of TIGIT and low levels of CD226 compared with effector T cells (Teffs), resulting in a high ratio of TIGIT/CD226 expression and accelerating tumor development. In contrast, augmenting CD226 expression and reversing the TIGIT/  CD226 ratio would predict a good clinical outcome [26]. However, reduction in CD226 expression decreases the immune regulatory capacity, and CD226-deficient CTL or NK cells exhibit markedly less cytotoxic activity against DNAM-1 ligand-expressing tumors [27]. Accumulating evidence has shown that CD226 plays a pivotal role in tumor recognition and cancer immune surveillance [28], even promoting antitumor immune responses mediated by NK and T cells.
Killer cell lectin-like receptor subfamily C4 -killer cell lectin-like receptor subfamily K1 (KLRC4-KLRK1) belongs to killer cell lectin-like receptor family [29] and represents naturally occurring read-through transcription between neighboring KLRC4 and KLRK1. KLRC4 lacks a significant portion of the KLRC4 coding sequence but encodes the KLRK1 protein. Once tumorigenesis occurs, the amount of KLRK1 ligand increases immediately. KLRK1 (or NKG2D) is also an activating receptor expressed by NK cells and T cell subsets, can augment the cytotoxicity of NK cells/T cells or synergize with immune checkpoint inhibitors to eliminate tumor cells [30]. However, tumor cells evoke a range of mechanisms to evade KLRK1 surveillance system detection and impair the clinical benefits of immunotherapy in various cancers [31,32]. The downregulation of KLRK1 hampered NK cell cytotoxicity [33]. Conversely, blocking the shedding of ligands by tumors or the release of KLRK1 ligand-bearing exosomes might restore the expression of KLRK1 receptors on NK cells or T cells and improve their activity [34].
It was reported that the KLRK1 axis is becoming an emerging target in cancer immunotherapy [35,36], and the overexpression of CD226 or KLRK1 on NK cells resulted in efficient anti-sarcoma activity [37]. These studies were in accordance with our exploration to provide a molecular basis for the development of CD226-and KLRC4-KLRK1-targeted antitumor immune therapeutics. Considering that the samples from GEO database were much fewer than those from TCGA database, further exploration should include in more breast cancer patients for verification.
Tumorigenesis is initiated by 3 steps: cancer cell elimination by various immune cells, such as NK cells and CD8+ T cells, then immune pressure leading to the selection of tumor cell variants and finally, immune escape by inhibiting effector cells or inducing tolerogenic cells [38]. Escaping antitumor immunity is a hallmark for the progression of breast cancer. In the TME, tumor cells interact with various types of immune cells by activating immune checkpoint pathways [39,40]. Immune checkpoints (CTLA-4, PD-1/PDL1, LAG3, TIM3 and TIGIT) are orchestrated by a series of costimulatory and inhibitory signaling molecules and then modulate effector T lymphocyte (Teff) activity. Recent advancements in antibodies against immune checkpoints have highlighted the benefits of immune checkpoint inhibitors in both animal studies and clinical trials [41].
In this study, we also detected immune checkpoint genes, including CTLA-4, PD-1, LAG3 and TIM3, and found that their high expression was not associated with good prognosis in breast cancer (data not shown). This finding is probably related to the level of checkpoint gene expression and immune state of the TME [8]. To achieve unbiased results, the identification of immune checkpoint gene expression level is critical. The investigation into the relationship between immune checkpoint gene expression and the TME immune state (such as tumor-infiltrating lymphocytes and the T cell receptor repertoire) would provide key insights into checkpoint blockade therapy. At present, there are multiple antitumor approaches: inverting tumor immunosuppression (for example, employing immune checkpoint inhibitors and the direct induction of the Teff immune response), using immunebased therapies targeting specific immune cell types (including improving cytotoxic efficacy, and promoting immune surveillance through NK cells or Teffs), reducing the number of immunosuppressive myeloid cells, inhibiting Tregs, and altering the function of myeloid cells [7].
Despite combining multidisciplinary treatment strategies, breast cancer patients still have a comparably high mortality. An improved understanding of the immune-related genetic profile of TME in breast cancer and the identification of new immunological targets are critical for improving clinical outcomes. CD226, KLRC4-KLRK1 and subsequent new targets seem to be promising avenues for promoting antitumor targeted therapy in breast cancer.

Conclusions
The exploration of the TME and immunological treatment in breast cancer have become increasingly important in recent years. In our study, although the TME did not correlate with the stages of breast cancer, we verified that it was highly associated with the subtypes of breast cancer and gene mutations (CDH1, TP53 and PTEN) and possessed immunological characteristics. The combined analysis of OS, time-dependent ROC, and the PPI network revealed that the genes of the top 3 GO terms of the upregulated DEGs from the high vs. low immune score groups were associated with better prognosis in breast cancer, and 15 of them were related to good prognosis in breast cancer, especially CD226 and KLRC4-KLRK1. High CD226 and KLRC4-KLRK1 expression levels were identified, and their correlation with better OS in specific stages or subtypes of breast cancer was validated.