Identification of a novel microRNA recurrence-related signature and risk stratification system in breast cancer

Increasing evidence has revealed that microRNAs (miRNAs) play vital roles in breast cancer (BC) prognosis. Thus, we aimed to identify recurrence-related miRNAs and establish accurate risk stratification system in BC patients. A total of 381 differentially expressed miRNAs were confirmed by analyzing 1044 BC tissues and 102 adjacent normal samples from The Cancer Genome Atlas (TCGA). Then, based on the association between each miRNAs and disease-free survival (DFS), we identified miRNA recurrence-related signature to construct a novel prognostic nomogram using Cox regression model. Target genes of the four miRNAs were analyzed via Gene Ontology and KEGG pathway analyses. Time-dependent receiver operating characteristic analysis indicated that a combination of the miRNA signature and tumor-node-metastasis (TNM) stage had better predictive performance than that of TNM stage (0.710 vs 0.616, P<0.0001). Furthermore, risk stratification analysis suggested that the miRNA-based model could significantly classify patients into the high- and low-risk groups in the two cohorts (all P<0.0001), and was independent of other clinical features. Functional enrichment analysis demonstrated that the 46 target genes mainly enrichment in important cell biological processes, protein binding and cancer-related pathways. The miRNA-based prognostic model may facilitate individualized treatment decisions for BC patients.


INTRODUCTION
Breast cancer (BC) is most commonly malignancy and lead to main cancer death among females worldwide [1][2][3][4]. Although the improvement of BC prognosis has been made, most of BC-related deaths are caused by tumor relapse or progression [5]. BC has been recognized as a heterogeneous disease that displays distinct differences in respect of biological behavior, gene expression profiles and survival outcome, even in the same tumor-node-metastasis (TNM) stage [6]. prediction of survival for each patient. Hence, there is an urgent clinical need to establish a practical tool incorporating molecular markers and other prognostic factors for accurately predicting survival in BC patients.
MicroRNAs (miRNAs) are small, noncoding RNAs that regulate multiple cellular processes, such as cell differentiation, apoptosis, and cell-cycle progression [8,9]. A growing number of researches have shown that the differentially expressed miRNAs (DEMs) might serve as prognostic molecular biomarkers for various tumors [10][11][12][13][14][15][16][17][18][19][20]. Traditional clinicopathological risk factors were unable to clearly distinguish between BC patients who have a low or high risk of relapse. Thus, a comprehensive approach of integrating DEMs and other prognostic factors to achieve reliable risk stratification is highly necessary. Nomogram, which is a visual statistical model, can provide a numerical probability of a clinical event for each patient [21]. Furthermore, nomogram enables to make individualized estimates of survival to aid clinical decision making.
Consequently, we aimed to develop and validate a novel multi-miRNA-based model incorporating recurrencerelated miRNAs and other risk factors for evaluating disease-free survival (DFS) and make effective risk stratification in BC patients.

Patient characteristics
In this study, a total of 897 patients with invasive BC from The Cancer Genome Atlas (TCGA) database were selected. Table 1 showed baseline characteristics of the derivation, and the internal validation sets. No significant difference of baseline characteristics were observed between the two data sets (all P > 0.05). The median age of the patients were 58 year (interquartile range [IQR]: 48-66) and 56 year (IQR: 47-66) in the two independent cohorts. The 5-year DFS rates of the patients were 0.823 (95% CI: 0.780-0.858) and 0.864 (95% CI: 0.809-0.905) in the derivation and the internal validation cohort, respectively.

Construction of miRNA-based risk score and prognostic model
After edgeR filtering (false discovery rate [FDR] < 0.05 and log2 fold change [log2FC] ≥ 1) between 1044 BC samples and 102 adjacent normal tissues, we screened out 381 DEMs from 1601 miRNAs expression profiles. Of these DEMs, 273 miRNAs were upregulated, and 108 miRNAs were downregulated. And the 1601 miRNAs were visualized via volcano plot in Figure 1.
Risk score = 0.495 × expressionhsa-miR-3145 + 0.245 × expressionhsa-miR-3178 + 0.100 × expressionhsa-miR-1293 -0.409 × expressionhsa-miR-4522. The results of the univariate and multivariate CPHR analyses in the derivation cohort were listed in Table 3. On the basis of multivariate CPHR analysis, we confirmed two independent risk factors for DFS (P < 0.05), including TNM stage, and miRNA recurrence-related signature. It should be pointed out that T stage and N stage were associated with TNM stage, known as multicollinearity, could affect the beta coefficients on multivariate CPHR analysis, giving rise to spurious associations and unreliable results [21]. Therefore, T stage and N stage were not entered into multivariate CPHR analysis. To help clinicians with a quantitative tool for individualized risk prediction of DFS, we developed a novel prognostic nomogram that incorporated the miRNA recurrence-related signature and TNM stage to predict 5-year DFS in BC patients ( Figure 2).

Evaluate the predictive performance of the miRNAbased prognostic model
The established miRNA-based prognostic nomogram was shown in Figure   a total nomogram score. With the optimal cutoff total scores (1.7255) determined via X-tile software [22], patients were stratified into the low-risk group (n=778) and high-risk group (n=119) in the derivation set. With the same cutoff scores, patients were divided into the low-risk group (n=397) and high-risk group (n=52) in the internal validation set. We also observed the distribution of risk scores, DFS, DFS statuses in the two independent data sets ( Figure 4A-4B). In addition, on the basis of risk stratification system, Kaplan-Meier curves were performed in both the derivation and the internal validation cohort, which demonstrated that patients in the high-risk group had poorer DFS than those in the lowrisk group (P < 0.0001, Figure 4C-4D). The 5-year DFS rates of 897 patients were 0.862 (95% CI: 0.820-0.895) and 0.506 (95% CI: 0.339-0.652) in the low-and highrisk groups, respectively (P < 0.0001). Besides, we conducted effective risk stratification analyses in BC patients with T stage, N stage, TNM stage, hormone receptor (HR) and human epithelial growth factor receptor 2 (Her2) status. And patients in the low-risk group had significantly better DFS than those in the highrisk group in T1 (P < 0.0001), T2 (P = 0.0056), T3/T4 (P = 0.00011), TNM stage III (P = 0.00024), N1 (P = 0.00033), N2/N3 (P = 0.014), HR-(P < 0.0001), HR+ (P < 0.0001), Her2-(P < 0.0001) and Her2+ (P = 0.032) ( Figure 5). Additionally, time-dependent receiver operating characteristic (ROC) analyses indicated that the miRNA-based prognostic model had better predictive performance than any clinical risk factors, or single prognostic miRNA alone in Figure 6. In terms of predictive accuracy, the miRNA-based prognostic nomogram was distinctly greater than that of the TNM stage (AUC: 0.710 vs 0.667, P < 0.0001).

Functional enrichment analysis of predicted target genes
To further identify the potential cellular biological functions and mechanisms of the four prognostic miRNAs, 46 target genes were predicted using three databases, including TargetScan, miRTarBase and miRDB. Gene ontology (GO) analysis revealed that these genes were related with protein binding, cytoplasm and nucleus ( Figure 7A). Kyoto Encyclopedia of Genes and Genomes analysis (KEGG) pathways analyses found that these genes mainly enrichment in cancerrelated pathways and Epstein-Barr virus infection ( Figure 7B).

DISCUSSION AGING
Prognostic evaluation is vital for making appropriate treatment decisions. Because the traditional TNM stage is mainly based on anatomical information, it is unable to achieve adequate assessment of disease recurrence in Table 2. miRNA recurrence-related signature in the derivation cohort.  This tool conducted well as supported by the good predictive accuracy (AUC > 0.7) in the derivation and internal validation sets, respectively. Moreover, the calibration curves illustrated the good agreements  between nomogram prediction and actual observations. Combination of miRNA recurrence-related signature and TNM stage was superior to TNM stage, indicating that the miRNA recurrence-related signature added the prognostic value of TNM stage. Furthermore, this miRNA-based nomogram could significantly stratify patients into the low-and high-risk group independent of the same TNM stage. Such accurate risk stratification could allow oncologists to identify the high-risk patients for aggressive therapy to improve BC prognosis.
Previous researches about miRNAs demonstrated that the miRNA-based signature is a crucial predictor for tumor recurrence [14,16,19,20,[23][24][25][26][27][28][29][30]. Gong et al constructed a 10-miRNA-based classifier to predict recurrence in hormone receptor (HR)+ Her2-BC patients [19]. However, this study did not combine 10-miRNA signature with TNM stage to build a prognostic nomogram and was limited by small miRNA expression profiling. Prognostic nomogram, which comprise the visualization of statistical models, has potential to achieve a more individualized risk prediction of survival outcomes on the basis of combination of different prognostic variables [7,21]. A large dataset of TCGA project provides us with a comprehensive foundation to mine multi-miRNA-based prognostic signature. Thus, a novel prognostic nomogram based on TCGA database, which incorporates multi-miRNA-based and TNM stage, is essential to make individualized estimates of 5year DFS in BC patients. On the other hand, some previous studies have been limited by small number of miRNAs screened, small sample sizes and lack of independent validation [31][32][33]. It should be pointed out that the sample size influences the result of statistical AGING indicates the optimal cutoff point of the nomogram score used to stratify patients into the low-and high-risk group. Kaplan-Meier curves of the low-and high-risk patients based on the miRNA-based prognostic model in the derivation cohort (C) and validation cohort (D). DFS, disease-free survival.
significance [21]. As a result, these previous studies may not have an adequate sample size to identify a significant effect estimate. Thus, our study is more reliable and relevant from those published in several previous studies [31][32][33].
There are some limitations in the study. Firstly, we lack of experimental study to explain the biological implications of the miRNA recurrence-related signature. Thus, the molecular mechanism of these miRNAs should be investigated in further study. Secondly, the miRNAbased prognostic nomogram needs to be further validated by a prospective, large-scale multicenter study before it can be applied in clinical practice. Thirdly, the TCGA database lacks of postoperative information (adjuvant chemotherapy, radiotherapy). Hence, we could not identify the low-risk patients to tailor adjuvant therapy and foresee which patients are likely to benefit from adjuvant chemotherapy.

CONCLUSIONS
In summary, a novel prognostic nomogram incorporating miRNA recurrence-related signature and TNM stage was established and internally validated to improve individualized risk estimation of 5-year DFS and make accurate risk stratification in BC patients. This easy-touse tool can help clinicians to predict 5-year DFS and easily select proper patients who are in need of a specific therapeutic strategy.   (B) Kyoto Encyclopedia of Genes and Genomes analyses (KEGG) enrichment analysis. The x-axis indicates the number of genes, and the y-axis represents the GO terms and KEGG pathway names. The color represents the P value.

Participants and study design
Data for selected samples of 1044 BC samples and 102 adjacent normal tissues were downloaded from TCGA database. Inclusion criteria were as follows: (i) histological diagnosis of invasive BC; (ii) complete follow-up data and miRNA expression profile available. Then, a total of 897 patients were included in this study. According to a computer-generated allocation numbers, 449 patients as validation cohort were randomly selected from the 897 patients (derivation cohort). Because the data were all publicly derived from the TCGA project, approval by our institutional ethics committees was not needed.

Establishment of multi-miRNA-based risk score and prognostic nomogram
To screen out the DEMs between 1044 BC samples and 102 adjacent normal tissues, we defined DEMs with a FDR < 0.05 and |log2FC)| ≥1. Then, univariate CPHR analysis was conducted to found the association between each DEMs, clinical risk factors and DFS (P < 0.05). Multivariate CPHR analysis was used to confirm the independent variables (P < 0.05). Thus, independent prognostic DEMs were selected to build a multi-miRNAbased risk score. And the multi-miRNA-based risk score = sum of coefficients × expression level of miRNAs. Furthermore, to make full use of the prognostic miRNAs, a novel model integrating the multi-miRNA-based signature and clinical factors to improve survival prediction in BC patients.

Assessment of the multi-miRNA-based prognostic nomogram
To further evaluate the risk stratification ability of the multi-miRNA-based nomogram, we classified patients into the high-and low-risk subgroups according to the optimal cutoff nomogram score determined by X-tile plot [22]. Moreover, the predictive accuracy of the multi-miRNA-based model was calculated via AUC based on time-dependent ROC analysis [34]. Finally, calibration plot was performed to assess the calibration ability of the multi-miRNA-based nomogram.

Target gene prediction and functional enrichment analysis
We applied TargetScan, miRTarBase and miRDB to confirm the target genes of prognostic miRNAs [35][36][37]. Then, GO and KEGG pathway enrichment analyses were executed to analyze these target genes using the database for Annotation, Visualization, and Integrated Discovery 6.8 Bioinformatics Tool (DAVID 6.8) [38].

Statistical analysis
The χ2 test and the Mann-Whitney U test were used to compare the differences of variables between the two data sets, when appropriate. Survival curves were conducted via the Kaplan-Meier method and compared via the log-rank test. A threshold P < 0.05 was determined as statistical significance. The optimal cutoff values of prognostic nomogram scores were confirmed using X-tile software, version 3.6.1 (Yale University, New Haven, CT, USA) [22].

AUTHOR CONTRIBUTIONS
NL, FXS, JGL, BC, GCZ and ZHP made substantial contribution to conception and design, analysis and interpretation of data, wrote and revised the manuscript. JGL and ZHP collected and analyzed the data, and also took part in the drafting of the manuscript. YLW, LZW and HM revised the manuscript. All authors contributed toward data analysis, drafting and revising the paper and agree to be accountable for all aspects of the work. All authors have read and approved the final version of the manuscript.

ACKNOWLEDGMENTS
This work benefited from the Cancer Genome Atlas (TCGA) database. We were grateful to the access to the