Identification of a nomogram based on long non-coding RNA to improve prognosis prediction of esophageal squamous cell carcinoma

Purpose: Esophageal squamous cell carcinoma (ESCC) remains a common aggressive malignancy in the world. Several long non-coding RNAs (lncRNAs) are reported to predict the prognosis of ESCC. Therefore, an in-depth research is urgently needed to further investigate the prognostic value of lncRNAs in ESCC. Results: From the training set, we identified a eight-lncRNA signature (including AP000487, AC011997, LINC01592, LINC01497, LINC01711, FENDRR, AC087045, AC137770) which separated the patients into two groups with significantly different overall survival (hazard ratio, HR = 3.79, 95% confidence interval, 95% CI [2.56-5.62]; P < 0.001). The signature was applied to the validation set (HR = 2.73, 95%CI [1.65-4.53]; P < 0.001) and showed similar prognostic values. Stratified, univariate and multivariate Cox regression analysis indicated that the signature was an independent prognostic factor for patients with ESCC. A nomogram based on the lncRNAs signature, age, grade and stage was developed and showed good accuracy for predicting 1-, 3- and 5-year survival probability of ESCC patients. We found a strong correlation between the gene significance for the survival time and T stage. Eight modules were constructed, among which the key module most closely associated with clinical information was identified. Conclusions: Our eight-lincRNA signature and nomogram could be practical and reliable prognostic tools for esophageal squamous cell carcinoma. Methods: We downloaded the lncRNA expression profiles of ESCC patients from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets and separated to training and validation cohort. The univariate, least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analysis were used to identify a lncRNA-based signature. The predictive value of the signature was assessed using the Kaplan-Meier method, receiver operating characteristic (ROC) curves and area under curve (AUC). Weighted gene co-expression network analysis (WGCNA) was applied to predict the intrinsic relationship between gene expressions. In addition, we further explored the combination of clinical information and module construction.


INTRODUCTION
Esophageal cancer (EC) is a leading malignancy worldwide, with approximately 572,000 new patients and 508,000 deaths annually [1]. Nearly 90% of patients with EC in Eastern Asia have esophageal squamous cell carcinoma (ESCC) [2][3][4]. Due to its rapid progress and insensitive to chemotherapy, the outcome of ESCC remains extremely poor. In spite of the development of surgical and medical management, many ESCC patients suffered from diagnosed at late stage. The pathogenesis of ESCC is a multi-step process, including several stages until ultimately carcinoma [5]. Therefore, focusing on the molecular mechanisms underlying the initiation and development of ESCC will help reveal promising diagnostic biomarkers and novel therapeutic targets.
The development of high-throughput sequencing technologies has improved our understanding of the heterogeneity and molecular basis underlying the ESCC [6]. LncRNAs, defined as non-protein-coding RNA transcripts (>200 nt), are reported to play a vital role recently in the genomics era [7]. To date, accumulating evidences supported the potential biomarkers of lncRNAs in a large range of cancers including ESCC. A few lncRNAs have been suggested as oncogenic roles in ESCC so far. For instance, increased expression of FMR1-AS1 has recently been reported to be a poor marker in female esophageal carcinoma [8], while LINC01503 is confirmed to be significantly higher level in ESCC tumors and correlated to poorer survival times of patients [9]. With the increasing progress in bioinformatics, a wide spectrum of disease prediction and investigation of molecular mechanisms have come to light. Therefore, the biological significance of lncRNAs in the development of ESCC needs further research.
In our research, we intended to develop a concise lncRNA-based signature and nomogram to improve prognostic value of ESCC via integrated bioinformatics approaches.

DELs identification
All 502 lincRNAs was identified as differentially expressed lincRNAs (DELs) between tumor tissues and adjacent normal tissues, including 223 up-regulated and 279 down-regulated genes ( Figure 1A and 1B).

Construction and validation of the eight-lincRNA signature
The DELs of the training set (GSE53625) were exposed to univariate COX regression and 33 DELs related to the OS (P < 0.05) were measured as predictive lincRNAs for LASSO analysis (Table 1). Finally, eight lincRNAs were  . As above, it was suggested that the lncRNAs in the signature were all risk factors for OS. Their coefficients indicated their impact on OS prediction. For example, the influence of LINC01711 was greatest while that of LINC01592 was least. All patients were separated to high-and low-risk sets based on median risk score in the GEO ( Figure 2B) and TCGA ( Figure 2C) cohorts. The patients' status, survival time, and lincRNA expression levels are shown in Figure 2B (GEO) and 2C (TCGA). The survival analysis presented that the OS of low-risk set was better than that of high-risk set in the GEO cohort (hazard ratio, HR = 3.79, 95% confidence interval, 95% CI [2.56-5.62]; P < 0.001) ( Figure 3A). The results were consistent in the TCGA cohort (HR = 2.73, 95%CI [1.65-4.53]; P < 0.001) ( Figure 3B). The area under the ROC curve (AUC) for 0.5-, 1-, 3-, and 5-year OS were 0.673, 0.734, 0.798, 0.816, 0.795 and 0.777, 0.644, 0.642, 0.649, 0.765 in the TCGA and GEO cohorts, respectively. Together, it was indicated that the signature showed an excellent performance for OS prediction.

Nomogram construction
Based on the prognostic signature and clinical factors, such as age, grade and stage, a nomogram was constructed ( Figure 6A). The calibration curve was used to describe the prediction value of the nomogram and the 45-degree line indicated the actual survival outcomes. The results for predicting 1-, 3and 5-year OS showed that the nomogram-predicted survival closely matched with the best prediction performance ( Figure 6B

WGCNA
WGCNA was used to develop a gene co-expression system to select biologically meaningful gene modules which are related to the lincRNAs in the signature. We set the cut-off as Person correlation coefficient > 0.6 and P < 0.001 to screen genes co-expression with lincRNAs. Then, differential expressed gene (DEG) analysis was performed among these genes. Figure 7A shows the volcano plot of DEGs. To create a scale-free system, the scale-free topology fit index reached 0.8 by setting the soft threshold power value beta to 5 ( Figure 7B). In addition, genes with similar patterns was located in different modules by average linkage clustering ( Figure  7C). The minimum cluster size was identified as 30 per module. The dynamic shear method was performed to determine the gene module. We calculated module eigengene (ME) which was the overall gene expression level of corresponding modules and clustered them based on their correlation in order to explore co-expression similarity of all modules ( Figure 7D). To analyze the connection of gene modules and clinical characteristics, eigengenes were calculated correlations with clinical factors, such as survival time, status, gender, age, alcohol, grade, stage, T, N, pneumonia. Finally, the robust correlation was identified between the gene significance and the survival time, T stage ( Figure 7E). The eight modules were generally separated into two clusters ( Figure 7F). Gene significance were also determined to evaluate the correlation between gene expression and survival time ( Figure 8A). To further analyze genes in these modules, we found a strong correlation between the gene significance for the survival time and Module Membership in the green module (cor = 0.47, P = 2.8e − 08), the brown module was negatively correlated with the survival time (cor = −0.24, P = 0.0025), the blue module was positively correlated with the survival time (cor = 0.45, P = 2.9e − 11), and the red module was positively correlated with the survival time (cor = 0.3, P = 0.0026) ( Figure 8B). The function enrichment analysis was performed to explore the GO database and KEGG pathway in which are involved ( Figure 8C). The results indicated that the biological process mainly involved in keratinocyte differentiation, peptide cross-linking, epidermal cell differentiation, and skin development and so on. The results showed that the molecular function was related to peptidase activity, acting on L-amino acid peptides, endopeptidase inhibitor activity, cadherin binding and serine-type peptidase activity. The cell components which was correlated to the results included tertiary granule lumen, intermediate filament cytoskeleton, keratin filament, and intermediate filament.
Furthermore, KEGG pathway functional enrichment showed that influenza A, chemical carcinogenesis and drug metabolism were mainly related to the genes in these modules.
The same analysis was completed to assess the correlation between gene expression and T stage ( Figure 9A). Thus, we found a strong correlation between the gene significance for the T stage and Module Membership in the green module (cor = 0.57, P = 3.3e − 12), the black module was positively correlated with the T stage (cor = − 0.24, P = 2.1e -8), and the blue module was negatively correlated with the T stage (cor = 0.21, P = 0.003) ( Figure 9B). The function enrichment analysis was also performed to explore the GO database and KEGG pathway in which are involved ( Figure 9C). The results indicated that the biological process mainly involved in extracellular matrix organization. The results showed that the molecular function was related to metallopeptidase activity, metalloendopeptidase activity and collagen biding. The cell components which was mainly correlated to the results included endoplasmic reticulum lumen. Moreover, KEGG pathway functional enrichment showed that protein digestion and absorption was mainly related to the genes in these modules.

DISCUSSION
ESCC remains a serious burden on health system worldwide. Traditional algorisms such as tumor-nodemetastasis (TNM) staging system failed to consider the genetic alterations of ESCC. Thus, in view of the high heterogeneity of ESCC, investigation of novel biomarkers and models is necessary. Recently, lincRNA-based signatures have received much focus and revealed excellent potential in prognosis prediction of numerous cancers [10,11].
In the present study, we identified DELs between tumor and normal tissues in GEO data, and confirmed lincRNAs significantly correlated with prognosis using univariate COX regression and LASSO analysis. Finally, eight lincRNAs (AP000487, AC011997, LINC01592, LINC01497, LINC01711, FENDRR, AC087045, AC137770) were selected to build a prognostic signature for ESCC. A robust nomogram consisted of the 8-lncRNAs signature, age, grade and stage was constructed for prognostic prediction of ESCC patients. Furthermore, the AUC value of the signature-based nomogram was better than the AUC values of age, grade and stage in 1-, 3-, 5-years.
In this study, we identified 8 key prognostic lncRNAs of ESRCC patients, however, no review had been studied about intriguing mechanisms of these lncRNAs except FENDRR. FENDRR is transcribed from the FOXF1 promoter and considered to be one of the favorable lncRNA biomarkers for various cancers, such as liver cancer [12], lung squamous cell carcinoma [13], bladder cancer [14], gastric cancer [15] and lung adenocarcinoma [16]. lncRNA FENDRR was first identified associated with chromatin-modifying complexes in 2009 [17]. FENDRR could suppress the progression of NSCLC by regulating miR-761 and TIMP2 [18]. Consistent with the previously published papers, the expression levels of FENDRR were found positively related to the risk score of ESCC patients in our research. It is worth noting that patients with high FENDRR expression appeared to be closed to poor prognosis in ESCC.
lncRNA-based signature is a novel tool which could provide simple and accurate clinical outcome prediction. So far, three lncRNA-based signatures have been developed for ESCC through bioinformatics methods [19][20][21]. For instance, a three-lncRNA signature was established based on multivariable Cox regression analysis and could precisely predict OS and disease-free survival (DFS) for ESCC [19]. Another nine-lncRNA signature was constructed by random forest algorithm and support vector machine algorithm and identified to predict the tumor stage and patient survival rate [21]. The third seven-lncRNA signature was identified by random survival forest algorithm and Cox regression analysis and this signature combined with TNM showed better prognostic predict capability than either alone [20]. However, these signatures were failing to describe the clinical significance of the signature. As far as we know, clinical factors such as gender and grade could also have an influence on the OS. These factors need to be included to improve the prediction accurate.      Female  33  Stage  I  10  Male  146  II  77  Age  <=60  99  III  92  >60  80  T  T1  12  Alcohol  No  73  T2  27  Yes  106  T3  110  Grade  G1  32  T4  30  G2  Nomogram is a commonly used tool in oncology which can create an individual probability by integrating diverse prognostic and determinant variables according to corresponding clinical characteristics [22]. Several nomograms were constructed to guide individualized treatment based on clinicopathological risk factors in renal cell carcinoma [23], breast cancer [24] and colorectal cancer [25]. In this study, a prognostic nomogram combined signature with clinical factors was settled. The clinical factors in the nomogram are not affected by researchers and can be easily obtained.
What's more, our nomogram had a better predictive accuracy than that of each factor alone.
WGCNA is a useful tool to investigate the molecular mechanisms of many malignancies, such as breast cancer [26] and colon cancer [27]. In present research, WGCNA was performed to analyze the genes associated with the lincRNAs in the signature. We used this method to transform the expression profiles from these genes to 8 modules. In these modules, we further focused on gene pivots highly related to various clinical features. Functional enrichment of key module genes was also analyzed.
In summary, our eight-lincRNA signature and nomogram could be practical and reliable prognostic tools for esophageal squamous cell carcinoma. They can offer incremental clinical value over traditional staging system for overall survival prediction of ESCC, which can utilize treatment decisions.

Data collection
The RNA-sequencing and clinical information for ESCC were acquired from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) as the training cohort, and The Cancer Genome Atlas (TCGA) database (http://cancergenome.nih.gov/) as the validation cohort. GSE53625 from GEO was conducted by GPL18109 (Agilent-038314 CBC Homo sapiens lncRNA + mRNA microarray V2.0) [28]. The clinical features of patients with ESCC in the TCGA and GEO cohorts are presented in Table 2.

Identification of differentially expressed lincRNAs in ESCC
The RNA-sequencing data were normalized with Expectation-Maximization algorithm (log2 transformation). Then 37501 lincRNAs were annotated. The differentially expressed lincRNAs (DELs) were identified between tumor and normal tissues using the R package "Limma" with log2 | fold-change (FC) | > 1 and adjusted P-value < 0.05.

The construction of lncRNAs-based prognostic signature
First, we used univariate and least absolute shrinkage and selection operator (LASSO) COX regression analysis to select the independent risk lncRNAs. Next, multivariate COX regression was used to identify corresponding coefficients of ESCC prognostic signature using R package "glment", "survminer" and "survival". The risk score of every patient from the TCGA and GEO cohorts were calculated based on the signature. All samples were randomly separated to high-and low-risk sets with the median score as cut-off value. Survival analysis for each set was evaluated by the Kaplan-Meier curve and logrank test. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) were drawn using R package "survivalROC".

Weighted gene co-expression network analysis (WGCNA)
WGCNA is a useful tool to establish the co-expression network between gene pattern and clinical factors using R package "WGCNA" (Version: 1.68). The procedure of WGCNA included identifying gene expression similarity matrix, adjacency matrix and co-expression network. ScaleFree plot was used for evaluating whether the network exhibits a scale free topology. The power value of soft threshold of adjacency matrix was identified as 5 to meet scale-free topology criterion. The hierarchically clustering analysis based on averagelinkage were originated from the Dynamic using Tree Cut method for Branch Cutting (deep-split = 2, cut height = 0.4, minimum cluster size = 30). The association between modules and variables was identified to select relevant module.

The nomogram establishing
A concise nomogram of predicting the survival of ESCC was established using the R package "rms", "Hmisc", "lattice", "Formula", and "foreign". The concordance index (C-index) was performed to assess prediction capability. The patients with ESCC were separated to diverse risk clusters along with their scores. All analyses were completed with R software. The P-value < 0.05 was considered statistically significant.

Data availability
All datasets generated for this study are included in the manuscript.

Ethics statement
As the data (TCGA and GEO datasets) are publicly available, no ethical approval is required.

AUTHOR CONTRIBUTIONS
WL L and HT Z conceived and designed the experiments. J L and WL L analyzed the data. J L and HT Z wrote the paper. All authors read and approved the final manuscript.