Identification and validation of an eight-lncRNA signature that predicts prognosis in patients with esophageal squamous cell carcinoma

Esophageal squamous cell carcinoma (ESCC) is correlated with worse clinical prognosis and lacks available targeted therapy. Thus, identification of reliable biomarkers is required for the diagnosis and treatment of ESCC. We downloaded the GSE53625 dataset as a training dataset to screen differentially expressed RNAs (DERs) with the criterion of false discovery rate (FDR) < 0.05 and |log2fold change (FC)| > 1. A support vector machine classifier was used to find the optimal feature gene set that could conclusively distinguish different samples. An eight-lncRNA signature was identified by random survival forest algorithm and multivariate Cox regression analysis. The RNA sequencing data from The Cancer Genome Atlas (TCGA) database were used for external validation. The predictive value of the signature was assessed using Kaplan–Meier test, time-dependent receiver operating characteristic (ROC) curves, and dynamic area under the curve (AUC). Furthermore, a nomogram to predict patients’ 3-year and 5-year prognosis was constructed. CCK-8 assay, flow cytometry, and transwell assay were conducted in ESCC cells. A total of 1136 DERs, including 689 downregulated mRNAs, 318 upregulated mRNAs, 74 downregulated lncRNAs and 55 upregulated lncRNAs, were obtained in the GES53625 dataset. From the training dataset, we identified an eight-lncRNA signature, (ADAMTS9-AS1, DLX6-AS1, LINC00470, LINC00520, LINC01497, LINC01749, MAMDC2-AS1, and SSTR5-AS1). A nomogram based on the eight-lncRNA signature, age, and pathologic stage was developed and showed good accuracy for predicting 3-year and 5-year survival probability of patients with ESCC. Functionally, knockdown of LINC00470 significantly suppressed cell proliferation, G1/S transition, and migration in two ESCC cell lines (EC9706 and TE-9). Moreover, knockdown of LINC00470 downregulated the protein levels of PCNA, CDK4, and N-cadherin, while upregulating E-cadherin protein level in EC9706 and TE-9 cells. Our eight-lncRNA signature and nomogram can provide theoretical guidance for further research on the molecular mechanism of ESCC and the screening of molecular markers.

ncbi. nlm. nih. gov/ geo/) database [15] under the GPL18109 platform (Agilent human lncRNA + mRNA array V.2.0). These 179 samples from GEO were used as a training set. Meanwhile, the data of RNA-seq expression, including 161 tumor tissue samples (80 squamous carcinoma and 81 adenocarcinoma) and 11 controls (platform: Illumina HiSeq 2000 RNA Sequencing), were obtained from the TCGA database. We kept 80 squamous carcinoma sample as the validation set. Statistical clinical information of patients in the training set and validation set is summarized in Table 1.

Identification of significantly DERs
Differential expression analyses were performed for the identification of differentially expressed RNAs (DERs), including lncRNAs and mRNAs (hereafter referred to as "DEl-ncRNAs" and "DEmRNAs, " respectively) between 179 tumor samples and 179 control samples using Limma package version 3.34.7 in R3.4.1 language [16]. The same cutoff value (FDR < 0.05 and |log 2 FC|) was taken as the inclusion criteria for selection of DElncRNAs and DEmRNAs. According to the value of DERs in training set, pheatmap version 1.0.8 in R3.4.1 language [17] based on centered Pearson correlation algorithm [18] was utilized to perform bidirectional hierarchical clustering for describing the gene expression differences between tumor samples and control samples.

Construction and evaluation of SVM classifier
Combined with survival information in training set, we performed univariate Cox regression analysis from survival package version 2.41-1 in R3.4.1 language [19] to screen significantly prognostic-related DERs (PDERs, including PDElncRNAs and PDEmRNAs) with log-rank p-value < 0.05 as the cutoff criterion. The screened PDElncR-NAs were used to conduct recursive feature elimination (RFE) analysis in caret package in R3.4.1 language [20,21] to extract the optimal feature genes with the minimum root mean square error (RMSE) obtained by the 100-fold cross-validation. Subsequently, these optimal feature genes were applied to construct Sigmoid kernel support vector machine (SVM) model using the R3.4.1 e1071 package (https:// cran.r-proje ct. org/ web/ packa ges/ e1071) [22]. We then evaluated the model's performance in GSE53625 training set and TCGA validation set using area under the curve (AUC) in receiver operating characteristic (ROC) curve. Meanwhile, we calculated each index value of the ROC curve, including sensitivity, specificity, positive prediction value (PPV), and negative prediction value (NPV).

Identification of signature lncRNAs and RS calculation
On the basis of the optimal feature genes, signature lncRNAs correlated with independent prognosis were identified using a multivariable Cox proportional hazards model implemented with the R3.4.1 survival package version 2.41-1 [19] with log-rank p-value < 0.05 as the cutoff criterion. Then, we calculated risk score (RS) following the risk formula: ∑β lncRNA × Exp lncRNA , where β lncRNA indicates the coefficient and Exp lncRNA indicates the expression level of signature lncRNA. Afterwards, all patients in training set and validation set were divided into high-risk and low-risk groups according to their median risk score. We used the Kaplan-Meier method in R3.4.1 survival package version 2.41-1 [19] to analyze the overall survival of the two groups and verified the prediction value of the model by plotting ROC curves for the training set and validation set.

Independent prognosis analysis and nomogram construction
The prognostic value of clinical variables and the RS calculated based on lncRNA signature in training set was initially assessed in univariate Cox proportional hazards regression analyses. Subsequently, each significantly different variable was further evaluated in a multivariate Cox proportional hazards regression analysis. The log-rank p-value < 0.05 was served as the cutoff criterion. Furthermore, a nomogram to predict patients' 3-year and 5-year prognosis was constructed using R3.4.1 rms package version 5.1-2 (https:// cran.r-proje ct. org/ web/ packa ges/ rms/ index. html) [23,24].

Prediction analysis of signature lncRNA-related genes and functional enrichment
To evaluate the function of signature lncRNAs, we first identified mRNAs significantly related to the signature lncRNAs via calculating the Pearson correlation coefficient (PCC) between 8 signature lncRNAs and 92 PDEmRNAs in the data from the training set using the cor.test function in R3.4.1 language [25]. After screening the connection pairs with RCC > 0.6, signature lncRNA and PDEmRNAs co-expression network was constructed and visualized using Cytoscape version 3.6.1 [26]. Subsequently, these PDEmRNAs in co-expression network were inputted into David website (https:// david. ncifc rf. gov) to perform GO biological process and KEGG pathway enrichment analysis, with p < 0.05 as the cutoff value.

Clinical samples and cell lines
The tissue samples used were collected from the Harbin Medical University Cancer Hospital between September 2018 and October 2019, including 15 ESCC tissues and 15 adjacent tissues, all from surgically removed specimens. The study was approved by the ethics committee of the Harbin Medical University Cancer Hospital, and each patient signed a written informed consent form. Two ESCC cell lines (EC9706 and TE-9) were purchased from the Cell Bank of Type Culture Collection of Chinese Academy of Sciences (Shanghai, China), which were cultured in DMEM with 10% FBS (Gibco, USA) at 37 °C containing 5% CO 2 .

Cell transfection
For gene knockdown, EC9706 and TE-9 cells were seeded into six-well plates at a density of 3 × 10 5 cells per well to 80% confluence and transfected with small interfering RNA targeting LINC00470 (si-LINC00470) or negative control (si-NC) generated by GenePharma (Shanghai, China) in accordance with the instructions of Lipofectamine 3000 Reagents (Invitrogen, USA). After 48 h, cells were harvested for further analysis.

Cell proliferation assay
CCK-8 assay was performed to evaluate the cell proliferation ability in ESCC cells. In brief, transfected cells were inoculated into 96-well plates at a density of 3000 cells per well. At the indicated timepoint (0, 24, 48, and 72 h, respectively), 10 µl of CCK-8 solution (Sigma-Aldrich, USA) was added to each well. After 2 h incubation, the absorbance in each well was measured at 450 nm under a microplate reader.

Flow cytometry
The cell cycle distribution was analyzed using flow cytometry. Briefly, transfected cells (1 × 10 6 ) were harvested, washed with PBS, and fixed by ice-cold ethanol (70%) overnight at 4 °C. Afterwards, cells were washed with PBS twice and stained with propidium iodide (PI) for 30 min at 37 °C. The DNA content of stained cells was determined using BD FACSCalibur flow cytometer (BD Biosciences, Franklin Lakes, NJ, USA) and analyzed with ModFitLT.

Cell migration assay
Cell migration was measured using transwell 24-well chambers (Corning Inc, Corning, NY, USA). In brief, transfected cells (5 × 10 5 ) were harvested and resuspended in serumfree medium. Then, the cell suspensions were added to the upper chamber, and 600 µl medium containing 15% FBS was added to the lower chamber. After 12 h culture, the migratory cells in the lower chamber were fixed with 4% paraformaldehyde for 10 min and stained in 0.5% crystal violet (Sigma-Aldrich, USA) for 30 min. Finally, migratory cells were photographed and counted from five random fields under a light microscope.

Statistical analysis
All quantitative data were analyzed using GraphPad Prism 5 (La Jolla, CA, USA) and expressed as mean ± standard deviation (SD). Differences between si-NC and si-LINC00470 groups were assessed using Student's t-test. A p-value of < 0.05 was considered statistically significant.

Identification of significantly DERs
Significant DERs were first identified among 179 tumor samples compared with 179 control samples in the training set. A total of 129 DElncRNAs (74 downregulated and 55 upregulated) and 1007 DEmRNAs (689 downregulated and 318 upregulated) were identified and are listed in Additional file 1: Table S1. These data were used to build the volcano plot of DElncRNAs and DEmRNAs (Fig. 1A) and the bidirectional hierarchical clustering heatmap (Fig. 1B), indicating the samples tend to cluster in two distinct directions.

Optimal feature gene selection
A total of 114 PDERs, including 22 PDElncRNAs and 92 PDEmRNAs, were obtained after univariate Cox regression analysis and are listed in Additional file 2: Table S2. Based on the screened 22 PDElncRNAs, the lncRNA combination with the lowest RMSE was selected as the optimal feature genes in the RFE recursive algorithm screening. As shown in Fig. 2, when the number of lncRNAs was 13, the optimal parameter (minimum RMSE = 0.1352) was obtained, and corresponding 13 optimal feature genes are summarized in Additional file 3: Table S3. A classification model was constructed in training set, whose performance was assessed in the GSE53625 training set and TCGA validation set. The classification results of samples based on the classifier are shown in the scatter diagram in Fig. 3 (left), in which the points with two different colors and shapes are clearly distinguished. The area under the ROC curve is shown in Fig. 3 (right), and corresponding index values of the ROC curve are presented in Table 2. ROC curve analysis revealed an AUC of 0.997 in the training set and 0.901 in the validation set. These results indicate that these optimal feature genes could be used as effective and accurate ESCC diagnostic biomarkers.
The risk coefficients suggested that ADAMTS9-AS1, LINC01497, and MAMDC2-AS1 were risk factors for ESCC (coef > 0), whereas DLX6-AS1, LINC00470, LINC00520, LINC01749, and SSTR5-AS1 appeared to be protective factors (coef < 0) ( Table 3). The RS of each patient in the training set and validation set was calculated with the following formula: RS = (0.147172) × Exp ADAMTS9-AS1 + (−0.063991)   Accordingly, patients were divided into high-and low-risk groups depending on their median risk score to assess the score's ability to accurately predict survival in a Cox regression model (Additional file 4: Table S4). Kaplan-Meier analysis showed that patients in the low-risk group had better prognosis than those in the high-risk group in the training set (Fig. 4A) and validation set (Fig. 4B). The AUC of the ROC curve was 0.989 in the training set and 0.865 in the validation set (Fig. 4C). These results confirmed that the risk score could be an independent predictor of overall survival.   The eight-lncRNA signature was an independent predictor of ESCC prognosis To investigate whether the eight-lncRNA signature was an independent predictor of prognosis among patients with ESCC in the training set, we performed univariate and multivariate Cox regression analyses. As illustrated in Table 4, the age, pathologic N, pathologic stage, adjuvant therapy, and RS model status were significantly correlated with patients' overall survival in the univariate Cox regression. Moreover, the age, pathologic stage, and RS model status based on the eight-lncRNA signature remained three independent predictors. In addition, the results from Kaplan-Meier analysis showed that age (Fig. 5A) and pathologic stage (Fig. 5B) had a significant impact on the prognosis of patients with ESCC (with a log-rank test p-value less than 0.0001). Furthermore, a nomogram was constructed that integrated age, pathologic stage, and RS model status to analyze the relationship between these three predictors and survival prognosis (Fig. 6A), which indicated that a higher total number of points on the nomogram presented a worse prognosis. Further analysis suggested that the predicted 3-year and 5-year survival rates by the survival model in the histogram were consistent with the actual 3-year and 5-year survival rates (Fig. 6B).

Functional characteristics of signature lncRNA-related genes
We first calculated the PCC between expression levels of 92 PDEmRNAs and eight-lncRNA signature and obtained 279 connection pairs with PCC > 0.6 (Additional file 1: Table S5). A total of 82 nodes, including 8 signature lncRNAs and 74 PDEmRNAs, were obtained in the constructed co-expression network (Fig. 7). Then we performed GO and KEGG functional enrichment analysis for these 74 PDEmRNAs. As shown in Fig. 8 and Table 5, these mRNAs were mainly enriched in the differentiation and development of epidermal and epithelial cells in GO biological process analysis, as well as the secretion of digestive juices in KEGG enrichment analysis.

Validation of the expression levels of eight-lncRNA signature in ESCC tissues
Quantitative real-time PCR analysis was performed to determine the expression levels of eight-lncRNA signature in 15 pairs of tumor tissues and matched adjacent tissues derived from patients with ESCC. As shown in Fig. 9, the expression levels  of DLX6-AS1 and LINC00470 were significantly upregulated, while LINC01479, LINC01749, and SSTR5-AS1 were markedly downregulated in ESCC tissues compared with adjacent tissues. However, there was no significant differences in expression levels of ADAMTS9-AS1, LINC00520, or MAMDC2-AS1 between two groups. According to the higher fold change, we selected LINC00470 for subsequent functional assays.

Knockdown of LINC00470 suppresses ESCC cell proliferation, G1/S transition, and migration
To investigate the function of LINC00470 in ESCC in vitro, LINC00470 expression was first knocked down in EC9706 and TE-9 cells by using si-LINC00470 transfection, which was demonstrated by quantitative real-time PCR analysis (Fig. 10A). CCK-8 assay showed that knockdown of LINC00470 resulted in growth retardation of EC9706 and TE-9 cells (Fig. 10B). Moreover, the percentage of cells at G0/G1 phase was significantly increased, in accordance with S and G2/M phase being decreased in si-LINC00470 group compared with si-NC group in both EC9706 (Fig. 10C) and TE-9 (Fig. 10D) cells.
In addition, transwell assay indicated that knockdown of LINC00470 markedly inhibited the cell migration ability in EC9706 and TE-9 cells (Fig. 10E). At the molecular level, knockdown of LINC00470 downregulated the protein levels of PCNA, CDK4, and N-cadherin, while upregulating E-cadherin protein level in EC9706 and TE-9 cells (Fig. 10F). The above results demonstrate that knockdown of LINC00470 can inhibit the proliferation and migration of ESCC cells.

Discussion
To the best of our best knowledge, the tumor-node-metastasis (TNM) staging system acts as the main transitional algorithm to direct the treatment strategies and also serves as a prognostic predictor, but fails to consider the genetic alterations in most types of cancers, including ESCC [27,28]. In recent years, identification of lincRNA-based signatures has received great attention for its potential to aid in the prognosis of cancers, including hepatocellular carcinoma [29], bladder cancer [30], and pancreatic cancer [31].
In the present study, we first identified 1136 significantly DEGs between tumor tissues and normal tissues in GEO data and confirmed 114 DEGs correlated with prognosis. Finally, eight-lncRNA signature (DLX6-AS1, LINC00470, LINC01479, LINC01749, SSTR5-AS1, ADAMTS9-AS1, LINC00520, and MAMDC2-AS1) was constructed for ESCC. Importantly, a robust nomogram consisting of age, pathologic stage, and RS model status based on the eight-lncRNAs signature was constructed for prediction of prognosis for patients with ESCC. Further analysis suggested the predicted 3-year and 5-year survival rates by the survival model in the histogram were consistent with the actual 3-and 5-year survival rates. By integrating diverse prognostic variables based on clinical characteristics, nomogram has been a widely used tool in oncology that could determine individual probability [32]. Here, our data suggest that our constructed nomogram had better predictive accuracy than each factor alone. Similar to our data, Khalil et al. [33] established a three-lncRNA signature and demonstrated that it could precisely predict overall survival and disease-free survival for ESCC. Three-lncRNA signature (RP11-366H4.1.1, LINC00460, and AC093850.2) was constructed by random forest algorithm and support vector machine algorithm and identified to be potential predictor of overall survival for patients with ESCC [34]. In addition, Mao et al. [32] identified a robust seven-lncRNA signature associated with overall survival that was independent of classical prognostic factors and molecular subtypes in ESCC. The different lncRNA signatures identified in ESCC might be mainly ascribed to different sample resources, sample sizes, and analysis methods. Subsequently, our data showed that 74 PDEmRNAs in co-expression network were mainly enriched in the differentiation and development of epidermal and epithelial cells, as well as the secretion of digestive juices. Consistently, ESCC progression was closely associated with epidermal and epithelial cell differentiation and growth [35,36].
Subsequently, we confirmed that the expression levels of DLX6-AS1 and LINC00470 were significantly upregulated, while LINC01479, LINC01749, and SSTR5-AS1 were markedly downregulated in ESCC tissues compared with adjacent tissues. By searching published articles, we found that no review had explored the intriguing mechanisms of these five lncRNAs in ESCC, except DLX6-AS1. Several studies have demonstrated that DLX6-AS1 is associated with malignant progression and promotes cell growth and metastasis in ESCC cells [37][38][39]. Considering the relatively higher increased fold change in expression level, we selected LINC00470 for further functional experiments. As expected, knockdown of LINC00470 significantly suppressed cell proliferation, G1/S transition, and migration in two ESCC cell lines (EC9706 and TE-9). In fact, LINC00470 has been reported to be an oncogene in other malignant tumors. For instance, Wu et al. [40] reported that LINC00470 promoted glioma cell proliferation and invasion and attenuated chemosensitivity. Yan et al. [41] performed overexpression and knockdown experiments to demonstrate the oncogenic functions of LINC00470 on gastric cancer cell proliferation, migration, and invasion. The findings by Huang et al. [42] indicated that knockdown of LINC00470 expression inhibited cell proliferation and cell cycle progression, while overexpression of LINC00470 showed the opposite effects in hepatocellular carcinoma. In addition, LINC00470 promoted invasiveness, migration, and angiogenesis of endometrial cancer cells [43]. Knockdown of LINC00470 could significantly inhibit the melanoma cell proliferation and migration, and suppress the growth of tumor in vivo [44]. On the basis of this evidence, we speculate that high LINC00470 expression appears to be related to poor prognosis in ESCC. It must be mentioned that there are several limitations to this study, including lack of further in vitro experimental study and in vivo data to validate the prognostic performance of our proposed lncRNA signature.