A seven-long noncoding RNA signature predicts overall survival for patients with early stage non-small cell lung cancer

Non-small cell lung cancer (NSCLC) is the most common cancer and cause of cancer-related mortality globally. Increasing evidence suggested that the long non-coding RNAs (lncRNAs) were involved in cancer-related death. To explore the possible prognostic lncRNA biomarkers for NSCLC patients, in the present study, we conducted a comprehensive lncRNA profiling analysis based on 1902 patients from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets. In the discovery phase, we employed 682 patients from the combination of four GEO datasets (GSE30219, GSE31546, GSE33745 and GSE50081) and conducted a seven-lncRNA formula to predict overall survival (OS). Next, we validated our risk-score formula in two independent datasets, TCGA (n=994) and GSE31210 (n=226). Stratified analysis revealed that the seven-lncRNA signature was significantly associated with OS in stage I patients from both discovery and validation groups (all P<0.001). Additionally, the prognostic value of the seven-lncRNA signature was also found to be favorable in patients carrying wild-type KRAS or EGFR. Bioinformatical analysis suggested that the seven-lncRNA signature affected patients’ prognosis by influencing cell cycle-related pathways. In summary, our findings revealed a seven-lncRNA signature that predicted OS of NSCLC patients, especially in those with early tumor stage and carrying wild-type KRAS or EGFR.


INTRODUCTION
Non-small cell lung cancer (NSCLC) is one of the most common and lethal malignant diseases worldwide during the past decades due to lacking of early diagnostic and predictive biomarkers [1,2]. With the development of the molecular targeted agents [3] and the therapeutic strategy [4], the 5-year overall survival (OS) rate of NSCLC has been prolonged. However, only one third of NSCLC patients are diagnosed at an early stage [5]. To date, there are limited methods for us to provide a prognostic prediction of NSCLC patients, which is important to choose appropriate therapies for those patients [6]. Therefore, our study focused on finding a prognostic risk-score model derived from the whole-genome data.
AGING Long non-coding RNAs (lncRNAs) are defined as nonprotein coding transcripts longer than 200 nucleotides [7,8]. During the past decade, genome-wide sequencing helped researchers discover a large amount of lncRNAs [8]. Increasing studies suggested that the lncRNAs were involved in multiple cellular biological process and were associated with different human diseases including cancer [9,10]. Furthermore, several lncRNAs have been identified as oncogenes, such as MALAT1 in lung cancer [11], HOTAIR in breast cancer [12], and PANDA in hepatocellular carcinoma [13]. Above lncRNAs were also found to be associated with the prognosis of cancer patients [13][14][15].
Recently, the prognostic values of lncRNAs have attracted much attention in the field of cancer research. Researchers put great effort into investigating the lncRNA biomarkers in multiple malignancies, including gastric cancer [16], breast cancer [17], hepatocellular carcinoma [18] and ovarian cancer [19]. In lung cancer, Tu et al explored the lncRNA biomarkers in based on 739 patients derived from GEO datasets [20] and identified an eight-lncRNA signature which was associated with clinical outcomes. In the present study, we conducted a large sample size (n=1902) and crossplatform (including RNA-seq and microarray data) analysis, and performed multiple validation. Considering the above clear advantages, our findings will provide robust evidence for the exploration of lncRNA biomarkers in NSCLC.

Identification of seven lncRNAs as prognostic biomarkers in discovery group
By subjecting the lncRNA profiling data of discovery group to univariable Cox proportional hazards regression analysis, we obtained 457 lncRNAs whose P-value were less than 0.05. Among them, seven lncRNAs were selected for further analysis, including APTR, DHRS4-AS1, ITGA9-AS1, LINC01137, LOC101927972, RPARP-AS1 and SH3BP5-AS1. Next, we fitted the 7 lncRNAs expression, the corresponding survival status and survival time into the multivariable Cox regression. Using the coefficients obtained from the multivariable Cox regression, a risk-score formula was constructed as following: risk score = -0.06 x APTR -0.20 x DHRS4- Notably, in the formula, every lncRNA expression value is weight by a negative coefficient ( Table 1). As shown in Figure 1A and 1B, the heatmap revealed that the expression level of seven lncRNAs was decreased accompanying with the higher risk scores. Furthermore, we calculated the association between risk score and cancer-related death ( Figure 1C). Our data showed that mortality rate in high risk group was significant higher than low risk group ( Figure 1D), indicating the seven lncRNAs might play a protective role in NSCLC.

The prognostic values of seven-lncRNA signature in discovery and validation groups
Considering the relationship between the seven-lncRNA signature and patients' deaths, we next explored whether the seven-lncRNA signature could affect patients' OS time. Our data in discovery group (Meta-GEO dataset, n=682) showed that the patients who had low risk-scores were supposed to have a longer OS time than higher risk group (Figure 2A). To validate above finding, we employed two validation groups, TCGA and GSE31210 dataset, containing 994 and 226 patients, respectively. As expected, patients in high risk group had a significant increased mortality risk than low AGING risk group either in TCGA (HR=1.34, 95% CI= 1.10-1.63, P =0.004) or GSE31210 datasets (HR=3.06, 95% CI= 1.58-5.93, P =0.002) ( Figure 2B and 2C). Moreover, prognostic meta-analysis based on above three groups (n=1902) confirmed that the seven-lncRNA signature was a risk factor for NSCLC patients (combined HR=1.76, 95% CI= 1.26-2.47, P=0.001) ( Figure 2D).

The seven-lncRNA signature was associated with OS in stage I patients
To explore the impacts of clinical characteristics on the prognostic values of the seven-lncRNA signature, we performed a set of predefined stratified analysis. As shown in Table 2, we stratified the NSCLC patients by five clinical characteristics (including age, gender, smoking status, pathological subtypes and AJCC stage).
According to the results of stratified analysis, the most significant result was found in stage I patients in both discovery and two validation groups (all P<0.001) ( Table 2). To validate above findings, we further divided stage I patients into Ia and Ib according AJCC system, and calculated the prognostic values of the seven-lncRNA signature in each subgroups, respectively. Kaplan-Meier method was used to visualize the OS probabilities between high-and lowrisk groups. As expected, the overall survival time of high risk group was all significantly shorter than low risk group in either stage Ia or Ib patients ( Figure 3). The prognostic meta-analysis showed that the stage Ia patients with high risk scores had a two-fold increased mortality risk than those with low risk scores (combined HR=2.17, 95% CI= 1.15-4.09, P=0.017) ( Figure 3D). Similarly, the association between the seven-lncRNA signature and OS also reached a statistical significance in stage Ib patients (combined HR=1.88, 95% CI= 1.30-2.72, P=0.001) ( Figure 3H).
The seven-lncRNA signature was associated with OS in patients carrying wild-type KRAS or EGFR genes EGFR and KRAS were the most frequent mutant genes and associated with poor prognosis in NSCLC. Previous studies suggested that the mutation frequency of above two genes was significantly decreased in patients with AGING early stage. Bearing this in mind, we also performed stratified analysis based on EGFR or KRAS mutation status. Our data showed that the higher risk score was associated with higher mortality risk in wild-type KRAS or EGFR subgroups ( Figure 4). Above results were consistent in the two validation groups, suggesting the seven-lncRNA signature acted as a risk factor for patients carrying wild-type KRAS or EGFR genes.
Cell cycle-related pathways were associated with the seven-lncRNA signature Finally, to explore the potential mechanisms of the seven-lncRNA signature, we used the WGCNA method to cluster genes that highly correlated with the risk scores based on the profiling data of TCGA dataset [24]. We identified a total of 11 modules and found that pink module was the only positively correlated with the risk-score (P =2x10 -105 ) ( Figure 5A and 5B). Pathway enrichment analysis was then performed on an online-based web tool "Metascape" (http://metascape.org/) using the genes in pink module. As shown in Figure 5C, 90 genes were significantly enriched in cell cyclerelated pathways, including "cell cycle", "cell cycle phase transition" and "regulation of cell cycle process" pathways, suggesting the activation of cell cycle-related pathways might contribute to higher mortality risk in patients with high risk scores.

DISCUSSION
In the past decade, lncRNAs have been demonstrated by accumulating evidence to contribute in carcinogenesis and tumor progression [22]. In the present study, we conducted a seven-lncRNA signature, and validated their association with OS in two independent datasets. Moreover, we found the prognostic values of the seven-lncRNA signature reached a high statistical significance in patients with stage I and carrying wild-type KRAS and EGFR genes. To the best of our knowledge, this is AGING the first study, which was based on the high-throughput lncRNA profiling data from nearly two thousand patients, to explore the lncRNAs as prognostic biomarkers for early stage NSCLC patients. Our findings might provide convincing evidence for clinical application.
In the present study, we identified a total seven lncRNAs which were associated with OS in NSCLC, including APTR, DHRS4-AS1, ITGA9-AS1, LINC01137, LOC101927972, RPARP-AS1 and SH3BP5-AS1. Most of these identified lncRNAs were reported in cancer for the first time, except APTR (Alu-mediated p21 trans-criptional regulator). Previous studies reported that APTR could promote the cell proliferation and contribute to carcinogenesis by repressing p21 promoter activities in human glioblastomas [21]. By contrast, in this study, we found that the coefficient of APTR was minus, indicating that APTR might play a protective role in NSCLC. The detailed molecular mechanism warrant further investigations.
The most important finding in this study was that the seven-lncRNA signature was significantly associated with OS in stage I patients. Our data showed their association reached a high statistical significance (P<0.001)    [23,24]. Thus, the frequencies of wild-type EGFR and KRAS were always high in patients with early stage. We reviewed the EGFR and KRAS mutation status in TCGA-NSCLC dataset, and found that the mutation frequencies of EGFR and KRAS in stage 1 patients were only 10.2% and 5.7%, respectively, which were lower than advanced stage according to literature reports [25,26]. Interestingly, our data also showed that the seven-lncRNA signature was significantly associated with OS in patients carrying wild-type EGFR or KRAS genes, which might partially explain the favorable prognostic values of the seven-lncRNA signature in early stage patients.
In order to investigate the potential mechanisms affected by the seven lncRNAs, we also performed bioinformatic analysis. By using the WGCNA method, we clustered 11 gene modules from more than 5000 differentially expressed genes, and found that only one of them (pink) was positively correlated with seven-lncRNA signature. Pathway enrichment analysis further suggested that the genes in pink module were mostly enriched in cell cycle-related signaling pathways, indicating the seven-lncRNA signature might affect cell cycle-related pathways and consequently contributed to tumor progression.
In summary, we constructed a risk-score model derived from seven lncRNAs to predict the OS time of NSCLC patients. The risk association of the seven-lncRNA signature with survival was more evident in patients with early stage and carrying wild-type EGFR or KRAS genes. We warrant further studies, especially the large, wellperformed cohorts to confirm or refuse our findings.

Sample sources and study design
The raw data of gene expression and corresponding clinical information of NSCLC patients were downloaded from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) websites. Four GEO datasets (GSE30219, GSE31546, GSE37745 and GSE50081) were combined and used as the discovery group. After removal the five samples without survival data or enough clinical information, 682 patients from four GEO datasets were finally used for further analysis. Meanwhile, 994 patients from TCGA dataset were employed as validation group. Moreover, we also employed 226 patients from another independent GEO dataset (GSE31210) as the second validation group. All patients in GSE31210 belonged to the stage I and II of the American Joint Committee on Cancer (AJCC) system. The RNA expression profiling data of five GEO datasets were all performed on Affymetrix U133 Plus 2.0 microarray platform and the TCGA data was performed on Illumina sequencing platform. The clinical characteristics of discovery and validation groups were shown in Table 3.

Normalization and lncRNA annotation of GEO data and TCGA data
The raw data of five GEO datasets (GSE30219, GSE31210, GSE31546, GSE37745 and GSE50081) were downloaded as probe-level CEL files. Then, all raw data were quantile normalized using Robust Multiarray Average (RMA) method. With the lncRNAspecific probes presented on Affymetrix U133 Plus 2.0 downloaded from Affymetrix website (http://www.affymetrix.com) [15], we finally identified 2986 lncRNA transcripts with RefSeq transcript IDs. As AGING for TCGA data, we downloaded the sequencing data with the type of fragments per kilobase of exon per million fragments (FPKM) from the TCGA website (https://tcga-data.nci.nih.gov/).

Construction of risk-score formula
After normalization and annotation of GEO raw data, we subjected 2986 lncRNAs to univariable Cox regression proportional hazards regression analysis to select lncRNAs which were associated with OS of NSCLC patients. Those lncRNAs with a statistical significance in univariable Cox regression were then selected into multivariable Cox regression to obtain the coefficients. By linearly combining the expression value of selected lncRNAs weighted by their coefficients, a risk-score formula was constructed as following [20]: risk score N In our formula, the N is the number of 7, the are the expression value of every 7 lncRNAs and the are their corresponding coefficients from the multivariable Cox regression analysis [27]. Abbreviations: AJCC, the American Joint Committee on Cancer; KRAS, Kirsten rat sarcoma viral oncogene homolog; EGFR, epidermal growth factor receptor.

Survival analysis
According to above formula, the risk scores of NSCLC patients were calculated, by which patients were divided into high-and low-risk group with the cutoff of the median. The Kaplan-Meier method was used to assess the difference in survival time of high-and low-risk NSCLC patients. Difference of survival times between the two groups was considered as significant according to the P value of two-sided log-rank test less than 0.05.

Prognostic meta-analysis
To investigate the combined prognostic values of the seven-lncRNA signature of discovery and two validation groups, we performed a prognostic metaanalysis. All meta-analysis was performed on STATA software, version 12.0 (StataCorp LP, College Station, TX, United States). Pooled HR value was calculated using the random-effects model.

Weighted Correlation Network analysis (WGCNA)
To find the gene modules associated with our risk scores, we construct a co-expression network using the R package "WGCNA" according to our previous reports [28]. The soft thresholding power was selected to 9 to produce a weighted network. The enrolled genes were hierarchically clustered into 11 modules except the gray module.

Pathway enrichment analysis
The pink modules, the most significant modules being associated with risk score, were picked out to perform the pathway enrichment analysis. Pathway enrichment analysis of genes in pink module was performed on an online-based web tool "Metascape" (http://metascape. org/).