A three‐lncRNA expression signature associated with the prognosis of gastric cancer patients

Abstract Long noncoding RNAs (lncRNAs) have emerged as essential players in gene regulation. An ever‐increasing number of lncRNAs has been found to be associated with the biogenesis and prognosis of gastric cancer (GC). We aimed to develop an lncRNA signature with prognostic value for survival outcomes of GC. Using an lncRNA mining approach, we analyzed the lncRNA expression profiles of 492 GC patients from the Gene Expression Omnibus (GEO), which consisted of the GSE62254 set (N = 300) and the GSE15459 set (N = 192). The associations between the lncRNAs’ expression and survival outcome were evaluated. A set of three lncRNAs (LINC01140, TGFB2‐OT1, and RP11‐347C12.10) was identified to significantly correlate with overall survival. These lncRNAs were then combined to form a single prognostic signature. Based on this three‐lncRNA expression signature, patients in the GSE62254 set were classified into high‐ and low‐risk subgroups with significantly different overall survival (hazard ratio [HR] = 1.93, P < 0.001) and disease‐free survival (HR = 1.91, P < 0.001). Good reproducibility for the prognostic value of this lncRNA signature was confirmed in the GSE15459 set. Further analysis showed that the prognostic value of this signature was independent of some clinical characteristics. Gene set enrichment analysis indicated that high‐risk scores positively related to several molecular pathways of cancer metastasis. Our results suggest that this innovative lncRNA expression signature may be a useful biomarker for the prognosis of patients with GC based on bioinformatics analysis.

protein-coding genes and their critical genome alterations in the pathogenesis of GC [3,4]. Nevertheless, proteincoding genes only account for <2% of the whole genome sequence, and the remaining noncoding genes are transcribed into noncoding RNAs (ncRNAs). The ncRNAs are divided into two primary categories according to their size: small ncRNAs  nucleotides, e.g., microRNAs and small interfering RNAs) and long noncoding RNAs (lncRNAs, >200 nucleotides) [5]. Emerging evidence has indicated that ncRNAs play an important role in cancer pathogenesis, which could provide new insight into GC biology [6,7]. Recently, microRNAs have moved to the forefront of lncRNA GC research, whereas the role of lncRNAs is emerging, with an increasing number of lncRNAs being reported as associated with GC tumorigenesis. Aberrant lncRNA expression has been found in many types of cancers, such as GC [6], esophageal cancer [8], and liver cancer [9]. LncRNAs have attracted major attention due to their essential regulatory functions in cell proliferation, migration, and apoptosis. For GC, a substantial portion of lncRNAs is expressed specifically, which suggests their potential role as possible biomarkers and may be predictive of the clinical outcome.
Currently, the methodology of repurposing frequently used microarray data for expression profiling of ncRNAs has been well established [10,11]. For instance, Hu et al. used a series of microarray datasets to build a resource of clinically relevant lncRNAs and found a tumor-specific prognostic lncRNA signature in colorectal cancer [10]. We initially explored previously published gene expression microarray data of large GC cohorts from the Gene Expression Omnibus (GEO) and constructed lncRNA profiles using the abovementioned mining method. Based on the sample-splitting method and Cox regression analysis, we tried to identify useful lncRNAs associated with the survival of GC patients. The critical goal of this study was to discover key lncRNAs that can act as novel biomarkers to determine GC prognosis.

GC data sets
The GC gene expression data used in this study were obtained from publicly available GEO databases. To evaluate the association between lncRNA expression signatures and GC survival, we selected the microarray expression profiles based on three criteria: (1) the profiles should be generated by the Affymetrix HG-U133 Plus 2.0 Array (GPL570 platform), (2) the corresponding clinical data, such as histological classification and follow-up information, were available online, and (3) the sample size was >100. This resulted in two sets (GSE62254 and GSE15459) that were screened in our study.

Microarray expression processing and lncRNA profile mining
Raw CEL files of the expression for the two GEO datasets were downloaded, and the Robust Multichip Average (RMA) algorithm was performed for backgroundadjustment, quantile normalization, and log-transformation by the R package "affy" [12]. LncRNA profiles were achieved by Seqmap V1.0.8 on a local computer [13]. Briefly, the probe sets of Affymetrix HG-U133 Plus 2.0 were retrieved from the Affymetrix website (http://www. affymetrix.com). We then re-mapped those probes to the chromosomal positions of the ncRNAs derived from GENCODE (release 24, GRCh38) with no mismatch [14]. A total of 2380 probes and 2118 corresponding ncRNA genes were obtained. When multiple probes mapped to the same ncRNA, we used the arithmetic mean of the probe intensities.

Gene set enrichment analysis (GSEA)
Gene set enrichment analysis (GSEA) was performed using GSEA software V2.2. The gene set used for the enrichment analysis was "c2.cp.v5.0.entrez.gmt" (1330 gene sets), which are canonical representations of biological processes. The GSEA results were visualized in Cytoscape software V3.2.1 using the Enrichment Map plug-in [15]. Gene sets with a false discovery rate (FDR) value <0.05 after performing 1000 random sample permutations were termed "enriched."

Statistical analysis
The correlation of lncRNA expression with patients' OS or DFS was assessed by a univariate Cox regression analysis along with a permutation test using Biometric Research Branch-Array Tools V4.1.1 (FDR <0.05, P < 0.01) [16]. Genes were considered statistically significant with a permutation P < 0.01. A random survival forests (RSF) variable hunting algorithm was undertaken to further identify valuable lncRNAs [17]. In the RSF model, the number of Monte Carlo iterations (nrep) was set as 100, and the value controlling the step size used in the forward process (nstep) was set as 5. Because the GSE62254 set offered a larger sample size and more detailed clinical information than the GSE15459 set, we chose the GSE62254 to determine the risk score formula using a multivariable Cox regression model for the selected lncRNAs. The formula was established by including the expression of these selected lncRNAs, weighted by their estimated regression coefficients. According to this risk score formula, patients in each set were classified into a high-or low-risk group by using the median risk score as the cutoff point. The P. Song et al.

LncRNA and Gastric Cancer Prognosis
Kaplan-Meier method with a log-rank test was adopted to evaluate the survival differences between the low-and high-risk groups. Hazard ratios (HRs) and 95% CIs were estimated by univariate or multivariate Cox regression analysis. Multivariate Cox stepwise regression analysis was employed to determine predictive factors for GC prognosis, with a significance level of P < 0.05 for entering and P > 0.10 for removing the respective explanatory variables. Receiver operating characteristic (ROC) curves were used to compare the sensitivity and specificity of the prognosis of the lncRNA risk score. Mann-Whitney U tests (continuous variables) or chi-squared tests (categorical variables) were employed to evaluate the associations between lncRNA risk scores and patients of different clinical features. All of the statistical analyses were performed using the R V3.1.3 program (www.rproject.org) and SAS software V9.1. A two-sided P < 0.05 was considered statistically significant.

Characteristics of the two GC sets
Two independent sets of GC subjects were included in the present study. The GSE62254 set contained 300 GC patients and had a mean follow-up time of 50.6 months (range: 1.0-105.7 months). There were 199 males (66.3%) and 101 females (33.7%), 32 (10.7%) cardia patients and 268 (89.3%) noncardia GC patients, and 134 (44.7%) patients with diffuse type and 146 (48.7%) with intestinal type. In addition, 10.0%, 32.0%, 31.7%, and 25.7% of patients were identified as TNM stage I, II, III, and IV, respectively. For the GSE15459 set, 95 patients (49.5%) died of disease-related to GC in a period of up to 157.8 months (mean: 38.4 months) of follow-up. There were 125 males (65.1%) and 67 females (34.9%), and 75 (39.1%) diffuse type and 99 (51.6%) intestinal type GC. Additionally, 16.1%, 15.1%, 37.5%, and 31.3% of patients were diagnosed as stage I, II, III, and IV disease, respectively. The demographics and some clinical characteristics of the patients in these two sets were quite similar (P = 0.779 for sex, P = 0.191 for Lauren classification), whereas the proportion of stage III/IV patients in the GSE15459 set (68.8%) was larger than that in the GSE62254 set (57.3%, Table S1). As two patients in the GSE62254 set did not have definite TNM stage, they were further excluded in the subgroup analysis by TNM stage.

Identification of prognostic lncRNAs
As depicted in Figure 1, the 300 GC patients in GSE62254 were randomly divided into two sets, and both of the GC sets were used for the detection of prognostic ncRNAs.
After subjecting the ncRNA expression data to univariable Cox regression analysis by BRB-Array Tools, we identified 85 and 21 ncRNAs that strongly correlated with the patients' OS from the two sets, respectively. Among those transcripts, there were 11 overlapping ones and the lengths were more than 200 nucleotides. To make the model more practical, RSF were performed on the basis of the 11 lncRNAs, resulting in three lncRNAs remaining in the model. Therefore, a set of three lncRNAs was selected as the predictor for survival of GC (Table 1). Of these, LINC01140 and TGFB2-OT1 showed a positive coefficient in the univariate analysis, indicating that their higher levels of expression were associated with a shorter survival. A negative coefficient indicated that patients with a higher level of expression of RP11-347C12.10 tended to have a longer survival compared with those with a lower expression. Table 1 also describes a list of these three genes with their obtained variable importance values, with LINC01140 showing greater importance than the other predictors ( Figure  S1). All three lncRNAs have been verified in LNCipedia (a database for annotated human lncRNA sequences) and confirmed as ncRNAs in this website [18]. Additionally, the noncoding nature of these lncRNAs was verified by coding potential analysis.

Three-lncRNA signature and GC survival
We created a risk-score formula according to the expression of these three lncRNAs for OS outcome in the total GSE62254 set using multivariate Cox regression as follows: (0.84321 × expression level of LINC01140) + (0.87302*expression level of TGFB2-OT1) + (−2.4496 × expression level of RP11-347C12.10). The risk scores of the three-lncRNA signature for each sample in the GSE62254 set were calculated and ranked according to the values. Figure 2 shows that patients with low-risk scores tended to express high levels of protective lncRNAs (RP11-347C12.10), whereas patients with highrisk scores tended to express high levels of risky lncRNAs (LINC01140 and TGFB2-OT1). Using the median risk score (0.149) as the cutoff point, patients were divided into a highrisk group (score >0.149, N = 150) and a low-risk group (score ≤0.149, N = 150). As shown in Figure 3, we observed that GC patients with high-risk scores had lower OS and DFS rates than those with low-risk scores (both log-rank test P < 0.001). To validate our findings, we classified patients in the GSE15459 set into a high-risk (N = 88) and a lowrisk group (N = 104) by using the same cutoff value. Consistent with the findings described above, patients in the high-risk group suffered significantly poorer OS than did those in the low-risk group (log-rank test P = 0.003).

Stratified analysis of the three-lncRNA signature and patients' survival
To control for potential confounders and analyze phenotype-specific survival, the patients were stratified by demographic characteristics and clinical features. As demonstrated in Table 2, we found that the decreased OS and DFS rates were noticeable for the patients with highrisk scores among the subgroups of ≤64 years, >64 years, female, male, intestinal, diffuse, and noncardia in the GSE62254 set. Importantly, increased death was also pronounced for individuals with high-risk scores among the subgroups of >64 years, male, diffuse, and stage III/IV in the GSE15459 set (adjusted HR = 1.84, 95% CI = 1.08-3.12 for >64 years; 1.93, 1.17-3.17 for male; 2.49, 1.18-5.24 for diffuse; 1.96, 1.24-3.12 for stage III/IV).  Correlation between the three-lncRNA signature and clinicopathological features of gastric cancer patients According to the median risk score, patients were equally classified into two groups (relative high-risk group and low-risk group). In the GSE62254 set, the differences in lncRNA risk scores were significantly associated with age (P = 0.049), histological types (P < 0.001), and TNM stage (P < 0.001). However, this significant correlation was only associated with histological types in the GSE15459 set (P < 0.001). When the risk score of the three-lncRNA signature was evaluated in different strata of clinicopathological features, similar results were observed (Table 3).

Stepwise Cox regression model for survival
A multivariate stepwise Cox regression analysis was performed to evaluate the correlation between variables, including selected demographic characteristics and clinical features, risk scores (as continuous variables) and GC survival. Finally, three variables (age, TMN stage, and three-lncRNA risk score) of the GSE62254 set and two variables (TMN stage and three-lncRNA risk score) of the GSE15459 set were included in the stepwise regression model (Table 4). Furthermore, multivariate stepwise models were applied in each data set based on selected variables. As a result, the variable of the three-lncRNA risk score appeared in most stratified strata except the cardia, stage I/II subgroups in the GSE62254 set and the ≤64 years, female, intestinal, and diffuse subgroups in the GSE15459 set (Table S2).
Identification of the three-lncRNA signature associated with biological pathways and processes We carried out GSEA to identify associated biological processes and signaling pathways based on the risk score of the three-lncRNA signature in the GSE62254 set (Table  S3). The gene sets with a significantly different expression were visualized as interaction networks with the Cytoscape and Enrichment Map ( Fig. 4A and B). Several cancerrelated networks, namely extracellular matrix pathways, integrin pathways, focal adhesion pathways and the TGF-β pathway, were enriched in the high-risk group, which implied that the signature might be involved in tumor metastasis. Thus, we compared the risk scores of patients with different TNM stages and found that patients at advanced stages tended to have higher risk scores than those at early stages (Fig. 4C, P < 0.001).
Discriminatory and prognostic ability of the three-lncRNA signature for survival Because the GS62254 set contained the DFS information, we used ROC analysis to compare the sensitivity and specificity of GC recurrence between the risk score of the three-lncRNA signature, TNM stage, and age of these patients. The area under the receiver operating characteristic (AUROC) was determined and compared among these three prognostic factors. Figure 5A shows that the AUROC of the three-lncRNA risk score was 0.688, which was larger than that of each single lncRNA (0.677 for TGFB2-OT1, 0.620 for LINC01140, and 0.610 for RP11-347C12. 10). In addition, as illustrated in Figure 5B, there was no significant difference between the AUROC of the three-lncRNA risk score with the TNM stage (AUROC = 0.741, P = 0.187). However, we observed that the merged AUROC of three-lncRNA risk score and the TNM stage (AUROC = 0.782) was larger than for each individually (for the three-lncRNA risk score, P = 0.018; for the TNM stage, P = 0.301) and showed a good performance.
We also used a likelihood test to determine whether the signature really added consistent prognostic power to the TNM stage. Akaike information criterion (AIC) and Schwarz criterion (SBC) were employed to access the most appropriate model for OS in the GSE62254 set. The lowest value of AIC and SBC indicated the preferred model, which in this case was adopting both the three-lncRNA signature and the TNM stage prognostic parameters (Table S4).

Discussion
Over the past several years, a number of the genome's repertoire of nonprotein-coding transcripts, including lncRNAs, has been viewed as inconsequential transcriptional "garbage." Owing to the achievement of ENCODE and the implementation of The Cancer Genome Atlas (TCGA) program, lncRNAs have been highlighted for their important roles in cancer development and progression [19,20]. The involvement of lncRNAs in fundamental biological processes such as cell cycle regulation, apoptosis, and the DNA damage response and their implication in some human diseases are increasingly reported. More recently, most studies have shown that altered lncRNA expression levels are associated within the spectrum of disease development, but their prognostic values have rarely been investigated. To explore potential prognostic lncRNAs for GC, we achieved lncRNA profiling by mining the existing microarray gene expression data from the GEO database. In this present study, by analyzing the associations between lncRNA expression profiles and clinical features of GC patients in two large cohorts, a three-lncRNA signature was identified that was significantly associated with patients' OS and DFS. IQR, interquartile range (from 25th percentile to the 75th percentile); SD, standard deviation; NA, not applicable.
LncRNA and Gastric Cancer Prognosis P. Song et al.  By applying the three-lncRNA signature to the patients of the GSE62254 and GSE15459 sets, obvious separations could be observed in the survival curves between patients with low-or high-risk signatures. What was apparent from the results was that patients with low-risk scores had a significantly prolonged survival time compared with patients with high-risk scores. Regardless of whether the risk scores were evaluated as continuous variables or category variables, the correlation between this three-lncRNA expression signature and survival was significant. In the stratified analysis, we further found that the three-lncRNA risk score influenced the survival in different strata except for the cardia type, stage I/II, and stage III/IV subgroups for GSE62254, and female and stage I/II subgroups for GSE15459.
Further analysis uncovered that the lncRNA risk score was associated with age, Lauren type and TNM stage, especially the Lauren type. Compared with intestinal GC, diffuse GC has a greater tendency toward lymph node metastasis, advanced TNM stage, and a poor survival. In the diffuse GC patients, the risk scores of lncRNA were significantly higher than those in the intestinal GC patients. Despite the exact mechanism of the lncRNA signature with survival remaining unknown, the poor prognosis for patients with high-risk scores could be partially due to its association with some key clinical characteristics (more aggressive pathological type and later TNM stage). Interestingly, even when stratified by some clinical variables (e.g., Lauren type), the prognostic value of the lncRNA signature still existed, which suggests that it might be a significant determinant of survival in GC and not an accidental feature of the transcription noise.
For the characteristics of the three lncRNAs, no functional studies involving them in GC were reported. To the best of our knowledge, our present study is the first to report the relationship between their expression levels and survival time. We then analyzed the genomic locations of these putative lncRNAs and found that they overlapped with some transcripts of oncogenes or tumor suppressor genes. TGFB2-OT1 is a newly discovered lncRNA derived from the 3′-UTR of TGFB2 and can regulate autophagy in vascular endothelial cells. Huang et al. reported that TGFB2-OT1 acts as a ceRNA, competes for binding with miR-4459, miR-3960, and miR-4488, and regulates the expression of the miRNA targets to affect autophagy and inflammation [21]. LINC01140 and RP11-347C12. 10 have been identified as long intergenic noncoding RNA, which may regulate the transcription of genomically neighboring protein-coding genes in cis (such as HS2ST1 and CD2BP2) and of distant protein-coding genes in trans [22]. Thus, it will be worthwhile to investigate the functional roles of these lncRNAs.
The identification of this three-lncRNA signature that was associated with outcome in GC patients had some clinical implications. On the one hand, we found that the prognostic value of our three-lncRNA signature was independent of age and TNM stage in the stepwise Cox regression. Presently, age and TNM stage, particularly the TNM stage, have been regarded as important predictors for survival of GC patients [23]. Stage III and IV patients demonstrate a high local recurrence rate and poor survival outcome compared with those at stage I and II. However, clinically we can find that even patients with the same TNM stage might have different prognoses. This highlight the reasons for the unremitting exploration and hard work for identifying new biomarkers for a more precise survival prediction of high-risk patients with GC and a consequently improved personalized cancer treatment. For this purpose, we developed a three-lncRNA expression signature model that was closely associated with survival of GC patients of stage III/IV. However, this phenomenon was observed in the GSE15459 set and not in the GSE62254 set, so prospective multicenter large-scale studies are essential to test the idea. On the other hand, the three-lncRNA signature had a similar predictive value as the TNM stage for disease recurrence in the ROC analysis. The combination of the three-lncRNA signature and the TNM stage could have a stronger power for DFS. The ability of our lncRNA signature implied that it could be useful for identifying subgroups of GC patients with identical TNM stages. Collectively, these data suggested that the three-lncRNA signature might be a novel molecular target.
Moreover, GSEA was conducted to determine whether a predefined functional gene set showed coordinated expression based on the risk scores. The results of the GSEA revealed that the three-lncRNA signature was more likely to involve extracellular matrix pathways, integrin pathways and focal adhesion pathways. The extracellular matrix can, via its receptors (especially the integrins), induce a variety of intracellular signals and regulate several cellular responses, including migration, differentiation, and proliferation, and has emerged as a major pathway contributing to cancer cell survival [24]. Most notably, integrinmediated cell attachment has been shown to be required for tumor invasion and metastasis [25,26]. As is already well known, the major risk factors for GC prognosis are lymph node and distant metastasis, which usually occur in the advanced stages. Patients in the advanced stage were validated in our study to get higher risk scores than those in the early stage. Therefore, the enriched signaling pathways might support that our three-lncRNA signature had survival prediction power and suggested possible avenues for future targeted therapies.
To date, gene expression profiling has commercially served as adjuncts for the treatment of cancers, including breast, prostate, and colon cancer. For example, the 21gene recurrence scores (Oncotype DX Breast Cancer Assay) are utilized as an important indicator to evaluate distant disease recurrence and the benefit of adjuvant chemotherapy in estrogen-receptor-positive breast cancer. However, no such effective prognostic tool is available for patients with GC to help physicians and patients determine the best course of therapy. As shown in this study, a small number of genes (three genes) could be sufficient to predict the prognosis of GC, which provide a valuable and feasible reference for clinicians.
Several limitations to this study needed to be noted. First, in our study only a portion of human ncRNAs was analyzed, and these ncRNAs were obtained by repurposing the microarray probes. Thus, the prognostic lncRNAs identified here might not represent all of the lncRNA candidates that potentially correlate with GC survival. Second, we did not investigate the mechanisms behind the prognostic values of these three lncRNAs in GC, and experimental studies on cancer cell lines and xenograft models would provide important information to further the understanding of their functional roles. Third, we only recapitulated our findings in two published datasets, and thus more datasets are required for further validation. It is worth mentioning that some important characteristics (age, TNM stage) between the two datasets are quite different, but the three-lncRNA signature was still associated with patients' survival in both two sets, indicating the prognostic value of this signature was robust.
In conclusion, our study reveals a three-lncRNA signature that is linked to survival in GC patients. The prognostic value of this signature is independent of the TNM stage, one of the main predictive factors. Taken together, this innovative signature may serve as possible candidate biomarkers and therapeutic targets for GC. Future studies will focus on the validation of our findings in clinical trials and the functional effects of these identified lncRNAs. Figure S1. Random survival forests-variable hunting analysis for identifying valuable lncRNAs. (A) Error rate for the data as a function of trees; (B) out-of-bag importance values for the three lncRNAs. Table S1. Demographic and clinical features of including subjects involved in this study. Table S2. Results of stepwise Cox regression analysis on gastric cancer patients' survival among different groups. Table S3. Gene set enrichment analysis describes biological pathways associated with risk score. Table S4. Likelihood tests for different survival models.