A Recurrence-Specific Gene-Based Prognosis Prediction Model for Lung Adenocarcinoma through Machine Learning Algorithm

Background After curative surgical resection, about 30-75% lung adenocarcinoma (LUAD) patients suffer from recurrence with dismal survival outcomes. Identification of patients with high risk of recurrence to impose intense therapy is urgently needed. Materials and Methods Gene expression data of LUAD were obtained from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. Differentially expressed genes (DEGs) were calculated by comparing the recurrent and primary tissues. Prognostic genes associated with the recurrence-free survival (RFS) of LUAD patients were identified using univariate analysis. LASSO Cox regression and multivariate Cox analysis were applied to extract key genes and establish the prediction model. Results We detected 37 DEGs between primary and recurrent LUAD tumors. Using univariate analysis, 31 DEGs were found to be significantly associated with RFS. We established the RFS prediction model including thirteen genes using the LASSO Cox regression. In the training cohort, we classified patients into high- and low-risk groups and found that patients in the high-risk group suffered from worse RFS compared to those in the low-risk group (P < 0.01). Concordant results were confirmed in the internal and external validation cohort. The efficiency of the prediction model was also confirmed under different clinical subgroups. The high-risk group was significantly identified as the risk factor of recurrence in LUAD by the multivariate Cox analysis (HR = 13.37, P = 0.01). Compared to clinicopathological features, our prediction model possessed higher accuracy to identify patients with high risk of recurrence (AUC = 96.3%). Finally, we found that the G2M checkpoint pathway was enriched both in recurrent tumors and primary tumors of high-risk patients. Conclusions Our recurrence-specific gene-based prognostic prediction model provides extra information about the risk of recurrence in LUAD, which is conducive for clinicians to conduct individualized therapy in clinic.


Introduction
Lung cancer, the 5-year overall survival (OS) rate of which is as low as 23% [1], is the leading cancer threatening people's health worldwide [2]. Lung cancer contains two major histological types: non-small-cell lung cancer (approximately 83%) and small-cell lung cancer [1]. According to the Cancer Statistics Review 2012 [3], lung adenocarcinoma (LUAD) accounts for 43.3% of all lung cancers, replacing squamous cell lung carcinoma as the most common type of lung cancer. For early-stage LUAD patients, surgical resection is recommended [4]. However, after curative surgical resection, about 30-75% patents suffer from recurrence [5][6][7]. Once recurrence happens, survival outcomes are dismal, with a range of 8-14 months of postrecurrence survival (PRS) [8] and 1year mortality as high as 48.3%-80.6% depending on the tumor stage [9,10].
Identifying patients with high probability of submitting to recurrence and imposing intense therapy might tremendously improve the survival outcomes of LUAD. Clinical decisions for LUAD patients were mainly based on clinicopathological features like TNM stage, surgical margins, differentiation, vascular invasion, or single gene mutation status like the epidermal growth factor receptor (EGFR) mutation and the BRAF V600E mutation [11,12]. However, these clinicopathological features fail to clearly identify patients with high risk of recurrence. Since tumorigenesis is complicated with numerous pathways regulated, researchers hypothesize that multigene profiles are capable of discriminating patients with heterogeneity survival outcomes [13]. Several groups have developed gene expression-based prediction model that successfully stratified LUAD patients into high-and low-risk groups [13][14][15][16][17][18]. Based on quantitative-PCR assay, Prof. David M Jablons developed a 14-gene expression prediction model, which stratified patients into low-risk, intermediate-risk, and high-risk groups. And the 5-year OS were 71.4%, 58.3%, and 49.2% for low-risk, intermediate-risk, and high-risk patients, respectively [19]. Benefiting from the accumulated public expression database like The Cancer Genome Atlas (TCGA) and Gene expression Omnibus (GEO), Prof. Chun-lai Lu built a risk model by screening prognostic-related genes using expression data from TCGA [20]. Many of these gene signatures are based on literature review or roughly screening survival-related genes using the Cox regression, which makes them not stable enough to be generalized in clinic.
It is rationally hypothesized that building a prediction model on the basis of recurrence-specific genes would better distinguish high-risk patients of recurrence. Therefore, aiming to identify high-risk LUAD patients of recurrence, we explored the recurrence-associated genes using the public GEO dataset and established a recurrence-free survival (RFS) prediction model using the expression data of LUAD patients from TCGA and validated its accuracy and feasibility in an external dataset.  2 BioMed Research International TCGA database were randomly divided into the training and testing subsets, with a ratio of 7 : 3.

Identification of Differentially Expressed Genes (DEGs).
The differentially expressed genes (DEGs) of the microarray-based data (GSE7880) were identified using the "limma" package [21] while the DEGs of sequencing-based data (TCGA) were identified using the "DESeq2" package [22]. DEGs of both datasets were determined based on an absolute log2 ðfold changeÞ > 1 and a P value less than 0.05.
The GSEA [23] software was used to calculate the normalized enrichment scores (NES) and false discovery rate (FDR) values for the Hallmark gene sets [24]. The genes were preranked according to the log fold change values. NES corresponds to the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. The normalization is based on the gene set enrichment scores for all dataset permutations.

Development of Prediction
Model. All data analyses were calculated using the related R packages on the R platform (https://cran.r-project.org/src/base/R-3/) (version 3.6.2). The univariate and multivariate Cox regression analyses were carried out using the "survival" package (v3. [1][2][3][4][5][6][7][8]. We normalized TCGA gene expression value into log 2 ðTPM + 1Þ and performed the univariate Cox regression analysis to find out the RFS-related gene candidates (P < 0:05) using the DEGs (note: TPM, transcripts per kilobase of exon model per million mapped reads). Then, the LASSO Cox regression analysis was carried out to select features (gene signature) with the best prediction power. The multivariate Cox regression analysis was performed to construct the prognostic model using the selected features, by which we calculated the risk scores of each patient and separated them into low-(risk score < 0) and high-risk (risk score > 0) subgroups. The survival plot was calculated with the "rms" package (v5.1-4), which were used to detect the significant difference of RFS risks between these two subgroups, and the logrank test was performed to state the differential significance between the two subgroups. Besides, the receiver operating characteristic (ROC) curve was employed to test the stability and sensitivity of this prognostic model using the R package "pROC" (v1.16.1) [25].

Establishment of Recurrence-Specific Gene-Based RFS
Predicting Model. In order to develop a robust RFS predicting model for LUAD, we collected the expression data of 426 LUAD patients from TCGA with available clear RFS. We extracted the 37 DEGs' expression profile from TCGA datasets and performed the univariate Cox regression analysis to 3 BioMed Research International identify RFS-related gene candidates. As a result, 31 genes were found significantly associated with the RFS of LUAD patients (P < 0:05, logrank test; Figure S1). Then, we randomly selected 70% of patients from TCGA dataset as the training cohort (298 samples) and the rest as testing cohort (128 samples) ( Table 1). The LASSO Cox regression analysis was applied to extract key genes with most RFS prediction power in training cohort. Finally, thirteen key genes including ACTR2, ALDH2, FBP1, HIRA, ITGB2, MLF1, P4HA1, S100A10, S100B, SARS, SCGB1A1, SERPIND1, and VSIG4 were extracted to establish the RFS prediction model. We  Figure 2). Risk score < 0 infers patients with low risk of recurrence, while risk score > 0 infers patients with high risk of recurrence.

Efficiency of the RFS Prediction Model.
Using the RFS prediction model, 48.66% and 49.22% of patients were classified into the high-risk group in the training and validation cohort, respectively. We found that patients with high risk suffered from worse RFS compared to patients with low risk in the training cohort (median RFS: 795 days vs. 3521 days; P < 0:01, logrank test; Figure 3(a)). Concordantly, similar result was further confirmed in the validation cohort (median RFS: 1084 days vs. 2701 days; P = 0:03, logrank test; Figure 3(b)). Furthermore, we validated the efficiency of the prediction model using an external validation cohort (443 patients) reported by Prof. David G Beer [26] from the GEO database (GSE68465; Table 1). After extracting the expression data of thirteen key genes, we categorize patients into high-risk and low-risk groups as previously elaborated. As expected, patients with high risk suffered from worse RFS compared to patients with low risk (median RFS: 31.50 months vs. 59.17 months; P = 0:01, logrank test; Figure 3(c)). Furthermore, we evaluated the efficiency of our prediction model in clinicopathological subgroups. In subgroup of age, gender, pathologic stage, smoking history, and location in lung parenchyma, better RFS happened in patients of the low-risk group compared to those of the high-risk group (Figures 3(d)-3(g); Figure S2A-F).
Combining the clinicopathological features including patient age, gender, pathologic stage, smoking history, location in lung parenchyma, and expression subtype [27] with our prediction signature, we performed the multivariate Cox regression analysis. Except our prediction signature, none of the clinicopathological signatures was related to the risk of RFS (HR = 13:37, CI: 1.75-99.10, P = 0:01, logrank test; Figure 3(h)). Also, we wonder whether the efficiency of the DEG-based signature is better than other clinicopathological signatures, so we compared the stability and sensitivity of the RFS prediction model using the ROC curve (Figure 3(i)). Compared to other clinicopathological features, the RFS prediction model achieved the supreme efficiency of predicting RFS (AUC = 96:3%; Figure 3(i)). Taken together, the recurrence-specific gene-based signature is capable of better stratifying LUAD patients into high-and low-risk groups compared to other clinicopathological features.

Key Pathways Associated with the High Risk of Recurrence in LUAD.
In order to figure out the key molecular pathways associated with the recurrence of LUAD, we detected the DEGs between patients with high risk and those  (Figures 4(b)-4(h)). As previously reported, all these pathways were associated with tumor progression [28][29][30][31][32][33]. It is noted that the G2M checkpoint pathway was the only pathway that was enriched in both recurrent tumors and primary tumor with high risk of recurrence (Figures 1(b) and 4(f)), which indicates its potential as treatment targets for patients prone to recurrence.

Discussion
With the aim of identifying LUAD patients with heterogeneity RFS, we detected the DEGs between primary and recurrent LUAD tumors, extracted RFS-associated genes, and  established the RFS prediction model using a machine learning algorithm based on a large cohort. Using the prediction model, we classified the patients into high-and low-risk groups and found that patients in the high-risk group suffered from worse RFS compared to those in the low-risk group. Concordant results were confirmed in the internal and external validation cohort. Compared to clinicopathological features, our prediction model possessed higher accuracy to identify patients with high risk of recurrence. Finally, we found that the G2M checkpoint pathway was enriched in both recurrent tumors and primary tumors of high-risk patients.
Due to the high proportion of recurrence that occurred in LUAD, more and more researchers realize the importance of identification of patients with high risk of recurrence. Considering the limitations of clinicopathological features, combination of multisurvival-associated genes might be an ideal way to solve this problem, and pilot studies achieved significant progress [13-15, 19, 20, 26]. Instead of extracting genes merely associated with survival outcomes using the Cox regression analysis, we developed our prediction model based on recurrence-specific genes using the machine learning algorithm. Most genes included in our final prediction model were reported to be related to survival in lung cancer or other cancers [34][35][36][37][38][39][40][41][42], which indicates the rationality of our prediction model. For example, high expression of P4HA1 and S100A10 was reported to be associated with dismal survival outcomes in LUAD [36,38]. Prof. Xiao-jing Wang found that MLF1 promotes the proliferation and colony-forming abilities of lung adenocarcinoma cells and significantly decreases apoptosis in vitro [39]. Since we reported the conduction of our prediction model clearly, it is feasible and convenient for clinicians to design the specific target panel and apply it in clinic to evaluate the risk of recurrence for each patient.
Our prediction model provides extra information about the risk of recurrence, which is conducive for clinicians to identify high-risk patients and impose intense therapy like adjuvant chemotherapy.
The G2M checkpoint pathway was found to be enriched both in recurrent tumors and primary tumors of high-risk patients, which infers its important association with recurrence. G2M checkpoint is an essential process of cell cycle which ensures that cells do not initiate mitosis until damaged or incompletely replicated DNA is sufficiently repaired. Thus, cell cycle checkpoint is the critical barrier to preserve genome integrity and chromosomal stability and prevent progression of tumors from early stages to malignant invasive lesions [29]. Expression of genes involved in cell cycle checkpoint pathway has been reported to be related to the survival outcomes in lung cancer [29,43]. For example, overexpression of PLK1, an early trigger for G2/M transition, is a negative prognostic factor in non-small-cell lung cancer patients [44]. Due to its critical role in tumorigenesis and progression, inhibitors of cell cycle regulators have attracted intense research interests [29]. As an example, MK-0457 (VX-680) blocks tumor xenograft growth and induces tumor regression in preclinical models [45]. Since the G2M checkpoint pathway was significantly enriched in recurrent and high-risk patients, combination of inhibitors of cell cycle reg-ulators and traditional chemotherapy or radiotherapy might achieve improved efficacy in patients with high risk of recurrence.
In conclusion, the signature we developed using the recurrence-specific genes is robust in predicting RFS outcomes of LUAD. Our prediction model provides extra information about the risk of recurrence, which is conducive for clinicians to conduct individualized therapy in clinic. To further apply in clinic, multicenter-based large-scale studies are warranted to verify the feasibility and stability of the model.

Data Availability
The data used to support the findings of this study are included within the article. The data and materials in the current study are available from the corresponding author on reasonable request.

Consent
All authors consent to submit the manuscript for publication.

Conflicts of Interest
The authors declare that they have no potential conflicts of interest.

Authors' Contributions
SH and ZJ both contribute to the work, including conception and design, article drafting, and revising. LK, ZM, and ZF are the guarantors for the article who accept full responsibility for the work. Figure S1: univariate Cox regression analysis to screen out the recurrence-free survival-related differentially expressed genes. Figure S2: efficiency of prediction model under different clinical subgroups. The K-M curve confirmed that the signature could significantly distinguish low-and high-risk groups in the sex subgroups (A and B), age subgroups (C and D), and location in lung parenchyma subgroups (E and F). Table S1: differential expressed genes between recurrent and primary LUAD. Table S2: differential expressed genes between high and low risks of LUAD patients of TCGA.