A gene-based survival score for lung adenocarcinoma by multiple transcriptional datasets analysis

Lung adenocarcinoma (LUAD) remains a crucial factor endangering human health. Gene-based clinical predictions could be of great help for cancer intervention strategies. Here, we tried to build a gene-based survival score (SS) for LUAD via analyzing multiple transcriptional datasets. We first acquired differentially expressed genes between tumors and normal tissues from intersections of four LUAD datasets. Next, survival-related genes were preliminarily unscrambled by univariate Cox regression and further filtrated by LASSO regression. Then, we applied PCA to establish a comprehensive SS based on survival-related genes. Subsequently, we applied four independent LUAD datasets to evaluate prognostic prediction of SS. Moreover, we explored associations between SS and clinicopathological features. Furthermore, we assessed independent predictive value of SS by multivariate Cox analysis and then built prognostic models based on clinical stage and SS. Finally, we performed pathway enrichments analysis and investigated immune checkpoints expression underlying SS in four datasets. We established a 13 gene-based SS, which could precisely predict OS and PFS of LUAD. Close relations were elicited between SS and canonical malignant indictors. Furthermore, SS could serve as an independent risk factor for OS and PFS. Besides, the predictive efficacies of prognostic models were also reasonable (C-indexes: OS, 0.7; PFS, 0.7). Finally, we demonstrated enhanced cell proliferation and immune escape might account for high clinical risk of SS. We built a 13 gene-based SS for prognostic prediction of LUAD, which exhibited wide applicability and could contribute to LUAD management.


Background
Lung cancer remains intractable but imperative to cope with for the highest morbidity and mortality among cancers [1]. A principle subtype of lung cancers is lung adenocarcinoma (LUAD), whose investigation means a great deal to us [2][3][4]. Advance in cancer biology demonstrated cancer could be regarded as a disorder caused mainly by aberrant genes, while some core ones even drive carcinogenesis [5,6]. That is to say, genes are undoubtedly valuable targets for cancer management.
In fact, remarkable achievements in clinical practice have proved powerful effect of genes on clinical oncology especially for LUAD [7,8]. First take chemotherapy for example. Many widely applied chemotherapeutic agents are aimed at critical genes in biological processes like cell proliferation and metabolism [9,10]. Besides, targeted therapy based on driver gene, such as epidermal growth factor receptor (EGFR), has significantly improved the prognosis of patients with specific genetic background [11,12]. Moreover, immunotherapy targeted at immune-checkpoint genes has achieved revolutionary progress for LUAD patients, especially for who have no targetable driver mutation till now [4,7,13].
Moreover, clinical predictions based on gene signatures also contribute much to handling cancer [14][15][16]. For example, some canonical biomarkers are references for distinguishing specific cancer from normal counterparts and other histologic subtypes [17,18]. Besides, other gene-based clinical prediction like prognostic prediction has emerged as a hot spot for the relative convenience to obtain and the great significance for treatment [15]. Tremendous advance in omics and mature application of statistical methods in bioinformation contributed greatly to bridge genes signatures and cancer characteristics. For example, The Cancer Genome Atlas (TCGA) program and Gene Expression Omnibus (GEO) database both offer abundant resources for cancer investigation. The Least Absolute Shrinkage and Selection Operator (LASSO) regression can both adjust the complexity and execute variable selection, thereby improving the prediction precision and interpretability of the regression model [19]. Moreover, compared with bioenrichment methods based only on differentially expressed genes (DEGs), Gene Set Enrichment Analysis (GSEA) could take into account those genes with subtle expression changes but significant biological significance, therefore, it is more comprehensive and precise [20]. Herein, we tried to build a gene-based survival score (SS) for LUAD via systematic transcriptome analysis, and this SS exhibited favorable predictive efficacy in multiple datasets.
Besides, 515 samples of TCGA-LUAD, GSE30219, GSE31210 and GSE50081 were candidates for enrichment analysis. And 253 samples of TCGA-LUAD containing comprehensive clinicopathologic records (age, gender, TNM parameters, clinical stage, OS and PFS) were applied for multivariate analysis.

Statistical methods
DESeq2 package (RNA-sequencing) and limma package (microarrays) were applied to DEGs (Adjusted P-value < 0.05, fold change > 2 or < 0.5) [28,29]. Z score was used to normalize data (Function: scale). Cox regression model, LASSO regression model, Kaplan-Meier (K-M) curve and log-rank test were used for survival analysis (Packages: survival, survminer and glmnet). Principal component analysis (PCA) was utilized for comprehensive assessment (Function: princomp; Packages: FactoMineR and factoextra). Receiver operating characteristic (ROC) curve analysis was performed to determine optimal cut-off (Packages: pROC). Logistic regression model was applied to find associations between genes and two-category data (TNM parameters and clinical stage were transferred to two-category data) (Function: glm). Pearson correlation analysis was applied for correlation assessment (Packages: ggcorrplot). GSEA was employed for biological investigation [20]. Wilcoxon rank sum test was applied for differential analysis between two groups (Function: wilcox.test). P < 0.05 was considered significant. Arithmetic functions were operated in R language [30].

Results
Identifying 13 core genes to establish SS for LUAD Genes closely related to tumor prognosis are likely to play key roles in tumor progression. Valuable candidates are DEGs between tumors and normal tissues. So we obtained the intersection of DEGs from four LUAD transcriptomic datasets (GSE10072, GSE32863, GSE43458 and TCGA-LUAD), and we acquired 52 upregulated DEGs and 180 downregulated DEGs (fold change > 2 or fold change < 0.5, Adjusted P-value < 0.05) (Fig. 1a) (Detailed information about acquiring DEGs could be seen in our previous research [31]). Filtrating survival-related genes from these DEGs was executed in TCGA-LUAD dataset, for its largest sample size and most complete clinical records. We conducted univariate Cox regression analysis towards OS for preliminary identification, and we obtained 16 hazardous genes from upregulated DEGs and 35 protective genes from downregulated DEGs (for hazardous genes, which may promote cancer, p < 0.05, HR > 1; for protective genes, which may prevent cancer, p < 0.05, HR < 1) (Fig. 1b, c). Since independence among variables is a prerequisite for establishing multi-factor models, we analyzed associations among the screened genes through a correlation matrix. The results showed strong relationship among these genes, so further screening was needed (Fig. 1d). Thereupon, we implemented LASSO analysis and we got 13 core genes: Abnormal Spindle Microtubule Assembly (ASPM), Epithelial Cell  Supplementary Table 1). Furthermore, we performed PCA for the comprehensive evaluation based on core genes (Fig. 1g). Finally, we chose the first 6 principal components (comps) to establish a SS for LUAD (The cumulative contribution rate is more than 70%): Score = 0.409981*comp.1 + 0.170820*comp.2 + 0.129785*comp.3 + 0.106358*comp.4 + 0.09938*comp.5 + 0.083671*comp.6.
And coefficients for genes to make up each comp are exhibited in Supplementary Table 2.

SS exhibits high-risk probability for OS in LUAD
Next, we estimated prognostic value of SS in LUAD. We chose OS for prognostic indicator and examined in four independent LUAD datasets (TCGA-LUAD, GSE30219, GSE31210, GSE50081). We divided patients into two groups (High and Low SS) based on optimal cut-off value via ROC curve analysis according to the survival status of OS in each dataset respectively (Fig. 2a-d). We found SS showed obvious high-risk probability for OS and patients with higher SS had shorter OS periods in four datasets (Fig. 2e-h).

SS possesses high-risk probability for PFS in LUAD
Then, we estimated predictive value of SS upon PFS in four independent LUAD datasets (TCGA-LUAD, GSE30219, GSE31210, GSE50081). We implemented ROC curve to divided patients into High and Low SS groups based on outcome status of PFS in these datasets respectively (Fig. 3a-d). Analogously, we found higher SS indicated bigger risk probability for PFS in all testing datasets (Fig. 3e-h).
SS correlates highly with clinicopathological features and functions as a novel independent risk factor for LUAD prognosis Further, we investigated relationships between SS and clinicopathological parameters of LUAD. Marker of proliferation Ki-67 (MKI67) and proliferating cell nuclear antigen (PCNA) are both canonical biomarkers for clinical oncology [32][33][34][35]. We found SS was intensively positively correlated with MKI67 in four LUAD datasets (TCGA-LUAD, GSE30219, GSE31210, GSE50081) ( Fig. 4a-d). Analogously, SS possessed strong positive association with PCNA in these datasets (Fig. 4e-h). TNM parameters and tumor clinical stage also play important roles in tumor handling. And we chose TCGA-LUAD dataset with relatively more complete clinical information for following analysis. We found that SS indicated high-risk probability for N (lymph node metastasis), M (distant metastasis) and clinical stage (Fig. 4i). Usually, univariate analysis might cover up the real prognostic function due to confounding factors. So we verified further SS could function as an independent risk predictor for both OS and PFS via multiple Cox regression analysis considering age, gender, TNM parameters and clinical stage in TCGA-LUAD (Table 1, 2). Moreover, we used clinical stage and SS to build concise nomographs predicting OS and PFS probability of LUAD (Fig. 4j, k). And predictive potencies were acceptable (C-indexes: OS, 0.7; PFS, 0.7).

Exploring molecular characteristics underlying SS in LUAD
We tried to uncover molecular mechanisms underlying clinical role of SS in LUAD. We first ranked the patients (515 samples in TCGA-LUAD) in order of SS. The top 50 patients were divided into High SS group, and last 50 patients were Low SS group (Fig. 5a). Then we performed GSEA to investigate molecular features of SS based on transcription profiling. We found high SS showed enhanced cell cycle in several gene sets (Fig. 5b). Further, we validated it in other three datasets (GSE30219, top 40 vs last 40; GSE31210, top 50 vs last 50; GSE50081, top 50 vs last 50), and found similar results (Fig. 5c-h). Last, we explored expression profile of immune checkpoints under SS in above four LUAD datasets, and found most of immune checkpoints possessed increased expression in high SS group (Fig. 5i-l).

Discussion
LUAD possesses strong heterogeneities in both tumor biology and clinical characteristics [36]. Therefore, it is urgently needed to precisely assess LUAD prognosis for applying appropriate intervention as well as avoiding overtreatment. Deregulation of gene expression during malignant transformation and progression offers theoretical basis to interpret carcinogenesis via gene signatures, while the tremendous advance in onco-genomics provides prominently practical convenience [6,14,16].
Here, we established a gene-based survival assessment  named SS, which exhibited good accuracy and stability in multiple datasets. Compared with former gene-based prognostic predictions [37][38][39][40], SS owns three specialties or innovations: 1. SS has favorable stability, or adaptability in clinical application. Our initial candidates are common DEGs between tumors and normal tissues in four independent LUAD datasets. After screening by Cox model and LASSO regression, we established a 13 gene-based SS in TCGA-LUAD, and validated its efficiency in other three LUAD datasets. 2. SS was multifunction for clinical usage. First, SS could assess both OS and PFS. OS is a golden standard of prognosis evaluation but takes a long time to collect, while PFS is a relatively convenient indicator for clinical intervention. Besides, SS was positively related to malignant biomarkers and tumor stage, and could function as an independent risk factor for prognosis. 3. Biological significance of SS was verified in multiple datasets. In four LUAD datasets, higher SS all indicated enhanced cell proliferation, which confirmed the prime trait of abnormal proliferation in carcinogenesis. Moreover, LUAD with higher SS exhibited increased expression of immune checkpoint genes, which underlined the prominent role of immune escape in malignant progression of LUAD. These 13 core genes building SS are involved in a diversity of biological processes like cell proliferation, nutrition transportation and material metabolism. Most of these genes have been reported to play critical roles in lung carcinogenesis, however, some lack detailed research. For example, ASPM and ECT2 both are proliferation-related genes and participate in DNA synthesis and cytokinesis [41][42][43]. Studies have shown that LUAD had high expression of ASPM and ECT2, which both indicated poor prognosis, and ECT2 could facilitate lung tumorigenesis [43][44][45][46][47][48][49]. Lung cancer also has high expression profile of GCNT3 (a gene regulating mucin synthesis), GOLM1(a gene coding for a Golgi transmembrane protein) and IGF2BP3 (a gene coding for a RNA binding protein), which are all positively correlated with malignant progression of lung cancer [50][51][52][53]. Nutrient transporter SLC2A1 and SLC7A5 both exhibit cancer-promoting potential in lung cancer [54][55][56]. TIMP1 was originally thought to be a tumor suppressor gene, since it could intensely inhibit matrix metalloproteinases (MMPs), canonical oncoproteins [57,58]. However, recent studies have shown that TIMP1 was highly expressed in LUAD, functioned as an independent prognostic risk factor and could facilitate malignant progression through non-MMPs pathways [58][59][60]. TYMS, a nucleotide synthetase, is commonly used as an indicator of chemotherapy sensitivity, that is its high expression in lung cancer often indicates insensitive for pemetrexed [61]. Also, for patients under platinum-based adjuvant treatment after surgical resection, high TYMS expression often indicates poorer prognosis [62]. SCGB1A1, coding for secretory globulins, has a protective role against smoking-induced lung tumorigenesis [63]. However, for CYP4B1, ARHGEF6 and FAM189A2, their function in lung cancer are rarely studied.
Of course, there are some flaws in our research. First, the present transcriptomic data mainly covered proteincoding genes, but increasing studies have uncovered that non-coding RNAs like microRNAs (miRNAs), long noncoding RNAs (lncRNAs) and circulating RNAs (cir-RNAs) present powerful biological functions, especially in cancer research [64]. Second, the transcriptomic data for analysis came from tissues, while estimation based on liquid detection would be less invasive and more executable [65]. Improvement will be made in our future work.

Conclusions
In brief, we built a gene-based survival SS for LUAD and proved its wide applicability for clinical predictions, which will assist in handling LUAD effectively.