Development and Validation of a 23-Gene Signature for Prognosis Prediction in Lung Adenocarcinoma

Background: Lung cancer remains the most fatal tumorous disease in the worldwide. Among that, lung adenocarcinoma (LUAD) was the most common histological type. A precise and concise prognostic model was urgently needed of LUAD. We developed a 23-gene signature for prognosis prediction based on EMT, immune and stromal datasets. Methods: Univariate Cox regression analysis was performed to select genes which were signicantly associated with overall survival (OS) of the TCGA LUAD cohorts. LASSO regression and multivariate Cox regression analysis was used to build the multi-gene signature. Enrichment analyses and a protein-protein interactions (PPI) network were performed to show the interaction and functions of the signature. A nomogram was developed based on risk score and other clinical features. Predictive performance of the signature was externally validated in two independent datasets from Gene Expression Omnibus (GSE37745 and GSE13213). Results: A total of 1334 EMT, immune and stromal associated genes were obtained. After LASSO regression and multivariate Cox regression analysis, a 23-gene signature for risk stratication was built. K-M curves showed that the patients with high risk had a poorer outcome. Finally, a nomogram was built to predict prognosis. The predictive performance of the 23-gene signature was conrmed in internal and external validation. Conclusion: We developed and veried a 23-gene signature based on EMT, immune and stromal gene sets. It provided a convenient and concise tool for risk stratication and individual medicine. and Genomes; GO, Gene ontology; ROC, Receiver Operating Characteristic; AIC, Akaike Information Criterion; AUC: area under the curve; LASSO, least absolute shrinkage and selection operator; EMT-TFs, EMT-inducing transcriptions factors; MMPs, matrix metalloproteinases.


Introduction
Lung cancer remains the most commonly diagnosed cancer and the leading cause of oncological death, accounting for 18.4% of total cancer-related deaths worldwide, which makes it the most fatal tumorous disease (1). Non-small cell lung cancer (NSCLC) accounts of approximately 85% cases of lung cancer, among which, lung adenocarcinoma (LUAD) is the most common histological type, accounting for at least a half of all NSCLC cases (2,3).
Despite continued development in treatment modalities, including targeted therapy and immunotherapy, the prognosis of patients has improved signi cantly but the 5-year survival rate of patients remains < 20% (4,5). To date, histopathologic diagnosis and tumor staging system remain the main approaches of the prognostic prediction(6). However, these traditional approaches are not su cient for assessing the prognosis of LUAD patients. Therefore, identi cation and validation of a series of speci c prognostic biomarkers for screening and evaluating patients are imperative. In recent years, with the development of tumor biology, a series of biological behaviors and phenomena attracted our attention, including epithelial-mesenchymal transition (EMT) and tumor microenvironment (TME).
Epithelial-mesenchymal transition (EMT) is a complicated process involved in tumor metastasis (7).
During activation of the epithelial-mesenchymal transition (EMT) program, the cohesive and polarized tumor epithelial cells transform into a motile mesenchymal state with lacking the adhesion character of junction between epithelial cells(8). EMT process results in the increase of tumor metastasis as well as multidrug resistance (MDR), which affect prognosis of patients (9,10). However, EMT is a complex process involving genetic modi cation and immune alteration(8). Recently, a series of studies indicate that the tumor microenvironment (TME) play an important role in EMT, metastasis and MDR(6). Tumor microenvironment (TME) is a complex network, which comprises of a variety of cell type, including the epithelial cells, immune cells, stromal cells, cancer-associated broblasts (CAFs) and extracellular matrix (ECM) (11). During TME, the immune cells and TME-induced factors are involved in EMT phenotype. TMEinduced factors secreted by cancer cells and CAFs can directly induce the expression of EMT transcription factors or indirectly by activating other signaling pathways, such as WNT/β catenin signaling pathway, TNF-a signaling pathway(6). In lung cancer, a recent study demonstrates that the crosstalk between Th9/Th17 lymphocytes and cancer cells promote the EMT and metastasis (12). The TME and EMT interact to promote MDR, metastasis and tumor progression together(6, 12). These ndings indicate that EMT and TME-related genes may hold great promise as lung cancer prognostic markers.
Due to the complexity of EMT and TME related genes, traditional single-gene markers are often not accurate enough for prognosis prediction. We have carried out systematic discovery and veri cation of biomarkers, and constructed a 23-gene signature using The Cancer Genome of Atlas (TCGA) datasets.
The signature was based on EMT, immune and stromal (EIS) gene datasets, and external validation was performed in two independent LUAD datasets from Gene Expression Omnibus (GEO). The EIS signature may offer an accurate prediction and risk-strati cation in LUAD patients' prognosis and play an important role in the clinical management.

Study population
In the study, the RNA-sequencing (RNA-seq) datasets and corresponding clinical characteristics of 59 normal and 535 LUAD specimens were both downloaded from the TCGA website (https://portal.gdc.cancer.gov/). And the data was normalized and processed using R version 4.0.0 software. 500 patients with intact clinical characteristics were categorized as the training set, including gender, age, clinical stage, follow-up time and survival status.
The GSE37745 and GSE13213 datasets were obtained from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), which included 100 and 130 clinical and microarray-based gene expression data of LUAD specimens, respectively. They were categorized as the validation sets.

Enrichment analysis and PPI (protein-protein interactions) network construction
The top 300 differentially expressed genes between low risk and high risk groups were selected. After combining 23 genes of signature, Gene ontology (GO) analysis were performed in the 323 genes using R version 4.0.0 software.
We generated a PPI network of the 323 genes using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, https://string-db.org , version 11.0). The minimum required interaction score was set to 0.9.

ESTIMATE algorithm
As an algorithmic tool, ESTIMATE algorithm (Estimation of STromal and Immune cells in MAlignant Tumour tissues using Expression data) was performed to evaluated scores and in ltrating of stromal and immune cells between low and high risk groups. Besides, the tumor purity was calculated based on these.
The R package "estimate" was applied in R version 4.0.0 software.

CIBERSORT estimation
Cell-type Identi cation By Estimating Relative Subsets Of RNA Transcripts (CIBERSORT) was applied to assess the abundance ratio matrix of 22 kinds of immune cells. The differential abundance of immune cells in ltrates was obtained in cohorts with low and high risks. A violin illustration was plotted to visualize the results.

Statistical analysis
1184 EMT associated genes were identi ed from dbEMT 2.0 website(http://dbemt.bioinfominzhao.org/index.html), 102 immune (13) and 48 stromal (14) associated genes were selected from published datasets. R package "Edger" was utilized to identify differentially expressed genes of the 1334 genes with |the log(foldchange)| > 1.0 and FDR (adjusted p value) < 0.05. By univariate Cox proportional hazard regression analysis of selected genes, we attained the genes signi cantly related to overall survival (OS) of TCGA-LUAD patients. After performing the least absolute shrinkage and selection operator (LASSO) regression analysis of these OS associated genes, an EMT, immune and stromal associated multi-gene signature was developed. Kaplan-Meier curves were plotted to determine survival outcomes and the log-rank test was used to evaluate the statistical signi cance. We then calculated the risk score for each patient to predict OS. The cutoff value was taken as the median of the risk score, and the TCGA-LUAD cohort was dichotomized based on it. The Cox proportional hazards regression model was used to perform univariate and multivariate analysis to access the association between risk score and OS. A nomogram was constructed with age, stage and risk score, using the R library "survival" and "rms" package. Following that, the validation of the nomogram-based prediction model was accessed by calibration curves and Receiver Operating Characteristic (ROC) curves. Calibration was used to access the concordance between predicted and actual survival. ROC curves were used to evaluate the sensitivity and speci city of survival prediction based on the nomogram. All statistics were performed by R software (Version 4.0.0; https://www.R-project.org). In all analyses, a P value of <0.05 was considered statistically signi cant.

A 23-genes signature was identi ed for LUAD patients' risk strati cation
We selected 1184 EMT associated genes from dbEMT 2.0 website(http://dbemt.bioinfominzhao.org/index.html) and a total of 150 immune and stromal related genes from published datasets to generate multigene signature. (Supplementary Table 1) A total of 1334 genes were subjected to differential expressed genes selection and univariate Cox regression analysis. After that, a total of 121 genes were identi ed which were signi cantly related to the OS of TCGA-LUAD patients. Performing  (Figure 2A and 2B). Following that we examined the genetic alteration of these 23 genes in the cBioPortal database (http://www.cbioportal.org). A total of 507 LUAD patients/samples with mutation and CNA data were concluded, 23 genes were altered in 272 (54%) of queried patients/samples. And the main alteration was ampli cation ( Figure 2C). Frequent genetic alterations mean that these genes play an important role in the development of LUAD.
The risk score of each patient was calculated based on the Cox regression coe cients and the mRNA expression value. Set the median of the risk score as cutoff value and dichotomize the cohorts into high and low risk groups. A heatmap was plotted to show the gene expression difference between high and low risk groups ( Figure 3A). The distribution of risk score and survival time from top to bottom was performed in Figure 3B and 3C, which showed the relationship between risk score and survival time in our LUAD cohort. The K-M survival curve showed that the patients with high risk had a poorer outcome ( Figure 3D). ROC curve analysis was performed to evaluate the forecasting property of the 23-gene signature. The AUC values of 1, 3 and 5 years were 0.8, 0.783 and 0.7538, respectively ( Figure 3E).

Establishment of a risk associated nomogram for prognostic prediction
We performed univariate and multivariate Cox regression analysis based on the risk score and various clinicopathological characteristics. After univariate analysis, statistically signi cant features were identi ed which included age(P=0.049), risk score (P<0.001), stage (P<0.001). Multivariate regression analysis was performed on the above features. A forest plot was built, revealing that the risk score served as an independent predictor of prognosis ( Figure 4A).
Nomogram is considered a reliable tool for individualized risk assessment, by integrating multiple risk factors (15)(16)(17). In order to improve the accuracy of our gene signature prediction, by integrating the signi cant factors in univariate analysis (age, stage, risk score), we established a risk-nomogram to predict the 1-, 3-and 5-year OS of LUAD patients ( Figure 4B). Each factor assigned points proportionally according to the impact of its risk on survival. A higher total score, based on the sum of the assigned score of each factor, is associated with a poorer prognosis. The internal validation of the model was performed by ROC curves and calibration plots in 1-, 3-and 5-years ( Figure 4C and 4D). The AUC value of 1-, 3-and 5 years was 0.8, 0.784 and 0.73988 respectively, which showed the high predictive accuracy of the model.

Enrichment analysis and PPI (protein-protein interactions) network
GO analysis indicated that 23-gene signature and the top 300 differentially expressed genes was mainly related to DNA molecular structure and expression regulation. (Figure5 A). The PPI network showed the interaction between the 323 proteins. (Figure5 B) The minimum required interaction score was set 0.9 and the free molecules in the network were removed. GSEA analysis indicated that differentially expressed genes were signi cantly enriched in several essential pathways, including cell cycle, DNA replication, adherens junction, T cell receptor and B cell receptor signaling pathways. (Figure 5C) 3.4. Correlation between immune microenvironment, tumor staging and risk score.
CIBERSORT algorithm was applied to analyze the differential composition of tumor-in ltrating immune cells in cohorts with low and high risk. ( Figure 6A) Several immune cells were signi cantly associated to risk score, including T cells CD4 memory resting, monocytes, macrophages M0, dendritic cells resting, mast cells resting and mast cells activated. ESTIMATE algorithm was performed to calculated immune score, stromal score and tumor purity between cohorts with low and high risk, which showed signi cant correlation. (Figure 6B and 6C). Besides, risk score was also signi cantly related to tumor staging. ( Figure  6D We next evaluated the predictive property of the 23-gene signature in several independent LUAD cohorts (GSE13213 and GSE 37745) from GEO database. In each validation, we calculated risk score of each patients and divided patients into high and low risk groups based on the median of risk score. K-M curve was plotted to demonstrate that patients with a high risk had a worse outcome ( Figure 7A). The patients were ranked according to the risk and survival time ( Figure 7B and 7C). The accuracy of prediction was evaluated by ROC curves and calibration plot of 1-, 3-and 5-years ( Figure 7D and 7E).

Discussion
With the advancement of diagnosis and treatment methods, the death rate of lung cancer continues to decrease signi cantly, lung cancer remains the most fatal neoplastic disease all over the world, nowadays(18). Among which, LUAD is the most common histological type and treatment of LUAD has made important progress in chemotherapy, targeted therapy and other treatment methods. There are still a series of problems, such as drug resistance and identi cation of novel biomarkers (19,20). Therefore, the development of biomarkers is particularly important in the diagnosis and treatment of LUAD.
EMT is a complex cellular process, during which the epithelial features decrease and epithelial cells obtain mesenchymal phenotype (21). At the cellular level, activation of EMT leads to the loss of cell polarity, the interruption of cell-cell junctions, the degradation of basement membrane and the reorganization of extracellular matrix (ECM) (22). At the molecular level, activation of EMT repress the expression of epithelial cadherin (E-cadherin) and certain cytokeratins and active the expression of mesenchymal markers, such as neural cadherin (N-cadherin), vimentin, bronectin and β1 and β3 integrins (23). The activation of EMT and the above process are orchestrated by EMT-inducing transcriptions factors (EMT-TFs). EMT is closely related to tumor malignant progression, which includes tumor-initiating properties, motility and resistance to both chemotherapy and immunotherapy (24)(25)(26).
These brought EMT to our attention.
However, the activation of EMT rarely as a spontaneous process in carcinoma cells. It is reported that signals from the tumor microenvironment (TME) act on cancer cells to induce the expression of EMT-TF, thereby coordinating the expression of various components in the EMT-related program (22). A large number of studies veri ed that the activation of EMT were closely associated to TME-associated cells, including immune cells, stromal cells, cancer-associated broblasts (CAFs) (11,22). CAFs coordinately induce the activation of EMT in carcinoma cells by secreting a series of cytokines and growth factors, such as TGFβ, IL-6, EGF, VEGF and HGF (27)(28)(29)(30). The activated effector T cells facilitate the EMT activation by releasing cytokines such as IL-6, TNF,TGFβ and other unknown mechanisms (31). Besides, TAMs secrete an array of factors to induce EMT (32)(33)(34)(35). However, the effect of TME on EMT is not unilateral. Acting reciprocally, the tumor cells with EMT-related markers can regulate the activity of cellular components in the tumor microenvironment(36). Under the in uence of EMT-TFs, the cellular components with antitumor functions, such as cytotoxic CD8 + T cells and etc. are overwhelmed by immunosuppressive cells (37)(38)(39). In addition to changes in cell composition, immunomodulatory markers on cell surface were also altered which induce immune evasion (40,41). In summary, through interaction, EMT and TME in uence each other and play an important role in the malignant progression of tumors.
A series of studies indicate the potential of EMT markers as tumoral high-grade malignancy markers in NSCLC and other malignant tumors (42,43). However, due to the complexity, continuity and interaction with other phenotypes, a recent review indicated that one or a small number of markers maybe hard to assess EMT status. A combination of markers is more reliable (21). Therefore, the junction of EMT and TME markers is necessary for clinical oncology.
In our study, we selected 1184 EMT associated genes from dbEMT 2.0 website (http://dbemt.bioinfominzhao.org/), 102 immune associated genes and 48 stromal associated genes from published datasets to build our signature. We performed univariate Cox proportional hazard regression analysis on the 1334 genes and selected genes signi cantly associated to overall survival (OS) of TCGA-LUAD patients. After lasso regression analysis and multivariate Cox regression analysis, 23 genes were selected to build our EMT, immune and stromal associated genes, including KL, ECT2, FBLN5, TIMELESS, TERT, TYMS, FHL2, MNDA, TIMP1, IL11, CDH19, AOC4P, ETV1, KRT18, SATB2, EFNB2, TNFRSF11A, MEOX2, MEG3, ANGPTL4, WNT1, L1CAM, FURIN. The risk score was calculated of each patient to predict OS of LUAD patients. The K-M curves showed that the patients with high risk had a poorer outcome. More importantly prognostic power of the 23-gene signature then was validated in GEO cohorts. Gene signature is a powerful classi er to and has been widely used to predict prognosis of patients (44). In published research, different EMTassociated signatures were built in head and neck squamous cell carcinoma and pan-cancer, which were highly relevant to TME markers, respectively (45,46). However, as far as we know, there is no signature based on EMT and TME markers.
Bioinformatic analysis demonstrated that the 23-gene signature mainly related to cell cycle and cell division, ubiquitin-mediated proteolysis and cell adhesion molecules. Interestingly, a large number of genes from 23-EIS signature are related to cell cycle, division and DNA replication. For example, Epithelial cell transforming 2 (ECT2), an activator of Rho family GTPases, was involved in a variety of cell process, including adhesion, migration, cell division, growth, and centromere maintenance (47). As a oncoprotein, ECT2 is reported to play an important role in the development of multiple cancers including LUAD(48). Moreover, a series of genes, such as TIMELESS (49), TERT(50) and FHL2 (51) were also involved in cell cycle process, more than EMT, immune and stromal.
In addition, several key molecules of the EIS signature were also involved in classic signal pathways. EFNB2, cell surface transmembrane ligand for Eph receptors, is involved in Eph receptors and ephrins signaling, which affects tumor growth, invasion, metastasis and angiogenesis (52).Moreover, TIMP1(metalloproteinase inhibitor 1) binds to matrix metalloproteinases (MMPs) and inactivates them, which is closely associated to cell motility, invasion and angiogenesis. TIMP1 also acts as a growth factor to regulate cell differentiation, migration and cell death directly (53).
Importantly, we established a nomogram to predict the 1-, 3-and 5-year OS of LUAD patients. By integrating multiple risk factors, nomogram is considered as a reliable tool for risk prediction (54). As a predictive model, each factor associated with OS of patients was included into the nomogram, such as age, gender, and TNM stage. Besides, risk score based on gene signature can also incorporated as a predictive factor. According to the impact of its risk on survival, each factor was assigned points proportionally. In our study, after univariate analysis, the signi cant clinical factors (age, stage) and risk score was included into the nomogram. Due to the combination of the multi-gene signature, the predictive performance of the nomogram maybe more accurate than traditional nomogram with only clinical features. Since EMT is related to tumor metastasis and drug resistance, in addition to OS, this signature may have a potential predictive effect on metastasis possibility and medication nonadherence. Moreover, the detection method is simple and feasible, which is suitable to validate our gene-signature in biopsies.
However, certain limitations existed in our study. For example, the clinical features of the public cohort were not complete enough, such as neoadjuvant treatment information and smoking status of patients. Although the 23-gene signature can predict OS of patients accurately based on gene expression level, a deeper research of gene, such as gene mutation, was not included in our research. And also, the possibility of the genes being potential therapeutic targets was not clari ed.
In conclusion, we developed and veri ed a 23-gene signature of EMT, immune and stromal based on TCGA and GEO datasets. These genes are all of great importance and expected to be new targets. Besides, a nomogram based on these genes and clinical features was developed, which could accurately This study was based on publicly available data from the TCGA database, and did not involve interaction with human subjects or the use of personally identi able information. The study did not require informed consent for TCGA registration cases, and the author obtained a "limited use data agreement" from TCGA. No trial registration was required.