Systematic profiling of invasion‐related gene signature predicts prognostic features of lung adenocarcinoma

Abstract Due to the high heterogeneity of lung adenocarcinoma (LUAD), molecular subtype based on gene expression profiles is of great significance for diagnosis and prognosis prediction in patients with LUAD. Invasion‐related genes were obtained from the CancerSEA database, and LUAD expression profiles were downloaded from The Cancer Genome Atlas. The ConsensusClusterPlus was used to obtain molecular subtypes based on invasion‐related genes. The limma software package was used to identify differentially expressed genes (DEGs). A multi‐gene risk model was constructed by Lasso‐Cox analysis. A nomogram was also constructed based on risk scores and meaningful clinical features. 3 subtypes (C1, C2 and C3) based on the expression of 97 invasion‐related genes were obtained. C3 had the worst prognosis. A total of 669 DEGs were identified among the subtypes. Pathway enrichment analysis results showed that the DEGs were mainly enriched in the cell cycle, DNA replication, the p53 signalling pathway and other tumour‐related pathways. A 5‐gene signature (KRT6A, MELTF, IRX5, MS4A1 and CRTAC1) was identified by using Lasso‐Cox analysis. The training, validation and external independent cohorts proved that the model was robust and had better prediction ability than other lung cancer models. The gene expression results showed that the expression levels of MS4A1 and KRT6A in tumour tissues were higher than in normal tissues, while CRTAC1 expression in tumour tissues was lower than in normal tissues. The 5‐gene signature prognostic stratification system based on invasion‐related genes could be used to assess prognostic risk in patients with LUAD.

diagnostic methods, chemotherapy, radiotherapy, and surgical diagnosis and treatment options in recent years, the prognoses of patients with LUAD remain poor. 7 At present, lung cancer treatment mainly depends on histological type and clinical stage, but due to the high heterogeneity of LUAD, even patients with the same histological type and clinical stage of LUAD have different prognoses, so the classification of LUAD based on high-throughput sequencing data is of great significance for individualized and accurate LUAD treatment.
In recent years, the use of high-throughput sequencing technology to detect a large number of gene expression changes, combined with the use of bioinformatics methods to systematically analyse tumour-related genes and their regulatory mechanisms, has become an important research means in functional genomics, and it has been widely used to screen potential tumour biomarkers. [8][9][10] In their research of LUAD, Krzystanek et al 11  In this study, we identified molecular subtypes of LUAD based on tumour invasion-related genes by using gene expression data from public databases, such as The Cancer Genome Atlas (TCGA) and GEO, for the first time. We evaluated the relationships between the molecular subtypes and prognosis and clinical features. The prognostic risk model based on differentially expressed genes (DEGs) between the LUAD subtypes could be used to evaluate LUAD prognosis. In addition, the nomogram we constructed could be used to help clinical decision-making and prognosis judgement.

| Data download and preprocessing
RNA-sequencing data and clinical follow-up information for LUAD were downloaded from the TCGA database. The GSE31210 chip data set containing survival time information was downloaded from the GEO database.
Invasion-related genes were obtained from the CancerSEA website, 16 which contains 97 genes (Table S1).
The TCGA-LUAD RNA-sequencing data were preprocessed as follows: (1) the samples with no clinical follow-up information were removed; (2) the samples with no survival time information were removed; (3) the samples with no status information were removed; (4) the Ensemble IDs were transformed into gene symbols; and (5) the median expression of multiple gene symbols was obtained.
The GEO data were preprocessed as follows: (1) the samples with no clinical follow-up information were removed; (2) the samples with no survival time or status information were removed; (3) the probes were converted into gene symbols; (4) the probes were mapped to multiple genes, and the probes were deleted; and (5) the median expression of multiple gene symbols was obtained.
After preprocessing, there were 500 TCGA-LUAD samples and 126 GSE31210 data set samples. The clinical statistics of the samples can be found in Table 1.

| Consistent clustering
The expression levels of 97 invasion-related genes were extracted from the TCGA expression profiles, and genes related to LUAD prognosis were obtained by univariate Cox analysis using the coxph function in R (P < .01). ConsensusClusterPlus (V1.48.0) was used to cluster the samples consistently according to significant genes from the singlefactor Cox analysis (parameters: reps = 100, pItem = 0.8, pFeature = 1, distance = Minkowski). Pam and Minkowski distances were used as the clustering algorithm and distance measure, respectively.

| Identification of differentially expressed genes
The DEGs of different molecular subtypes were calculated by using the limma package in R. 17 The DEGs were filtered according to the threshold of FDR <0.01 and |log2fc| > 1, and then, volcano maps were plotted.

| GO and KEGG enrichment analyses
The results of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses using the DAVID database showed that P < .05 was statistically significant. The results were visualized by using the ggplot2 package in R (V3.5.3).

| Calculation of immune scores
Stromal, immune and estimate scores were calculated by using the ESTIMATE package in R. Ten immune cells were evaluated by using MCPcounter, and 28 immune cells were evaluated by using singlesample gene set enrichment analysis (ssGSEA) with the GSVA package in R. 18

| Construction of prognostic risk model
The 500 samples in the TCGA data set were divided into training and validation cohorts. To avoid the influence of random assignment bias on the stability of subsequent modelling, all samples were put back into the random grouping 100 times in advance, and the group sampling was carried out according to the training cohort-to-validation cohort ratio of 1:1. The final training cohort contained 250 samples, and the final validation cohort contained 250 samples.

| Univariate and multivariate Cox regression analyses
The survival coxph function in R was used to analyse the DEGs among the molecular subtypes and the survival data through univariate Cox proportional hazard regression. We selected P < .001 as the filter threshold, and genes related to prognosis were obtained. The R package glmnet was used to perform Lasso regression on the DEGs and prognosis-related genes to compress the risk model to reduce the number of genes. 19 The step method in the stats package in R starts from the most complex model and reduces the number of parameters by deleting 1 variable in turn. The smaller the step value, the more superior the model. This means that the fitting degree of the model is better with fewer parameters. The number of genes in the risk model was further reduced by using the Akaike information criterion (AIC) algorithm.

| Gene set enrichment analysis
To observe the relationships between risk scores and biological were selected. Based on the results of single-and multi-factor analyses, a nomogram model was constructed. 20

| Gene expression in pan-carcinoma
We downloaded 6 immune-infiltrating cell scores from 33 cancers in the TIMER database (https://cistr ome.shiny apps.io/timer/), and we analysed the expression of 5 genes in these 33 cancer tissues.

| Clinical expression of genes in the Oncomine and GEO cohorts
Oncomine (http://www.oncom ine.org) is a gene chip-based database and integrated data-mining platform. In this study, we set the screening criteria as follows: (a) cancer type: LUAD; (b) analysis type: cancer vs normal analysis; and (c) threshold criteria: P < .05, fold change >1.5 and gene rank = top 10%. The LUAD cohort was downloaded from the GEO database, and the ggplot2 R package was used to visualize the expression levels of 5 genes in the LUAD data set.

| Immunohistochemistry and proteinlevel validation
The Human Protein Atlas (HPA) provides information on the tissue and cell distributions of 26 000 human proteins. We explored protein levels relating to the 5 genes in normal lung and tumour tissues.

| Study flow chart
To make the research easier for readers to understand, we drew a methodology flow chart ( Figure 1).

| Molecular typing of LUAD based on invasionrelated genes
Through univariate Cox analysis of the 97 invasion-related genes in the TCGA expression profile data, 19 genes were found to be associated with LUAD prognosis (P < .01; Table S2). Consistent cluster analysis showed that the samples could be clustered at k = 3 ( Figure 2A). The expression levels of the invasion-related genes in the 3 subtypes are shown in Figure 2B. These levels were different among the C1, C2 and C3 subtypes. Most of the genes were highly expressed in the C3 subtype and lowly expressed in the C2 subtype. We further analysed the relationships between the 3 subtypes and prognosis. The results showed that there were significant differences between the 3 subtypes. The prognoses of patients with the C2 subtype were the best, and those of patients with the C3 subtype were the worst (log-rank P < .05; Figure 2C,D).

| Identification and functional analysis of DEGs among subtypes
The DEGs between C1, C2 and C3 were identified by using the limma package in R. The volcano map of the DEGs between C1 and C3 is shown in Figure S1A; there were 98 up-regulated genes and 123 down-regulated genes. The volcano map of the DEGs between C1 and C2 is shown in Figure S1B; there was 1 up-regulated gene and 4 down-regulated genes. The volcano map of the DEGs between C2 and C3 is shown in Figure S1C; there were 389 up-regulated genes and 267 down-regulated genes.
A total of 669 DEGs between C1/C2, C2/C3 and C1/C3 were obtained, and these DEGs were further analysed by KEGG pathway and GO functional enrichment analyses using the WebGestaltR

| Clinical correlations of molecular subtypes and comparison with existing subtypes
The distributions of different clinical features among the C1, C2 and C3 subtypes were compared. The results showed that there were significantly more C2 patients than C1 and C3 patients in the T1, N0 and Stage I samples, while there were significantly fewer C2 patients than C1 and C3 patients in the T2, N1 and Stage II samples ( Figure 2E-G). The number of survivors in the C2 group was significantly higher than in the C1 and C3 groups ( Figure 2H). These results confirmed that patients with the C2 subtype had the best prognoses.
Previous studies have analysed 33 cancers in the TCGA database. These studies clustered non-blood tumours into 6 immune subtypes based on the distributions of various features, such as macrophages, immune-infiltrating lymphocytes, transforming growth factor-beta response, interferonγ response and wound healing; these subtypes include C1 (wound healing), C2 (INFγ predominance), C3 (inflammation), C4 (lymphocyte depletion), C5 (immunological silencing) and C6 (transforming growth factorbeta predominance), among which C1 and C6 have been associated with poor prognosis. 21 By comparing the molecular subtypes with these immune subtypes, it was found that most LUAD patients in the TCGA data set belonged to the C1, C2 and C3 immune subtypes (about 89.5%), and there were no patients with the C5 immune subtype in the LUAD TCGA data set ( Figure 2I). By comparing the distributions of the molecular and immune subtypes, it was found that patients with the C3 molecular subtype showed the highest proportion of the C2 immune subtype, reaching 54% ( Figure 2J). The proportion of the C2 immune subtype among the C2 molecular subtype was lower, and the proportion of the C3 immune subtype was higher than that of the C3 molecular subtype. The survival curve analysis results showed that there were significant differences in OS among the immune subtypes (P < .05; Figure 2K). These results suggested that the prognosis of the C3 immune subtype was better than that of the C2 immune subtype.

| Comparison of immune scores among subtypes
The relationships between the molecular subtypes of the TCGA data set and immune scores were identified by using the ESTIMATE software package in R, MCPcounter, and the ssGSEA method in the GSVA package. The results showed that there were significant differences in immune scores among the different subtypes ( Figure 3A-C).
The heat map of immune scores among the 3 subtypes is shown in Figure 3D.

| Construction of risk model
The 500 samples in the TCGA data set were grouped according to the training set-to-validation set ratio of 1:1, and the univariate Cox proportional hazard regression model method was used to evaluate the 669 DEGs between the molecular subtypes. A total of 29 genes were found to be associated with prognosis (Table S3).
Lasso regression was used to further compress the 29 genes. The trajectory of each independent variable is shown in Figure S2A.
As lambda decreased, the number of independent variable coefficients tending to 0 increased. We

| Verification of risk model robustness in internal and external data sets
The robustness of the model was verified by the internal data set (TCGA validation set and all data sets) and external data set (GSE31210 data set). In all data sets, the same model and coefficients as those in the training set were used. The survival curve showed significant differences between the high-and low-risk groups in the verification set and all data sets ( Figure 4D,G). The risk score of each sample was calculated according to gene expression, risk score distributions were plotted in TCGA internal validation set and all data sets in Figure 4E,H. The classification efficiencies of prognosis prediction at 1, 3 and 5 years in the TCGA testing cohort and entire TCGA cohort are shown in Figure 4F,I, respectively. The 1-year AUC reached 0.73 in both data sets.
Z-score transformation of risk scores was performed in GSE31210 data set. Samples with risk scores >0 after Z-score transformation were divided into the high-risk group, and samples with risk scores <0 after Z-score transformation were divided into the low-risk group. This resulted in 94 samples in the high-risk group and 132 samples in the low-risk group. The survival curve showed a significant difference between the high-and low-risk groups (P = .0028; Figure 4J).
The risk score distribution of the samples in the GSE31210 cohort was consistent with that of the training set ( Figure 4K). Receiver operating characteristic (ROC) analysis showed that the 1-year AUC reached 0.79 ( Figure 4L). The results showed that the risk scores of C3 subtype samples with poor prognosis were significantly higher than those of C2 subtype samples with good prognosis (Figure S3H), which further suggested that high-risk scores were associated with poor survival outcome.

| Relationships between risk scores and pathways
To observe the relationships between the risk scores and biological   Figure S5A,B). A nomogram was constructed based on the significant variables of multiple factors ( Figure 6A), and the results showed that risk scores had the greatest effect on survival prediction, suggesting that the 5-gene signature was a good predictor of survival. Furthermore, by using calibration curves to evaluate the accuracy of the model ( Figure 6B), it was observed that the calibration curves at 1, 3 and 5 years were close to the standard curve, suggesting that the model had good prediction performance.

| Construction of nomogram
Moreover, decision curve analysis was used to evaluate the model's reliability ( Figure 6C), and the results showed that the benefits of risk scores and the nomogram were significantly higher than those of the extreme curve, and the effect of the nomogram was higher than the effects of T stage, N stage and risk scores, which were close to the extreme curve, suggesting that risk scores and the nomogram had good clinical applicability.

| Comparison of risk model with other models
To prove the superiority of our model, 3 risk models, including an 8-gene signature (Li), 22 a 3-gene signature (Yue) 23 and a 3-gene signature (Liu), 24 were chosen to compare with our 5-gene signature.
To make the models comparable, we calculated the risk score of each LUAD sample in the TCGA data set by the same method, and we evaluated the ROC curve of each model. Z-score transformation of risk scores was performed. The samples with risk scores >0 after Z-score transformation were divided into the high-risk group, and samples with risk scores <0 after Z-score transformation were divided into the low-risk group. The survival curves were plotted.
The results showed that all 3 models could significantly classify the high-and low-risk groups into prognostic categories ( Figure 6E,G,I).
However, the AUCs of the ROC curves of the 3 models were lower than those of the 5-gene signature at 1, 3 and 5 years in the TCGA data set ( Figure 6D,F,H). These results showed that our model had good clinical predictive power.

| Expression and prognosis of 5 genes in 33 pan-cancers
The box diagram showed that MS4A1 was significantly highly ex-

| Clinical validation of 6 genes in terms of protein and mRNA expression
The results showed that in the Oncomine database, CRAT1 was lowly

| D ISCUSS I ON
In this study, we first genotyped the 500 LUAD samples of the TCGA data set based on 97 invasion-related genes, and we divided these cell transformation. 27 The KRT6A protein is a potential biomarker for distinguishing LUAD from squamous cell carcinoma. 28 The IRX5 is a transcription factor that is closely associated with a variety of malignancies. 29 We Z-scored the risk scores and divided the samples whose risk scores were >0 into the high-risk group and those whose risk scores were <0 into the low-risk group. The results showed that the high-risk score samples had significantly shorter survival times than the low-risk score samples. By analysing the relationships between risk scores and pathways, we found that the tumour-related path-  According to the significant clinical characteristics in the univariate and multivariate regression analyses, T stage, N stage and risk scores were used to construct the nomogram. Calibration and decision curve analysis curves suggested that the model had good prediction performance. Both the internal and external data sets also confirmed that the 5-gene signature was robust, and it could perform well in the independent data set (GSE31210). Our model performed better than other models of LUAD. One advantage of our model is that targeted sequencing based on particular genes reduces healthcare costs significantly compared with whole-genome sequencing. Second, we selected invasion-related genes as the target genes, which is very important for the early diagnosis and prognosis prediction of LUAD.
More importantly, in the routine clinical diagnosis and treatment process, patients' treatment plans and prognoses are largely dependent on pathological stage, the determination of which currently depends on the anatomic location of LUAD, so the biological heterogeneity of patients with LUAD is not currently being fully reflected. The nomogram we constructed can make up for this deficiency and provide a basis for the individualized treatment of patients with LUAD.
Gene expression was explored by using the Oncomine, GEO and HPA databases. The results showed that the expression levels of MS4A1 and KRT6A in tumour tissues were higher than in normal tissues, while CRTAC1 expression in tumour tissues was lower than in normal tissues.
Our study has some limitations. First, the population in the TCGA database is predominantly White and Black, and our results need to be validated in other racial groups. Second, the construction of the alignment map was done retrospectively, so our results need to be further validated in multi-centre clinical trials and prospective studies. In the future, we will explore whether other regression modelling methods can further improve the prediction accuracy of the model.
In conclusion, we identified molecular subtypes of LUAD based on tumour invasion-related genes, and we developed a 5-gene signature prognostic hierarchical system. We recommend the use of this classifier as a molecular diagnostic test to assess the prognostic risk of LUAD.