Autophagy-related tumor subtypes associated with significant gene expression profiles and immune cell infiltration signatures to reveal the prognosis of non-small cell lung cancer

Autophagy plays an important role in non-small cell lung cancer (NSCLC). We aimed to establish novel autophagy-related tumor subtypes to distinguish the prognosis of NSCLC. In this study, gene expression profiles, mutation data and clinical information obtained from the Cancer Genome Atlas. Kaplan Meier-plotter could evaluate prognostic value of autophagy-related genes. Consensus clustering revealed autophagy-related tumor subtypes. Gene expression profiles, mutation data and immune infiltration signatures were identified, oncogenic pathways and gene-drug interactions were performed according to the clusters. Finally, a total of 23 prognostic genes were screened and consensus clustering analysis divided the NSCLC into 2 clusters. The mutation signature showed that 6 genes are special. Immune infiltration signatures showed that higher fraction of immune cells was associated with cluster 1. The oncogenic pathways and gene-drug interactions also showed different patterns. In conclusion, autophagy-related tumor subtypes have different prognosis. Understanding the subtypes of NSCLC are helpful to accurately identify the NSCLC and personalized treatment.


Introduction
Lung cancer remains the leading cause of cancer-related mortality worldwide, in which non-small cell lung cancer (NSCLC) account for nearly 85% of all the lung cancer [1,2]. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the main type of NSCLC [3,4]. There are improvements of therapy methods, such as chemotherapeutic drugs and immune therapy, while the 5-year survival rate of NSCLC patients is only 18% [5]. Furthermore, surgical resection is the most beneficial treatment for NSCLC, but most newly diagnosed patients are at the onset of advanced or metastatic stages and usually lost the chance for operation. For the clinical tumor-node-metastasis (TNM) stage IIIB NSCLCs, the 5-year survival rate is only 7%, and the 5-year survival rate of TNM stage IV NSCLC patients as low as 2% [6]. Hence, it is essential to manage the patients according to the biomarkers for early detection of NSCLC in order to improve the prognosis and reduce the mortality rates.
Autophagy is a key biological process, it could maintain the cellular homeostasis by engulfing cytoplasmic proteins, complexes or organelles within the autophagosome [7,8]. Autophagosome is a Ivyspring International Publisher cytoplasmic double-membrane and it can be transported and fused with lysosome to generate the autolysosome [9]. Autophagy was reported to associated with tumorigenesis [10,11]. Over the past few years, a lot of studies have elucidated that autophagy take part in the development and progression of NSCLC [12,13]. In brief, autophagy has dual functions in the tumorigenesis, including positive and negative effects. Positive parts behaved as proper degree of autophagy could clear damaged proteins and organelles in the early stages of the tumor so as to inhibiting tumor development [14]. Negative effects are involved in the advanced stages of tumorigenesis, autophagy could promote rapid growth of tumor cells by degrading and recycling the damaged or aged organelles components [15].
Up to date, there are increasing evidence indicated the role of autophagy-related genes in the development of cancer. Autophagy-related gene expression signature could serve as an independent prognostic indicator for serous ovarian cancer [16]. Eight autophagy-related genes (BCL2, BIRC5, EIF4EBP1, ERO1L, FOS, GAPDH, ITPR1 and VEGFA) were explored and the author found that these genes are significantly associated with overall survival in breast cancer [17]. The Beclin1 and LC3 genes were correlated with the tumor stage, metastasis conditions and mortality in pancreatic cancer [18]. There is also another one study proposed an autophagy-related gene prognostic signature and divided all the patients into high-risk and low-risk groups and the author concluded that autophagy-related gene prognostic signature is a promising biomarker for monitoring the outcomes of LUAD and LUSC [19]. These findings confirm the role of autophagy in cancers and suggest that autophagy-related genes maybe served as prognostic biomarkers.
Although there are many studies focus on the relationship between autophagy-related gene expression signature and the prognosis of cancer, very few studies have studied the reason why the autophagyrelated gene expression signature could influence the prognosis of NSCLC. The purpose of this study was to establish novel autophagy-related tumor subtypes to predict the prognosis of NSCLC. Meanwhile, we also want to explain possible reasons why the novel autophagy-related tumor subtypes could influence the prognosis of NSCLC.

Data collection
The gene expression profiles, mutation data and clinical information of NSCLC patients were downloaded from the Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/tcga/). In detail, TCGA contains a total of 1102 patients (including 103 adjacent normal lung tissues and 999 NSCLC tissues). The selected criteria were the followings: gene expression profiles, mutation expression profiles, studies compared adjacent non-tumorous lung tissues and NSCLC tissues in human. The excluded criteria were the followings: those studies that compared genes between lung cancer and benign disease in human, expression profiles using cell lines or serum, saliva, peripheral blood; patient had no survival time or survival status, patient had clinical information but no gene expression data. After screening according to the criteria, there are 990 NSCLC patients left, including 486 LUAD patients and 504 LUSC patients ( Table 1). The Human Autophagy Database (HADb; http://www.autophagy.lu) is the first human autophagy-dedicated database, it is a public repository containing information about the human genes described so far as involved in autophagy. A total of 232 genes from the HADb were identified as autophagy-related genes.

Identification of differentially expressed autophagy-related genes
Gene expression data from TCGA was analyzed by the R package limma package. The cut-off criterion was set as the p < 0.05 and absolute fold change > 2. In addition, the R package ggplot2 package was used to perform the volcano plots of all the autophagy-related genes between adjacent normal lung tissues and NSCLC tissues. Heat maps for the differentially expressed autophagy-related genes was generated using the R package pheatmap package. Then Kaplan Meier-plotter (https://kmplot.com/analysis/) was used to evaluate the prognostic value of differentially expressed autophagy-related genes in NSCLC.

Functional enrichment analysis and protein-protein interaction (PPI) network construction
Kyoto Encyclopedia of Genes and Genomes (KEGG) is a knowledge base for systematic analysis of gene functions. Gene ontology (GO) enrichment analysis predicts the function of the target genes in three aspects, including biological processes, cellular components and molecular function. There are several ways to performed the functional enrichment analysis and clusterProfiler package was used in our study [20]. P<0.05 was the threshold for the identification of significant GO terms and KEGG pathways. The GOplot package was employed to visualize the enrichment terms. PPI network was constructed by the STRING database (https://string-db.org) [21] and cytoscape software [22].

Evaluation of tumor-infiltrating immune cells
CIBERSORT is an algorithm that uses gene expression data to quantify specific cell types in mixed cell populations [23]. The CIBERSORT method is used to estimate the score of immune cells in LUAD and LUSC samples. The normalized gene expression data were prepared using standard annotation files and data were uploaded to the CIBERSORT, then the R package Genefilter package was utilized to screen each LUAD and LUSC sample. With the threshold of p<0.05, the result of the inferred score of immune cell populations were considered accurate.

Statistical analysis
All statistical analysis was performed using the R software, and p<0.05 was regarded as statistically significant. The unpaired t test was used to assess the expression level of the autophagy-related genes between cluster 1 and cluster 2. The Kaplan-Meier survival curve analysis and the log-rank test were used to analyze overall survival. The difference of infiltrating immune cells between cluster 1 and cluster 2 was assessed by unpaired t test.

Identification of differentially expressed autophagy-related genes and their prognostic value in NSCLC
In this study, gene expression profiles from TCGA database in NSCLC were selected and the expression levels of differentially expressed autophagy-related genes were extracted. Genes with p<0.05 and absolute fold change>2 were considered as differentially expressed autophagy-related genes. A total of 232 autophagy-related genes were pooled from HADb. After screening process, there are 39 autophagy-related genes were differentially expressed in NSCLC, including 14 down-regulated genes and 25 up-regulated genes (Table 2) ( Figure  1A-B). Then Kaplan Meier-plotter online database was used to evaluate the prognostic value of 39 differentially expressed autophagy-related genes in NSCLC, and the results showed that only 23 differentially expressed autophagy-related genes ( Figure 1C) were related to the overall survival for NSCLC, including 7 down-regulated genes and 16 up-regulated genes ( Figure S1).

Functional enrichment analysis of differentially expressed autophagy-related genes
The expression levels of 23 prognostic differentially expressed autophagy-related genes were visualized by violin plots (Figure 2A-C) and heatmap ( Figure 2D). Besides, correlation of 23 prognostic differentially expressed autophagy-related genes were also explored ( Figure 2E). To determine biological functions of the prognostic 23 differentially expressed autophagy-related genes, gene ontology (GO) enrichment analysis was performed to predict the function of the 23 differentially expressed autophagy-related genes in biological processes, cellular components and molecular function. The results showed that the identified differentially expressed autophagy-related genes were mainly involved in biological processes were autophagy, process utilizing autophagic mechanism, intrinsic apoptotic signaling pathway, peptidyl-serine modification and regulation of endopeptidase activity. The most significantly enriched molecular function concentrated on protein phosphatase binding, phosphatase binding, eukaryotic initiation factor 4E binding, translation initiation factor binding and calmodulin binding. It seems differentially expressed autophagy-related genes have no significant relationship to CC according to the p value ( Figure  3A-B). Further Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis was also performed to investigate the significance of differentially expressed autophagy-related genes in the development of NSCLC. The result showed that 23 differentially expressed autophagy-related genes were enriched in 8 KEGG pathways ( Figure 3C-E). According to the results of functional enrichment of differentially expressed autophagy-related genes, we found that differentially expressed autophagy-related genes was not only connected to autophagy but also involved in other biological processes. So, in this study, we hope to identify autophagy subtypes different biological characters based on 23 differentially expressed autophagy-related genes associated with of NSCLC.

Consensus clustering and principal components analysis of differentially expressed autophagy-related genes identified two clusters of NSCLC
Autophagy may have different expression patterns among NSCLC patients, which partially affects the prognosis and gene expression signature. In this study, 23 differentially expressed autophagyrelated genes were used to identify autophagy subtypes associated with overall survival of NSCLC. Consensus clustering was used to explore the similarity of 23 differentially expressed autophagyrelated gene expression patterns. By selecting k value of 2, we obtained the optimal cumulative distribution function (CDF) value and classified the NSCLC patients into 2 clusters ( Figure 4A-C). Principal components analysis (PCA) revealed two significantly different distribution patterns of NSCLC patients. The samples of cluster 1 and cluster 2 were distributed on the left side and right side, respectively ( Figure 4D). Consensus clustering and principal components analysis suggested that autophagy may plays a role in the occurrence and development of NSCLC. Besides, to explore whether these 2 clusters will affect the clinical outcomes, we constructed a prognostic classifier using Kaplan-Meier analysis. The results showed that the prognosis of cluster 2 expression pattern is better than cluster 1 (p = 0.031) ( Figure 4E). In detail, the five-year overall survival rate in cluster 2 (43.2%) is better than cluster 1 (40.4%), cluster 2 also have better ten-year overall survival rate (29.1%) than cluster 1 (20.4%). There is a difference in fifteen-year overall survival for cluster 1 have 10.4% fifteen-year overall survival, while there is no clue about cluster 2. Besides, we also noticed that the survival curves of these two clusters crossed before year fifteen, which mean there are other factors affect prognosis and it should be further discussed in the future.

Identification of differentially expressed genes (DEGs) and functional enrichment analysis between cluster 1 and cluster 2
Since different clusters have shown variations in the autophagy-related genes and patient prognosis, we explored the DEGs in the cluster 1 and cluster 2. A total of 109 DEGs (28 up-regulated genes in cluster 1 and 81 up-regulated genes in cluster 2) were screened and visualized ( Figure 5A). To explore biological functions of the prognostic value of DEGs from these 2 clusters, GO and KEGG enrichment analysis were performed. The results showed the identified DEGs in cluster 1 mainly involved in biological processes were nucleosome assembly, chromatin assembly and nucleosome organization. The most significantly enriched cellular components were nucleosome, DNA packaging complex and protein-DNA complex. As for molecular function, which were statistically concentrated on nucleosomal DNA binding, nucleosome binding and chromatin DNA binding ( Figure 5B-C). KEGG analysis result showed that DEGs in cluster 1 were enriched in six pathways, including systemic lupus erythematosus, alcoholism, viral carcinogenesis, necroptosis, transcriptional misregulation in cancer and shigellosis ( Figure 5D). While the functional analysis results in cluster 2 showed the DEGs mainly involved in vascular process in circulatory system, regulation of blood vessel size and regulation of tube size in biological processes. While in cellular components the DEGs mainly enriched in lamellar body, rough endoplasmic reticulum and multivesicular body. As for molecular function, the DEGs mainly involved in carbohydrate binding, G-protein coupled peptide receptor activity as well as peptide receptor activity ( Figure 5E-F). The KEGG analysis result showed that the DEGs were only enriched in renin secretion ( Figure 5G). Besides, the STRING database and Cytoscape software were used to construct the PPI networks for DEGs of each cluster ( Figure 5H-I).
Identification of the significant mutation profile signature between cluster 1 and cluster 2 The accumulation of somatic DNA mutation plays an important role in the formation of tumor. In this study, we explored the difference of mutation profile signatures for LUAD and LUSC based on the different prognosis between cluster 1 and cluster 2. In these 2 clusters, the proportion of missense mutations is the major mutation, the most variant type is SNP both in LUAD and LUSC. In the cluster 1, C > A (58996) is the highest mutation mode of SNP in LUAD ( Figure 6A), while C > A (29600) is the main mutation mode of SNP in LUSC ( Figure 6B). The average number of mutations in each sample is 136 in LUAD In LUAD, 8 genes (TTN, MUC16, CSMD3, RYR2, TP53, LRP1B, ZFHX4 and KRAS) were mutated in both cluster 1 and cluster 2 among top 10 mutated genes, which means that the mutation rate of these 8 genes have no significant difference between cluster 1 and cluster 2. While the mutated FLG (21%) and USH2A (28%) are special for cluster 1 (Figure 6E), meanwhile the mutated KEAP1 (42%) and COL11A1 (30%) are special for cluster 2 ( Figure 6F). In LUSC, 9 of top 10 mutated genes were consistent in these 2 clusters, while KMT2D (22%) are special for cluster 1 and XIRP2 (25%) are special for cluster 2 ( Figure  6G-H). Besides, the most significant mutated genes (Figure 7), mutation mode of SNP ( Figure 8A-D), and the somatic interactions among these significant mutated genes were also visualized ( Figure 8E-H).

Identification of immune cell infiltration signatures of each cluster based on tumor mutation burden (TMB)
A lot of studies have showed that autophagy was involved in tumor microenvironment, and the autophagy-related genes could affect the immune responses. Based on the different prognosis between these 2 clusters and different TMB level for the samples in each cluster, we explored the difference of immune cell infiltration signatures for each cluster in both LUAD and LUSC according to the TMB levels. The fractions of infiltrating immune cells in tumor tissue were calculated by CIBERSORT algorithm. The cut-off of p value is 0.05. In LUAD, there are 436 patients which including 205 low TMB samples and 198 high TMB samples in cluster 1. In cluster 2, there are 17 low TMB and 13 high TMB samples. In LUSC, there are 128 and 97 low TMB samples, 124 and 94 high TMB samples in cluster 1 and cluster 2, respectively. Immune cell infiltration signature of cluster 1 and cluster 2 both in LUAD and LUSC patients were displayed as boxplots ( Figure 9A-D).
The results shown that cluster 1 of LUAD have higher fraction of T cells CD8 (p < 0.001), T cells CD4 memory resting (p < 0.001), T cells CD4 memory activated

Significant oncogenic pathways and drug-gene interactions between cluster 1 and cluster 2
For the DEGs from different clusters, significant mutation profile signature from LUAD and LUSC were screened. Besides, we also explored the oncogenic pathways between cluster 1 and cluster 2. After we visualized all the oncogenic pathways, we found that the mutated oncogenic pathways are mainly involved in WNT, RTK-RAS, PI3K, NOTCH and Hippo signaling pathways both in LUAD and LUSC according to the cluster classification ( Figure  10). The most significant difference of the same mutated signaling pathway between cluster 1 and cluster 2 is the numbers of mutated genes and mutated sample numbers. It seemed that cluster 1 have more mutated samples and genes compared to cluster 2 both in LUAD and LUSC.
After screening mutated oncogenic pathways, we also performed drug-gene interactions among the mutated top 5 genes in these 2 clusters ( Figure   11A-D). As the results displayed, the plot shows potential druggable gene categories along with up to top 5 genes (if any) involved in them. Besides, the overview of differentially mutated genes according to cluster classification were visualized ( Figure 11E-F). Meanwhile, the mutated gene interaction to drugs according to cluster classification were further explored (Table 3).

Discussion
Autophagy underlying the initiation, progression, and metastasis of various cancers, including NSCLC. While aberrantly regulated autophagy in the prognosis of NSCLC and the mechanisms are less well defined. It reported that deregulation of UBE2C-mediated autophagy repression aggravates NSCLC progression [24]. TRIM59 could as a new molecular biomarker for predicting the prognosis of NSCLC patients [24]. The prognostic effect of circulating exosomes miR-425-3p on the response of NSCLC to platinum-based chemotherapy [26]. In this study, we first screened 39 differentially expressed autophagy-related genes, then according to their prognostic value, 23 of 39 were related to the overall survival for NSCLC. Differ from regular analysis of cox and logistic regression, we classified the NSCLC into 2 subtypes according to consensus clustering based on the 23 prognostic genes. Besides, Kaplan-Meier analysis showed different prognosis between cluster 1 and cluster 2 subtypes. So according to these 2 clusters, DEGs of each cluster were identified to explore their biological functions.
According to our present study, we identified two autophagy-related gene subgroups using consensus clustering analysis based on 23 prognostic autophagy-related genes, in which we found cluster 1 have poor prognostic value in NSCLC. Considering the occurrence of mutation in NSCLC are very common, and the mutation site could be used as therapy target to improve the prognosis of NSCLC. At the same time, there are many studies focused on the relationship between mutation and immunotherapy, such as the epidermal growth factor receptor (EGFR) mutations, which are the second most common oncogenic driver event in NSCLC [27]. Besides, Anti-PD1/PD-L1 immunotherapy has emerged as a standard of care for stage III-IV NSCLC over the past decade. Patient selection is usually based on PD-L1 expression by tumor cells and/or tumor mutational burden. While mutations in oncogenic driver genes will modify the immune tumor microenvironment and may promote anti-PD1/PD-L1 resistance [28]. It is important to explore new mutation genes and investigated the biological functions to improve the prognosis. KMT2D mutation is associated with poor prognosis of NSCLC [29]. Patients with LUSC gain overall favorable survival advantage from TTN mutation type, and overall favorable survival and disease-free survival advantage from TTN/TP53 double mutation [30]. Patients with TP53/EGFR double mutations, especially missense mutations, have shorter response rates and PFS when treated with EGFR TKI [31]. KRAS G12D and STK11 mutations confer poor prognoses for patients with KRAS-mutant NSCLC [32]. In the current study, mutation profile signature showed that mutated FLG (21%), USH2A (28%) and KMT2D (22%) are special for cluster 1, mutated KEAP1 (42%), COL11A1 (30%) and XIRP2 (25%) are special for cluster 2.
Autophagy was involved in tumor microenvironment and the autophagy-related genes could affect the immune responses. There is evidence that the microenvironment of NSCLC is rich in different types of immune cells which are associated with clinical outcomes [33,34]. The composition of the immune microenvironment differs across patients as well as in cancers of the same type. Early clinical studies revealed that immune cell infiltration had a major impact on the clinical course of several cancers [35][36][37][38]. Thus, we also pooled immune cell infiltration signatures for each cluster. In cluster 1, T cells CD8, T cells CD4 memory resting, T cells CD4 memory activated, Macrophages M1, Dendritic cells resting, Mast cells resting, Monocytes, Dendritic cells activated, T cells follicular helper, B cells memory, T cells regulatory (Tregs) and Plasma cells have significant different infiltration levels. While Monocytes (p=0.035) and Macrophages M0 seems associated with cluster 2. Combined with previous studies, CD8 T cells are an important immune cell in tumor immune microenvironment. In addition to CD8, the importance of other subtypes of immune cells, including CD4 T cell and macrophage has also been reported. It has been reported that increased CD4 and CD8 T cell abundance in tumor immune microenvironment is associated with better survival outcomes [39,40]. Macrophages are the most abundant cells, which performed several functions within the tumor microenvironment. Tumor-associated macrophages commonly refer to an alternative M2 phenotype, exhibiting anti-inflammatory and pro-tumoral effects. On the contrary, Macrophages M1 have pro-inflammatory effect and anti-tumoral effects [41]. In the current study, we identified immune cell infiltration signatures based on TMB. The results showed that higher fractions of T cells CD4 memory resting, T cells follicular helper and Plasma cells in cluster 1 both in LUAD and LUSC, which maybe the reason why cluster 1 have worse prognosis. According to the results, it seems that the different immune cell infiltration signatures between cluster 1 and cluster 2 could influence the prognosis of NSCLC.
The oncogenic pathways involved in multiple cancers and related to the progression and prognosis of cancers, such as RTK−RAS, WNT, NOTCH, Hippo, PI3K, Cell Cycle, MYC, TGF−Beta and TP53 pathways. For instance, overexpression of Wnt-1, -2, -3, and -5a and of Wnt-pathway components Frizzled-8, Dishevelled, Porcupine, and TCF-4 is common in resected NSCLC and is associated with poor prognosis [42]. Harmful NOTCH mutations are identified as new predictors of effective immunotherapy for NSCLC [43]. YAP1 is the main Hippo pathway effector and an effective oncogene. It is overexpressed in NSCLC and the loss of YAP1 could be used as a clinical indicator to predict neuroendocrine characteristics and chemosensitivity [44]. G3BP1 may play a key role in activating the PI3K/AKT/mTOR pathway and can be used as a new prognostic biomarker for patients with NSCLC undergoing surgery [45]. KDM4A promotes the growth of NSCLC through Wnt/β-catenin signaling pathway and DLX5mediated Myc expression [46]. In our study, we found that the mutated oncogenic pathways are mainly involved in WNT, RTK-RAS, PI3K, NOTCH and Hippo signaling pathways in each cluster. The main difference between cluster 1 and cluster 2 is the mutated sample numbers, mutated gene numbers and types. It seems that cluster 1 have more mutated genes and sample numbers, caused the different activity of oncogenic pathways compared to cluster 2, which may contribute to the different prognosis of each cluster. Gene-drug interactions mainly divided into 3 categories, which including inhibitory interactions, induction interactions and phenoconversion interactions. Among them, inhibitory and induction interactions could affect the pharmacokinetics of drugs. Besides, these interactions can occur with the administration of a perpetrator drug that alters the drug metabolism or transport, as well as with the presence of loss-or gain-of-function genetic variants that alter function of enzymes [47]. In the current study, we identified the potential druggable gene categories along with up to top 5 genes (if any) involved in them and found that Keap1 in cluster 2, could inhibited pharmacokinetics of bradoxolone methyl and dimethyl fumarate both in LUAD and LUSC. However, this needs to be further verified in the future. We believe our findings could provide newly insight in both classification and personalized treatment strategy.  In conclusion, based on autophagy-related gene expression characteristics, we used consensus clustering to identify 2 subtypes of NSCLC. These 2 subtypes shown significant different mRNA expression signatures and different immune cell infiltration patterns and mutation signatures. These differences may affect the tumor progression and tumor prognosis. This study may be helpful to accurately classify NSCLC patients and provide newly insight in personalized treatment.