An N6-methyladenosine and target genes-based study on subtypes and prognosis of lung adenocarcinoma

Purpose: Lung adenocarcinoma (LUAD) is a highly lethal subtype of primary lung cancer with a poor prognosis. N6-methyladenosine (m6A), the most predominant form of RNA modification, regulates biological processes and has critical prognostic implications for LUAD. Our study aimed to mine potential target genes of m6A regulators to explore their biological significance in subtyping LUAD and predicting survival. Methods: Using gene expression data from TCGA database, candidate target genes of m6A were predicted from differentially expressed genes (DEGs) in tumor based on M6A2 Target database. The survival-related target DEGs identified by Cox-regression analysis was used for consensus clustering analysis to subtype LUAD. Uni-and multi-variable Cox regression analysis and LASSO Cox-PH regression analysis were used to select the optimal prognostic genes for constructing prognostic score (PS) model. Nomogram encompassing PS score and independent prognostic factors was built to predict 3-year and 5-year survival probability. Results: We obtained 2429 DEGs in tumor tissue, within which, 1267 were predicted to m6A target genes. A prognostic m6A-DEGs network of 224 survival-related target DEGs was established. We classified LUAD into 2 subtypes, which were significantly different in OS time, clinicopathological characteristics, and fractions of 12 immune cell types. A PS model of five genes (C1QTNF6, THSD1, GRIK2, E2F7 and SLCO1B3) successfully split the training set or an independent GEO dataset into two subgroups with significantly different OS time (p < 0.001, AUC = 0.723; p = 0.017, AUC = 0.705).A nomogram model combining PS status, pathologic stage, and recurrence was built, showing good performance in predicting 3-year and 5-year survival probability (C-index = 0.708, 0.723, p-value = 0). Conclusion: Using candidate m6A target genes, we obtained two molecular subtypes and designed a reliable five-gene PS score model for survival prediction in LUAD.


Introduction
Lung adenocarcinoma (LUAD) is the most common histological subtype of lung primary lung malignancy, approximately accounting for 40% of all cases [1]. LUAD is a leading cause of cancer-related mortality worldwide, with overall survival (OS) time shorter than five years [2]. LUAD frequently involves disseminated metastasis and is highly resistant to conventional radiotherapies and chemotherapies [3]. LUAD patients have high recurrence rate after surgical excision [4]. Therefore, identification of prognostic biomarkers of LUAD is critical for improving long-term survival of LUAD patients and facilitates developing novel effective treatment strategies. N 6 -methyladenosine (m 6 A) is the most abundant and prevalent modification in the regulation of splicing, stability, translocation, and translation of eukaryotic mRNAs [5]. The m 6 A regulators are primarily comprised of "writers" (methyltransferase), such as METTL3, RBM15/15B, "erasers" (demethylase), such as FTO and ALKBH5, and "readers" (m 6 A-binding proteins that targets the m 6 A site on mRNA), such as YTHDF1 and YTHDF2 [6]. The aberrantly expressed m 6 A regulators play a critical role in tumorigenesis [7]. Several pieces of evidences have revealed that m 6 A regulators have important prognostic implications in LUAD [8,9]. m 6 A writers, erasers, and readers could serve prognostic biomarkers in LUAD [10]. Moreover, a growing number of studies have developed useful risk scoring systems based on the m 6 A regulators to predict survival in LUAD [1113]. However, these studies only characterize prognostic roles of m 6 A regulators in LUAD, their regulatory mechanisms including downstream target genes and relevant prognostic implications remain elusive.
Here we drew on gene expression data of LUAD from The Cancer Genome Atlas (TCGA) database to identify the potential target genes of m 6 A methylation regulators and investigate their associations with LUAD subtyping and prognosis prediction. We developed a prognostic score (PS) system based on prognostic target genes to predict survival of patients. This study provided illumulating insights into the biological roles of m 6 A methylation regulators in LUAD.

Data source and preprocessing
Microarray gene expression data of 501 tumor samples and 58 normal samples with corresponding clinical characteristics (Illumina HiSeq 2000 RNA Sequencing platform) were downloaded from TCGA database (https://gdc-portal.nci.nih.gov/) and used as training set.
We launched a search for validation datasets in the Gene Expression Omnibus (GEO) [14] at the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/geo/). The following criteria should be met: gene expression data; tumor tissue samples; histology type information available; all samples no less than 150 and eligible samples no less than 100; survival information available. GSE50081 [15] included 127 tumor samples with corresponding clinical information (GPL570 Affymetrix Human Genome U133 Plus 2.0 Array platform) and was used as the validation dataset.

Differential expression analysis
Differentially expressed genes (DEGs) were screened between normal tissue samples and tumor tissue samples, with FDR < 0.05 and |log2FC| > 1 as the cutoff. Using expression data of top 50 up-regulated DEGs and top 50 down-regulated DEGs according to descending value of |log2FC|, two-way hierarchical clustering analysis was performed using pheatmap version 1.0.8 (https://cran.r-project.org/web/packages/pheatmap/index.html) in R3.6.1 based on centered pearson correlation.

Construction of prognostic m 6 A-DEGs network
In order to investigate the m 6 A-related regulatory mechanisms, target genes of m 6 A regulators were predicted based on M 6 A2 Target database [16] (http://m6a2target.canceromics.org/) and were then mapped to the identified DEGs. The overlapped genes were subjected to the uni-variate Cox-regression analysis. The significant survival-related target genes and m 6 A RNA methylation regulators were used to construct a prognostic m 6 A-DEGs network. The network was visualized and its topological characteristics was analyzed using Cytoscape software [17] (version 3.6.1, https://cytoscape.org/). Gene ontology (GO) function and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed for the genes in the network using the Database for Annotation, Visualization and Integrated Discovery (DAVID) tool (version 6.8. https://david. ncifcrf.gov/). The cutoff was set at p-value < 0.05 and the count of significantly enriched genes > 5.

LUAD subtypes analysis
Using gene expression data of the survival-related target genes, consensus clustering analysis was performed by ConsensusClusterPlus Version 1.54.0 to identify subtypes of LUAD. ConsensusClusterPlus implements the consensus clustering method in R that provides quantitative and visual stability evidence for estimating the number of unsupervised classes in a dataset [18]. The optimal clusters were determined according to cumulative distribution function. Overall survival (OS) time, clinical characteristics and fractions of tumor-infiltrating immune cells (TIICs) were compared between different subtypes. CIBERSORT [19] software (https://cibersort.stanford.edu/index.php) was used to estimate proportions of tumor-infiltrating immune cells.

Development of A PS model for prognosis prediction
To build a PS model for predicting prognosis in LUAD, firstly, multi-variable Cox regression analysis was performed to identify independent prognostic genes with log-rank p-value < 0.05 from the survival-related target genes. Least absolute shrinkage and selection operator (LASSO) Cox-PH regression model [20] was used to determine the optimal prognostic genes from the independent prognostic genes. Prognostic score (PS) was defined as the linear combination of logarithmically transformed gene expression levels of the optimal prognostic genes weighted by canonical discriminant function coefficients and was used to evaluate mortality risk of a patient. Using expression levels and regression coefficients of the optimal prognostic genes, PS score was calculated for every sample using the following formula: Prognostic score (PS) = ∑Coefgenes ×Exp genes where Coefgenes represents regression coefficient of a prognostic gene and Expgenes represents expression level of a prognostic gene.
According to median PS score, all samples of the training set or the validation set were separated into a high-risk group and a low-risk group.

Construction and assessment of a nomogram model
Uni-variable and multi-variable Cox regression analysis was performed for all clinical characteristics and PS status. Nomogram is a graphical calculating tool for individualized risk estimation based on use of multiple variables and continuous variables continuously [21]. A nomogram model incorporating all independent prognostic factors was built to predict 3-year and 5-year survival probability. Calibration curves of the nomogram were plotted and concordance index (C-index) [22] was calculated to evaluate its predictive performance.

Statistical analysis
Survival analysis was performed to compare OS probabilities of different groups using Kaplan-Meier method and log-rank test with the threshold of p-value < 0.05. Receiver Operating Characteristic (ROC) curves were plotted to analyze accuracy of the PS system.

Identification of DEGs in LUAD
A total of 2429 genes that met the criteria of FDR < 0.05 and |log2FC| > 1 were significantly differentially expressed between tumor samples (N = 501) and normal samples (N = 58, Figure 2A) in the training set. As shown in a heatmap ( Figure 2B), expression patterns of top 50 up-regulated DEGs and top 50 down-regulated DEGs were obviously different between tumor samples and normal samples, indicating the DEGs are able to distinguish tumor and normal samples.

A prognostic m 6 A-DEGs network included 12 m 6 A RNA methylation regulators and 224 genes
Total 1267 DEGs were predicted to be target genes of m 6 A RNA methylation regulators based on m 6 A2 Target database and 224 of these were significantly associated with survival in uni-variable Cox regression analysis. Consequently, 12 m 6 A RNA methylation regulators (writers: RBM15, RBM15B; erasers: FTO; readers: YTHDF1, YTHDF2, YTHDF3, YTHDC1, YTHDC2, HNRNPC, IGF2BP1, IGF2BP2, IGF2BP3) and the 224 survival-related target genes formed a prognostic m 6 A-DEGs network (Figure 3). Topological analysis showed that IGF2BP1 was the most hub gene with a degree of 145. The other nodes with high degree included YTHDF1, YTHDC1, IGF2BP2 and IGF2BP3 ( Table 1). The genes in the network were significantly involved in 26 GO biological processes largely related to cell cycle, and 5 KEGG pathways, such as cell cycle and p53 signaling pathways ( Table 2).

Two LUAD subtypes had different OS time, clinicopathological characteristics and fractions of TIICs
Consensus clustering analysis for the 224 survival-related genes led to the identification of two LUAD subtypes ( Figure 4A). In the training set, 222 tumor samples belonged to subtype 1, while 279 tumor samples fell into subtype 2. Kaplan-Meier survival curves in Figure 4B showed that subtype 2 had significantly shorter OS time compared to subtype 1 (p-value = 0.00096). Subtype 2 had less patients older than 60 years (p-value = 2.51E03), less female patients (p-value = 1.15E03), higher Pathologic N stage (p-value = 4.68E03), higher Pathologic T stage (p-value = 2.85E03), higher pathologic stage (p-value = 5.81E04), and less never-smokers (p-value = 1.33E02, Table 3) in comparison with subtype 1. It has been reported that the m 6 A genes are important regulators in tumor microenvironment of LUAD and affect TIICs [7]. Therefore, we also analyzed composition of TIICs in the two subtypes.
Remarkably, compared to subtype 1, subtype 2 had significantly less memory B cells, more CD8 + T cells, less resting CD4 + memory T cells, more activated CD4 + memory T cells, more Tfh cells, more activated NK cells, less monocytes, more M0 and M1 macrophages, less M2 macrophages, more resting myeloid dendritic cells, less activated mast cells, more resting mast cells, and more Neutrophils (p-value < 0.001, Figure 5).

A five-gene PS model for stratification of LUAD patients
Total 93 genes in the m 6 A-DEGs network were found to be independent prognostic factors in multi-variable Cox regression analysis. Five independent prognostic genes [C1QTNF6 (C1q/tumor necrosis factor-related protein 6), THSD1 (Thrombospondin type I domain 1), GRIK2 (Glutamate receptor, ionotropic, kainate 2), E2F7 (E2F transcription factor 7) and SLCO1B3 (solute carrier organic anion transporter family member 1B3)] achieved the optimal lambda value and were selected by LASSO Cox-PH model as the optimal gene panel for prognosis prediction (Table 4 and Figure 6). Kaplan-Meier survival curves in Figure 7 showed that for each of the five optimal prognostic genes, the high expression and low expression samples had significantly different OS time (p < 0.001, p = 0.0052, p = 0.00092, p = 0.00012, and p = 0.0036) in the training set.
Based on expression levels and regression coefficients of the five optimal prognostic genes, PS score was calculated for every sample in the training set. With median PS score as the cutoff, the training set was split in a high-risk group and a low-risk group. The high-risk patients had significantly shorter OS time compared to the low-risk patients (p < 0.0001, Figure 8A). AUC for the training set was 0.723 (specificity = 0.560, sensitivity = 0.820, Figure 8B). Similarly, the validation set (GSE50081) was divided by the five-gene PS score into high-risk and low-risk groups with significantly different OS time (p = 0.017, Figure 8A). AUC for the validation set was 0.705 (specificity = 0.711, sensitivity = 0.627, Figure 8B).   Table 5). In order to improve predictive performance of the five-gene PS score and facilitate its application, we combined pathologic stage, recurrence and PS status to build a nomogram for predicting 3-year and 5-year survival probability ( Figure 9B). Calibration plots for the nomogram showed good consistence between the predicted and actual 3-year and 5-year survival probabilities (C-index = 0.708, 0.723, p-value = 0, Figure 9C), suggesting good predictive accuracy of the composite nomogram based on pathologic stage, recurrence and PS status. All these results demonstrated that the five-gene PS score was a useful prognostic tool in LUAD.

Discussions
m 6 A RNA methylation is a critical player in tumor initiation and progression through affecting gene expression and various cellular processes [27]. m 6 A genes have been increasingly acknowledged as potential prognostic biomarkers of LUAD [28]. However, candidate target genes of m 6 A modification regulators and their prognostic significance remain poorly studied. Our present study predicted 1267 DEGs in LUAD tumor as candidate target genes of m 6 A regulators based on m 6 A2 Target database. Using the 224 survival-related target genes, we obtained a prognostic m 6 A-DEGs network and two molecular subtypes (subtype 1 and 2) of LUAD with significantly different OS time, distinct clinical characteristics and composition of immune cells. Moreover, we developed and validated a robust five-gene PS system for survival prediction in LUAD. This study deepens our understanding on the regulatory mechanisms underlying the roles of m 6 A RNA methylation in LUAD, contributing to improvement of prognosis prediction and aiding in individualized genome-based therapy in LUAD. LUAD has various histological subtypes ranged from pre-invasive lesion to aggressive adenocarcinoma [2]. Our study identified two molecular subtypes of LUAD (subtype 1 and 2) using consensus clustering analysis based on expression data of the survival-related target genes. Subtype 2 had poorer survival then subtype 1. Subtype 2 was prone to be at more advanced stage and behave more aggressively and invasively. These results suggest that these survival-related target genes are important determinants in LUAD progression in consistence with a previous report [29]. Besides, subtype 2 is more likely to be observed in young male smokers, while subtype 1 has a greater propensity to be found in older female never-smokers. TIICs have a close association with LUAD progression and prognosis [30]. Our study found that subtype 1 and 2 were significantly different in proportions of 14 types of immune cells. These findings together demonstrate diverse clinicopathological characteristics and varied composition of immune cells between the two subtypes. Identification of the two m 6 A methylation-related molecular subtypes may have important clinical implications and lend support to targeted therapies to combat LUAD.
The present study showed that our PS model was a reliable and robust tool for survival prediction in LUAD, manifested in several levels. Firstly, the PS score could discriminate high-risk patients from low-risk patients in the training set. Secondly, prognostic performance of our PS model was successfully verified in an independent dataset from GEO database, supporting the generalizability of the PS model. Thirdly, AUC values of ROC curves for the two sets were larger than 0.7, providing evidence for predictive accuracy of the PS model. Fourthly, PS status had been identified as an independent prognostic factor of LUAD. Finally, the nomogram model incorporating PS status, pathologic stage and recurrence performed well in predicting 3-year and 5-year survival probability of LUAD patients (C-index > 0.7).
Our PS model was developed based on five optimal prognostic genes (C1QTNF6, THSD1, GRIK2, E2F7 and SLCO1B3). According to the HR values, 4 genes (C1QTNF6, GRIK2, E2F7 and SLCO1B3) were associated with increased risk of LUAD, while the expression level of one gene (TNSD1) was associated with reduced risk of LUAD. C1QTNF6 overexpression is involved in cell proliferation, migration and invasion and may be an independent prognostic factor in LUAD [31,32]. Inhibition of C1QTNF6 could attenuate cell proliferation, migration, and invasion and promote apoptosis, indicating C1QTNF6 might be a new perspective in treating non-small-cell lung cancer [33]. GRIK2 belongs to an ionotropic glutamate receptor and is identified as a novel epigenetic target in gastric cancer [34]. Its high expression is related to a poor prognosis in urinary tract carcinoma [35]. E2F7 is an E2F transcription factor, which plays an essential role in regulation of cell cycle progression [36]. E2F7 is highly upregulated in non-small-cell lung cancer samples, and aberrantly allow the cells to enter into S phase of cell cycle [37]. High expression of E2F7 is associated with short OS time of LUAD patients [38]. SLCO1B3 is mainly expressed in the basement membrane of liver cells and plays important roles in transporting endogenous compounds into cells [39]. It has been recognized as a positive prognostic biomarker of breast cancer [40]. SLCO1B3 exerts an oncogenic effect through promoting epithelial-mesenchymal transition in progression of non-small cell lung cancer [41]. Nagai et al. showed that SLCO1B3 is expressed in lung cancer tissues but not in normal tissues [42]. Taken together, C1QTNF, GRIK2, E2F7 and SLCO1B3 are highly expressed in cancer samples, which are consistent with our results. THSD1 is a novel tumor suppressor gene mapping to 13q14. It is demonstrated to be downregulated in esophageal squamous cell carcinoma, which might be related with promoter hypermethylation [43]. In LUAD, THSD1 is found being aberrantly methylated and its expression is significantly correlates with prognosis of LUAD patients [44]. Our result found that TNSD1 was associated with reduced risk of LUAD, further suggesting its role as a tumor suppressor gene. Nevertheless, molecular mechanism and prognostic significance of GRIK2 in LUAD remains poorly understood. The molecular and cellular mechanisms of different caners share similarities, such as mutations in proto-oncogenes and tumor suppressors, exposures to chemicals and discordant regulation or activities of many critical signaling pathways [45]. The dysregulation or association with prognosis of the identified predictive genes in other neoplastic conditions might provide evidence of their role in LUAD carcinogenesis. However, downstream analysis on the roles of them in carcinogenesis broadly are still warranted to identify cross-talk and pleiotropic mechanisms as well as potential adjuvant therapies and repurposable drugs across different neoplastic conditions.
There are some limitations in our study. First, we have identified two subtypes of LUAD based on the expression level of 222 DEGs that were significantly related with survival. These subtypes might be clinically meaningful. However, sub-types need to be widely validated before they could be agreed upon as acceptable. Second, downstream analyses on the roles of the identified predictive genes are still warranted. Third, clinical information of some patients in the training set and validation set is lacking, which might lead to bias in Cox regression analysis.

Conclusions
Using predicted target genes of m 6 A regulators, we obtained two molecular subtypes of LUAD with differential survival time, distinct clinicopathological and immunological features. Based on five prognostic target genes, we developed and validated a PS model for risk in LUAD. This study contributes to better tumor classification and supplies a reference for individualized survival estimation and targeted treatment strategy for LUAD.