Integrative Analysis of Methylation and Gene Expression in Lung Adenocarcinoma and Squamous Cell Lung Carcinoma

Lung cancer is a highly prevalent type of cancer with a poor 5-year survival rate of about 4–17%. Eighty percent lung cancer belongs to non-small-cell lung cancer (NSCLC). For a long time, the treatment of NSCLC has been mostly guided by tumor stage, and there has been no significant difference between the therapy strategy of lung adenocarcinoma (LUAD) and squamous cell lung carcinoma (SCLC), the two major subtypes of NSCLC. In recent years, important molecular differences between LUAD and SCLC are increasingly identified, indicating that targeted therapy will be more and more histologically specific in the future. To investigate the LUAD and SCLC difference on multi-omics scale, we analyzed the methylation and gene expression data together. With the Boruta method to remove irrelevant features and the MCFS (Monte Carlo Feature Selection) method to identify the significantly important features, we identified 113 key methylation features and 23 key gene expression features. HNF1B and TP63 were found to be dysfunctional on both methylation and gene expression levels. The experimentally determined interaction network suggested that TP63 may play an important role in connecting methylation genes and expression genes. Many of the discovered signature genes have been supported by literature. Our results may provide directions of precision diagnosis and therapy of LUAD and SCLC.


INTRODUCTION
Lung cancer, considered to be a highly prevalent type of cancer, is a leading cause of cancer-related mortality worldwide, resulting in 1.6 million deaths each year with poor 5-year survival rate of about 4-17% (Hirsch et al., 2017;Altorki et al., 2019). Lung cancer is classified as follows: smallcell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC), accounting for approximately 20 and 80% of all lung cancer cases, respectively (Oser et al., 2015). NSCLC is a complex systems disease with dysfunctions on multiple pathways and multiple molecular levels Li et al., 2013;Zhou et al., 2015;Chen et al., 2016;Liu et al., 2017). It can also be typically divided into three main subtypes, lung adenocarcinoma (LUAD), squamous cell lung carcinoma (SCLC), and large cell cancer (LCC), according to standard pathology methods (Socinski et al., 2016;Swanton and Govindan, 2016;Herbst et al., 2018). Compared with squamous lung cancer, adenocarcinoma was associated with better prognosis. Despite the advances in diagnostic and therapeutic technology, lung cancer remains a serious global public health concern.
For a long time, the treatment of NSCLC has been mostly guided by tumor stage, and there has been no significant difference between the therapy strategy of LUAD and SCLC. Most lung cancers are usually diagnosed at an advanced stage and are treated primarily with systemic chemotherapy, typically with platinum-based regimens (Bishop et al., 2010). Recent progress in characterization of NSCLC by molecular typing, especially in adenocarcinomas of the lung, have brought new investigation of therapeutic agents that target dominant oncogenic mutations, such as epidermal growth factor receptor (EGFR)-targeted therapies, which have showed improved response rates in patients with NSCLC (Shigematsu et al., 2005).
Currently, progress in molecular biology of lung cancer has resulted in the identification of multiple potential biomarkers that may be related to the clinical management of NSCLC patients. In recent years, with the emergence of next-generation sequencing technologies, important molecular differences between LUAD and SCLC are increasingly identified, indicating that targeted therapy will be more and more histologically specific in the future (Kim et al., 2005;Sun et al., 2007;Li et al., 2014). Several studies have identified multiple gene expression subtypes that differ in prognosis, genomic alterations, clinical characteristics, including tumor differentiation, stage-specific survival, underlying drivers, and potential responses to treatment within LUAD and SCLC (Wilkerson et al., 2010;Thomas et al., 2014;Lu et al., 2016). For example, LUAD patients that harbor EGFR, ALK, ROS1, or BRAF mutations were discovered to benefit the most (Villalobos and Wistuba, 2017;Herbst et al., 2018). Targeted therapies for gene abnormalities of HER2, MET, RET, and NTRK1 appear to be an effective approach to treat LUAD (Dearden et al., 2013;Mazieres et al., 2013). SCLC shows different mutation spectrum from that of adenocarcinoma, and the mutation targeted therapy for SCLC has not been thoroughly studied to obtain approved treatment Soldera and Leighl, 2017).
A series of imaging studies suggested that NSCLC may progress rapidly between occurrence and primary treatment (Koh et al., 2017). Therefore, it is necessary for clinicians to identify between these two subtypes of NSCLC in a convenient and rapid way. With the improvement of the above clinical and molecular levels, growing evidences have shown that immunohistochemistry (IHC) is an effective tool for differentiating adenocarcinoma from squamous cell carcinoma (Bass et al., 2009;Weiss et al., 2010).
It is reported that the formation and development of lung cancer are related to the accumulation of permanent genetic changes and dynamic epigenetic changes. Therefore, enhancing our understanding of tumor biology and gene expression profiles will be critical for cancer treatment and diagnosis. In this study, an integrative analysis of lung cancer methylation data and gene expression data was performed, and mixed features were also screened out for analysis.

The Joint Methylation and Expression Profiles of Lung Cancer Patients
The methylation and gene expression profiles of lung cancer patients were obtained from GEO (Gene Expression Omnibus) 1 . The data were originally generated by Karlsson et al. (2014). They used the data to cluster the patients into five groups, and these groups showed different overall survival (Karlsson et al., 2014). We were more interested in how the methylation and expression differ from well-known subtypes, especially LUAD and SCLC. Therefore, we analyzed the 77 LUAD and 22 SCLC patients who had both methylation and expression data.
The methylation profiles were measured with Illumina HumanMethylation450 BeadChip while the gene expression profiles were measured with Illumina HumanHT-12 V4.0 expression BeadChip. The probe expression levels were averaged onto 20,178 genes. The 354,251 methylation sites within genes were analyzed. Therefore, each patient was represented with 20,178 genes and 354,251 methylation sites.

Screen for the Relevant Methylation and Expression Features
Since the number of methylation and expression features was very large, it was difficult to analyze directly. We applied the Boruta method (Kursa and Rudnicki, 2010) to screen the combined data and identify the relevant methylation and expression features. The Boruta method was based on random forest classification, and the relevance of features to sample classes was measured by the ensemble of the random forest classifier's stochasticity.

Evaluate the Importance of Relevant Methylation and Expression Features
After the irrelevant features were removed, the relevant methylation and expression features were ranked based on their importance evaluated with MCFS (Monte Carlo Feature Selection) (Draminski et al., 2008). The MCFS was a widely used method to rank features based on classification trees (Chen et al., , 2019Pan et al., 2018Pan et al., , 2019aLi et al., 2019). First, for the d features, we selected s subsets and each subset included m features (m was much smaller than d). Then, for each subset, t trees were constructed. Based on the s × t trees, we can estimate a feature's importance by considering how many times it appeared in these trees and how well it performed in these trees as a node. By comparing the permutation results, the significance of features was evaluated.

Perdition Performance of the Mixed Methylation and Expression Signature
The MCFS can find the significant top-ranking features by comparing with permutations. To objectively evaluate the significant top-ranking features' prediction performance, we performed LOOCV (Leave One Out Cross Validation) using SVM (Support Vector Machine) classifier Sun et al., 2018;Pan et al., 2019a). Each time, one sample was chosen as test samples and all other samples were used to train the SVM predictor. After all samples were tested once, we compared the actual sample classes with predicted sample classes and calculated the sensitivity, specificity, accuracy, and Mathew's correlation coefficient (MCC) based on the confusion matrix (Huang et al., 2011(Huang et al., , 2013Cai et al., 2012).

Rank the Methylation and Expression Features
The methylation and gene expression data were combined and, therefore, each lung cancer patient was represented with mixed methylation and gene expression features. The number of mixed features (20,178 gene expression features and 354,251 methylation features) was too large to conduct sophisticated statistical analysis. So, we removed irrelevant features using the Boruta method (Kursa and Rudnicki, 2010). At last, 711 relevant features were remained.
Then, these 711 Boruta selected features were further ranked with the MCFS method (Draminski et al., 2008). As a classification tree-based ensemble learning algorithm, MCFS can rank the features based on how many times and how much it contributed to the sample classification in s × t trees. By comparing with permutation results, it can evaluate the significance of features.

Identify the Methylation and Expression Signature
The 136 significant top-ranking features were identified using the latest dmLab version 2.3.0 software downloaded from 2

Comparison With CNV Signature
Comparing with the 136 LUAD and SQCLC CNV signatures identified by Li et al. (2014), we found that the methylated genes HORMAD2, KLHL3, LPP, and PTPN3 are also CNAs genes. HORMAD2 is expressed in nearly 10% of Chinese Han lung cancer tissues, which is a new target for lung cancer research (Liu et al., 2012). Lipoma preferred partner (LPP) may be an important candidate molecular marker for the classification of NSCLC tissue subtypes. PTPN3 can inhibit lung cancer by regulating EGFR signal . However, there are no reports of KLHL3 in lung cancer, which also suggests a new idea of candidate molecular markers for the identification of lung cancer subtypes.

The Relationship Between Methylation and Expression Signature Genes
The 113 methylation features can be mapped onto 93 genes. We overlapped the selected methylation feature genes and expression feature genes and found that HNF1B and TP63 were dysfunctional on both methylation and gene expression levels. HNF1B was one of the DNA methylated markers of the same subtype (Matsuo et al., 2014;Shi et al., 2017). TP63, also known as P63, was considered to be the most common marker for SCLC (Bishop et al., 2012;Van de Laar et al., 2014). We downloaded the 66 lung cancer genes from KEGG hsa05223 NSCLC 4 and mapped them and the overlapped two genes: HNF1B and TP63, onto STRING network (Szklarczyk et al., 2018). What's more, we searched the methylation genes and expression genes in STRING database (Szklarczyk et al., 2018) and extracted the experimentally determined interaction and plotted the network in Figure 3. The light-yellow nodes were methylation genes, the light-blue nodes were expression genes. The overlapped methylation and expression genes were marked in red, the overlapped methylation and CNV genes from Li et al. (2014) were marked in pink. It can be seen that TP63 played an important role in connecting methylation genes and expression genes. The methylation genes and expression genes were closely connected to form a dense functional module on the network. FIGURE 3 | The methylation genes and expression genes with experimentally determined interactions on STRING network. The light-yellow nodes were methylation genes, and the light-blue nodes were expression genes. The overlapped methylation and expression genes were marked in red, and the overlapped methylation and CNV genes were marked in pink. TP63 played an important role in connecting methylation genes and expression genes.

The Biological Significance of the Identified Signature
To develop more specific and individualized targeted therapy, there is an urgent need to improve our knowledge on the molecular basis, in addition to different phenotypes. It is noteworthy that adenocarcinoma and squamous cell carcinoma show marked differences in expression profiles, DNA methylation, and lesion location. In this study, the features containing methylation and expression data were screened by Boruta and then further sorted by MCFS. After comparing the selected features with related literatures, a certain correlation was found between these features and lung cancer subtypes.
In this study, 113 methylation features were screened and mapped to 93 genes. We inquired about the functions of these genes and their relationship with lung cancer to discuss whether they have the potential as molecular markers to recognize LUAD and SQCLC. Many genes have been proved to promote or inhibit the progression of lung cancer. For instance, FOXK1 was expressed in many malignant tissues (Huang and Lee, 2004) and Ma et al. (2018) also found that FOXK1 plays a carcinogenic role in lung cancer. MAD1L1 is a checkpoint gene, with its mutation been proved to play a pathogenic role in lung cancer (Tsukasaki et al., 2001). Some genes have been reported to be related with the prognosis of NSCLC, such as HORMAD2 and ANO1. The overexpression of ANO1 is related to the high expression of EGFR, which can be used as a predictor of recurrence after NSCLC (He et al., 2017). In addition, according to Zhang et al. (2014) HORMAD2 gene polymorphism has great potential prognostic value in Chinese patients with NSCLC. Other genes are associated with NSCLC subtypes, such as another member of the FOX family, FOXK2, which was reported to be closely related to the overall survival of LUAD . DOK1 and HOPX were found to serve as lung tumor suppressors for LUAD (Berger et al., 2010;Chen et al., 2015). In the study of Zhou et al. (2017) the methylation locus of PARD3 gene was positively correlated with the expression of PARD3 and suppression of PARD3 intensified chemoresistance in LUAD cells. SFTA3 was found obviously overexpressed in LUAD, and its expression in LUAD and SQCLC was quite different. Therefore, the sensitivity and specificity of using SFTA3 to distinguish the two subtypes will be relatively high (Zhan et al., 2015). ARHGEF1 aliased p114RhoGEF and its expression might help to predict progression and survival of SQCLC patients (Song et al., 2013). Notably, LPP has multiple functions of actin binding protein and transcriptional coactivator (Kuriyama et al., 2016). Ngan et al. (2017) proved that the expression of LPP reduces the number of circulating tumor cells and inhibits lung cancer metastasis. Kang et al. (2009) used high-resolution array-CGH to find that the difference in genomic imbalance patterns between SQCLC and LUAD was most significant in 3q26.2-q29, while LPP (3q28) was significantly targeted in SQCLC, suggesting that LPP may be an attractive candidate molecular marker for histological subtype classification of NSCLC and may be involved in the pathogenesis of SQCLC.

The GO Enrichment Analysis of the Identified Signature
In order to further analyze the relationship between mixed characteristics and lung cancer, we carried out GO enrichment analysis. The results suggest that characteristic genes are mainly related to keratinization, epidermal cell differentiation, tissue development, and cytoplasm. The GO enriched results with FDR (False Discovery Rate) smaller than 0.05 are listed in Table 5. P63 appears to be useful in differentiating SQCLC from LUAD in small biopsies with no keratosis or glandular differentiation, helping to establish different treatments (Camilo et al., 2006). The expression of keratinocyte transglutaminase and cytokeratin 10 was measured as markers of squamous differentiation (Lokshin et al., 1999). Epidermal cell differentiation is related to EGFR signal pathway, which can inhibit the proliferation and metastasis of cancer cells, while EGFR mutation is largely limited to LUAD (Ladanyi and Pao, 2008). The expression of Promyelocytic leukemia zinc finger (PLZF) in SQCLC was weak or absent, which was significantly lower than that in LUAD (Xiao et al., 2015). To sum up, most of the 113 methylated genes and 23 expressed genes we found are closely related to lung cancer, and some of them have the possibility of distinguishing SQCLC from LUAD, which is helpful for the targeted selection of lung cancer treatment and provide more research support for lung cancer molecular markers.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
HZ, ZJ, LC, and BZ contributed to the study design. HZ, ZJ, and LC conducted the literature search. HZ, ZJ, and BZ acquired the data. ZJ and LC wrote the manuscript. HZ and BZ performed the data analysis. All authors gave the final approval of the version to be submitted, read, and approved the final manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe. 2020.00003/full#supplementary-material TABLE S1 | The annotations of the 113 methylation features.