Identication of Key Genes for Nasopharyngeal Carcinoma by Gene Expression and DNA Methylation Data Integration Analysis

Background: The pathogenesis of Nasopharyngeal carcinoma (NPC) is very complicated. The present study aimed to identity some candidate genes as biomarkers for NPC diagnosis and pathogenesis. Methods: Three Microarray datasets GSE53819, GSE64634 and GSE12452 and a methylation array (GSE52068) were re-analyzed. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the differentially expressed genes (DEGs) were applied. STRING software was used to construct a protein-protein interaction (PPI) network of DEGs and visualized by Cytoscape. Random Forest (RF) algorithm was performed to construct classiers and identied key genes. Results: A total of 91 DEGs were screened from the three datasets. GO term and KEGG pathway analysis suggested that the DEGs were predominantly enriched in drug metabolism-cytochrome P450 pathway, metabolism of xenobiotics by cytochrome P450 pathway, chemical carcinogenesis pathway, ciliary part, motile cilium, axoneme, microtubule and ciliary plasm. We obtained nine hub genes and one signicant module. We constructed a classier based on DEGs and found CLIC6 and CLU have the best classication ability. Finally, ve hypermethylated and downregulated genes (hyper-down) were identied by integrating methylation data. Conclusions: With gene expression and methylation data integration analysis, several key genes were identied may be potential biomarkers for NPC diagnosis and pathogenesis.

Background Nasopharyngeal carcinoma (NPC) is the most common squamous cell carcinoma arising from nasopharynx and is characterised by its high incidence in distinct geographic and ethnic populations.
The top 3 high national incidence rates are estimated to be in Malaysia, Indonesia, and Singapore 1 . In particular, the NPC incidence rates observed in South China, including Guangdong province and Hong Kong were signi cantly higher than in other regions 1,2 . A number of NPC patients tend to develop into an advanced stage of disease before were diagnosed with NPC. A new report reported that the 5-year overall survival of advanced NPC patients was 50.0% 3 . Radiotherapy is still a major treatment for nonmetastatic disease and approximate 30% of cases will develop local or distant metastasis recurrence 4,5 . Although, previous literatures revealed that genetic, age, Epstein-Barr virus infection and certain non-coding RNAs (circRNA and microRNA) are well associated with NPC 6-9 , the molecular mechanism regulating the activation and progression of NPC is still unclear. Therefore, it is necessary to identify key genes for NPC pathogenesis.
The occurrence of cancer is the result of a series of molecular changes that occur in cells, which leads to a large number of changes in gene expression. For example, numerous studies have indicated that differentially expressed genes (DEGs) frequently associated with NPC. Abnormally expressed c9orf24, PCDP1, and LRRC46 may serve as diagnostic and therapeutic markers for NPC 10 . TSPAN8 were con rmed to highly expressed in NPC tissues and to relate with poor survival in NPC patients 11 .
Abnormal DNA methylation can lead to differential genes expression, therefore, the alterations of DNA methylation and genes may affect the development of NPC. DNA methylation is a common epigenetic alteration that happens exclusively on cytosine nucleotides and a majority of in the background of CpG islands (speci c gene regions 5′-C-phosphate-G-3′). The changes occurred in modi cation pattern ultimately not only in uence cell predisposition and tumour phenotype, but promote the malignancy of tumor 14,15 . For example, a report found that the HOP homeobox HOPX showed a highest methylation level and lead to poor clinical outcomes in NPC patients 16 .The tumor suppressor gene ten-eleven translocation methyl cytosine dioxygenase1 (TET1) can facilitate the shifts of 5-methylcytosine to 5hydroxy methyl cytosine in NPC cells 17 . These ndings indicated that the abnormal methylated and differentially expressed genes may serve as potential biomarkers for cancer treatment and prediction 18 .  Fig. 1 DEGs identi cation The raw microarray data of the four datasets downloaded from the GEO database were processed by the online tool GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r) to identify DEGs and DMRs between normal nasopharyngeal tissues and nasopharyngeal carcinoma tissues. We identi ed the DEGs using threshold p value < 0.05 and absolute log2FC > 2.0. The overlapping DEGs among GSE53819, GSE64634 and GSE12452 were identi ed.

Protein-protein interaction (PPI) network construction and hub genes identi cation
The PPI network of 91 overlapping DEGs was constructed with the online tool STRING (https://stringdb.org/). A node represented a gene and each edge represented the interaction between genes in the PPI network. Subsequently, Cytoscape software (download from http://www.cytoscape.org/) was used to search for hub genes. Several models were obtained from the PPI network by the MCODE analysis in Cytoscape. Hub nodes were ltrated out according to the connectivity degree, betweenness centrality, and closeness centrality of a node. Degree Cutoff = 2, K-Core = 2 and Node Score Cutoff = 0.2 were regarded as cutoff parameters for module identi cation.

Classi er construction
Random Forest (RF) algorithm was applied to construct one RF classi er based on gene expression data to distinguish NPC patients from normal controls. We used 93 samples from the three datasets and the 91 DEGs expression data to train the classi er. The leave-one-out cross validation (LOOCV) was used to assess classi cation ability. Then, we assessed the average performance of each DEG for classi er using the "importance" function of "Random Forest" package, and the DEGs were ordered in Mean Decrease Accuracy based on their importance.

DEGs identi cation
To identify the DEGs between normal tissues and NPC tissues, we analyze the data of GSE12452, GSE53819 and GSE64634 datasets. Then, the volcano plots of DEGs were shown in Fig.2 (Fig.3). These 91 overlapping genes were con rmed as candidate DEGs and employed for further analysis.

GO term and KEGG pathway analysis
The top enriched GO terms and KEGG pathways were listed in Fig. 4. As Fig.4 shows, the 91 overlapping genes were highly associated with pathways including drug metabolism-cytochrome P450 pathway, metabolism of xenobiotics by cytochrome P450 pathway and chemical carcinogenesis pathway. With the analysis of GO terms, these genes were largely enriched in cell component (CC) including ciliary part, motile cilium, axoneme, microtubule and ciliary plasm. The enriched biological process (BP) notations included cilium organization, axoneme assembly and cilium movement.

Protein-protein interaction (PPI) network construction
The DEG expression PPI network was showed in Fig.5. The PPI network contained 91 nodes and 9 DEGs were recognized as hub genes (see Table 1) with node degree≥10. In addition, in this PPI network, we observed three signi cant models after MCODE analysis in Cytoscape and we choose the most signi cant module with MCODE score = 7.7 and number of nodes =9. The module revealed that seven of the nine genes (DNALI1, RSPH1, SPAG6, ARMC4, DNAI2, DNAH9, and RSPH4A) were belong to the hub genes. With regards to the functions, these DEGs were closely involved the cilium movement, ciliary part and cilium organization, in which DNALI1, TEKT1,ARMC4, DNAI2, DNAH9, RSPH4A,SPAG6 and RSPH1 were highly enriched [19][20][21][22][23][24][25][26][27] . ARMC4 is a top tumor suppressor gene and mutation occurs in breast cancer 28 . SNTN is a kind of apical structure protein and its abnormal expression was veri ed to be related to pathological and cancerous phenotypes 29 . These ndings indicated that these tightly interactional genes associated molecular pathways and might activate pathogenesis of NPC.

Classi er construction
We built the RF classi er using the function in "RandomForest" package for R and added these DEGs to the classi er one by one in order of importance (supplementary table 1). For a variable, the higher decrease in Gini and the higher MeanDecreaseAccuracy measure, the greater is its importance in the classi cation process. The top 30 DEGs obtained from both rankings showed in Figure 6C. As Fig. 6A shows, the top 2, including CLIC6 and CLU, are the best predictors of the classi er, and have a good prediction power AUC:0.985 . The chloride intracellular channel (CLIC) family contains six homologs in human (CLIC1-6) which plays a critical role in the processes of human cancer. A previous study suggested that the expression level of CLICs (CLIC1,4,5 and 6) were signi cantly correlated with histological tumor grade in breast cancer 30 . Clusterin (CLU) was regulated by N,N'-Dinitrosopiperazine (DNP) and promotes NPC metastasis 31 . In the case of unknown clinical information, it is also very good at identifying tumors and healthy tissues based on two-dimensional representation of unsupervised classi cation analysis (Fig. 6B).

Identi cation of differentially methylated CpG sites (DMCs)
To our knowledge, the distribution of DMCs and their in uence on key functional genomic elements including CpG islands in NPC are poorly de ned. We studied the global methylation patterns of 91 DEGs across gene sub-regions and methylated genomic regions. CpG island hypermethylation is a landmark epigenetic modi cation of cancer. In our study, 3264 DMCs were identi ed from NPC tumor genomes, among of them, 2599 were hypermethylated CpG sites (hyper-CpGs) and 665 were hypomethylated CpG sites (hypo-CpGs). The number of hyper-CpGs was greater than the hypo-CpGs in NPC tiusses 32 . As the Fig.7 shows that 1132 hyper-DMCs were enriched the body, and 1959 of hyper-CpGs were distinctly enriched with CpG islands (79.80%), but very few hyper-CpGs located in the shelfs. Our analyses also suggested that the regions close to TSS (TSS1500, TSS200, and 5′UTR) and the CpG islands (shores), in general, have more DMCs in NPC tumor tiusses. In addition, the distribution of few hypo-CpGs were comparatively even among shelfs, s-shore and CpG islands, except for higher in N-shore. Expectedly, we observed that the CpG island and its neighborhood location (shores) were predominantly hypermethylated which is similar with an early study 32,33 35 . This nding indicated that hypermethylation related genes may were highly associated with the cell division, cell cycle and development of NPC. Finally, by comprehensive analysis of DMCs and DEGs, we observed 5 differential methylation genes (DMGs), including CLDN10, CLIC6, LRRC10B, SORBS2 and UBXN10, which both hypermethylated and downregulated. The correlation analysis between the 5 DMGs and DMCs was generalized in Table 2.

Discussion
In this study, we analyzed DEGs and DMGs comprehensively by bioinformatics analysis. 91 DEGs were identi ed in NPC, including 5 up-regulated genes and 86 down-regulated genes. CLIC6 and CLU selected from 91 DEGs have the best classi cation ability by RF classi er. Five genes (CLDN10,CLIC6, SORBS2, UBXN10 and LRRC10B) which were both hypermethylated and downregulated in NPC were identi ed as well. There were no one that both hypomethylated and upregulated genes (hypo-up), which reveal that the hypermethylated genes may largely related to epigenetic modi cation in NPC and might lead to downregulation of these genes.
Claudin-10 (CLDN10) is the member of claudins family. Claudins are a key element of tight junction (TJ) protein and its abnormal expression can lead to the invasion and distant metastasis of tumor cells 36 . The previous report indicated that CLDN10 can enhance the metastatic phenotype of human osteosarcoma 37 . CLDN10 also shows high expression level in the patients who developed papillary thyroid cancer and had a worse survival 38 . A previous study reported that its homologous protein CLDN11 was an anti-tumor gene , abnormal methylated and expressed of CLDN11 regulated the development of the NPC by inhibiting cell migration and invasion 39 (Li et al. 2018)(39)(39) 39 . As described earlier, CLIC6 had best predictive ability and may play important role in NPC. Although, their relationship with NPC are not well known, these ndings suggested CLDN10 and CLIC6 as biomarkers for NPC are feasible. Especially, CLIC6 has the best predictive power and has three hyper-CpGs (1stExon, TSS200 and 1stExon), which indicates that it has great potential as a biomarker for NPC.
Sorbin and SH3 domain-containing protein2 (SORBS2) is a scaffolding protein and acts to refrain the ovarian cancer invasion 40 . SORBS2 suppressed the proliferation, diffusion and migration of hepatocellular carcinoma (HCC) cell both in vivo and in vitro studies 41 . Similarly, two other studies showed that SORBS2 may as a tumor suppressor gene not only is associated with cervical carcinogenesis but also negatively regulated the pancreatic cancer 42,43 . Hence, we can speculate that the hyper-down SORBS2 may work in the same way as in ovarian and HCC. UBXN10 and LRRC10B have not been reported in NPC. A study reported that LRRC10B might be a target gene regulated by RPLP0P2.
RPLP0P2 was involved into lower prognosis, proliferation and adhesion ability of lung tumor cells 44 .
Malavika Raman et al demonstrated UBXN10 locates in VCP-dependent manner cilia 45 .VCP and UBXN10 are essential for promoting cilia formation. As mentioned above, these genes are abundant in ciliary part, motile cilium, cilium organization and cilium movement and therefore may perform certain functions in NPC progress. However, this hypothesis requires to be veri ed in further experiments.
Pathways in cancer, we found that these abnormal genes were primarily enriched metabolismcytochrome P450 pathway (CYP450), metabolism of xenobiotics by cytochrome P450 pathway and chemical carcinogenesis pathway. CYP450 was a group of enzymes that appears highly polymorphic and their variants play an important role in cancer. Currently, CYP450 was used in the 80% of drugs, including some anticancer drugs 46 . Similarly, a previous report indicated that one of the expression patterns of cytochrome P450 (CYP2A6) genes had an increased risk for NPC and affected the susceptibility of NPC 47 . Hence, the DEGs were involved in metabolism of xenobiotics by CYP450 pathway may ultimately affected the treatment of NPC. The chemical carcinogenesis pathway has been proven to connected with high incidence of NPC 48,49 . Active cilia can remove inhaled harmful particles and bacteria. Therefore, without normal cilia, mucociliary clearance is severely impaired, and bacterial mucus stays in the nose and sinuses causing in ammation and recurrent infections. Therefore, we can acquire an insight to understand the mechanism of cancer by analysis of these related signaling pathways.

Conclusions
In our study, we veri ed some signature genes that were directly or indirectly involved with the pathological process of NPC. Some of them were identi ed in NPC for the rst time and may act as potential biomarkers.