Identification of key genes and pathways associated with cholangiocarcinoma development based on weighted gene correlation network analysis

Background As the most frequently occurred tumor in biliary tract, cholangiocarcinoma (CCA) is mainly characterized by its late diagnosis and poor outcome. It is therefore urgent to identify specific genes and pathways associated with its progression and prognosis. Materials and Methods The differentially expressed genes in The Cancer Genome Atlas were analyzed to build the co-expression network by Weighted gene co-expression network analysis (WGCNA). Gene ontology (GO) as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were conducted for the selected genes. Module–clinical trait relationships were analyzed to explore the association with clinicopathological parameters. Log-rank tests and cox regression were used to identify the prognosis-related genes. Results The most related modules with CCA development were tan module containing 181 genes and salmon module with 148 genes. GO analysis suggested enrichment terms of digestion, hormone transport and secretion, epithelial cell proliferation, signal release, fibroblast activation, response to acid chemical, wnt, Nicotinamide adenine dinucleotide phosphate metabolism. KEGG analysis demonstrated 15 significantly altered pathways including glutathione metabolism, wnt, central carbon metabolism, mTOR, pancreatic secretion, protein digestion, axon guidance, retinol metabolism, insulin secretion, salivary secretion, fat digestion. Key genes of SOX2, KIT, PRSS56, WNT9A, SLC4A4, PRRG4, PANX2, PIR, RASSF8, MFSD4A, INS, RNF39, IL1R2, CST1, and PPP3CA might be potential prognostic markers for CCA, of which RNF39 and PRSS56 also showed significant correlation with clinical stage. Discussion Differentially expressed genes and key modules contributing to CCA development were identified by WGCNA. Our results offer novel insights into the characteristics in the etiology, prognosis, and treatment of CCA.

85 activation of oncogenic pathways, has been linked with cholangiocarcinoma 86 development (Park et al. 1999). Frequent mutations of oncogenes such as KRAS, as well 87 as cancer suppressor genes of TP53 and SMAD4 were identified by next generation

91
Although a number of genes and mechanisms have been proved to be closely 92 implicated in the development of CCA, the comprehensive picture of the whole genes and 93 regulations of CCA is still unclear. In recent years, bioinformatic methods become 94 increasingly effective in exploration and analysis of multiple genes or proteins of 95 complicated diseases. Weighted gene co-expression network analysis (WGCNA), a new 96 gene co-expression analysis method, has been successfully used to screen biomarkers 97 and pathways that could be applied in susceptibility genes, diagnose and treatment of 98 cancer. In this study, WGCNA was conducted to analyse data of TCGA data repository 99 of CCA to screen modules and core genes in pathogenesis, progression and survival of 100 CCA.

Materials and Methods
102 Publically available data sets 103 RNA expression as well as clinical parameters of CCA patients were obtained from 104 The Cancer Genome Atlas (TCGA) database (cancergenome.nih.gov). The level of gene 106 reads (TPM). Clinical characteristics had the sample type, histology grade, recurrence, 107 histologic grade, and prognosis. Each sample must had complete pathology stage and 108 histology records. If the expression of genes showed limited variation, we regraded them 109 as noise and discard these ones because the results of these genes might come from 110 systematic error and have limited significance.
111 Construction of co-expression network of genes 112 In this study, we adopted WGCNA method to build a co-expression network for 113 certain genes using R language (Langfelder & Horvath 2008). We used WGCNA method 114 to calculate power number in order to construct modules through co-expression. WGCNA 115 method was also performed for construction of the co-expression network and extraction 116 of the genetic information in the most relevant module. "Heatmap tool" package of R 117 software was selected to analyze the correlation degree among modules. As a 118 representative of the gene expression profiles of a module, module eigengene (ME) was 119 used to evaluate the relationship between module and overall survival. 120 Identification of clinical traits-related modules 121 We then performed Pearson's correlation examination to explore the relation of MEs 122 with clinical traits of histologic grade，recurrence, pathologic M, pathologic N, pathologic 123 T. A p value less than 0.05 indicated statistical significance. The two most significant 124 modules are selected as hub modules. 126 In the present study, we adopted the clusterprofiler package of R language to Prognosis investigation was conducted using the 'survival' package within R 134 software. Cox regression test was used to calculate the hazard ratio and its 95% 135 confidence interval while Kaplan-Meier method was adopted to draw survival curve. The 136 increased and decreased expression level of each gene in the candidate module was 137 decided according to median value. 138 Screening for candidate hub genes 139 We chose genes associated with prognosis as candidate genes. Next, we analyzed 140 the association between candidate genes and stage parameter. The genes associated 141 with both overall survival and clinical stage were selected as hub genes. 162 Enrichment investigation of the hub modules 163 We next proceeded GO and KEGG investigation of the genes within tan module and 164 salmon module. Altogether 50 terms showed significant difference in GO enrichment 165 analysis (Supplementary Table 1) and the most important terms were summarized in 166 Table 1. As was illustrated in Figure 3, the salmon module in GO analysis was related 167 with digestion, hormone transport, hormone secretion, epithelial cell proliferation, 211 glutathione metabolism, wnt signaling, mTOR signaling, pancreatic secretion, protein 212 digestion and absorption, insulin secretion and salivary secretion.

213
In this present study, RNA sequencing data for CCA samples from TCGA were