Identification of genes and functional coexpression modules closely related to ulcerative colitis by gene datasets analysis

Background Ulcerative colitis is a type of inflammatory bowel disease posing a great threat to the public health worldwide. Previously, gene expression studies of mucosal colonic biopsies have provided some insight into the pathophysiological mechanisms in ulcerative colitis; however, the exact pathogenesis is unclear. The purpose of this study is to identify the most related genes and pathways of UC by bioinformatics, so as to reveal the core of the pathogenesis. Methods Genome-wide gene expression datasets involving ulcerative colitis patients were collected from gene expression omnibus database. To identify most close genes, an integrated analysis of gene expression signature was performed by employing robust rank aggregation method. We used weighted gene co-expression network analysis to explore the functional modules involved in ulcerative colitis pathogenesis. Besides, biological process and pathways analysis of co-expression modules were figured out by gene ontology enrichment analysis using Metascape. Results A total of 328 ulcerative colitis patients and 138 healthy controls were from 14 datasets. The 150 most significant differentially expressed genes are likely to include causative genes of disease, and further studies are needed to demonstrate this. Seven main functional modules were identified, which pathway enrichment analysis indicated were associated with many biological processes. Pathways such as ‘extracellular matrix, immune inflammatory response, cell cycle, material metabolism’ are consistent with the core mechanism of ulcerative colitis. However, ‘defense response to virus’ and ‘herpes simplex infection’ suggest that viral infection is one of the aetiological agents. Besides, ‘Signaling by Receptor Tyrosine Kinases’ and ‘pathway in cancer’ provide new clues for the study of the risk and process of ulcerative colitis cancerization.


INTRODUCTION
Ulcerative colitis (UC) is a subtype of inflammatory bowel disease (IBD), which is a kind of idiopathic, chronic, recurrent, debilitating and nonspecific inflammatory condition, and its characteristic is the alternate periods of remission and active disease (Planell et al., 2013;Strober, Fuss & Mannon, 2007). Worldwide, UC is more common to gather those datasets and synthetically integrate those massive data through systems biology tools, and finally receive the stable and credible results (Marques et al., 2010;Rung & Brazma, 2017;Seifuddin et al., 2013).
The robust rank aggregation (RRA) analysis is a strict tool of systems biology, which can be adopted to the comparison of multiple gene ranking lists obtained from experiments on different platforms greatly expanded the sample size, making the identification of genes related to diseases more reliable and valuable (Kolde et al., 2012). The theory of RRA is that by looking at the location of genes respectively in each ranked list and comparing it with a randomly shuffled baseline list, each gene will be assigned a p-value, and the better the location in these ranked lists, the smaller the p-value will be. The final ranking of genes is based on the P value, and logarithmic fold changes (logFC) can be calculated as needed to determine the importance of genes together with the P-value.
In the current, systematic review and comprehensive integration of genome-wide gene expression datasets in UC is still missing. Therefore, we performed the systematic review and comprehensively integrated those genome-wide gene expression datasets through RRA to identify the most probable causative genes of UC. We hope to mark out some deepening insights into UC pathogenesis and provide some molecular target for therapeutic.
Moreover, we would use weighted gene co-expression network analysis(WGCNA) to categorise those important and aberrantly expressed genes into several biologically functional modules (Langfelder & Horvath, 2008;Prom-On et al., 2010), which could be biologically meaningful gene clusters and play important roles in UC pathogenesis.

Datasets search and eligibility criteria
On the Gene Expression Omnibus (GEO) home page (http://www.ncbi.nlm.nih.gov/geo/), ''UC biopsy'' was used as the search term, and the datasets in the search results were filtered according to the following criteria: (1) the gene expression profile measured by microarray chip technology; (2) the dataset was a comparison between active UC patients' tissue and non-UC patients' healthy tissue; (3) Sample size should be at least 5; (4) The database provided raw data or gene expression. Fragments Per Kilobase of transcript per Million fragments mapped (FPKM) matrix files for these datasets and can be used for reanalysis. The raw data is the direct information measured by instrument, in CEL format, which can be processed by R and converted into TXT format of gene expression FPKM matrix. The gene expression FPKM matrix files provided by the website should not have been normalized. Datasets that did not meet the above criteria are excluded.

Robust Rank Aggregation (RRA) analysis
The data set of a single platform is difficult to reach a large sample size, and the result is of low credibility. We used the RRA analysis method to comprehensively compare and analyze the results obtained from the genetic difference analysis of each platform, and selected the genes with strong consistency and difference, so as to make the final differentially expressed genes (DEGs) more convincing. Multiple packages of R software were applied for data processing and statistical analysis (R Core Team, 2018;Gentleman et al., 2004).

Affy package for data preprocessing
read.AnnotatedDataFrame(), read in the grouping information file for the samples(UC patients and controls); read.csv(), read in the annotations files of gene expression omnibus platform (GPL), including the conversion of probes to gene symbols; eset.rma <-justRMA(), datExpr=exprs (eset.rma), these two-step functions apply the RMA method to normalize original files, with the purpose of adjusting the overall characteristics of a single sample to make it more suitable for comparison.

Surrogate variable analysis (SVA) package for batch effect removing
Batch effect is caused by different samples under different conditions such as experiment time, experiment environment, instrument, etc., and merely data normalization cannot remove batch effect. SVA package were used to remove the batch effects from different samples of the same platform (Chen et al., 2011;Leek et al., 2012). This step is performed using Empirical Bayes method, whose core function is ComBat Finally, gene expression value matrix files with row name as gene symbol and column name as sample number were obtained for each platform for further analysis.

Limma package for differential genes analysis
The limma package is a comprehensive package with many options for loading data, data pre-processing (background correction, intra-group normalization and inter-group normalization), and differential genetic analysis. The function of empirical Bayes linear regression method for finding differential genes is very popular. At the same time, limma package is very scalable. Both one channel and tow channel data can be analyzed for differential genes, even including quantitative PCR and RNA-seq data types (Ritchie et al., 2015).
The gene expression matrix files obtained in the last step were used for differential gene analysis between UC and Control groups by Limma package respectively, so as to acquire the DEGs of each platform (Wettenhall & Smyth, 2004). MakeContrasts() as the key function and gene rank lists of different platforms were generated. In the process, the False Discovery Rate (FDR) is calculated by benjamini-hochberg correction method, which means a adjusted P-Value, but the P-Value is still used as the basis for the significance judgment of the result.

RobustRankAggreg package for RRA analysis
The RobustRankAggreg package was used to implement RRA analysis for Gene rank lists of different platforms to generate the most valuable DEGs (Kolde et al., 2012). Core functions: list(), rankMatrix(), aggregateRanks Genes with P value < 0.05 and |logFC|>1 were selected, and the smaller the P value, the higher the ranking, often small P value of the gene corresponds to a large |logFC|. The final result was visualized by pheatmap package.

WGCNA
In order to clarify the main role of DEGs in the pathogenesis of UC, this method is used to cluster genes with close relationship in the same module. The weighted gene co-expression network was constructed by the WGCNA package of R.
First, an appropriate gene expression FPKM matrix file is required. A number of genes and suitable samples were extracted from the raw data, and the matrix file is the FPKM of these genes for each sample. The DEGs generated in the RRA analysis were only the most important genes, and could not present the overall picture of the co-expression network. In order to cover most valuable difference genes, we adjusted the cut off value to p < 0.05 and |logFC|>0.14. In other studies, |logFC| values are often different in order to select sufficient and relatively high value genes for WGCNA. For example, Yan et al. selected |logFC|>0.26 (Yan et al., 2018), while Lu et al. (2014 set |logFC|>0.585 in order to get more differentiated genes . Besides, only samples from the same platform can be combined for WGCNA. To make the results more convincing, we selected GPL570 with the largest sample size, including 143 UC patients and 79 controls from eight datasets. Then, hclust() was used to hierarchical clustering of samples by average method and results in the initial sampletree. The following we defined sample clustering height = 80 to remove the isolated samples from the group, so as to obtain a more hierarchical sampletree for further analysis.
The core process of WGCNA is to build a scale-free distributed topological network, making the functional modules developed more cohesive (Langfelder & Horvath, 2008). In the view of many relevant references prove that when the scale-free fit index is greater than 0.85, the network already conforms to the scale-free network distribution (Lancu et al., 2015;Zhang & Horvath, 2005). We set an appropriate soft threshold power value to make the generated Scale free Topology Model Fit >0.85.
Next, module identification was realized by Dynamic Tree Cut method, setting minModuleSize = 30 and deepSplit = 2. Further, mergeCloseModules(), a function that can be merged automatically, completes the merging of similar modules by setting the minimum height for merging modules at 0.3. Finally, some genes that could not be classified into any functional module were uniformly collected into the grey60 module. Incidentally, the colors of each module are randomly assigned.

Functional enrichment analysis
Functional enrichment analysis was performed by Metascape (http://metascape.org) accord to the genes assigned to each module (Tripathi et al., 2015). In the results, the top 10 biological processes with the minimum p value of each module were listed, which reflected the functional characteristics of the modules.

Statistical analysis
The version of R used for statistical analysis was 3.5.0 (R Core Team, 2018). In all cases, P < 0.05 was considered statistically significant.

UC microarray datasets
In the end, 14 datasets from five platforms were selected. Details of datasets were shown in Table 1, including GSE number, sample size, Source types, detection platform, data file type and authors. In the study, the number of UC patients in each dataset ranged from

significant Differentially Expressed Genes (DEGs) between UC and non-UC Patients
The top 100 up-regulated genes and the top 50 down-regulated genes by Robust Rank Aggregation (RRA) analysis were shown in Table S1. P < 6.11E-07 and |logFC|>1 reminded significant differences of the top 100 up-regulated genes. Besides, the top 50 down-regulated genes had significant difference index of p < 6.32E-07 and |logFC|>1 (Table S1). In order to highlight the effect of the presentation, Fig. 1 displayed the logFC for unique dataset platforms and multi-dataset platforms of the top 50 up-regulated and top 50 downregulated genes. Green represents down-regulation and red represents up-regulation. The colors deepen with the increase of |logFC| respectively. The similarity of color saturation reflects the consistency of these important genes in the datasets of each platform.
Color stratification displayed the difference of expression between the two groups. In the top several genes with the greatest difference, genes such as MMP1, REG1A and AQP8 had been confirmed to abnormally expressed in UC (Planell et al., 2013).

5,344 DEGs were clustered into seven functional modules through WGCNA
Appropriate samples and genes were screened to construct gene expression FPKM matrix files. Data sets must come from the same platform to be combined into a single matrix file for analysis, and we selected all samples from the GPL570 platform with the largest sample size. After adjusting the cut off value of RRA process to p < 0.05 and |logFC|>0.14, 5344 DEGs were obtained, which was more suitable for WGCNA.
When soft-threshold power was set to 10, the scale-free topology index was >0.85, and mean connectivity was infinitely close to 0 (Fig. 3A). The analysis produced 8 co-expression modules, among which seven modules contained more genes and were the main functional modules (Fig. 3B). The number of genes in each module ranged from 97 to 1,718. The module with the largest number of genes was the blue module and the second largest module is the black module with 1,398 genes. Blue and black modules also contain the largest number of the150 most important DEGs. Therefore, we believe that the pathways involved in the two modules dominate the occurrence and progress of UC. The detailed gene names were listed in Table S2. The network heatmap plot showed that these major modules maintain a good independence from each other (Fig. 3C). Table 2 listed the functional enrichment analysis results of seven major co-expression modules. Biological processes from were ranked by Log10(P), and having the greatest |Log10(P)| was considered critical.

Co-expression modules were enriched to obtain significant pathways
The genes of the blue module were significantly enriched in the 'extracellular matrix organization, lymphocyte activation, blood vessel morphogenesis, leukocyte migration and inflammatory response'. Also, 'nucleobase-containing small molecule metabolic process, small molecule catabolic process, isoleucine degradation' pathways were the most important pathway enriched in the black module. Besides, the genes of salmon module were mainly enriched in the biological processes of 'interferon signaling, defense response to virus and herpes simplex infection'. The cyan module was enriched into functional pathways involved in multiple fields, including protein regulation, neutrophil immunity, tyrosine kinase pathway, cancer-related pathways and many other aspects. In addition, the enriched pathways of Grey60 and midnightblue modules were closely related to inflammatory response, while the 'Cell Cycle' and 'Cell Cycle Checkpoints' were the results of green module functional enrichment (Table 2).

DISCUSSION
UC is a type of IBD that affects the large intestine and colon. The pathogenesis of UC is complex and remains largely unknown. It is believed that genetic features, the immune response to microbial dysbiosis, mucosal immune response and environmental factors contribute to the pathogenesis of UC (Danese & Fiocchi, 2011). Though many genes have been found be involved in UC, the gene networks associated with the etiology of UC has not been clearly defined.

Notes.
'Count' is the number of genes contained in enriched pathway. '%' is the proportion of the total number of genes in each module. 'Log10(P)' is the p-value in log base 10. 'Log10(q)' is the multi-test adjusted p-value in log base 10.
In this study, 14 genome-wide gene expression datasets were finally included, which involved a total of 328 UC patients and 138 healthy controls. Integrated analysis using the RRA method identified quite a few crucially up-regulated or down-regulated genes (Table 1 & Fig. 1). Some of those genes are novel UC gene signatures and their molecular roles in UC pathogenesis are still largely unknown. These abnormally expressed genes may be therapeutic targets for UC and need further research.
The WGCNA clustering criteria have a great biological significance which have been widely used to explore the molecular mechanisms of various diseases (Yan et al., 2018), including IBD (Lin et al., 2018Xie, Zhang & Qu, 2018). In our study, the expressions of 5344 UC associated genes obtained from the RRA analysis were used in the WGCNA analysis, together with they were classified into seven co-expression biologically functional modules (Fig. 3B), which highlighted some new insights into the pathogenesis of UC at a systems level.
By functional enrichment analysis of the modules, we revealed several significant pathogenic mechanisms closely related to UC. In the absence of clinical traits, the importance of module is often judged by the number of genes they contain. The blue and black modules both have more than 1,000 genes, and contain the largest number of top 150 genes, which are considered to be the two most important modules.
To further understand the significance of these functional modules in the pathogenesis of UC, enrichment analysis was performed using Metascape. The importance of pathways is based on Log10(P) values. Important pathways in important modules probably have the strongest correlation with the symptoms or pathophysiology of UC. The enrichment analysis of genes in the blue module mainly involved in 'extracellular matrix organization, lymphocyte activation, blood vessel morphogenesis, leukocyte migration' which relevant to inflammatory responses revealed that inflammatory pathway occupies a core position in various pathways related to UC. Extracellular matrix can regulate inflammation, healing and fibrosis. The intestinal extracellular matrix is comprised of various macromolecules, including glycoproteins such as collagens, vitronectin, fibronectin and matricellular proteins. A recent study has reported that extracellular matrix organization strongly promotes the occurrence of Intestinal fibrosis which is common in IBD (Latella et al., 2014;Wynn & Ramalingam, 2012). The black modules with the second largest number of genes and the enriched functional pathways mainly include 'nucleobase-containing small molecule metabolic process, small molecule catabolic process, isoleucine degradation'. The regulation of metabolism of various small molecular substances suggests that many pathways and metabolism are active in tissue cells when UC is activated. 'Cell Cycle' and 'Cell Cycle Checkpoints' were the most outstanding pathways of Green module. One study pointed out that the cell cycle regulates the immune, tolerance and autoimmunity functions of T cells, and the excessive inflammation of IBD is the loss of immune tolerance caused by abnormal regulation of the cell cycle (Sturm et al., 2004). The enrichment results of Cyan module pathway can be seen that immune response-related pathways are still common and the localization of a large number of proteins inside and outside the cell once again indicates the activity of cell metabolism. In addition, 'pathway in cancer' process conforms to the recognized fact that UC and colorectal cancer (CRC) are closely related. Studies have shown that 8 to 10 years after diagnosis of UC, the risk of CRC begins to increase (Yashiro, 2014). Tyrosine kinase receptor pathway, which regulates cell proliferation and differentiation and promotes cell survival, has been closely associated with CRC (Herr et al., 2018). Meanwhile, it has been reported that tyrosine kinase receptor RON is highly expressed in UC mucosa (Hirayama et al., 2007). Therefore, we believe that tyrosine kinase pathway plays an important role in the occurrence of UC canceration.
Moreover, there are obvious similarities between the pathway enrichment results of grey60 module and midnigntblue module. The former chiefly include 'myeloid leukocyte activation, inflammatory response, response to bacterium'. The latter also focuses on the fields of 'inflammatory response, immune response'. Numerous studies have demonstrated the association between clostridium difficile infection and UC. Clostridium difficile toxins may lead to an enhanced inflammatory response in the presence of Clostridium difficile infection (Martinelli et al., 2014). With regard to other bacteria, salmonella and campylobacter infections have also been noted to cause an exacerbation of IBD (Malik, 2015;Singh, Graff & Bernstein, 2009). The functional enrichment pathways of salmon module mainly involve in 'interferon signaling, defense response to virus and herpes simplex infection', of which 'interferon signaling' is the most important. There were some observational studies on the link between Interferon Signaling and UC. It is generally known that IFN-gamma plays a key role in the early steps of installation of inflammation, promoting monocyte recruitment and activation, and inducing the expression of other inflammatory cytokines. IFN-gamma expression was increased in the pouch mucosa of UC patients compared with controls, and thus it seems to play a pivotal role in UC patients (Leal et al., 2010). Interferon signaling has been identified as a central aspect of innate immune response which induces a wide variety of antiviral proteins against pathogens infection. Moreover, interferon signaling play a crucial role in the response to herpes virus infection by antagonizing viral replication and spread (Noisakran & Carr, 2001;Su, Zhan & Zheng, 2016). This reminds us that the occurrence of UC is probably a sequential process of herpes simplex infection-defense response to virus-interferon signaling in a part of patients. A study reported corticosteroid refractory patients may benefit from antiviral therapy (Shukla et al., 2015). This subgroup of patients who were refractory to corticosteroid was likely to undergo above-mentioned sequential process continuously. Therefore, screening for herpes virus infection, prompt diagnosis and antiviral therapy may effectively relief these patients' condition and reduce colectomy risk. However, the molecular mechanisms underlying the roles of nucleobase-containing small molecule metabolic process in UC are still poorly understood and need to be elucidated in the future.
RRA analysis in the study identifies a large amount of significant DEGs that were drastically up-regulated or down-regulated, plenty of which have been reported in previous articles. We listed top 100 DEGs in the visualization operation to show the reliability of the results. The most significant causative genes are likely to be contained in the top 100 genes and need further experimental verification. Therefore, in our discussion, we will focus on the genes that are considered to be closely associated with the occurrence and development of UC.
MMP1, REG1A and AQP8 have been reported in related literatures (Planell et al., 2013). MMP1, which belong to metal dependent enzymes family, is known as interstitial collagenase involved in extracellular matrix turnover (Fanjul-Fernandez et al., 2009). MMP1 expression increased in the colonic mucosa of UC patients compared to normal controls, and the mucosa up-regulation of MMP1 correlated with the severity of disease in UC (Wang, Tan & Zhang, 2009).There is growing evidence that MMP-1 reflect acute tissue injury and involved in the initial steps of ulceration in UC and new blood vessel formation, but the molecular mechanism underlying its effects remains unclear (McKaig et al., 2003;Wang & Yan, 2006). In the previous literature it has been pointed out by several authors that Abnormally high expression of REG1A is present in the colonic mucosa in UC patients, but its precise molecular mechanism is far from being completely understood (Planell et al., 2013). Currently several researches reported that AQP8 play important roles in gastrointestinal diseases, including UC. The expression of AQP8 is a marker of normal proliferating colonic epithelial cells and AQP8 are closely connected with fluid transport in colon (Zhao et al., 2016). A study reported that AQP8 expression reduced in the ileum of UC patients while AQP8 was dramatically induced in the colon of UC patients (Zahn et al., 2007). However, a study with larger number of samples found that the AQP8 expression was markedly decreased in UC colon tissue compared to healthy subjects in agreement with the our results (Min et al., 2013).The decrease of AQP8 may lead to the disorder of colonic mucosal fluid absorption and reduce the secretion of intestinal tract, but its molecular mechanism is poorly understood (Calamita et al., 2001;Elkjaer et al., 2001). High expression of DUOX2 and DUOXA2 have been shown in patients with active UC, especially where inflammation is prominent. Both of them are regulated by inflammation and crypt-by-crypt basis in UC tissues, which can increase the production of H2O2. This process can enhance innate defense, but has the risk of potential DNA damage (MacFie et al., 2014). Studies have confirmed that DMBT1 and IL-22 mRNA are obviously highly expressed in UC mucosa, and have a significant correlation. Il-22 increased DMBT1 expression by stimulating STAT3 and NF-κB. This process is likely to have an important effect on the innate immunity of UC mucosa (Fukui et al., 2011). A study of 32 UC patients found that the detected levels of MMP-9 and LCN-2 in feces of patients with active UC were significantly increased, and that fecal MMP-9 could be a reliable biomarker of IBD activity (Buisson et al., 2018). Coincidentally, another report suggested that Serum LCN2 level significantly increased in patients with active UC, and it can serve as a biomarker of active UC (Stallhofer et al., 2015).
Among the significantly down-regulated genes, the high ranking ABCG2 also demonstrated low expression in patients with active UC in a previous study. ABCG2 is an efflux transporter involved in mucosal barrier function, low expression of which may increase the risk of tissue exposure to carcinogens, bacterial toxins and drugs (Englund et al., 2007). There are also some genes with significant differences, such as HMGCS2 and PCK1 are novel gene signatures of UC, but still lack of direct experimental evidence. Therefore, their relationship and value with UC need to be validated in future studies.
As mentioned above, this study creatively applied the RRA method to comprehensively analyze the DEGs of large samples from multiple platforms. The important DEGs filtered out are more reliable, and the functional distribution of the DEGs is more concentrated, which is conducive to clustering clearer functional modules in the process of WGCNA, so as to reveal an intimate pathway network with UC. The results of our study on important genes are compared with the results of other similar studies in Table S3. It can be found that in some studies, the DEGs of RNA microarray between UC and control group have a great overlap with our result, or at least a similarity in functional distribution.
Since the sample size of our study is larger, the results are more comprehensive. Some unreported genes still have considerable research value due to their homology with many genes that have been confirmed to be closely related to UC in gene function (Kobayashi et al., 2013;Noble et al., 2008;Planell et al., 2013;Wu et al., 2007). Compared to other data re-analyses researches on UC, almost all of the studies were conducted directly by functional clustering for a large number of DEGs to display the main mechanisms of the disease, which can not reflect the importance of the individual genes (Feng et al., 2017;Song et al., 2018). We innovatively used RRA to summarize and analyze the differential genes in multiple data sets to obtain the likely important causative genes of UC.
Regarding this study, findings are consistent with many previous research conclusions and current mainstream views, which reflects the reliability of research methods and results. However, due to various reasons, the research has some limitations. Firstly, because the raw data does not provide enough information about clinical traits and disease outcomes of samples, the correlation degree between modules and clinical traits cannot be analyzed by WGCNA method, which is limited in the judgment of module importance. Secondly, as for the setting of cut off value, p < 0.05 is considered to have statistical significance. LogFC is set based on the similar studies and the appropriate number of genes needed for the next analysis. The difference in the value set makes a difference in the results, but not in the essence. Third, the comprehensive analysis of multiple data sets is conducive to the selection of genes with relatively consistent differences for key research. However, there are differences in experimental conditions and sample composition of different data sets, which may cause some valuable information to be cleared in the data processing. Finally, this study only delineates the possible range of closely related genes through bioinformatics method, showing the most important pathways related to the pathogenesis. The results still need to be verified by specific experimental research, and provide help for the progress of disease diagnosis and treatment.

CONCLUSIONS
Bioinformatics analysis helps us narrow the scope of our research, which deepens the understanding of the molecular mechanism and provides theoretical foundation for molecular target therapy. The biggest characteristic of this study is that in the pathogenesis of UC, immunity and infection are the two most important factors. We suspect that the two are most likely to be cause-and-effect in the process of disease initiation and progression, which is a hot topic in medical research at present. Herpesvirus infection-viral response-interferon pathway may be the trilogy of corticosteroid refractory UC patients, who are necessary to accept antiviral therapy.
We can use this research as the basis for further clinical specimens experiment to verify these genes and pathways, which may lead to future insights into disease pathogenesis, diagnosis, and treatment.