A Pipeline to Call Multilevel Expression Changes between Cancer and Normal Tissues and Its Applications in Repurposing Drugs Effective for Gastric Cancer

Differential gene analyses on gastric cancer usually focus on expression change of single genes between tumor and adjacent normal tissues. However, besides changes on single genes, there are also coexpression and expression network module changes during the development of gastric cancer. In this study, we proposed a pipeline to investigate various levels of changes between gastric cancer and adjacent normal tissues, which were used to repurpose potential drugs for treating gastric cancer. Specifically, we performed a series of analyses on 242 gastric cancer samples (33-normal, 209-cancer) downloaded from the cancer genome atlas (TCGA), including data quality control, differential gene analysis, gene coexpression network analysis, module function enrichment analysis, differential coexpression analysis, differential pathway analysis, and screening of potential therapeutic drugs. In the end, we discovered some genes and pathways that are significantly different between cancer and adjacent normal tissues (such as the interleukin-4 and interleukin-13 signaling pathway) and screened perturbed genes by 2703 drugs that have a high overlap with the identified differentially expressed genes. Our pipeline might be useful for understanding cancer pathogenesis as well as gastric cancer treatment.


Introduction
Despite the development of science and technology, cancer is still a major disease we have to face. According to statistics, each year, more than 18 million new cancer cases are diagnosed worldwide. Among the living patients who have been diagnosed with cancer, about 44 million patients are told that they have less than five years of life. Cancer is also a disease with a high mortality rate. Nearly 10 million people die from cancer each year [1]. Among various cancers, gastric cancer is one of the most prevalent ones and gastric cancer patients mostly suffer from the poor prognosis of malignancy. Though there are a few anticancer drugs specifically designed for gastric cancer, more novel drugs are required to treat gastric cancer patients with different disease status.
Through the screening and testing of model organisms, more and more potential anticancer drugs have been discovered. Since most potential anticancer drugs are identified in model organisms with little human data support, their effectiveness in promoting human health remains unknown, and this uncertainty brings costly clinical trials to the pharmaceutical industry [2].
In the past 60 years, only about 130-180 kinds of anticancer drugs have been approved by the US FDA. There are about 1,300 to 1,500 kinds of various anticancer drug preparations formulated with these drugs. The research and development of new drugs and rational and effective medication guidance are still a big scientific problem.
At present, the most common drug development of gastric cancer is mainly through a large number of experimental screening, and its strategy may be slightly different, such as using some of the histological data for correlation, so as to achieve better efficacy [3]. But this method is very expensive for early drug screening, especially when the number of potential drugs is large. It has also been reported that using drug data and clinical case characteristics for machine learning to build software to guide drug use [4], such strategies have limited understanding of the pathogenesis of cancer, and the precise treatment of cancer is limited.
Several methods of computer-aided anticancer drug development have been reported. For example, many proteins have been resolved by X-ray or nuclear magnetic resonance (NMR) spectra and are available from the Open Access Protein Database (http://www.rcsb.org). This information enables researchers to understand and characterize many physiological processes based on the interactions between proteins or between proteins and small molecules (ligands), such as when drugs bind to targets. In addition to the 3D structure of the molecule, van der Waals radii, the parameters of covalent bonds, torsions, and dihedral angles were also considered, so people can quickly develop some anticancer drugs based on specific receptor proteins on the surface of cancer cells [5].
For the effective use of anticancer drugs, a more elegant solution does not seem to have been proposed at present. In this article, we will provide a new idea for cancer research and drug treatment. Here, we will compare gene expression before and after cancer and changes in gene regulatory networks, construct key sets of cancer genes, and provide poten-tial medication guidance through drug regulatory data. In this article, we will focus on gastric cancer.

Data Collection and Processing
2.1.1. Gene Expression Data. The sample data (gastric cancer) are all from The Cancer Genome Atlas (TCGA) database. We used the R package of "RTCGAToolbox" to download the RNA sequencing data (reads counts data) of gastric cancer. After excluding some irregular sample data, we obtained a total of 242 gastric cancer samples (33 normal samples, 209 cancer samples).

Drug
Regulatory Gene Data. Justin Lamb once proposed a network pharmacogenomic approach based on the concept of Connectivity Map (CMap) [6]. The CMap project contains more than 6,000 drug-perturbed gene expression profiles generated from multiple human cell lines, including more than 1,309 compounds. CMap can be used to query gene expression profiles related to various diseases, thereby identifying drugs that may "reverse" the expression of these genes. Such drugs have potential application value for treating corresponding diseases. The CMap concept has been successfully applied to disease research [7,8], and these successes have stimulated researchers to build a larger-scale perturbation-induced expression database, e.g., The Library of Network-Based Cellular Signatures (LINCS) Program  Figure 1: An overview of the pipeline. Four major steps are (1) differential expression genes analysis (DEGs) using TCGA dataset; (2) construction of gene coexpression network using WGCNA R packages to obtain clustering module; (3) perform functional enrichment analysis for each module, select cancer-related functional modules for subnetwork analysis, and screen for key genes; and (4) compare the gene expression characteristics induced by drug disturbances in CREEDS with DEGs, key modules, and key genes and select drugs with high overlap into the candidate list. 3 BioMed Research International [9], and a crowd extracted expression of differential signatures (CREEDS) [10]. Here, we used 8590 drug perturbationinduced gene expression signatures collected in Crowd Extracted Expression of Differential Signatures (CREEDS) for our analysis (http://amp.pharm.mssm.edu/creeds).

Analysis of Gene Expression Differences between Cancer
and Normal Samples. Read counts were used to call differential expression genes by DESeq2 [11] between cancer and normal samples (adjusted p value less than 0.05 was set as threshold). Before using DESeq2 for analysis, we used principal component analysis (PCA) to screen all samples of gastric cancer, excluding some outlier points to reduce sample disturbance.

Construction of Cancer and Normal Gene Coexpression
Network. We divided gastric cancer samples into normal and cancer samples, and we removed the abnormal samples to build a coexpression network through hierarchical clustering provided by weighted gene correlation network analysis (WGCNA) [12]. The soft threshold is set as follows: gastric cancer (normal-6, cancer-6).

Analysis of Gene Regulatory Networks.
We can obtain the clustered gene modules through weighted gene correlation network analysis (WGCNA). The correlation of gene expression within the modules is relatively high, which may belong to the same regulatory subnetwork. We select the modules clustered by the normal sample group in gastric cancers for functional enrichment analysis, find cancer-related functional pathways, and use the genes in the module as a benchmark to compare the changes of these genes in the corresponding cancer sample groups to explore cancer gene regulation changes from normal samples.

Analysis of Potential Applicability
Drugs. 8590 drug perturbation-induced gene expression signatures collected in Crowd Extracted Expression of Differential Signatures (CREEDS) were used in our analysis. For signatures from CREEDS, we use Fisher's exact test to rank them, and we calculated the significance of the overlap between up-and down-regulated genes in normal and cancer samples with drug perturbation-induced up-and down-regulated genes. A drug was ranked to the top if drug-induced genes significantly overlapped with differentially expressed genes in normal and cancer samples.

BioMed Research International
In the case of the above threshold screening, there are many significantly differentially expressed genes. We decided to increase the screening criteria (adjusted p value ≤ 0.001, base mean ≥ 100, | log2 fold change | ≥1) according to the data distribution (Figure 3). We obtained 2,409 significantly differentially expressed genes (1,084 up-regulated genes and 1,325 down-regulated genes, compared with normal samples). "ClueGO" [13] software was used for pathways enrichment analysis to create and visualize networks of pathways ( Figure 4). A total of 57 significantly enriched pathways were obtained (p value less than 0.001), and most of the pathways were mainly related to synthesis and modification (class I MHC-mediated antigen processing and presentation: R-HSA:983169, collagen biosynthesis and modifying enzymes: R-HSA:1650814, activation of the prereplicative complex: R-HSA:68962 [14],  [18], scavenging by class A receptors: R-HSA:3000480 [19], etc.), which were highly related to the characterization of cancer. We screened some signal paths to build a network graph based on the connectivity of pathways ( Figure 5).

WGCNA Analysis
Results. The input data of DESeq2 is also used as the input data of WGCNA. Based on the data results, we first divided the sample data into normal samples and cancer samples and run the WGCNA program separately. We remove those genes whose expression level is 0 in all samples, and then, we remove the sample points of the partial separation group based on the hierarchical clustering results of the samples (Figures 6(a) and 6(b)). There are 187 valid samples in cancer group and 27 valid samples in normal group. After running WGCNA, we obtained 72 gene modules in cancer group and 36 gene modules in normal group (Figures 6(c) and 6(d)). ClueGO cyREST tools are used for each module for functional enrichment analysis [20], which is a good batch task processing tool. The main pathway enrichment results of each module can be viewed in the Supplementary Table S2.
Due to the large results, we can analyze a specific pathway, taking the interleukin-4 and interleukin-13 signaling pathway as an example. It was reported that IL-4 and IL-13 inhibited colon cancer cell-cell adhesion by down-regulation of Ecadherin and CEA molecules [20]. In the normal group, there The results of 35 gene expression differences were extracted from the results of DESeq2. According to the WGCNA clustering results, they were divided into two groups and ranked according to the expression multiples. (because it is to use drugs for correction treatment, the down-regulated gene caused by drugs will match the up-regulated gene caused by gastric cancer, and the up-regulated gene caused by drugs will match the down-regulated gene caused by gastric cancer) and (e) set the corresponding threshold standard. Select the drug with high matching degree as the candidate treatment drug. 9 BioMed Research International were a total of 27 gene hits. In the cancer group, this number was reduced to 8 genes, and these genes did not overlap. The expression difference of these 35 genes in cancer and normal samples can be seen in Table 1. The correlation of these 35 genes was also calculated by DGCA: a comprehensive R package for differential gene correlation analysis [21]. We obtained the pairwise correlations of the thirty-five genes in the healthy group and the cancer group, retaining all the results of the absolute value of the correlation greater than 0.3 and the p value less than 0.05, and plotted the network diagram ( Figure 7) through Cytoscape (https://cytoscape.org/). It is not difficult to find that related genes in the interleukin-4 and interleukin-13 signaling pathways have decreased activity in cancer samples, and their associations have also been disrupted. This may be the direction of potential drug treatment.

Potential Drug Discovery.
We used drug perturbationinduced gene expression signatures obtained from CREEDS to compare with genes whose gene expression was significantly (p value < 0.05) different in cancer and normal samples and calculated the intersection of the genes in each drug perturbation-induced gene expression signature with significantly different genes (Figure 8). The results of the comparison can be viewed in Supplementary Table S3. Because drug mining needs to be more cautious, we used stricter screening criteria (p value is less than 1e − 5) and obtained a total of 2703 matching drug perturbation-induced gene expression signatures (Supplementary Table S4). We compared the number of overlapping genes with the number of genes affected by the drug itself to obtain coverage, ranked the coverage, and selected the results with a coverage greater The drugs obtained are rigorously screened and their mechanisms discovered through the relevant literature. 10 BioMed Research International than 0.95 for further observation. Seven drugs were screened out, and all were found to be related to cancer treatment through verification (Table 2), and most of the genes regulated by these seven drugs were significantly differently expressed genes for gastric cancer. This may provide new ideas and directions for our use of drugs.

Discussions
In this article, we propose a new idea for cancer mechanism research and drug treatment. We use RNA sequencing data from cancer and normal samples as our input files and perturbation-induced gene expression signatures as our reference files. Through the analysis of gene expression differences, WGCNA analysis, comparison analysis of the gene disturbance characteristics affected by drugs, etc., the related pathway changes of gastric cancer were explored, and some potential drugs that highly matched the characteristics of gastric cancer gene changes were obtained. This method is very meaningful for systematically understanding the cure mechanism of gastric cancer, the changing characteristics of pathways, and medication guidance.
In the specific research process, we obtained the list of differential genes through expression differential analysis. Through certain threshold screening and GO/KEGG enrichment analysis, we obtained some potential cancer-related pathways, such as interleukin-4 and interleukin-13 signaling: R-HSA:6785807, class I MHC-mediated antigen processing and presentation: R-HSA:983169, collagen biosynthesis and modifying enzymes: R-HSA:1650814, and activation of the prereplicative complex: R-HSA:68962. In the collinearity analysis, we annotated the function of the coexpression gene network module and compared the gene association changes of the subnetwork modules including interleukin-4 and interleukin-13 signaling pathway in normal samples and gastric cancer samples. We found that interleukin-4 and interleukin-13 signaling pathways have discredited activity in cancer samples, and their associations have also been disrupted. This may be the direction of potential drug treatment.
There are a few factors that bring possible errors and uncertainties to the analyses. For example, the cancer samples collected by TCGA come from multiple individuals and multiple platforms, which will bring batch effects. In addition, gastric cancer may have multiple subtypes, which might have intrinsic differences. In the future, we will develop methods to minimize the effects of these confounding factors, and try to identify gene expression changes and pathway changes in different subtypes. As the concept of personalized medicine is proposed and promoted, the cost of next-generation sequencing is decreasing year by year. Our objective is very likely to be achieved.

Data Availability
All the data used in this study could be downloaded from TCGA Gastric cancer transcriptome data https://portal.gdc .cancer.gov/projects/TCGA-STAD. Here, we downloaded it using R package of "RTCGAToolbox."

Conflicts of Interest
The authors have declared no conflict of interest.