Clinical Features and Prognostic Impact of Coexpression Modules Constructed by WGCNA for Diffuse Large B-Cell Lymphoma

Objective Diffuse large B-cell lymphoma (DLBCL) is a highly aggressive malignant tumor, accounting for 30-40% of non-Hodgkin's lymphoma. Our aim was to construct novel prognostic models of candidate genes based on clinical features. Methods RNA-seq and clinical data of DLBCL were retrieved from TCGA database. Coexpression modules were constructed by WGCNA. Then, we investigated the interactions between modules and clinical features. By overall survival analysis, prognostic candidate genes from modules of interest were identified. A coexpression network of prognostic candidate genes was then constructed through WGCNA. GEPIA was used to analyze the expression of a candidate gene between DLBCL and normal samples. Results 19 coexpression modules were constructed by 12813 genes from 52 DLBCL samples. The number of genes in modules ranged from 34 to 5457. We found that the purple module was significantly related with histological type (p value = 1e-04). Overall survival analysis revealed that MAFA-AS1, hsa-mir-338, and hsa-mir-891a were related with prognosis of DLBCL (p value = 0.027, 0.039, and 0.022, respectively). A coexpression network was constructed for the three prognostic genes. MAFA-AS1 was interacted with 36 genes, hsa-mir-891a was interacted with 11 genes, while no gene showed interaction with hsa-mir-338. Using GEPIA, we found that MAFA-AS1 showed low expression in DLBCL samples (p < 0.01). Conclusion We constructed a coexpression module related with histological type and identified three candidate genes (MAFA-AS1, hsa-mir-338, and hsa-mir-891a) that possessed potential value as prognostic biomarkers and therapeutic targets of DLBCL.


Introduction
DLBCL is a highly aggressive malignant tumor originating from mature B-cells, accounting for 30-40% of non-Hodgkin's lymphoma [1,2]. Patients with DLBCL usually have a poor prognosis due to ineffective primary and second-line therapy or recurrence after stem cell transplantation [3]. Therefore, easily applicable prognostic parameters are necessary for clinicians, especially since new molecular markers have not yet entered clinical routines [4]. The International Prognostic Index (IPI) is the most common tool for risk stratification in DLBCL. However, due to improved treatment options, pathobiology, and life expectancy of patients with DLBCL, IPI has been challenged [5]. Therefore, it is necessary to propose novel prognostic models based on clinical features and prognostic biomarkers.
As a molecular heterogeneous disease composed of different histopathologic and genetic subtypes, genetics of DLBCL has clinical implications for patient risk prediction and treatment [6]. Molecular traits are increasingly being used to guide DLBCL drug development, predict patients' clinical outcomes, and make treatment plans [7]. Therefore, the collection and evaluation of molecular features in clinical samples are critical to improve the prognosis of patients with DLBCL. Genomic studies have revealed a large number of mutant genes in DLBCL; their clinical significance remains unclear.
As a powerful method for transcriptomics analysis, RNA sequencing (RNA-seq) has been widely used to explore gene function and biological patterns, as well as to find candidate drug targets and to identify biomarkers for predicting disease risk and prognosis [8,9]. The Cancer Genome Atlas (TCGA) has produced RNA-seq data, which provides an unprecedented opportunity for cancer biology. Weighted gene coexpression network analysis (WGCNA) is a systematic biological method, which is widely used to generate gene coexpression networks [10]. Instead of linking thousands of genes to physiological characteristics, it focuses more on the relationship between several modules and features [11]. It provides a specific measure for clinical prediction of DLBCL diagnosis and developing new treatment strategies [12]. Therefore, WGCNA can explore hidden biological patterns. The method has been used to analyze many kinds of diseases, such as breast cancer, uveal melanoma, gastric cancer, and colon cancer [13][14][15][16]. However, there is no study on clinical modules of DLBCL using the WGCNA method.
In the present study, coexpression modules were constructed by WGCNA. After investigating the correlations between modules and clinical features, a module of interest was identified for functional enrichment analysis. We found three candidate genes (MAFA-AS1, hsa-mir-891a, and hsamir-338) that were related with prognosis of DLBCL by overall survival analysis. And MAFA-AS1 had low expression in DLBCL by GEPIA.

Materials and Methods
2.1. Data Processing. RNA-seq and clinical data of lymphoid neoplasm diffuse large B-cell lymphoma patients were downloaded from TCGA data repository (https://cancergenome .nih.gov/). The gene expression level was normalized as fragments per kilobase of transcript per million mapped reads (FPKM) using Robust Multiarray Average (RMA) algorithm. The miRNA expression level was measured as RPM. Clinical information included clinical TNM stage, histologic grade, age, gender, and survival information. As genes with little variation in expression usually represent noise, the most variant genes were filtered. Gene variables were measured by median absolute deviation (MAD).

Construction of Gene Coexpression
Network. Gene coexpression network was constructed using WGCNA package in R [17]. Power values were screened out by the WGCNA algorithm. Firstly, gene expression similarity matrix S = ðs ij Þ was constructed via calculating the absolute value of Pearson correlation coefficient between two genes. The formula was listed as below: where x i and x j were vectors of expression value for gene i and j and s ij represented the Pearson correlation coefficient of gene i and gene j. Next, gene expression similarity matrix was transformed into adjacency matrix a ij . The formula was a ij = js ij j β . To further identify functional modules in the coexpression network, the adjacent matrix was transformed into a topological overlap matrix (TOM), and the corresponding dissimilarity (1-TOM) was calculated.
where α was the weighted adjacency matrix given by a ij = js ij j β and β =5 was the soft threshold power.
where D expressed the degree of dissimilarity gene expression in different samples. Scale independence and average connectivity analysis of modules with different power values was performed by gradient test (power value ranging from 1 to 20). Appropriate power value was determined when the scale independence value was equal to 0.9. The WGCNA algorithm was then used to construct the coexpression network and extract the gene information in the most relevant module. The candidate network was selected according to the coexpression weight > 2:5. The correlations between modules and clinical features were analyzed using WGCNA. Module eigengene (ME) can summarize the gene expression profiles, and the formula was as follows: where q represents the qth module.
We calculated the correlations between MEs and external clinical data as the module membership (MM). p < 0:05 was statistically significant. The genes in the most relevant module were chosen as candidate genes, as follows: where ME q i means the identification of the ith gene in the qth module.

BioMed Research International
We calculated the correlations between MEs and external clinical data. A p value < 0.05 was statistically significant. The genes in the most relevant module were chosen as candidate genes.

Functional Enrichment Analysis.
To explore the potential biological themes and pathways of genes from modules of interest, the clusterProfiler package in R was used to annotate and visualize Gene Ontology (GO) terms (including biological processes, molecular functions, and cellular components) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [18]. A p value < 0.05 was considered significant pathways.

Prognostic Analysis of Candidate Module
Genes. Then, we further assessed the prognostic value of candidate genes by overall survival analysis. Log-rank tests were used to select prognosis-related genes from candidate genes. Survival package was used to carry out log-rank tests and survminer package was used to plot Kaplan-Meier survival curves.

Candidate Gene Coexpression Network Construction.
WGCNA can evaluate coexpression information of genes. Through survival analysis, we obtained the core genes related with prognosis. Moreover, we constructed a coexpression network of prognosis-related genes through WGCNA.

2.7.
Candidate Gene Risk Assessment. The online database Gene Expression Profiling Interactive Analysis (GEPIA) (http://gepia.cancer-pku.cn/index.html) was used to analyze the gene expression between cancer and normal samples [19]. As an interactive web, GEPIA can analyze the RNA-   BioMed Research International  [20]. Using GEPIA, we analyzed candidate genes related with prognosis between cancer and normal samples from TCGA.

Identification of the Purple Module as the Module Most
Relevant to Clinical Traits. To explore the molecular mechanisms behind the trait, we identified genes associated with a certain clinical trait. In the present study, the clinical parameters of DLBCL patients, including age, clinical stage, gender, recrudescence stage, and histological type were involved in the module-trait relationship analysis. As shown in Figure 3, the purple module was closely related with histological type (p value = 1e-04) and the cyan module was associated with age (p value = 0.02). Therefore, the purple module was selected as module of interest for subsequent analysis.

Functional Enrichment Analysis of Genes in the Purple
Module. To explore potential pathways of the purple module, GO and KEGG enrichment analyses were performed on the genes from the purple module. 52 GO-enriched terms were shown Table 2. The top ten GO terms included regulation of neuron projection development, gliogenesis, regulation of synaptic plasticity, myelin sheath, glial cell differentiation, glial cell development, calcium-dependent protein binding, extracellular matrix assembly, axon ensheathment in central nervous system, and central nervous system myelination (Figure 4(a)). In the KEGG analysis, 10 pathways were enriched by genes in the purple module, including synaptic vesicle cycle, Ras signaling pathway, GABAergic synapse, morphine addiction, cell adhesion molecules (CAMs), PI3K-Akt signaling pathway, retrograde endocannabinoid signaling, gastric cancer, histidine metabolism, and reninangiotensin system (Table 3, Figure 4(b)).

MAFA-AS1
, hsa-mir-338, and hsa-Mir-891a as Candidate Genes Related with Prognosis of DLBCL. The purple module was closely related with histological type. Therefore, genes in the module were of great significance to evaluate the potential value. Among the 100 genes, 3 genes (MAFA-AS1, hsa-mir-338, and hsa-mir-891a) were associated with prognosis of DLBCL (Table 4). As shown in Figure 5, highly expressed MAFA-AS1 and hsa-mir-338 had shorter survival time than their low expression (p value = 0.027 and 0.039, respectively). And low expression of hsa-mir-891a had poorer clinical outcomes than its high expression (p value = 0.022).

Candidate Gene Coexpression Network Construction.
We constructed a coexpression network for genes with prognostic significance by WGCNA. As shown in Figure 6, MAFA-AS1 was interacted with 36 genes, hsa-mir-891a was interacted with 11 genes, while no gene showed interaction with hsa-mir-338.

Discussion
As a heterogeneous disease, DLBCL can be classified into activated B-cell, germinal center B-cell, and primary mediastinal B-cell subtypes based on gene expression profiling [21]. Though about 70% of DLBCL patients have survival time longer than five years when treated with immunochemotherapy involving rituximab plus cyclophosphamide, doxorubicin, vincristine, and prednisolone (R-CHOP) [22], however, the remaining patients are still dying of this malignant tumor. In addition, the short-and long-term toxicity of chemotherapy, including secondary malignancies and leukemia, adversely affects the long-term prognosis of patients. Therefore, it is necessary to study new therapeutic targets in DLBCL.
RNA-seq is a next-generation sequencing technology for genome-wide quantitative gene expression, which has some advantages over microarrays in characterizing transcriptomes [23,24]. However, few studies have investigated the network characteristics of coexpression networks based on RNA-seq. In the present study, RNA-seq and clinical data of DLBCL were retrieved from TCGA. 19 coexpression modules were built by the 12813 most variant genes from 51 DLBCL samples using WGCNA. WGCNA has been proved to be an effective method for detecting coexpression modules and hub genes in many aspects [25]. The interaction between genes in different coexpression modules can be found.
Although a large number of biomarkers have been identified and validated, few studies have considered correlations between genes. Genes with similar expression patterns can encode proteins with similar functional properties that can form complexes or can function in similar pathways. Therefore, we made full use of the WGCNA algorithm to detect key genes associated with sample traits in coexpressed gene networks [26]. The purple module was closely related with histology type (p value = 1e-04). And the cyan module was closely related with age. The incidence of DLBCL is positively correlated with age, about two-thirds of DLBCL patients are over 65 years old worldwide [27]. Research has found that age at diagnosis has a relationship with the molecular features of DLBCL [28].
We found that the genes in the purple module were involved in multiple pathways such as Ras signaling pathway, PI3K/AKT signaling pathway, and cell adhesion molecules. The Ras signaling pathway plays an important role in cancer biology, which regulates cell growth and proliferation. Activating mutations in Ras can result in abnormal activation of its downstream target MEK1/2 [29]. MEK has become a potential target for the treatment of DLBCL. The PI3K/AKT signaling pathway is activated in DLBCL, which plays a key role in controlling the proliferation and survival of DLBCL cells [30]. Activation of this pathway in DLBCL can cause gene mutations, loss of PTEN, or constitutive activation of upstream regulatory pathways [31]. Cell adhesion molecules function in complex biological processes like cancer progression, inflammation, angiogenesis, and metastasis [32][33][34].
miRNAs are involved in several biological processes by regulating gene expression at the posttranscriptional level, such as cell proliferation and apoptosis [35,36]. For example, downregulated microRNA-155 promotes cell cycle arrest and apoptosis in DLBCL [37]. And miR-10a suppresses cell proliferation and promotes cell apoptosis via targeting BCL6 in DLBCL [38]. The main reason for the poor results of DLBCL chemotherapy is that DLBCL cells are resistant to chemotherapeutic drugs [39]. It has been confirmed that miRNAs are closely associated with cancer chemosensitivity. To investigate the prognosis value of the genes in the purple module, we made overall survival analysis and results revealed that three miRNAs were closely related with survival. Highly expressed MAFA-AS1 and hsa-mir-338 had shorter survival Ras signaling pathway 0.01222 3 hsa04727 GABAergic synapse 0.01507 2 hsa05032 Morphine addiction 0.01572 2 hsa04514 Cell adhesion molecules (CAMs) 0.03700 2 hsa04151 PI3K-Akt signaling pathway 0.03718 3 hsa04723 Retrograde endocannabinoid signaling 0.03890 2 hsa05226 Gastric cancer 0.03938 2 hsa00340 Histidine metabolism 0.04811 1 hsa04614 Renin-angiotensin system 0.04811 1 . And low expression of hsa-mir-891a had poorer clinical outcomes than its high expression (p value = 0.022). So far, there is no study concerning MAFA-AS1 and hsa-mir-891a. As for hsa-mir-338, it has been reported that hsa-mir-338-3p inhibits invasion and migration of colorectal cancer cells by suppressing smooth expression [40]. Aberrantly expressed hsa-mir-338-3p increases the risk of esophageal cancer [41]. In addition, hsa-mir-338 can be a prognostic biomarker for oral squamous cell carcinoma [42]. Then, we constructed a coexpression network for the three genes by WGCNA. MAFA-AS1 was interacted with 36 genes, hsa-mir-891a was interacted with 11 genes, while no gene showed interaction with hsa-mir-338. Using GEPIA datasets, we found that MAFA-AS1 showed low expression in DLBCL samples (p < 0:01). Taken together, three candidate genes (MAFA-AS1, hsamir-338, and hsa-mir-891a) from the purple module could be considered prognostic biomarkers for DLBCL. Compared to IPI, the new prognostic model had some advantages. For example, the three candidate genes could be detected using immunohistochemistry, which is convenient to screen patients with DLBCL and predict their prognosis. Moreover, based on clinical features and candidate genes, it can help improve treatment options, pathobiology, and life expectancy of patients with DLBCL. However, several limitations of our study need to be pointed out. The heterogeneity of the treatment protocols is inevitable, which can bias outcomes. Furthermore, the number of samples is limited; therefore, the prognostic role of the three candidate genes requires to be verified in a larger sample of DLBCL.

Conclusion
In our study, RNA-seq and clinical data of DLBCL were retrieved from TCGA. 19 coexpression modules were built  BioMed Research International by WGCNA. And we found that the purple module was closely associated with histological type. Further analysis suggested that three candidate genes (MAFA-AS1, hsa-mir-338, and hsa-mir-891a) were significantly related with clinical outcomes. Therefore, our findings revealed a coexpression module related with histological type and identified three candidate genes (MAFA-AS1, hsa-mir-338, and hsa-mir-    Prognostic index RNA-seq: RNA sequencing TCGA: The Cancer Genome Atlas WGCNA: Weighted gene coexpression network analysis FPKM: Fragments per kilobase of transcript per million mapped reads MAD: Median absolute deviation ME: Module eigengene GO: Gene Ontology KEGG: Kyoto Encyclopedia of Genes and Genomes GEPIA: Gene Expression Profiling Interactive Analysis BP: Biological processes CC: Cellular components MF: Molecular functions.

Data Availability
The datasets analyzed during the current study are available from the corresponding author on reasonable request.  Figure 9: Validation of candidate prognostic miRNAs in independent datasets. The differences in expression patterns of hsa-miR-338 (a) and hsa-miR-891a (b) between DLBCL and controls. Overall survival analysis of hsa-miR-338 (c) and hsa-miR-891a (d) in DLBCL.