Sparse Gene Coexpression Network Analysis Reveals EIF 3 JAS 1 as a Prognostic Marker for Breast Cancer

Predictive and prognostic biomarkers facilitate the selection of treatment strategies that can improve the survival of patients. Accumulating evidence indicates that long noncoding RNAs (lncRNAs) play important roles in cancer progression, with diagnostic and prognostic potential. However, few prognostic lncRNAs are reported for breast cancer, and little is known about their functions that contribute to cancer pathogenesis. In this paper, we used weighted correlation network analysis (WGCNA) to construct networks containing noncoding and protein-coding genes based on their expression in 1097 breast cancer patients. The differentially expressed genes were significantly overlapped with gene modules regulating cell cycle and cell adhesion. The cell cycle-related lncRNAs were consistently downregulated in breast cancer. One lncRNA, EIF3J-AS1, is significantly associated with clinicopathological characteristics, including tumor size, lymph node metastasis, estrogen receptor (ER), and progesterone receptor (PR) status. Kaplan–Meier survival analysis revealed that EIF3J-AS1, a downregulated lncRNA in breast tumor, is a potential prognostic marker for breast cancer. EIF3J-AS1 may function in an estrogen-independent manner and could be inhibited by the compound FDI-6. Therefore, integrating sparse gene coexpression network and clinicopathological features can accelerate identification and functional characterization of novel prognostic lncRNAs in breast cancer.


Introduction
Breast cancer is a highly heterogeneous disease, which is commonly divided into five subtypes, basal-like, HER2, luminal A, luminal B, and normal-like, using histopathological status of either estrogen receptor (ER), human epidermal growth factor receptor (HER2), or a gene expression-based classifier (PAM50) [1].The use of the mRNA-based prognostic marker, comprised of multiple differentially expressed genes, has been supported by clinical guidelines, which assists the clinical treatment of breast cancer by integrating clinicopathological factors [2,3].
Gene coexpression networks (GCNs) have been widely used in the studies of cancer for the identification of prognostic signature [4].GCN from transcriptomic profiles facilitates elucidating gene interactions and exploring regulatory mechanisms [5].For each gene expression profile, it contains expressions of tenths of thousands of genes in detected samples.The coexpression network is constructed based on the pairwise gene correlation matrix.In the network, each node represents one gene, while each edge represents a pair of genes with highly correlated expression pattern.The large coexpression network is not easy to interpret because of its high dimensionality.Besides, there are few master regulatory genes which basically control the state of the network [6].It is promising to decompose the sparse network into smaller components [7,8], which are also referred to as gene modules.GCN is quite sparse with only a few "hub" genes densely connected to each other.For years, the scale-free network model has been supported for biological networks [9].For example, sparse signal transduction networks follow the scale-free properties.In E. coli and S. cerevisiae, the degree distribution is P k = k −γ , r ≈ 2 [10], which implies that majority of the molecules are involved in few interactions and minority of them have many interactions [9,11].
Long noncoding RNA (lncRNA), with length longer than 200 nt, has been regarded as the dark matter of the genome for decades.However, with the development and application of next-generation sequencing (NGS), lncRNAs have been found to have a myriad of molecular functions in diseases including cancers [12].LncRNAs such as HOTAIR and MALAT1 had been reported as a prognostic marker for breast cancer [13,14].Differential analysis and coexpression network has been successfully applied to identify prognostic lncRNAs in breast cancer [15].Therefore, in this study, weighted correlation network analysis (WGCNA) was used to identify modules of highly correlated genes.Then, we focus on those modules significantly enriched by differentially expressed genes, which play important roles in breast cancer.The deregulated lncRNAs were identified by integrating clinicopathological characteristics and further investigated to determine their prognostic potential in biologically meaningful modules.

Materials and Methods
2.1.Data Preprocessing.We downloaded the transcriptomic expression profiles of TCGA breast cancer from TCGA (https://tcga-data.nci.nih.gov/).The dataset contains primary tumor and adjacent normal samples from primary and metastatic samples.Before constructing the gene coexpression network (GCN), we first filtered out genes with FPKM <1 in more than half of all the samples.The expression profiling contains the expression value of 12,488 genes in 1202 samples from 1097 patients.In the TCGA BRCA cohort, the number of normal samples is ~10% of the tumor samples.

Weighted Correlation Network Analysis (WGCNA).
WGCNA uses adjacency to measure the similarity between two genes in the network, which is calculated based on the correlation coefficient.In the network, similarity s ij is defined as the absolute correlation coefficient between the profiles of genes i and j: s ij = cor x i , x j For a traditional unweighted network, adjacency a ij is defined as follows: where τ is the hard (fixed) threshold parameter to weigh the edges.However, the unweighted networks do not accord with the continuous characteristics of the coexpression information, which will lead to loss of information.Therefore, weighted networks fit the nature of the continuous coexpression.The corresponding adjacency can be defined as a ij = s β ij , where β ≥ 1.The threshold β = 10 is chosen based on the approximate scale-free topology criterion of the coexpression network.The adjacency a ij is further transformed to a topological overlap matrix (TOM), which is a measure to evaluate how strongly two genes are correlated to the same set of neighboring genes.Then, 1-TOM is used as a dissimilarity measure for hierarchical clustering.In the clustering dendrogram, each branch represents one module, which compromises of genes with highly similar expression pattern.In this way, modules can be defined based on different branch-cutting methods, for example, the dynamic tree cut methods [16].For more details, please refer to [17].

Identifying Differentially
Expressed Genes.To identify the differential genes, we only chose the matched tumor and normal samples from 112 patients, in order to avoid the bias caused by unbalanced sample size in the TCGA BRCA cohort.The raw read counts were downloaded from TCGA (https://tcga-data .nci.nih.gov/).R package DESeq [18] was used to identify differentially expressed genes (DEGs) in breast cancer.The significance P value was adjusted by Benjamini-Hochberg FDR.The cutoff of significant P value was 0.05.

Gene Ontology Annotation and Enrichment
Analysis.We used the online web tool DAVID [19] v6.8 for functional enrichment analysis.Gene Ontology (GO) defines concepts used to describe gene functions along three aspects: biological process (BP), molecular function (MF), and cellular component (CC).When performing functional enrichment analysis on the genes in each module, we considered GO terms of BP branch.EASE score is a modified Fisher's exact P value to evaluate whether the interested genes are significantly enriched in a specific gene function, which contains a lot of genes to achieve this function.The smaller the P value is, the more enriched the interested genes are.The Benjamini-Hochberg false discovery rate (BH-FDR) was used for correcting multiple comparisons.The enrichment threshold of P value was set to 0.01.

Associating LncRNAs with Clinicopathological
Characteristics.The patients were divided into high and low groups, according to the median expression level of candidate lncRNAs.The patients with the lncRNA expression level larger than its median expression value were assigned into the high group and vice versa.Chi-squared test was used to associate gene expression with clinicopathological features.

Statistic Method for Cross-Dataset Validation and
Survival Analysis.The expression difference of candidate lncRNAs is compared by Mann-Whitney U test in cancer versus normal samples.Kaplan-Meier and Cox regression analyses were utilized to assess the prognostic significance of lncRNAs.The statistical analysis was performed using R.

Results
3.1.Gene Modules Identified Using WGCNAs.WGCNA was used to construct the gene coexpression network (GCN) based on the TCGA BRCA dataset.Only genes with appreciable expression levels (FPKM > 1) in more than half of the samples were considered for further analysis.The gene expression profiles, comprising of 12,488 genes in 1202 samples, were log 2 transformed and subjected to WGCNA.As shown in Figure 1(a), power 10 was chosen as the soft threshold to identify coexpressed gene modules (for details, see Section 2.2). 16 gene modules were identified, and module names were color-coded including blue, brown, green, grey, red, turquoise, yellow, and black.As the "grey" module was reserved for unassigned genes, we further As shown in Figure 1(c), the turquoise, blue, and brown modules were the top 3 modules containing the highest number of genes.We further checked the number of lncRNAs in each module.There are 8 lncRNAs in the turquoise module.And in the brown module, there are 6 lncRNAs, while 5 in the red module (Figure 1(d)).

Modules Significantly Enriched in DEGs of Breast Cancer.
The dysregulation of important genes (protein-coding and noncoding) plays important roles in tumorigenesis.We would like to know how many genes are differentially expressed in breast cancer.DESeq [18] was used to determine the DEGs between cancer and normal breast tissue, from the TCGA BRCA dataset.In total, 3032 DEGs were identified.Among these DEGs, an lncRNA Xist has experimentally supported associations with human breast cancer [20,21].
We further checked the dysregulated lncRNAs in the coexpressed gene modules.For each module, we calculated the number of DEGs.The hypergeometric test was used to test if the DEGs are significantly enriched in the module.From the P values, we found that the blue, yellow, red, and black modules are significantly overlapped with DEGs (P < 0 05, Figure 2(a)).For genes in these four modules, the expression heat map is shown in Figure 2(b).There are more upregulated DEGs in cancer tissues than those in downregulated ones.In our analysis, only 30 lncRNAs were included, excluding those lncRNAs with low expression levels [22].Among these lncRNAs in the four modules, 5 lncRNAs are upregulated in cancer relative to normal tissue and 9 lncRNAs are downregulated (Figure 2(c)).

Genes in Modules Mainly Involved in Cell Cycle and Cell
Adhesion.The gene modules in the network are often enriched with specific functions [23], which enable its 4 Complexity application in generating the hypothesis of biological significance.Besides, the "coexpression" approach has been used to understand the lncRNA function [24].To test whether the identified modules are biologically meaningful, information from Gene Ontology (GO) and KEGG pathway was used for function enrichment analysis.GO terms with BH adjusted P value of <0.01 are regarded as significant.
Among those four modules, the red module has no significantly enriched GO function and KEGG pathway.The genes in the blue module are enriched in the GO terms "cell division," "DNA replication," and "cell cycle" (Figure 3(a)).As for the KEGG pathway, "cell cycle" and "DNA replication" were the top enriched pathways (Figure 3(b)).For genes in the yellow module, they are enriched in the GO terms "signal transduction," "cell adhesion," and "angiogenesis" (Figure 3(c)).As for the KEGG pathway, "focal adhesion" and "signalling pathway" were highly enriched (Figure 3(d)).Similar to the yellow module, genes in the black module are enriched in the GO terms "cell adhesion" and "collagen catabolic process" (Figure 3(e)).The KEGG pathways "focal adhesion" and "PI3K-Akt signalling pathway" were highly enriched by genes in the black module (Figure 3(f)).The function similarity of the yellow and black modules can be known from the gene dendrogram from Figure 1(b).We also performed the similarity comparisons of all the 16 modules (Figure S1).From the clustering tree, the black module and yellow module are in the same branch, which further supports their similarity in gene functions.

Clinical Significance of Deregulated LncRNAs in Breast
Cancer.As the results showed (Figure 3), the genes in the blue module are cell cycle-related while genes in the black and yellow modules are related with cell adhesion and signal transduction.We further explored the clinical significance of the deregulated lncRNAs in these modules, according to their relationship with clinicopathological characteristics.
We divided the patients into high and low groups, according to the median expression level of candidate lncRNAs.As shown in Table 1, higher EIF3J-AS1 expression group has more older patients (P = 0 016) and more lymph node metastasis (P = 0 038).Low EIF3J-AS1 expression 5 Complexity group has larger tumor size (P = 0 006) and more advanced clinical stage and PR-negative patients (P = 0 002).
The blue module is enriched in cell cycle, which plays important roles in cancer.Moreover, considering EIF3J-AS1 and LPP-AS2 are significantly correlated with most of the clinicopathological characteristics, we further validate their expression and prognostic potential in breast cancer.

EIF3J-AS1 and LPP-AS2 Are Candidate Biomarkers in
Breast Cancer.As shown in Figure 2(c), EIF3J-AS1, LPP-AS2, and CKMT2-AS1 are lowly expressed in tumor samples, compared with their corresponding matched normal breast tissues from the TCGA cohort.It suggests that these three lncRNAs are candidate biomarkers in breast cancer.In Figure 4, it demonstrated that the expressions of EIF3J-AS1 (Figure 4  As validated in other datasets other than TCGA, EIF3J-AS1 and LPP-AS2, lncRNAs in the blue module, are candidate biomarkers in breast cancer.We further determine their prognostic potency.Using an online survival analysis tool for breast cancer [26], we performed overall survival (OS) and relapse-free survival (RFS) analysis for both EIF3J-AS1 and LPP-AS2.Patients were divided into high-and low-expressed groups, using its median expression value as the cutoff.As shown in Figure 5(a), patients with high expression of EIF3J-AS1 have better OS (P = 0 0029).However, the low and high LPP-AS2 expression groups do not exhibit significant difference in OS (Figure 5(b)).Further, we explored the RFS for both lncRNAs.High expression of EIF3J-AS1 suggests better RFS (Figure 5(c), P < 1 0e − 16).In contrast to OS, there is significant RFS difference between high and low LPP-AS2 expression groups (Figure 5(d), P = 0 003).

Coexpressed Genes of EIF3J-AS1 Participate in G2/M
Phase of Cell Cycle.To dissect the possible mechanism of lncRNAs in breast cancer, we further constructed the subnetwork of the three lncRNAs and their coexpressed genes in the blue module.As shown in the network (Figure 6(a)), EIF3J-AS1 and LPP-AS2 shared coexpressed genes.Using TANRIC [27], we have identified 812 of these genes showing strong correlations with lncRNA EIF3J-AS1 across the TCGA breast cancer dataset.54 genes of them have also interacted with EIF3J-AS1 in the GCN of the blue module.As our results showed (Figure 3(a)), the genes in the blue module participate in cell cycle.Therefore, we further found that seven genes (PTTG1, CDC20, BUB1, TTK, CDC45, PLK1, and CCNE1) are known cell cycle genes (NanoString Technologies) and are DEGs in the TCGA cohort.The expression of these seven cell cycle genes is highly correlated with that of EIF3J-AS1.
To elaborate the role of EIF3J-AS1 in cell cycle, we mapped its coexpressed genes to the KEGG pathway "cell cycle."CCNE1 and CDC45 participate in the G1 and S phases of cell cycle, while the other five genes are involved in the G2/M phase (Figure 6(b)).We speculate that EIF3J-AS1 regulates the later phase of cell cycle, based on the location of its coexpressed genes in the pathway "cell cycle."Cell cycle assays revealed significantly higher proportions of cells in the G2/M phase, suggesting a cell cycle arrest at the G2/M phase by CKI in MCF7 cells [28].From Figure S3, we found that downregulated genes after 5FU (Figure S3A) or CKI treatment (Figure S3B-C) were significantly overlapped with DEGs in the blue module.This also supports our conclusion that coexpressed genes of EIF3J-AS1 participate in the G2/M phase of cell cycle.7 Complexity

EIF3J-AS1 May Function in an Estrogen-Independent
Manner and Could Be Inhibited by FDI-6.As shown in Figure S2E, EIF3J-AS1 is significantly differentially expressed between ER+ and TNBC patients.Estrogen is the most important regulator of breast cancer.We next check EIF3J-AS1 expression after E2 treatment based on a public dataset (accession ID: GSE62789).Figure 7(a) shows that the expression of EIF3J-AS1 decreased after E2 treatment at early time points.At later time points, its expression level increased gradually.siRNA experiments (Figure 7(b)) of ERa (accession ID: GSE53532) demonstrated that EIF3J-AS1 expression increased after siERa.From E2 treatment and siERa experiments, EIF3J-AS1 may function in an estrogen-independent manner.
FOXM1 and CCNB1 are coexpressed with EIF3J-AS1 and are included in the blue module.FDI-6 was used as an inhibitor of FOXM1, according to a pubic dataset (GSE58626).EIF3J-AS1 immediately decreased with FDI-6 treatment: the fold change is around 1.5 (Figure 7(c)).The expression reduction is also expected for its coexpressed genes CCNB1 (Figure 7(d)).Therefore, we speculate that FDI-6 may be a candidate compound that can inhibit EIF3J-AS1 expression, which provides clues for further functional assay on EIF3J-AS1.

Discussion
LncRNAs have been reported as key players of many important signalling pathways in cancer, including p53 pathway, hypoxia signalling and epithelial-mesenchymal transition (EMT), telomere maintenance, and hormone receptor signalling [12].The expression of lncRNAs is reported to be specific to tissue and cancer types, which enables lncRNAs as favorable candidate biomarkers for cancer.For example, lncRNA SChLAP1 is cancer-and prostate-specific expressed and is a candidate prognostic marker [29].Coexpression of lncRNA and protein-coding genes has been utilized to predict the function of uncharacterized lncRNAs [30].Due to the inherent sparsity of the gene coexpression network, WGCNA was applied to identify the highly connected components (gene modules) from the network.Therefore, the functions of lncRNAs can be predicted based on their coexpressed protein-coding genes in the module.Among the deregulated lncRNAs, antisense lncRNAs are a class of long noncoding transcripts from the antisense strand of protein-coding genes.They can function as positive or negative regulators of its paired genes [31].In the TCGA breast cancer cohort, the expression level of LPP-AS2 is positively correlated with its protein-coding gene LPP (r = 0 52).In colorectal cancer, LPP-AS2 has been reported to be repressed by MYC, which is a proto-oncogene-regulating cell proliferation through cell cycle [32].LPP-AS2 and EIF3J-AS1 are significantly associated with clinicopathological characteristics, such as tumor size, lymph node metastasis, and PR status, which highlight their potency in clinical application.
The seven genes highly correlated with EIF3J-AS1 are deregulated in cancer, and some are reported as prognostic 9 Complexity markers of breast cancer.According to the results from survival analysis, high CDC20 expression and high PTTG1 indicated aggressive pathological course of breast cancer, particularly TNBC [33].BUB1 expression is correlated with a poor clinical prognosis in patients with breast cancer [34].High expression of BUB1 was associated with better OS of low-grade breast cancers [35].TTK is upregulated in several cancers, especially in TNBC.TTK was also associated with aggressive subgroups, has poor survival, and is a therapeutic target [36].Moreover, inhibiting TTK has been proposed as a novel strategy for cancer treatment, including TNBC [37].PLK1 regulates the phosphorylation of RAD51, which promotes the genome stability in breast cancer [38].PLK1 is hopefully a target gene in ER-positive breast cancer patients that have acquired resistance to estrogen deprivation therapy [39].Inhibition of PLK1 is a promising therapeutic approach for patients suffering triple-negative breast cancer (TNBC) [40].The expression of CCNE1, one cell growth-related gene, has been reduced by the lncRNA LINC00152 via knockdown experiments [41].Based on the in vitro experiments from public datasets, we speculate that EIF3J-AS1 may function in an estrogen-independent manner.EIF3J-AS1 could be inhibited by the compound FDI-6.Therefore, experiments like overexpression and RNA interference (RNAi) of EIF3J-AS1 are needed to further elaborate its regulation of cell cycle via its target genes.10 Complexity This study constructed and analyzed sparse gene coexpression network based on transcriptomic profiles of the TCGA breast cancer cohort.It identified a prognostic lncRNA that participates in cell cycle process via its coexpressed genes.

2
Complexity focused on the other modules except the grey module.The clustering dendrogram of genes is shown in Figure 1(b).

Figure 1 :Figure 2 :
Figure 1: Gene modules detected using the weighted correlation network analysis (WGCNA).(a) Scale-free topology index and mean connectivity were used to determine the soft threshold.(b) Clustering dendrogram of genes.The dissimilarity of genes is based on the topological overlap.The genes are assigned to different modules and modules are named using different colors.(c) Number of genes in each module identified from WGCNA.The numbers in the parentheses represent the number of genes in each module.(d) Number of lncRNAs in the modules containing more than 100 genes.

Figure 3 :
Figure 3: Biological function of DEGs in each module.(a) Gene Ontology and (b) KEGG pathway enrichment analysis for the blue module.(c) Gene Ontology and (d) KEGG pathway enrichment analysis for the yellow module.(e) Gene Ontology and (f) KEGG pathway enrichment analysis for the black module.
(a)), LPP-AS2 (Figure 4(b)), and CKMT2-AS1 (Figure 4(c)) are indeed significantly lower in tumor.The dataset GSE31448 contains cancer and normal mammary samples from 353 patients.Another dataset GSE58135 also contains expression profiles of 140 normal and tumor breast tissues.From Figures 4(d) and 4(e) and Figure S2A-B, EIF3J-AS1 is lowly expressed in breast tumor.LPP-AS2 is also lowly expressed in tumor tissues, compared to normal breast tissues (Figures 4(f) and 4(g)).

Figure 5 :
Figure 5: Prognostic potential of EIF3J-AS1 and LPP-AS2.(a) According to the Kaplan-Meier plot, patients with high EIF3J-AS1 expression have better overall survival.(b) Patients with high LPP-AS2 expression do not have different overall survival, compared to patients with low LPP-AS2 expression.(c) Patients with high EIF3J-AS1 expression have better relapse-free survival.(d) Patients with high LPP-AS2 expression have better relapse-free survival.

Figure 6 :
Figure 6: Coexpression network of EIF3J-AS1, LPP-AS2, and CKMT2-AS1.(a) The coexpression network was constructed for differentially expressed lncRNAs and their coexpressed genes in the blue module.Node size is proportional to the number of coexpressed genes.(b) The pink background genes are the seven coexpressed genes of EIF3J-AS1 in the KEGG pathway "cell cycle."

Table 2 :
The relationship between LPP-AS2 and clinicopathological features in the TCGA cohort.

Table 1 :
The relationship between EIF3J-AS1 and clinicopathological features in the TCGA cohort.