Inference of gene networks using gene expression data with applications

Gene networks (GNs) use graphs to represent the interaction relationships between genes. Large-scale GNs are often sparse and contain hub genes that interact with many other genes. In this paper, we propose a novel method called NetARD, which utilizes Automatic Relevance Determination (ARD) to estimate partial correlations, to infer GNs with the hub genes from gene expression data. We test NetARD on simulated GNs and in silico GNs, and it outperforms existing methods. In our high-throughput gene expression data analysis, we integrate the NetARD into a method called GN Co-expression Extension (GNCE). This approach infers the GNs of co-expressed genes, with genes from a predefined GN serving as hub genes. We validate this approach by extending the core GN of transcription factor genes of E. coli using microarray data. In an application example, we identify biological process (BP) Gene Ontology (GO) terms that are significantly involved in cancer progression. This task is accomplished by analyzing the GN inferred through GNCE using the core GN associated with the colorectal cancer pathway and RNA-seq data.


Introduction
Genes interact with each other through gene expression products in a living cell.The gene network (GN) represents the interaction relationships between genes by a graph whose nodes correspond to genes and undirected and directed edges represent the interactions and causal regulatory relationships between genes, respectively.The identification of GNs is crucial for gaining insights into cellular functions and disease-associated biological processes.The results obtained from studying GNs have important applications in the fields of cellular biology and medicine [1].With the advancements in DNA microarray and RNA-seq technologies, it is now possible to measure the mRNA expression levels of thousands of genes in an experiment, thereby capturing information about gene interactions.Mathematical or statistical networks have been employed as models to infer GNs using gene expression data.For a comprehensive understanding of model-based GN inference methods, one can refer to reviews such as [2][3][4].
GNs involved in biological processes often demonstrate sparsity, where individual genes typically interact with only a limited number of other genes.However, within these GNs, there are also hub genes that exhibit extensive interactions with a large number of genes.A prime example of such hub genes is those responsible for encoding transcription factors (TFs), which have the capability to regulate the expression of numerous genes through their TF products.Also, it has been noted that large-scale GNs often exhibit hub genes in the presence of scale-free topology [5].Hub genes, which are central to these GNs, play a critical role in cell survival and are potential targets for drug development.In recent years, network model-based methods have been developed to infer GNs with hub E-mail address: cchen@nchu.edu.tw.
Contents lists available at ScienceDirect Heliyon journal homepage: www.cell.com/heliyonhttps://doi.org/10.1016/j.heliyon.2024.e26065Received 31 January 2024; Received in revised form 6 February 2024; Accepted 7 February 2024 genes or scale-free topology from high-throughput gene expression data, e.g., correlation and weighted correlation networks [6,7], partial correlation networks or Gaussian graphical models (GGMs) [8,9], and vector autoregressions and regression-based dynamic Bayesian networks [10][11][12].These methods employ diverse regularization techniques to incorporate desired topological properties into the inferred GNs.However, inferring large-scale GNs still poses a challenging problem due to the intricate nature of gene interactions.With a vast number of genes potentially interacting directly or indirectly, unraveling these complex interactions become increasingly difficult.
In this paper, we utilize Automatic Relevance Determination (ARD) [13], a class of sparse Bayesian regressions.Our approach, called NetARD, utilizing the ARD to estimate partial correlations, enables us to infer gene networks (GNs) with hub genes from gene expression data.To validate the NetARD, we compare its performance against published methods in the inference of small-scale simulated GNs featuring hub genes, as well as in silico GNs.In our analysis of high-throughput gene expression data, we begin with a predefined core GN.We extend this core GN using the Gene Network Co-expression Extension (GNCE) method.This method incorporates co-expressed genes that interact with the core GN genes.In this process, we utilize the NetARD to infer the GNs of co-expressed genes, with genes from a predefined GN serving as hub genes.To evaluate the efficiency of GNCE, we implement it on microarray data to extend the transcription factor GN of E. coli.In our application example, we identify the Gene Ontology (GO) terms significantly associated with the colorectal cancer progression.This task is achieved by analyzing the GN that is extended via the CNCE from the core GN associated with the colorectal cancer pathway using the RNA-seq data.

Partial correlation network model
Let us consider the continuous random vector X = (X 1 , ⋯, X p ) T ∈ R p×1 that contains the expression levels of genes 1, ⋯, p.We assume that the mean vector E(X) = 0 p×1 ∈ R p×1 and the covariance matrix Cov(X) = Σ ∈ R p×p , where Σ is positive-definite, denoted as Σ ≻ 0. Let us denote by j X j represent the optimal linear regression with the minimum variance Var(ε Moreover, assume Both Eqs. ( 1) and ( 2) can be verified by utilizing the partitions of Σ and Ω, along with the formulas for optimal regression coefficient vectors expressed in terms of submatrices of Σ.For instance, a reference for the derivation of Eq. ( 2) can be found in Eqs.(17.17)-(17.9) of [14].It follows that Let the undirected graph G = (V, E) be a representation of the GN, where V contains nodes corresponding to X k (k = 1, ⋯, p) and E ⊆ V × V comprises undirected edges connecting nodes in V.The edge (k ′ , k) connecting nodes k ′ , k signifies the interaction between gene k ′ and gene k.Let W ∈ R p×p be the weighted adjacency matrix of G, where the k ′ , k entry often interpreted as the direct interaction between gene k ′ and gene k.Gene k ′ is a regulator of gene k and/or vice versa.

p-node ARD
Assume that X = (X (1) , ⋯, X (p) ) ∈ R N×p is the data matrix that contains N i.i.d.samples of X.Let us denote by X \k ∈ R N×(p− 1) the submatrix of X excluding the column X (k) .The linear regression of X (k) on X \k is given as X \k b (k) , where b In the Bayesian estimation framework, the ARD assumes that b (k)   follows a 2-level prior [13].At the first level, the prior on the diagonal.At the second level, the prior where , and g(t|s, r) = r s Γ(s) t s− 1 e − rt (t > 0) is the Gamma probability density function.Additionally, we assume the prior P (β|s β , r β ) = g(β ⃒ ⃒ s β , r β ).With all has been assumed, we formulate pARD as follows.
From the rule of conditional probability, it follows that Here, P k (b (k)   ⃒ ⃒ ⃒X, α (k) , β) represents the conditional posterior probability density function of b (k) and L k (α (k) , β ⃒ ⃒ X) the marginal likelihood function of regression model specified by α (k) , β.The conditional posterior probability density function P k (α (k) , β|X, s (k) , r (k) , s β , r β ) of regression model specified by α (k) , β is proportional to the product of L k (α (k) , β ⃒ ⃒ X) and Gamma priors over α (k) , β.By combining p ARDs, the p-node ARD (pARD) solves the maximization problem of the objective function Namely, for each fixed Θ = (s (1) , r (1) , ⋯, s (p) , r (p) , s β , r β ), the optimization seeks for the mode (α (1) , ⋯, α(p) , β) of the joint conditional posterior distribution The solution to the pARD determines a set of Bayesian regression models.According to Eq. (A1) in Appendix, P k (b (k) ⃒ ⃒ ⃒X, α(k) , β) follows the multi-normal distribution and its mean vector is given by This mean vector represents the Bayes estimate of b (k) that is the same as the weighted ridge estimate of b (k) with the regularization weights in α(k) β for k = 1, ⋯, p. Algorithm pARD for generating α(1) , ⋯, α(p) , β, b(1) , ⋯, b(p) simultaneously is described in Section A1 in Appendix. For k > 0 and 0 otherwise, and

Adaptive estimation and NetARD
In the adaptive estimation approach, the hyper-parameters within Θ serve to regularize W = (w k ′ k ) via regularizing b(k) (k = 1, ⋯, p) generated by the pARD algorithm.Let s β , r β be fixed.This approach involves updating W and fine-tuning (s (1) , r (1) , ⋯, s (p) , r (p) ) alternatingly over multiple iterations.Let us assume that W is computed in one iteration of this estimation process.For k = 1,⋯,p, we compute ṽk = 1 p ∑ p j=1 wjk ∈ [0, 1].In the subsequent iteration, we update r (k) , s (k) according to the following rules: Here, λ > 0, ξ ∈ [0, 1] are fixed constants, the function φ(t) is a strictly increasing function of t that maps [0, 1] into (0, ∞), and the symbol ∘ is the operation of element-wise product.With the elements of s (k) greater than 1 2 for k = 1, ⋯, p, Algorithm pARD generates b(1) , ⋯, b(p) using the updated Θ.Note that r (k) , sβ rβ are the prior means of α (k) , β, respectively.Additionally, is the regularization weight vector of b(k) .Indeed, we can view r (k) rβ sβ as a form of "prior" regularization weight vector.Roughly, when wk ′ k decreases (respectively, increases) in an iteration, it encourages an increase (respectively, decrease) in the "prior" regularization constants Consequently, this encouragement leads to a further decrease (respectively, increase) in wk ′ k in the subsequent iteration.Likewise, the inclusion of υk ′ in r new has an impact on the change in υk ′ during an iteration.A decrease (or increase) in υk ′ in one iteration can indeed encourage a further decrease (increase) of υk ′ in the subsequent iteration.This iterative process, often described as "the strong get stronger and the weak get weaker" effectively results in estimates of Algorithm apARD, which performs the adaptive estimation to generate W, is described in Section A2 in the Appendix.
Let us consider the assumption represented by G 0 = (V,E 0 ).In this context, the edges included in E 0 indicate likely edges, while the absence of edges in E 0 suggests that these edges are likely to be missing in the prior knowledge of G. To implement the apARD in this scenario, we scale down wk ′ k to rw k ′ k for (k ′ , k) ∕ ∈ E 0 in each iteration of the adaptive estimation process, where r ∈ [0, 1) is the shrinkage constant.The modified apARD can generate W in which wk ′ k (k ′ , k) ∕ ∈ E 0 decrease toward 0. The occurrence of wk ′ k > 0 (k ′ , k) ∕ ∈ E 0 diminishes as r decreases to 0. The modified apARD is utilized to infer GNs with pre-specified hub genes.In this scenario, we define , we utilize a modified regression algorithm (Algorithm 17.1) from Ref. [14] to compute the maximum likelihood (ML) estimate Ω of Ω that maximizes the log-likelihood l (Ω) = lndet(Ω) − trace(SΩ), where for generating G using the apARD and Ŵ using the ML estimation of Ω for G as the NetARD.The NetARD H is the NetARD implemented with the modified apARD under given hub genes.

Co-expression extension of GN
The co-expressed genes are often involved in the same biological processes.Let us assume that a predefined core GN associated with specific biological pathways is given.We propose the Co-expression Extension of GN (CEGN) to extend the core GN.Let ⃒ ⃒ estimate the similarity between X k ′ , X k .The dissimilarities or distances between X k ′ , X k are calculated as 1 − s k ′ k for all k ′ ∕ = k, and then analyzed using hierarchical cluster analysis [15,16] to construct the co-expressed gene cluster hierarchy.Typically, the results are visualized using a dendrogram.The co-expressed gene sets are derived from the generated dendrogram using tree cut techniques [17][18][19][20].We utilize the modified NetARD to infer the GN within each co-expressed gene set that includes at least one of the core GN genes, where the core GN genes serve as hub genes.This enables us to identify interactions between core GN genes and their co-expressed genes.The core GN is expanded by incorporating the inferred interacting co-expressed genes and the interactions between them.
The computational experiments for GN inference and analysis are conducted in the R environment [21] on a personal computer running the Windows operating system.The computational settings of apARD are described at the end of Section A2 in the Appendix.
When hub nodes are given, we set ξ = 0, r = 0 for the modified apARD.The W resulting from the (modified) apARD is further sparsified by resetting wk ′ k = 0 for wk ′ k < 10 − 3 .The hierarchical clustering analysis involved in the CEGN uses a chosen linkage method to generate the gene cluster hierarchies.Following this, the co-expressed gene clusters are derived from the resulting cluster hierarchy using the dynamic tree cut algorithm [20] implemented in the R package WGCNA [22].Unless otherwise stated, the above computational settings remain unchanged.

Simulated data
We perform simulation experiments to demonstrate the effectiveness of NetARD in inferring GNs with the hub genes (hub GNs).For these experiments, we use the R package hglasso [9] to simulate hub GNs with p = 100 genes.Within each simulated graph, each hub node is connected to approximately an average of 32 nodes, while each non-hub node is connected to approximately an average of 2 nodes.The graph's sparsity level (SL) and hub level (HL) are defined as the fraction of missing edges among all possible edges and the connectivity degree centralization [23], respectively.Fig. 1 illustrates a simulated hub GN with 4 hub genes, exhibiting SL, HL ≈ 0.97, 0.33, respectively.In this diagram, the pink nodes represent the hub genes.For a simulated GN, we follow the procedure in Ref. [9] to generate the associated Ω ≻ 0.More specifically, assume A = (a k ′ k ) ∈ R p×p is the adjacency matrix of the simulated graph G, where C.-K. Chen associated with Ŵ are approximately 0.36, 0.46, 0.52, respectively.Let the genes be ranked based on the number of their interaction partners.The top 1-3 genes in each G correspond to the actual hub genes of the hub GN.We evaluate NetARD's performance in inferring 5 simulated hub GNs.These simulated GNs have varying numbers of hub genes, ranging from 2 to 6. Furthermore, we use same simulated gene expression data to compare NetARD's performance with other published graph estimation methods.The methods used for comparison are as below: 1. glasso [24]: It maximizes the regularized l (Ω) with the L 1 -norm penalty to generate sparse estimates of Ω. 2. hglasso [9]: It maximizes the regularized l (Ω) with a convex penalty to produce the sparse estimates of Ω with the hub property.3. DWLasso [8]: This method symmetrizes the estimates of b (k) (k = 1, ⋯, p) of node-wise weighted LASSO regressions with estimated normalized regularization weights.This produces sparse estimates of Ω with the hub property.
For the DWLasso, we re-estimate Ω by Ω, which maximizes l (Ω) subject to the estimate of G corresponding to the estimate of Ω produced by the DWLasso.
In   pink indicate the expected AUROC and AUPR of randomly generated W over the 5 GN inference experiments, respectively.These dashed lines serve as the baseline performance indicators.All compared methods yield AUROCs, AUPRs above the baseline levels.
Table 1 presents the median values of AUROC, AUPR for Ŵ, as well as the median values of SL, HL of estimated G corresponding to Ŵ generated by each method.When ranking the compared methods for sparse GN inference based on the median AUPR, it is observed that the NetARD performs best in this experiment.

GNW in silico data
The GeneNetWeaver (GNW) [27] is a software tool that generates in silico GNs by extracting subnetworks from real GNs and simulates gene expression data using a system of non-linear ordinary differential equations.The dataset gnw2000 in the R package grndata [28] includes an undirected yeast GN with 2000 genes and 2000 simulated steady-state samples of noise-free gene expression data of GN generated by GNW.For our study, we generate 5 test datasets, each of which includes a p = 100-gene connected subnetwork extracted from the yeast GN and N = 60 samples of gene expression data extracted from the first 60 samples of gene expression data in gnw2000 dataset.To simulate observation errors, we add Gaussian noise generated using the standard deviation of 0.01 to the extracted data.Fig. 4 illustrates one of the generated 100-gene yeast GN that contains 119 edges, with pink nodes representing hub genes that connect to more than 5 other genes.
In the experiments for GN inference, we first standardize the expression data for each gene in a gene expression dataset.This is achieved by centering the mean to 0 and scaling the standard deviation to 1.The GN inference methods are then applied to these Fig. 5. AUROCs, AUPRs and boxplots of AUROCs, AUPRs resulting from GN inference methods on 5 in silico 100-gene yeast GNs.
C.-K. Chen standardized gene expression datasets using similar computational settings as in previous experiments.We also implement the NetARD H on data, considering the provided hub genes.Fig. 5 displays the AUROCs, AUPRs and boxplots for each method used to infer 5100-gene yeast GNs.Table 2 presents the median values of AUROC, AUPR of Ŵ and the median values of SL, HL of estimates of G yielded by each method.Overall, the NetARD outperforms the reference methods in terms of generating higher AUPRs.The performance of NetARD H exhibits significant improvement compared to that of NetARD.

E. coli microarray data
The Many Microbe Microarrays Database (M 3D ) [29] provides gene expression datasets measured using Affymetrix microarrays.From this database, we select the E. coli dataset, which includes 466 samples of expression data for 4297 genes.To acquire gene and transcription factor (TF) annotations, as well as experimentally validated regulatory relationships between TFs and genes in E. coli K-12, we utilize the R package regutools [30] to access datasets from RegulonDB [31].
To prepare the data for analysis, genes lacking gene symbol identifiers are excluded from the microarray dataset.The expression data of the remaining 4212 genes are log-transformed and then standardized.To construct the gold standard GN, we utilize the genegene regulation dataset, which includes 4405 regulatory relationships involving 201 TF-encoding gene or multi-gene regulators and 1842 target genes.We split multi-gene regulators into individual gene regulators.Genes not present in the measured genes for analysis, along with regulations involving missing regulators or target genes, are excluded.Self-regulations are omitted, and ultimately, regulatory directions are disregarded.The resulting gold standard GN includes 1887 genes and 4791 gene-gene interactions.Moreover, we utilize the TF-TF regulation dataset from the database, which includes 177 TFs and 449 TF-TF regulatory relationships, to extract interactions between TF genes from the gold standard GN.The resulting TF gene GN includes 154 TF genes and 333 gene-gene interactions.We observe that the gold standard GN exhibits a relatively low coverage of gene-gene interactions between the 4212 genes used for analysis.
To evaluate the performance of GNCE, we apply it to extend the TF gene GN using the processed microarray data.By employing the "Ward.D2" linkage for the hierarchical clustering and applying the dynamic tree cut algorithm [20] with the minimum cluster size set to 7 to the resulting cluster hierarchy, we obtain 148 co-expressed gene clusters and 1 noise gene cluster.We have utilized the largest minimum cluster size in the dynamic tree cut algorithm so that the noise gene cluster does not contain any TF genes.We identify and select 79 co-expressed gene sets, each of which includes TF genes.The sizes of these sets range from 7 to 117 genes.A total of 264  gene-gene interactions within the gold standard GN fall within the selected co-expressed gene sets.Subsequently, the NetARD H is utilized to infer the GN within each of the 79 co-expressed gene sets, where the included TF genes serve as hub genes.As a result, in the 2006 inferred gene-gene interactions, 203 of them match the gold standard gene-gene interactions with Re ≈ 0.77 and Pr ≈ 0.10.The limited coverage of gene-gene interactions in the gold standard GN may contribute to the significantly lower Pr compared to the median Pr ≈ 0.28 induced by the NetARD H . Merging the TF gene GN and the inferred gene-gene interactions results in the extension of TF GN that comprises 1630 genes.Out of 2327 edges of this inferred extension, 524 edges correspond to the edges within the gold standard GN.Fig. 6 (a) illustrates the TF gene GN and (b) the extension of TF gene GN.

Colon cancer RNAseq data
The Cancer Genome Atlas (TCGA) Project is a useful resource for obtaining human cancer gene expression data.In our study, we use the COAD RNA sequencing (RNAseq) dataset, which pertains to colon adenocarcinoma and is available through the R package RTCGA.rnaseq [32].The dataset provides normalized RNAseq expression data of 285 cancer tissue and 41 matched normal tissue samples, encompassing 20531 human genes.To acquire the predefined gene-gene interactions involved in colorectal cancer, we refer to the pathway map hsa05210 stored in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [33].This pathway map outlines the molecular interactions during the development of human colorectal cancer.To obtain the GN associated with this pathway map, we use the R package KEGGgraph [34] to parse the hsa05210 file that is in extensible markup language (XML) format.As a result, we generate a KEGG colorectal cancer GN comprising 86 genes and 149 edges.Here, the regulatory directions between genes are not considered.
To prepare the data for analysis, we take a few steps of data preprocessing.First, genes with missing values are excluded.In addition, genes with duplicated gene symbol identifiers or genes that are not annotated by R package org.Hs.eg.db [35] are excluded.Furthermore, genes with expression levels lower than 0.1 in 95% of the samples are excluded from the dataset.The expression data of the remaining 16234 genes are first log-transformed and then standardized.We then use the R package limma [36] to conduct a differential expression analysis using the standardized log-transformed gene expression data.This package utilizes robust statistical methods in a linear modeling framework to test whether the gene expression levels are significantly different in cancer and normal tissues and control for multiple testing.From this analysis, a total of 11677 genes are identified as differentially expressed (DE) across cancer and normal tissue samples, with their false discovery rate (FDR) adjusted p-values below 0.05.On the other hand, 15 KEGG colorectal cancer GN genes are identified as non-DE genes.
We apply the GNCE to the standardized log-transformed expression data for 11692 DE and KEGG colorectal cancer GN genes obtained from 285 cancer tissue samples.By employing the "complete" linkage for the hierarchical clustering and applying the dynamic tree cut algorithm with the minimum cluster size set to 10, we obtain 559 co-expression gene clusters and 1 noise gene cluster.Except for 1 gene (SAMAD2) included in the noise gene cluster, the remaining 85 KEGG colorectal cancer GN genes are distributed across 74 co-expressed gene clusters.The NetARD H with r = 0.2 is utilized to infer the GN within each of these 74 co-expressed gene sets, where the included KEGG colorectal cancer GN genes serve as hub genes.As a result, we identify 1144 interactions between hub genes and their co-expressed genes.Merging the KEGG colorectal cancer gene GN and the inferred gene-gene interactions results in the extended GN that contains 1269 genes and 4386 edges.C.-K. Chen

Colorectal cancer-related Gene Ontology terms
We analyze the inferred GN presented in Fig. 7 (b) to discover Gene Ontology (GO) terms related to biological processes (BPs) that play an important role in the context of colorectal cancer progression.We first employ GO enrichment analysis (GOEA) to identify BP GO terms significantly enriched in each co-expressed gene set within the inferred GN.In this analysis, we use the gene set of inferred GN as the background gene set.For each GO term, the associated gene set comprises genes from the background gene set that are annotated by that particular GO term.The strength of association between a GO term and a co-expressed gene set within the inferred GN is assessed by evaluating the strength of association between the gene set associated with the GO term and the co-expressed gene set.We employ the R package topGO [37] to compute the p-value of the Fisher exact test, which quantifies the likelihood of observing such an association through random sampling.BP GO terms with their FDR adjusted p-values lower than a specified level are considered significantly enriched in the co-expressed gene set.
Subsequently, we proceed to identify the GO terms that are significantly associated with the inferred GN from the previously identified enriched BP GO terms.Here, a GO term is considered associated with the inferred GN if the genes of its associated gene set distribute closely in the inferred GN.We utilize the Knet function [38], akin to Ripley's K-function, to examine the distribution of gene set within the inferred GN.A higher value of the area under the Knet function (AUK) indicates a greater degree of clustering of gene set within the inferred GN.A GO term is associated with the inferred GN if the AUK of its associated gene set significantly surpasses that of C.-K. Chen any randomly selected gene set of the same size.We employ the R package SANTA [38], where the shortest-path distance is utilized to quantify the distance between two genes within the GN, to compute the p-value of permutation test, by comparing the AUK of the GO term gene to the AUKs of randomly generated gene sets of the same size as the GO term gene set, for each GO term of interest.BP GO terms with their FDR adjusted p-values lower than a specified level are considered significantly associated with the inferred GN.
In the GO term analysis experiment, we set the significance level of the FDR-adjusted p-value for Fisher exact tests in GOEA at 0.05.We identify 891 BP GO terms that exhibit significant enrichment within 29 co-expressed gene sets.We choose from these BP GO terms whose associated gene sets contain 5 or more genes for subsequent analysis.Employing a significance level of FDR-adjusted p-value at 0.1 for permutation tests, we identify 40 GO terms, which exhibit significant enrichment within 11 co-expressed gene sets, significantly associated with the inferred GN.We present the results of our GO term analysis pertaining to a specific co-expressed gene set as follows.
Fig. 8 illustrates the largest subnetwork consisting of 103 co-expressed genes within the inferred GN shown in Fig. 7 (b).In this diagram, the pink nodes represent hub genes, including TGFB3 (Transforming Growth Factor Beta 3) and TCF7L1 (Transcription Factor 7-like 1), both included in the KEGG colorectal cancer GN.Among the 92 BP GO terms significantly enriched in genes in Fig. 8, there are 38 GO terms associated with gene sets containing 5 or more genes.Fig. 9 depicts the GO plot for the final 15 BP GO terms that are significantly associated with the inferred GN.These identified GO terms are related to the organization of extracellular matrix components (GO:0030198, GO:0030199), the development and differentiation of connective tissues (GO:0051216, GO:0001503, GO:0002062, GO:0048701, GO:0045669) and skins (GO:0043588), the development and differentiation of organs and tissues within a multicellular organism (GO:0007275, GO:0030324, GO:0035987, GO:0060021), as well as cell signaling (GO:0007229), regulation of Fig. 9. GO plot of identified BP GO terms associated with colorectal cancer.In this plot, red and blue dots represent the upregulated (logFC > 0) and downregulated (logFC < 0) GO term genes on the inferred GN in Fig. 7  cell differentiation (GO:0045597), and the reorganization of the actin cytoskeleton (GO:0031532).These GO terms provide informative annotations for the co-expressed genes associated with cancer-related BPs.Fig. 10 illustrates the workflow of the proposed method for identifying cancer-related GO terms using gene expression data and predefined core GN of cancer.Fig. 10.Workflow of the proposed method for identifying cancer-related GO terms using gene expression data and predefined core GN specific to cancer.

Conclusion
According to the AUPR, which provides a more insightful measure of sparse GN inference performance, the NetARD outperforms the reference GN inference methods when applied to simulated hub GNs and in silico yeast GNs.When provided with hub genes, the NetARD significantly improves the inference accuracy on in silico yeast GNs.The GNCE, which integrates co-expression analysis and the NetARD, retrieves the majority of interactions between TF genes and their co-expressed genes of E. coli.The GNCE enables the prediction of GO terms associated the colorectal cancer using combined information from curated core cancer GN and gene expression data.

Discussion
The proposed NetARD utilizes the apARD to infer the GN and then re-estimates the non-zero partial correlations between genes using the GGM.It is important to note that gene expression data can frequently display non-normally distributed characteristics, and the gene expression profiles can exhibit complex nonlinear correlations.Partial correlation or GGM-based graph estimation methods primarily target the capture of direct linear dependencies between (log-transformed) gene expression levels.Consequently, these methods may encounter difficulties in identifying GNs, particularly as the size of GN grows larger.
It is known that multiple genes frequently collaborate and exhibit co-expression during cellular processes.The GNCE identifies interactions between predefined core GN genes and their co-expressed genes using the NetARD, achieved through the clustering of genes into small-scale co-expressed gene sets.In our experiments, by controlling the minimum cluster size in the dynamic tree cut algorithm, the resulting co-expressed gene sets typically contain below 120 genes.Smaller co-expressed gene set sizes allow the NetARD to efficiently infer GNs with shorter computational time.However, this GNCE does not capture the interactions between nonco-expressed genes.To identify GO terms associated with the progression of colorectal cancer, we employ the GN inferred by GNCE, incorporating KEGG colorectal cancer core GN and gene expression data.Our approach is based on the fundamental assumption that GNs consisting of genes displaying co-expression patterns and interacting with those involved in the curated cancer pathway primarily contribute to the coordination of cellular processes.
Finally, additive models extend linear regressions by incorporating nonlinear component functions, such as spline functions, to accommodate the nonlinear interactions between predictor variables and the response variable.One of our future research interests involves the inference of GNs with biologically meaningful topological properties using additive models of gene expression.The GNs inferred in this context are expected to yield novel insights into the interactions among genes that coordinate cellular processes in both healthy and diseased states.f(α (1) , ⋯, α (p) , β) = 0, we obtain the maximizer of f(α (1) , ⋯, α (p) , β) in Eq. ( 5) with α (k ′ ) (k ′ ∕ = k), β fixed as where diag( Sb (k) ) is the diagonal vector of Sb (k) and the vector division is the element-wise division.By solving ∂ ∂β f(α (1) ,⋯,α (p) ,β) = 0), we obtain the maximizer of f(α (1) , ⋯, α (p) , β) with α (k) (1, ⋯, p) fixed as C.-K. Chen end for

W←Φ(B)
if W ceases to change then break end repeat return W = W, Θ.

Fig. 1 .
Fig. 1.Graph of simulated 100-gene hub GN.The pink nodes represent hub genes that interact with more than 10 genes in the GN.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 3 .
Fig. 3. AUROCs, AUPRs and boxplots of AUROCs, AUPRs resulting from inference methods on 5 simulated 100-gene hub GNs.The color dots indicate the AUROCs, AUPRs yielded by GN inference methods on each of 5 simulated GNs.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 4 .
Fig. 4. Graph of in silico 100-gene yeast GN.The pink nodes represent hub genes that interact with more than 5 genes in the GN.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 6 .
Fig. 6.(a) GN of TF genes of E. coli (b) Co-expression extension of GN of TF genes of E. coli.

Fig. 7 (
a) illustrates the KEGG colorectal cancer gene GN and (b) the extension of KEGG colorectal cancer gene GN.

Fig. 8 .
Fig. 8. Graph of 103 co-expressed gene subnetwork of the inferred GN in Fig. 8 (b).The pink nodes represent hub genes TGFB3, TCF7L1 in Fig. 7 (a).(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) Fig. 9. GO plot of identified BP GO terms associated with colorectal cancer.In this plot, red and blue dots represent the upregulated (logFC > 0) and downregulated (logFC < 0) GO term genes on the inferred GN in Fig. 7 (b), respectively.Here, the logFC represents the difference of averages of log2-transformed gene expression data over cancer and normal tissue samples.The z-score of gene set is calculated as u− d ̅ ̅ c √ , where u is the number of upregulated genes, d is the number of downregulated genes, and c is the total number of genes of gene set.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Table 2
Median values of AUROC, AUPR, SL, HL resulting from GN inference methods on 5 in silico 100-gene yeast GNs.