Inference of Gene Regulatory Networks Using Bayesian Nonparametric Regression and Topology Information

Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.


Introduction
Gene regulatory network plays an important role in diverse cellular functions. A reliable method to identify the structure and dynamics of such regulation is important for understanding complex biological processes and is helpful for treatment of diseases. With the development of high throughout technologies in recent years, gene expression data has provided a useful way to investigate the cellular system.
Generally, there are two types of gene expression data used to predict the structure of GRNs, which are steady-state data and time-series data. The steady-state data measures the steady-state levels in different samples, while time-series data measures the expression levels at several successive time points. Since the time-series data contains the dynamic information of the network while the steady-state data does not [1], we focus on the time-series data in this paper.
Over the last several years, a number of network inference methods have been developed to tackle this problem, including Bayesian network [2,3], dynamic Bayesian network [4,5], Boolean network [6,7], ordinary differential equation [8,9], and mutual information [10,11]. A comprehensive review can be found in [12,13]. Among these methods, dynamic Bayesian network has become the major focus for inferring gene regulatory network because it can infer causal interactions, model cyclic interactions, and has less computational complexity than ordinary differential equation.
Inferring a GRN from time-series data is known to be challenging partly due to the high number of genes relative to the number of data points. More importantly, the interactions between genes are typically nonlinear; thus linear model may be inefficient to recognize the nonlinear interactions. A flexible way to solve this problem is to use B-spline functions to describe the nonlinear interactions, and the Bspline functions have been used to infer GRNs in previous studies [14,15]. A key problem in spline regression is the knot selection which greatly influences the curve fitting. Reference [14] suggested using penalized-splines to avoid overfitting and reduce the number of parameters to be estimated. Among many penalized methods, lasso [16] is the most popular method due to its ability to select and estimate simultaneously and can produce exact 0 estimates. Group lasso [17] was also developed to select grouped variables. Reference [18] proposed group lasso or Bayesian group lasso when spline regression was used because the predictors belong to a same gene forming a natural group. Reference [19] also developed a Bayesian adaptive group lasso to perform simultaneous model selection and estimation for B-spline regression. However, Bayesian spline regression methods still predict a lot of false positive interactions because of the indirect effects existing in the GRNs.
Recently, [20] proposed a new method which uses network topology information to improve gene regulatory network inference; they used a prior that both prokaryotic and eukaryotic transcription networks exhibit an approximately scale-free out-degree distribution while the in-degree distribution is a more restricted exponential function; this structure property is described in [21]. Reference [20] also pointed out that 79% or more genes regulators are less than 3. This property means that most genes in a GRN are regulated by a few regulators and may be possible to be combined with the B-spline regression to improve the results of the GRN inference.
In this paper, we work with a dynamic Bayesian network and use spline regression to detect the nonlinear interactions between genes. A Bayesian group lasso is also used to avoid overfitting and reduce the number of parameters to be estimated. Comparing with group lasso, Bayesian group lasso is a better choice because there are 2 major advantages of Bayesian selection methods: (1) The tuning parameter can be set flexibly.
(2) The topology information can be incorporated easily. Further, instead of taking a traditional Bayesian group lasso, we use a Bayesian group lasso model with spike and slab priors since this problem only requires the sparsity on the group level and spike and slab priors can exclude or include the entire group of B-spline basis functions. Finally, we incorporate the topology information as a prior in the Bayesian approach which controls the size of the selected model. This method is assessed by applying to DREAM3 and DREAM4 datasets and two real biological datasets.

The Nonlinear Regression
Model for GRN Inference. Consider an × matrix , where is the number of the gene expression levels measured times and is the number of genes. A DBN model represents probabilistic relationships between genes via a directed acyclic graph . In this graph, genes are represented by a set of nodes = { 1 , . . . , } and the interactions between genes are represented by a set of directed edges ⊆ {( , ) : , ∈ }. A directed edge from node to node means gene is a regulator of gene . The probability distribution of genes given its parents can be expressed as where , is the gene expression where , is the expression level of gene and − , −1 is the vector without −1 : We assume that the GRN is a time-invariant network; thus (⋅) = (⋅) and the error term ∼ (0, 2 ). Although (⋅) can be characterized by any nonlinear functional representation, [15] suggested using B-spline basis functions instead of using Fourier basis, wavelets, or other nonlinear basis functions because of the pattern of the relationship between genes is unknown. Therefore, we also use B-spline basis functions in this article and the regulatory relationships can be written as where is the intercept and ( , ) = ∑ =1 ( , ). { ( , )} are B-spline basis functions of degree and is the parameters to estimate from data. Let { } be the set of equally spaced knots with min{ } = 1 < 2 < ⋅ ⋅ ⋅ < = max{ }, and = + . We get rid of the subscript for the variables for simplicity of notation. Then the regression equation can be written as where is the bases matrix of size × and is the corresponding coefficients vector.

Incorporating the Topology Information and Bayesian
Inference. We use the Bayesian group lasso method proposed in [22]; the hierarchical Bayesian model is where = 1 for = 1 and = otherwise. Here we use a spike and slab prior on and get the ranking of the potential regulatory links from . Although we can place a positive and very small as a prior when the in-degree of the target gene is small, there are still a lot of false positive interactions to be predicted. Inspired by the idea of maxP technique proposed Computational and Mathematical Methods in Medicine 3 by [20], we use a prior proposed in [23], to place a restriction on , that only allow the model to be of small size.
Here the integer-valued hyperparameter restricts the maximum number of parents for the target gene in each iteration. However, there are still some genes regulated by a large number of genes. Therefore, a fixed will affect the accuracy of the prediction. Thus a uniform prior on [1, ] is placed on , where is a predetermined integer. Then the model becomes The likelihood is According to the prior and the likelihood above, the joint posterior distribution on data is The Gibbs sampling scheme is as follows: We use − = ( 1 , . . . , −1 , +1 , . . . , ) to denote the coefficient vector without the th group and − = ( 1 , . . . , −1 , +1 , . . . , ) to denote the covariate matrix corresponding to − . The full conditions of ( = 1, ) and ( = 0, ) are where = Σ ( − − − ) and Σ = ( + (1/ 2 ) ) −1 .
Integrating out , we have From these equations, we can draw through Then the full conditional posterior distribution of is Thus, the full conditional distribution of is a normal distribution: The full conditions of ( 2 , = 1) and ( 2 , = 0) are 2 ) , Then the full conditional distribution of 2 is .
The full conditional distribution of 2 is Then the conditional posterior distribution of 2 is where And it can be verified that the conditional posterior distributions of other parameters are And a Monte Carlo EM algorithm is used to estimate : where equal to 1 + × ( − 1) is the number of the total regressors and ( −1) [ 2 | ] can be replaced by the sample average of 2 generated in the −1th step of the Gibbs sampler. We choose the second half of the samples and the result is the average of the samples.

Results
To demonstrate the effectiveness of the topology information and the B-spline functions, our method is used to infer GRNs from in silico time-series data and real biological data; a linear model with topology information and a nonlinear model without topology information are also applied as competing methods. Here we use the time-series data in DREAM3 and DREAM4 challenges as the in silico data, and we use a cell cycle regulatory subnetwork in Saccharomyces cerevisiae and Human Hela cell network as the real biological datasets. We generate 10000 samples from the posterior distribution and choose the second half of the samples to derive the results. The posterior estimates of all the parameters are obtained through the posterior averages of the chains. For the B-spline functions, we adopt the setting as [14] and use a cubic Bspline with 10 interior knots. Here we choose = 5 in our experiments.

Application to In Silico Networks.
We first evaluate our method on DREAM4 challenges networks of sizes 10 and 100 [24][25][26]. The size 10 network data consists of 5 simulated networks, each of which consists of 21 time points and 5 replicates. The size 100 network data also consists of 5 simulated networks, each of which consists of 21 time points and 10 replicates. We also evaluate our method on DREAM3 challenges networks of sizes 10, which is also used in [27]. This data consists of 5 simulated networks, each of which consists of 21 time points. There are also steady-state data provided by the DREAM4 challenge. However, we only focus on timeseries data in this article. Although the winning entry in DREAM4 competition used only the knock-out data [28] and combining the time-series and steady-state data can achieve much better results [27,29], it is infeasible to do knock-out experiments for all genes in practice and generally the knockout experiments only are done for a small part of genes [30].
Each of the five networks is inferred using all available time-series data, and the area under the receiver operating characteristic (AUROC) curve and the area under precisionrecall (AUPR) curve are computed according to the gold standard network topology provided by DREAM3 and DREAM4 challenge. The prediction performances on the DREAM4 10-gene networks and 100-gene networks are summarized in Tables 1 and 2. Table 1 shows that the Bayesian lasso and Bayesian group lasso perform similarly on size 10 data while the BGL prior has a better performance than the methods above in both average AUROC and AUPR. For net 2 and net 5, the BGL prior outperforms other methods significantly. We also compared our method with another 2 dynamic Bayesian network methods [31]; the result of our method is also comparable to these methods. Table 2 shows that the nonlinear model performs poorly on this dataset;   while the topology information can remarkably improve the prediction performance of the nonlinear model, the Bayesian group lasso with topology information outperforms the Bayesian group lasso methods in both AUROC and AUPR, and these methods also have higher AUROC than linear model, although the AUPR is a little worse. Compared with the results of the other 2 DBN methods, the result of the Bayesian lasso is similar to them and our method still has the highest average AUROC. The prediction performances on the DREAM3 10 gene networks are summarized in Table 3.
We also compared our method with another additive model based on ODE [27] and Inferelator 1.0 [32]. For Ecoli 1, Ecoli 2, Yeast 2, and Yeast 3, the 3 additive models perform better than the linear model; for Yeast 1, although BL performs much better than BGL, the BGL prior still gets slightly better results. The average results show that BGL prior outperforms the other methods in both AUROC and AUPR.

3.2.
Application to IRMA Network. The IRMA network data is a subnetwork embedded in Saccharomyces cerevisiae consisting of 5 genes: CBF1, GAL4, SWI5, GAL80, and ASH1. Both of the two time-series gene expressions include switch-on data and switch-off data. The switch-on data is taken from 5 experiments and the switch-off data is taken  from 4 experiments with a total of 142 samples measured by [33] and also used in [20,34]. The IRMA network is well studied and is a gold standard network. This network also has a fixed topology and the genes in the network are not regulated by other yeast genes. Here we use the precision rate (PR = TP/(TP + FP)), recall rate (RR = TP/(TP + FN)), and -measure = (2 ⋅ PR ⋅ RR)/(PR + RR) to evaluate the performance and select a best threshold as [35]. The signs of the interactions and self-regulations are not considered; thus the total number of the potential interactions is 20. Table 4 shows the inference performance for the IRMA network. The nonlinear model still performs better than linear model. The method with the prior has a higher TP than the Bayesian group lasso, which implies that the topology information improves the performance. Comparing with another B-spline based method [14] and the method used and compared in [36], although our method cannot achieve the best performance, it is still comparable to the TSNIF and performs much better than another B-spline based method.

Application to Hela Network.
We then apply our method on the cell cycle genes in human cancer cell lines (HeLa) which were analyzed by Whitfield [37]. A subnet consisting of 9 Hela cell genes was extracted by Sambo et al. [38] and the topology of this gene regulatory network is determined in the BioGRID database. They also developed a method called CNET to analysis the Hela network. This network is also analyzed by Lozano et al. [39] and Shojaie and Michailidis [40]; they proposed 2 1 penalized method, grpLasso, and TAlasso to infer causal interactions.
Here we use the third experiment of Whitfield [37] as the previous studies, consisting of 47 samples. The results of CNET, grpLasso, and TAlasso are taken from [40]. Table 5 shows the inference performance for the Hela network.
Comparing with the BGL, the BGL prior has a higher precision. Comparing with other methods, the penalized method seems to perform better than another B-spline based method and has a similar performance to the other 2 1 penalized methods and all the true positives of Morrissey's method are also found by BGL and BGL prior. On the other hand, the interactions from RFC4 to CDC2 and CDC2 to CCNE1 are found not only by BGL and BGL prior, but also by 2 of other 3 comparable methods. It may be because these interactions exist in real regulatory network but are not included in the BioGRID dataset.

Conclusion
In this study, we propose a fully Bayesian method, based on B-spline, group lasso, and topology information to infer gene regulatory network from time-series data. We use B-spline functions to capture the nonlinear interactions between genes, 1 norm penalty to prevent overfitting, and topology information, the knowledge of the exponential decrease in indegree that most genes have only a small number of regulators as a prior. A spike and slab prior is used to facilitate variable selection by putting a multivariate point mass at 0 ×1 for an -dimensional coefficients group. The performance of the proposed method is demonstrated by applications to the DREAM4 in silico data of sizes 10 and 100 network challenges and the real biological data of IRMA and Hela cell network. The results show that the topology information indeed contributes to the gene regulatory network inference which can improve the AUROC remarkably of the DREAM4 in silico data and improve the results of the IRMA network and Hela cell data. B-spline regression model also performs better than linear model in real biological data. Therefore, our method is an effective way of inferencing gene regulatory network from the time-series data.