A new multiple regression approach for the construction of genetic regulatory networks

https://doi.org/10.1016/j.artmed.2009.11.001Get rights and content

Abstract

Objective

Re-construction of a genetic regulatory network from a given time-series gene expression data is an important research topic in systems biology. One of the main difficulties in building a genetic regulatory network lies in the fact that practical data set has a huge number of genes vs. a small number of sampling time points. In this paper, we propose a new linear regression model that may overcome this difficulty for uncovering the regulatory relationship in a genetic network.

Methods

The proposed multiple regression model makes use of the scale-free property of a real biological network. In particular, a filter is constructed by using this scale-free property and some appropriate statistical tests to remove redundant interactions among the genes. A model is then constructed by minimizing the gap between the observed and the predicted data.

Results

Numerical examples based on yeast gene expression data are given to demonstrate that the proposed model fits the practical data very well. Some interesting properties of the genes and the underlying network are also observed.

Conclusions

In conclusion, we propose a new multiple regression model based on the scale-free property of real biological network for genetic regulatory network inference. Numerical results using yeast cell cycle gene expression dataset show the effectiveness of our method. We expect that the proposed method can be widely used for genetic network inference using high-throughput gene expression data from various species for systems biology discovery.

Introduction

The development of microarray technologies has dramatically accelerated the exploration of living organisms at genomic level. Nowadays, huge amounts of quantitative gene expression data can be routinely generated, and the regulatory relationship among genes can be inferred with suitable modeling approaches. The difficulty in network modeling lies in the fact that the number of genes is often huge whereas that of sampling time points is small in a time-series gene expression dataset. Thus effective mathematical models and efficient computational algorithms are necessary to infer the gene regulatory relationships using such data. In fact, many mathematical models and numerical algorithms have been proposed for inferring genetic regulatory networks [1], [2]. For discrete gene expression data (expressed: 1, unexpressed: 0), Boolean network model, probabilistic Boolean network model, multivariate Markov model, and Bayesian networks have been proposed and employed [3], [4], [5], [6], [7], [8], [9]. For continuous expression data, clustering algorithms, Bayesian networks and ordinary differential equations (ODEs) based methods have been successfully used for network inference [1].

We are particularly interested in continuous gene expression data. Although clustering method is not a proper network inference method, it is still widely used when the data volume is huge. The rationale behind clustering method is that genes in the same cluster are more likely to be functionally related [10]. Using such method, a high dimensional gene expression dataset can be downsized to a small number of clusters. However, such method cannot be used to infer the regulatory relationships in a captured genetic network. Another widely used probabilistic model is the Bayesian network [11], [12], in which the regulatory relationships among genes are represented as a directed acyclic graph, with the parent gene being the regulator of the child gene. The assumption behind Bayesian network is the Markov property, which states that each gene is independent of its non-descendants given its parents. This excludes the case that a gene may regulate its parents, which is the major limitation of the Bayesian network approach. Dynamic Bayesian network has been developed to overcome this limitation [12]. However, inference of genetic network using probabilistic models such as Bayesian network remains a difficult task.

Another class of methods for inferring a genetic regulatory network from a given continuous data is the ODE based algorithm [13], [14], [15], [16], which is able to describe the genetic regulations in the form of directed graphs. Moreover, it can be applied to both steady-state and time-series expression data for predicting the behavior of the genetic network under different conditions. The usual ODE model takes the following form:x˙i(t)=j=1naijxj(t)+biu(t),where i = 1, 2,…, n, t = 1, 2,…, m. Here n is the number of genes, m is the number of time points, xi(t) is the concentration of transcript i at time point t and x˙i(t) is the change rate of the concentration of gene i at time t. The parameter aij represents the influence of gene j on gene i, bi represents the effect of the external perturbations on xi and u(t) represents the external perturbation at time t. In Gardner et al. [15], a multiple regression method was proposed to compute aij from the steady-state gene expression data (x˙(t)=0). This method allows the number of input genes to be determined by the users. Similar methods, such as the time-series network identification algorithm and mode-of-action by network identification algorithm have been proposed in [13], [17], respectively. All these methods are restricted to the use of perturbations. When inferring a model, u(t) in Eq. (1) is assumed to be known and it is determined when the data sets are generated. In van Someren et al. [18], a least absolute regression network analysis method was proposed. The authors assumed the following model:Xt+1=AXt+ε,t=1,2,,m.where ɛ is used to model the noise and each entry aij of the matrix A is used to model the influence of the expression of gene j (at time t) on gene i (at time t + 1). To get a good estimate of A, instead of solving a least square (LS) problem which minimizes the errors between the observed data and the predicted data, an additional penalty cost is added. The role of the penalty cost is to balance the data fitting term and to control the connectivity among the genes.

The difficulty of applying the ODE based models is the estimation of the interaction coefficients. It is well known that many real biological networks have the scale-free property (i.e., the degree approximately follows the power-law distribution) [19]. More precisely, it is observed that in a genetic regulatory network, the out-degree distribution follows the power-law and the in-degree follows the Poisson distribution [20]. However, all the above approaches do not consider such distributions when inferring a genetic regulatory network. In this paper, we take into account the scale-free properties and propose a linear multiple regression model similar to the one in van Someren et al. [18] for modeling the regulatory relationships among genes. The scale-free properties are employed in the design of a filter, which is then applied to remove the relatively small nonzero entries in the matrix A. This filtering process is important in the construction of gene–gene connections matrix A such that the matrix has the following properties: the number of nonzero entries in each row follows the Poisson distribution and the number of nonzero entries in each column follows the power-law distribution. Two statistical tests, the t-test and the χ2-test, are applied to test the power-law distribution and the Poisson distribution, respectively. The LS method with regularization is applied to get the estimate of the filter and the matrix A based on the obtained filter.

The rest of the paper is structured as follows. In Section 2, we present the methodology for model building and inference. In Section 3, numerical examples based on a yeast gene expression data set are given to illustrate the proposed method. Finally, concluding remarks are given in Section 4 to address further research issues.

Section snippets

The linear model

In this subsection, we present our linear model. We assume that interactions among the genes are described by the following linear model:Xt+1=AXt+εt,for1tm.Here Xt is an n × 1 vector describing the expression level of n different genes at time t, and A is an n × n matrix. Here Aij, the entry in the ith row and jth column of the matrix A, describes the influence of gene j to gene i in the network. The random variable ɛt is used to model the noise at time t. Given the gene expression levels of

Numerical examples

In this section, we will demonstrate the procedures of our proposed algorithm using the yeast cell cycle gene expression dataset, which contains 384 genes measured at 17 time points. All the genes are identified based on their peak times at five phases of the cell cycle and are annotated. The expression levels of each gene were standardized so as to enhance the performance of model-based methods. The whole data set [23] can be downloaded at ‘http://faculty.washington.edu/kayee/model/’ and the

Conclusions

In this paper, we proposed a new multiple regression model for genetic network inference using time-series gene expression data. In this model, a filtering process based on the scale-free properties of a real gene regulatory network was first adopted to eliminate the redundant connections among genes in order to obtain a good estimate of the network structure. A minimization process was then used to estimate the influence coefficients of gene relationships. Since the number of sampling time

Conflict of interest statement

No conflict of interests.

Acknowledgments

The authors would like to thank the anonymous referees and the editor for their helpful suggestions and corrections. Research supported in part by HKRGC Grant No. 7017/07P, HKUCRGC Grants, HKU Strategy Research Theme fund on Computational Sciences, Hung Hing Ying Physical Research Sciences Research Grant, National Natural Science Foundation of China Grant No. 10971075 and National Natural Science Foundation of Guangdong Grant No. 915102240-1000002. SQ is supported by National Natural Science

References (24)

  • R.J. Cho et al.

    A genome-wide transcriptional analysis of the mitotic cell cycle

    Mol Cell

    (1998)
  • M. Bansal et al.

    How to infer gene networks from expression profiles

    Mol Syst Biol

    (2007)
  • H. de Jong

    Modeling and simulation of genetic regulatory systems: a literature review

    J Comput Biol

    (2002)
  • T. Akutsu et al.

    Inferring qualitative relations in genetic networks and metabolic pathways

    Bioinformatics

    (2000)
  • E. Boros et al.

    Error-free and best-fit extensions of partially defined Boolean functions

    Inform Comput

    (1998)
  • W. Ching et al.

    On construction of stochastic genetic networks based on gene expression sequences

    Int J Neural Syst

    (2005)
  • O. Hirose et al.

    Estimating gene networks from expression data and binding location data via Boolean networks

    Lect Note Comput Sci

    (2006)
  • T.E. Ideker et al.

    Discovery of regulatory interactions through perturbation: inference and experimental design

  • K. Noda et al.

    Finding genetic network from experiments by weighted network model

    Genome Inform.

    (1998)
  • I. Shmulevich et al.

    Inference of genetic regulatory networks under the best-fit extension paradigm

  • M. Eisen et al.

    Clustering analysis and display of genome-wide expression patterns

    Proc Natl Acad Sci USA

    (1998)
  • Friedman N, Elidan G. Bayesian network software libB 2.1. http://www.cs.huji.ac.il/labs/compbio/LibB/ (Accessed 20 June...
  • Cited by (28)

    • Mechanistic gene networks inferred from single-cell data with an outlier-insensitive method

      2021, Mathematical Biosciences
      Citation Excerpt :

      On account of advances in single-cell technology, there is an increasing interest in reverse engineering or inferring gene regulatory networks from single-cell data. Many algorithms have been developed to model gene regulatory networks, including regression approaches [1,2], Bayesian modeling [3,4], and mutual information [5,6]. Inference methods have different domains of applicability, as suggested by many review papers that discuss and compare different inference methods based on their underlying biological data and theoretical approaches [7–10].

    • Steady-state analysis of probabilistic Boolean networks

      2019, Journal of the Franklin Institute
    • Understanding Protein Networks Using Vester's Sensitivity Model

      2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    View all citing articles on Scopus

    Preliminary version was presented in the International Conference on BioMedical Engineering and Informatics, 2008.

    View full text