Anomaly detection in gene expression via stochastic models of gene regulatory networks

Background The steady-state behaviour of gene regulatory networks (GRNs) can provide crucial evidence for detecting disease-causing genes. However, monitoring the dynamics of GRNs is particularly difficult because biological data only reflects a snapshot of the dynamical behaviour of the living organism. Also most GRN data and methods are used to provide limited structural inferences. Results In this study, the theory of stochastic GRNs, derived from G-Networks, is applied to GRNs in order to monitor their steady-state behaviours. This approach is applied to a simulation dataset which is generated by using the stochastic gene expression model, and observe that the G-Network properly detects the abnormally expressed genes in the simulation study. In the analysis of real data concerning the cell cycle microarray of budding yeast, our approach finds that the steady-state probability of CLB2 is lower than that of other agents, while most of the genes have similar steady-state probabilities. These results lead to the conclusion that the key regulatory genes of the cell cycle can be expressed in the absence of CLB type cyclines, which was also the conclusion of the original microarray experiment study. Conclusion G-networks provide an efficient way to monitor steady-state of GRNs. Our method produces more reliable results then the conventional t-test in detecting differentially expressed genes. Also G-networks are successfully applied to the yeast GRNs. This study will be the base of further GRN dynamics studies cooperated with conventional GRN inference algorithms.


Background
Identifying the key features and dynamics of gene regulatory networks (GRNs) is an important step towards understanding behaviours of biological systems. Thanks to the development of high-throughput technology, the amount of microarray gene expression data has greatly increased, and numerous mathematical models attempt to explain gene regulations using gene networks [1,2]. Once a network structure is inferred, its dynamics needs to be considered. However, most Open Access methods focus on the inference of network structure which only provides a snapshot of a given dataset. Probabilistic Boolean Networks (PBNs) represent the dynamics of GRNs [3], but PBNs are limited by the computational complexity of the related algorithms [4].
In [5], a new approach to the steady-state analysis of GRNs based on G-Network theory [6,7] is proposed, while G-Networks were firstly applied to GRNs with simplifying assumptions concerning gene expression in [8]. However, the G-Network approach also exhibits specific difficulties because of the large number of parameters that are needed to compute their steadystate solution. Thus, in this study we reduce the number of model parameters on the basis of biological assumptions and focus on estimating two parameters in particular: the total input rate and steady-state probability of a gene.
A G-Network is a probabilistic queuing network having special customers which include positive and negative "customers", signals and triggers [6,7]. It was originally developed also as a model of stochastic neuronal networks [9] with "negative and positive signals or spikes" which represent inhibition and excitation. In terms of GRNs, a queue is a "place" in which mRNAs are stored, and an mRNA can be considered to be a "customer" of the G-Network. The positive and negative signals are interpreted as the protein activities such as transcription factors, inducers and repressors. Note that the customers or signals of the G-Network can be any biological molecules. However, in our study, we focus on behaviours of mRNAs because the available GRN data are usually mRNA expressions. Each queue has an input and service rates which represent a transcription and degradation processes, respectively. Our interest is to estimate the steady-state probability that a queue is busy, which corresponds to the probability that an mRNA is present, and we are also interested in the total mRNA input rate of each queue. To evaluation the accuracy of the proposed method, we generated a simple simulation dataset by using the stochastic gene expression models processed with the widely accepted Gillespie algorithm [10,11]. We also examine a real biological dataset obtained from the cell cycle of the budding yeast [12].
Although queueing theory is a common computational tool, G-Networks are an essential departure from queueing theory; in particular conventional queues could not be possibly applied to GRNs because the notion of inhibition does not exist in queueing theory but was introduced by G-Network theory. There are two other essential novelties in our work. First, our approach enables us to obtain the steady-state of GRNs with only polynomial computational complexity due to the product form solution of G-Networks; the computational cost due to large memory space and non-polynomial computational complexity are basic limitations in conventional methods such as PBN. Also our method can provide more reliable measures to detect differentially expressed genes in microarray analysis (as shown in our simulation study).

G-networks and gene regulatory networks
The GRN model used in this study is the probabilistic gene regulatory model introduced in [5]. In this model, let K i (t) be integer-valued random variables which represent a quantity (mRNA) of the gene i at time t. If the K i (t) is zero, the gene i cannot interact with other genes. Then we have the following Probabilities, where Λ i is the total input rate (sum of transcription rate, l i and increment rate of mRNAs come from outside of system, I i ), μ i is the service rate (e.g. Degradation rate of mRNAs). o(Δt) 0 as t 0. Let r i is representing the activity (signal process) rate of each gene i. Then 1/r i is the average time between successive interactions of gene i with other genes. If the ith gene interacts with other genes, the following events occur: • With probability P + (i, j), gene i activates gene j; when this happens, K i (t) is depleted by 1 and K j (t) is increased by 1 • With probability P -(i, j), gene i inhibits gene j; when this happens, both K i (t) and K j (t) are depleted by 1 • With probability Q(i, j, l) gene i joins with gene j to act upon gene l in excitatory mode, as a result of which both K i (t) and K j (t) are reduced by 1, while K l (t) is increased by 1 • With probability d i , which is defined as follow, Let's define a random process K(t) = [K 1 (t), ..., K n (t)], t ≥ 0 and an n-vector of non-negative integers k = [k 1 , ..., k n ]. The P (k, t) is the probability that K(t) takes k at time t, P (k, t) = P (K(t) = k). Then the probability that K(t) have k at time t + Δt is defined by ( ) is a vector that the value of ith element is k i + 1 (k i -1) and I(x) is indicator function which is 1 if the condition, x, is satisfied or 0 other wise. The first and second terms describe the increment and decrement of the length of queue i, respectively. Third term is the probability that the gene i is activated but nothing is happened except queue i lose one mRNA. From fourth to sixth terms are the probabilities that gene i is activated and interacts with gene j. The rest terms of (1) represent the probabilities that the interaction of gene i and gene j affect the gene l (length of lth queue). Divide (1) by Δt and introduce the equilibrium probability distribution of the system P(k) = lim t ∞ P (k, t) then we obtain following dynamic behaviour, Now, let's consider following equations, Where q i (= Λ i + /(r i + Λ i − )) represents the probability that gene i is expressed in steady-state. Using (2) and (3), E. Gelenbe showed the following product form is satisfied [5,7].

Results and discussion
Simple gene regulatory networks using stochastic gene expression model In order to assess our G-Network model, we construct a simple GRN structure and generate the expression data using a synthetic stochastic gene expression model [13,14]. This stochastic gene expression model has several important features such as protein dimerization [15] and time delay for protein signalling [13]. Figure 1 shows the simulated network structure which is based on the following basic principles: the number of proteins per cell chases the number of mRNAs which in turn chases the number of active genes [14]. Figure   where i, j {A, B, C, D} in Figure 1. In addition, we assume that proteins such as transcription factors and repressors require accumulation times for their activation [11,13], and use the modified Gillespie algorithm to generate the expression data [10,11]. The cell growth rate and cell volume are fixed, and we consider five cells. Detailed parameters are summarized in Table 1 with their references.
The transcription process in (5) follows an exponential distribution with transcription initiation rate l 2 [16]. The translation processes are given in (6) and include direct competition between the ribosome binding site and the RNAse-E binding site which degrade the mRNAs. Thus the translation process follows a geometric distribution with probability p and busting size b = p(1 -p) [13,16]. T D is the average time interval between successive competitions, and the number of surviving mRNAs n 2 in the population after transcription is blocked with n 2 = n 2,0 p t T D / . This is equal to T half = -(log(2)/log(p))T D [13]. Thus the translation initiation rate, l 3 = 1/T D , can be computed. The protein dimer association and disassociation rates are k a2 and k d2 , respectively, as shown in (7) and (8) [17]. We also consider the DNA-protein association and disassociation rates (k a1 and k d2 in (9) and (10), respectively) [18]. The degradation rate of mRNA and of proteins are obtained by using the half-life of each molecule (11) [16,17].
We generate three sets of expression data (Dataset 1, 2, and 3); each dataset has two groups, the normal and the case group. These groups are obtained with the same parameter values except for the transcription initiation rate of G A in case group is 0.0012 sec -1 which is half of the transcription rate in normal group, 0.0025 sec -1 . Both groups are simulated during 3000 seconds. In order to compare these two groups, we perform not only the G-Network analysis but also the t-test which is widely used to find differentially expressed genes in microarray analysis. Datasets 1 and 2 consist of 50 samples each     1 and 2), our method provides consistent results with large sample analysis (Dataset 3). The ratios (case/normal) also show that the q A and Λ A , in the case group, are smaller than one while the other ratios stay around one. In Dataset 3, the p-value of G B is significant along with that of G A because the expression of G A directly affects the expression of G B . However, G B is not the causal gene in this study. Our G-Network analysis reveals that only G A has lower q and Λ values than other nodes including G B . All these results concur with the simulation data generated with one half of the normal transcription rate.

Modeling cell cycle gene regulatory networks in budding yeast
The cell cycle regulated transcription and its overall controls have been studied in detail for budding yeast [19]. Recent developments in high-throughput microarray techniques help to reveal many of yeast genes controlling the cell cycle [20] which consists of four distinct phases: Gap1 (G1), Synthesis (S), Gap2 (G2), and Mitosis (M). The cells grow during their G1 and G2 phases and their DNA is replicated during the S phase. In the M phase, cell growth stops and the cell divides into two daughter cells that include nuclear division. Many genes are involved with specific cell cycle phases, but the number of key regulators that are responsible for the control of the cell cycle process is much smaller. Thus, based on published information, we build a cell cycle GRN with the key regulators in budding yeast as shown in Figure 3, although the relationships that contribute to the true regulatory network structure of the cell cycle still remain uncertain. Therefore we simplify the cell cycle network structure by selecting thirteen key regulatory genes (the gray circles in Figure 3) and connect the genes without regard to the transcriptional and post-transcriptional processes. Figure 4 shows the reconstructed regulatory network structure.
The activity of cyclin-dependent kinases (CDKs) plays an important role in controlling periodic events during cell cycle. Some studies of cell cycle with high-throughput technologies have suggested alternative regulation models of periodic transcription [20]. D. Olando et., al. [12] measured the transcription levels of cell cycle related genes with the use of Yeast 2.0 oligonucleotide array and determined the manner in which transcription factor networks contribute to CDKs and to global regulation of the cell-cycle transcription process. This microarray dataset is used in our study with the cell cycle network structure of Figure 4; it consists of two groups: one group is obtained from wild-type (WT) cells and the other is from cyclinmutant (CM) cells which are disrupted for all S-phase and mitotic cyclins (mutate clb1, 2, 3, 4, 5, and 6).
The microarray data consist of a total of 30 data points taken over 270 minutes. We subdivide it into five states (groups), each consisting of 6 data points. The expression levels are transformed by taking the natural logarithm. Figure 5 depicts the transformed expression profiles of the 13 genes with 5 states. The black and gray solid lines are the expression profiles from WT and CM cells, respectively, and S1, S2, ..., S5 represent the five states. It is obvious that the profiles of CLB2 are different between WT and CM cells because the CM dataset is designed to monitor the cell cycle processes without the clb cyclines. Table 3 summarizes the steady-state probabilities of 13 genes in the cell cycle GRN. All genes have similar steady-state probabilities in the WT and CM cell groups except for CLB2 in the CM group, which has a lower steady-state probability than the elements of the WT group: as shown in Table 3, the ratio of CM/WT is smaller than one (bold letter). This smaller probability can be explained by considering the experimental design of the CM dataset which is obtained without clb cyclines. Also, the original study of this dataset suggested alternative cell cycle regulatory pathways in [12] which had revealed that over 70% of the cell cycle related genes were expressed periodically without the clb cyclines. In our results, the steady-state probabilities of the CM group are consistent with that of the WT group. These results draw the same conclusion as the original study, i.e. that the steady-state of the 12 genes does not entirely depend on the expression of CLB2. Table 4 shows the estimated total input rate of the 13 genes. These results also show that only the input rates of CLB2 decrease in the CM group.

Conclusion
This paper has used the G-Network approach [5][6][7][8] to model GRNs. Two model parameters, the steady-state probability, q i , and the total input rate, Λ I , are estimated by determining the boundary of Λ i and using a grid search. We first use simulated gene expression data generated on the basis of a stochastic gene expression model. Two groups (normal and case) of expression data are examined. These two groups are exactly the same except for one parameter, the transcription initiation rate. We have observed that the G-Network based method is able to detect the abnormally expressed genes, while the t-test produces false positives. Then, using real data, we have observed that the steady-state probability of CLB2 is lower than that of other agents and concluded that the key genes of cell cycle regulation can be expressed without the clb cyclines; this result is consistent with the original experimental study.
However, the unchanged steady-state probabilities in all the five states may need to be considered, because the cell cycle has four phases (G1, S, G2, M) and expressions of genes involved with a specific phase are expected to be different from those in other phases. Also the small decrease rate and relatively large total input rates of CLB2 may require a more careful analysis of the G-Network approach in relation to cell cycle GRN structure. The manner in which we have used G-Network models in this paper did not currently include simultaneous interactions with three or more nodes. However this is not really a limiting effect of the model, since it suffices to include chain representations of dependencies in the G-Network model as has been done for neuronal networks [9] to cover excitatory and inhibitory effects that involve three or more nodes, and in fact random chains of nodes of any length. Although in this study the probabilities that gene i affect gene j, P + (i, j) and P -(i, j) in (3), are fixed at the value one, we think that the conventional reverse engineering GRN methods using the "Ensemble" method [21] can provide these probabilities more accurately for an improved steady-state analysis of GRNs.
In conclusion, our study has illustrated the use of G-Networks as a new approach for the steady-state analysis of GRNs, and has shown their usefulness in obtaining quantities such as the effective transcription rate and the steady-state probabilities, using them to detect differentially expressed genes, thus introducing a new approach which differs from more conventional microarray analysis methods. Future research will investigate the ensemble approaches to GRNs [21] based on the G-Network methodology in [5], which will allow to infer GRN structures, and also to monitor their steady-state behaviour.

Methods
Once a GRN structure is determined, it is necessary to estimate the total input rate (Λ i ) of ith queue and its steady-state probability, (q i ). For the simplicity, the probabilities, P + (i, j), P -(i, j), and Q(i, j, l) in (3) In (12), the Λ i and R i is the total input (Λ i = l i + I i ) and total output rates (R i = r i + μ i ), respectively. f i + is a function of activation probabilities of genes which affect to gene i positively and f i − is a function of activation probabilities of genes which affect to gene i negatively. We fix the r i as the number of out degrees of gene i and the degradation rate of mRNA, i, as a constant (Table 1) because the total output rate, R i is not our interest. Therefore, we need to estimate two parameters, the total input rate, Λ i , and the steady-state probability, q i .
Let Λ i lower is the lower bound of the Λ i , which is larger than zero. The lower bound of total input is regarded as an initial transcription rate without any external input. In this study, we use Λ i lower = 0.0025 [16]. The upper bound of Λ i Λ i upper is obtained by assuming inputs from other nodes are zero and the queues fully work. That is where the probabilities q i * and q j * are one.   ). Then the steady-state probability q i can be obtained numerically by solving (12) with the q i ( ) 0 and the Λ iu . Once the steady-state probability is determined, the log-likelihood of the given model can be computed by using (4) which is the same form of the likelihood of geometric distribution. It is known that the log-likelihood of geometric distribution is convex so we choose appropriate value Λ i which maximizes the loglikelihood function.
For each value of total input, Λ iu ( ( , we compute the steady-state probability, q iu , with initial value, q i ( ) 0 , and obtain the log-likelihood score, log L iu , which is used to choose the optimal I total input rate, Λ i Note that the q iu is a numerical solution of (12) with initial value, q i ( ) 0 , and total input rate, Λ iu . In order to compute Λ i which is a numerical solution of (12) with initial value, q i ( ) 0 , and total input rate, Λ i ( ) 0 . Then the steady-state probability of gene i, q i , can be obtained by updating its value iteratively until the d (t) <δ where d (t) is the difference between q i t ( ) and q i t ( ) −1 at tth iteration. In this study, δ is 0.0001.