Inference of gene regulatory networks with the strong-inhibition Boolean model

The inference of gene regulatory networks (GRNs) is an important topic in biology. In this paper, a logic-based algorithm that infers the strong-inhibition Boolean genetic regulatory networks (where regulation by any single repressor can definitely suppress the expression of the gene regulated) from time series is discussed. By properly ordering various computation steps, we derive for the first time explicit formulae for the probabilities at which different interactions can be inferred given a certain number of data. With the formulae, we can predict the precision of reconstructions of regulation networks when the data are insufficient. Numerical simulations coincide well with the analytical results. The method and results are expected to be applicable to a wide range of general dynamic networks, where logic algorithms play essential roles in the network dynamics and the probabilities of various logics can be estimated well.


Introduction
A gene regulatory network (GRN) [1,2] consists of a large number of DNA, RNA, proteins and other molecular entities with complicated interactions. The genes influence each other with their expression products in the structure of complex networks [3]. Since a huge number of data is available in GRNs while the detailed interaction connections are still unknown in many cases, the inference of genetic network structure has become a very important topic. A number of studies focused on the reconstruction of general networks [4][5][6][7], especially on the inference of GRNs [8][9][10][11][12][13]. A plethora of methods [14][15][16][17][18][19][20][21][22][23] have been proposed for exploring the network structures from different kinds of biological experimental data, such as time series and steadystate data. The development of this field is crucial for understanding the function, stability and evolution of biological systems. However, as pointed in [24], the inference problem is still far from being solved. For instance, one problem of great significance in practice, namely with how large a probability can an interaction be correctly inferred using a given number of measured data, especially when the data are insufficient, has scarcely been touched analytically. This is the central task of the present paper [25].
Different inference methods are based on different network models. Some commonly used methods adopt approximations of quantized variables and time, and infer GRNs with Boolean networks [25][26][27]. Since the seminal work of Kauffman [28], Boolean networks have been extensively used to depict GRNs [29][30][31]. In spite of its simplicity, in many cases it can catch the essence of regulations [32]. The results obtained in these simplified models can be used as starting bases to study more general and practical systems with continuous time and variables (usually described by coupled ordinary differential equations). In this paper, we follow this development to propose a logic-based Boolean algorithm in our inference analysis. We consider the problem of how to reconstruct the GRNs from evolution data arbitrarily given with a known regulatory rule. In particular, we focus on a threshold network model widely used in GRN analysis: the strong-inhibition Boolean (SIB) model [30,31], which in fact is one case of the so-called nested canalyzing function introduced recently in biology [33,34]. In SIB models, regulation by any single repressor of high density can definitely suppress the expression of the gene regulated. This rule represents some typical biological transcriptional regulatory processes [35][36][37]. The key result of this paper is that we are able to derive analytic formulae for dealing with the inference problem. We not only determine the inferences by various logics, but also compute analytically the probabilities for different logics to occur. Therefore, we answer a different question that is of great importance when the data are not 3 enough to determine the whole network: With a given amount of data randomly chosen, with how much probability can one succeed in correctly predicting different types of interactions? Furthermore, we obtain exact forms analytically computing these probabilities by correctly ordering the computations, step by step, for different types of regulations. These results can be used to estimate the necessary number of measured data and experimental costs for the given tasks of GRN inference with required prediction precision. All the analytical predictions are justified perfectly by numerical simulations. Moreover, although we consider only a Boolean model with specific strong inhibition dynamics, the ideas and methods can be extended to all general complex systems in which a logic algorithm (i.e. Boolean maps (BMs)) is important for the system dynamics and the probabilities of various logics are analytically computable. Some systems of practical importance may fall into this class, such as more general genomic regulatory networks, some neural networks and epidemic networks.

Model and some concepts
First let us make clear our model and classify different gene sets in inferring the network. In the SIB model, the state of node i at time t is denoted by S i (t). Each node i has only two possible states, on-state (S i = 1) or off-state (S i = 0), and its state at time t + 1 is determined by the state of the network at time t via the SIB rule: where a i j = 1 if node j activates node i, a i j = −∞ if j represses i and a i j = 0 if j does not regulate i. This rule is inspired by the observation that for a given gene its repressive interactions often dominate the activations in some important genomic regulations [30,35,37]. As in [36], we do not consider self-regulation in the networks, i.e. a ii = 0 for all nodes in our present discussion as in figure 1.
The structure of GRN can be inferred from the expression of genes. We denote the state of the whole GRN at time t, (S 1 (t), S 2 (t), . . . , S N (t)) T , by a vector pattern S(t). Under the SIB mechanism, one pattern S(t) of a given GRN will evolve to another pattern S(t + 1). S(t) and S(t + 1) make up a pair of input/output data which can be used to infer the GRN. For example, if in an input pattern only node i and node j stay at on-state, i.e. S i (t) = 1, S j (t) = 1 and S k (t) = 0 (k = i, j), whereas in the corresponding output pattern node i switches to off-state, i.e. S i (t + 1) = 0, then from this pair of input/output data, we can make a logic derivation that j has repressive regulation on i. Given enough pairs of data, it is possible to identify all links existing in GRN. Generally speaking, if we have all the necessary data (maximum 2 N different sets of fully chosen data) this inference process can be implemented using Boolean algebra [31]. A practical question is: Given a certain number of data randomly chosen that is much less than the above maximum number, to what extent can we reconstruct the networks? Because the inference from an arbitrary number of data is much more complicated than the example above, we propose an analytic method to simplify this logic-base inference. Our basic idea is the partition of the whole GRN as in figure 2 and this partition will be used to order the computations of different logics in the next section. The symbols used in figure 2 and in this paper are listed below: N : the number of the nodes in the whole GRN; q: the number of input/output pairs of patterns used in inference; A: the aim gene; α: the set whose elements activate A actually; β: the set whose elements repress A actually; S α : the set whose elements are suspected to activate A; S β : the set whose elements are suspected to repress A; E α : the set whose elements can be inferred to definitely not activate A; E β : the set whose elements can be inferred to definitely not repress A; I α : the set whose elements can be identified to definitely activate A; I β : the set whose elements can be identified to definitely repress A; D α : the set whose elements in fact do not activate A but remain inS α ; D β : the set whose elements in fact do not repress A but remain inS β ; α, β, S α , S β , E α , E β , I α , I β , D α , D β : the sizes ofα,β,S α ,S β ,Ẽ α ,Ẽ β ,Ĩ α ,Ĩ β ,D α ,D β , respectively, where the symbols with tildes represent the names of sets, while the symbols without tildes represent the sizes of corresponding sets. Partition of the whole GRN in the algorithm corresponding to the inference of regulations upon a certain gene. A represents the aim gene.S α (S β ): whose elements are suspected to activate (repress) A.Ẽ α (Ẽ β ): whose elements can be inferred definitely not to activate (repress) A.Ĩ α (Ĩ β ): whose elements are correctly inferred as activators (repressors).D α (D β ): whose elements in fact do not activate (repress) A but remain inS α (S β ) because of lack of data.α (β): the set of activators (repressors). Obviously,Ĩ α is a subset of the true activeα, andĨ β is a subset ofβ. Note thatα S α , because some nodes inα (the shadow domain) may be suspected to repress A so that they belong toD β . {i} shadow is very small, the meaning of which is given in the text.
In detail, the partition is implemented as follows. For a certain gene A, all other genes can be classified into three sets: some genes activate it, some others repress it, and the remaining genes do not regulate it. Assume that setα (β) represents the set whose elements activate (repress) A. Given a certain number of data, we can obtain some information about the links. Some nodes can be inferred to definitely not activate (repress) A, which are denoted byẼ α (Ẽ β ). Some nodes are suspected to activate (repress) A, which are denoted byS α (S β ). Among the setS α (S β ), we can definitely identify a subset denoted byĨ α (Ĩ β ) whose elements activate (repress) A. Obviously we haveĨ α ⊆α andĨ β ⊆β. Here the links via which the nodes iñ I α (Ĩ β ) regulate A are necessary in GRN and just called rigid edges in [31]. Moreover, there is a subsetD α (D β ) whose elements, in fact, do not activate (repress) A remaining inS α (S β ) because of insufficiency of the data. Strictly speaking, there may still be some intersections betweenD β andα represented by the shadow in figure 2 (note: no intersection betweenD α and β). Nevertheless, this intersection {i} shadow =D β ∩α =S β ∩α is small, and it is extremely small for the sparse matrix. With increasing data we can reduce considerablyD β as well as {i} shadow , so that we treat {i} shadow = ∅ for simplicity in the following analysis. Therefore we havẽ and The last equality in (3) will be better explained in the next section. Before inference all the nodes are suspected to repress A, so all belong to the suspected-repress setS β . Using data of A's transitions 1 → 1 and 0 → 1 some nodes can be excluded from the setS β so they belong to the excluded-repress setẼ β . (b) EA. Using data of A's transition 0 → 0 some of the nodes inẼ β that do not activate A can be excluded. The setẼ β is divided into two parts: the suspected-actS α and the excluded-actẼ α . (c) AR. Using the data of A's transition 1 → 0 some of the repressive links in the down-triangle can be identified fromS β . These nodes are denoted byĨ β . The squareD β represents these nodes inS β which in fact do not repress A. They are the nodes that step (i) fails to exclude fromS β . (d) AA. Using the data of A's transition 0 → 1 some nodes inS α can be identified to activate A. They are denoted as the up-trianglẽ I α . The squareD α represents the nodes that step (ii) fails to exclude fromS α .

Algorithm for inference
Based on the partition, an algorithm is proposed for identifying the partitioned sets step by step. Firstly, we can reduce the whole network into a subset of nodes that are suspected to regulate the aim gene, and then identify those links from the reduced suspected set. The first process is named Exclusion and the second one Affirmation. Since there are two kinds of regulations, the entire algorithm consists of four steps as shown in figure 3. Steps (i) and (ii) (figures 3(a) and (b)) are used to reduce the size of the suspected repressive set (S β ) and active set (S α ), respectively, while steps (iii) and (iv) (figures 3(c) and (d)) are used to identify some true repressive (Ĩ β ) and active ones (Ĩ α ) among these suspected sets. A novel trick in this work is to order these logic derivations in correct sequence so that the derivations of latter steps are based on the results of the previous ones. These steps require different kinds of data, which are summarized in Table 1. The data used in the algorithm.
Step A's Conditions for transition the given step The types of input/output pairs used in the four steps. The first column represents the abbreviations of the steps; the second column denotes the transitions of the aim node A, i.e. S A (t) → S A (t + 1); the third one specifies the requirements to achieve the input patterns. These requirements include not only the necessary conditions to achieve A's transition, but also the conditions under which the inference can be implemented in the corresponding step. For example, in step EA, the condition that all the nodes inα stay in off-state can guarantee A's transition 0 → 0. But in order to exclude some nodes from the suspected-act setS α , a condition that all nodes inS β stay in off-state in input pattern is additionally required. table 1. After these four steps, some nodes regulating the aim gene A can be identified. Then by following the same procedures, these input/output data can be used again to infer the regulations exerted upon other aim nodes. The details of the algorithm can be described as follows: (i) Exclusion of repression (ER). This step is ordered as the first derivation, and all other steps are based on the results of this logic computation. Before inference, all nodes except the aim gene A are suspected to repress A, so they all belong to the suspected-repress setS β . From A's transitions 0 → 1 and 1 → 1 listed in table 1, we can know that the nodes with on-state in the input pattern do not repress A. They are excluded fromS β and classified into a new set, the excluded-repress setẼ β . The function of this step is to partition all the nodes except A into two parts,S β andẼ β , as shown in figure 3(a).
(ii) Exclusion of activation (EA). The logic computation of this step depends on the result of (i), and the results of (i) and (ii) serve as the starting bases of steps (iii) and (iv), respectively. Before this step, all nodes inẼ β are suspected to activate A. With the help of A's transition 0 → 0, if all the nodes inS β stay in off-state as in table 1, we can exclude the nodes with onstate in the input pattern fromS α and put them in the excluded-act setẼ α . So this step partitions E β into two sets: suspected-act setS α and excluded-act setẼ α , as shown in figure 3(b).
(iii) Affirmation of repression (AR). This logic computation can be conducted based on the result of step (i). If only one node i inS β stays at on-state in the input pattern, and 8 A switches from S A (t) = 1 → S A (t + 1) = 0 (table 1), it can be inferred that i represses A. In this step, we use this kind of data to identify the repressive links. If there are more than one node inS β staying in on-state in the input pattern, we cannot identify exactly which one represses A. So the requirements in table 1 include not only these conditions that guarantee that A can switch from 1 to 0, but also the conditions to guarantee that we can identify the precise node repressing A. In this logic computation, the states of all nodes inẼ β set can be arbitrary, and this desirable condition can considerably enhance the efficiency of AR. Note that the information onS β andẼ β is prepared in step (i). The smaller the size ofS β , the easier it is to identify the repression. In figure 3(c), the setĨ β that has been identified to repress A is denoted by the down-triangle. Please note that there are some other situations where some repressive regulations can be affirmed. But these situations are so rare that we ignore them reasonably in this paper.
(iv) Affirmation of activation (AA). This step is similar to AR by considering different sets of data. The logic derivation is based on the results of step (ii). If only one node i inS α has S i (t) = 1 and all other nodes inS α have S(t) = 0, the A's transition 0 → 1 can help us to affirm that i activates A. After this step, the setĨ α is determined and is denoted by the up-triangle in figure 3(d).
The time expense of the algorithm can be estimated as follows. Given N nodes and q pairs of input/output data, we are collecting N q bits of information. For each node we need to examine these data not more than four times since there are four steps in the algorithm. Therefore, to infer the regulations upon N nodes, the computing expense is proportional to q N 2 .

Analytical derivations of inference precision
In the previous section, we showed how to use different logics to infer a given GRN. Here we go further to compute the probabilities of these logics and then derive explicit formulae quantitatively measuring the quality of the inference. In this section, we have two major assumptions. Firstly, we assume that the data about the input expression patterns are generated randomly, i.e. in each input pattern the probability that one node takes on-state or off-state is equal to 50%. Secondly, strong-inhibition dynamics, i.e. we assume the regulations obey the simply and wildly observed rule. With these two assumptions we discuss the extent to which we can correctly reconstruct the regulation network given a certain number of data.
This problem can be formulated as follows: for an aim gene A of a GRN that complies with the SIB mechanism of (1), we ask how many rigid edges on average can be identified, given q pairs of input/output patterns with the input randomly chosen. In other words, what are the sizes of sets such asĨ α andĨ β ?
(i) ER. This step will compute the probability P(D β ) as well as D β and S β = D β + β.
In order to explore the setẼ β , we need A's transition 1 → 1 or 0 → 1. The first kind of transition requires that S A (t) = 1 and S i (t) = 0, ∀i ∈β, and the second kind requires that S A (t) = 0, S i (t) = 0, ∀i ∈β and ∃ j ∈α, S j (t) = 1. The probability that one pair of input/output data belongs to these two kinds can be exactly computed by p 1 .
where the first term corresponds to the first kind of maps S A (t) = 1 → S A (t + 1) = 1, while the second term considers maps with S A (t) = 0 → S A (t + 1) = 1. With q pairs of random data, the probability p(k β ) that k β pairs satisfy condition (4) can be computed analytically by the binomial distribution B(q, k β , p 1 ) as After this step, node i ∈β can remain inD β ⊆S β if its value is 0 in all of these k β input patterns. Here, the probability p(D β |k β ) that inS β there are D β nodes not repressing A can be calculated as Hence, the total probability of D β is which yields the expectation of D β and (ii) EA. This step will derive P(D α ) and D α based on the results of D β . Node A's transition 0 → 0 is required in this step. If S A (t) = 0 and S i (t) = 0, ∀i ∈α, we can have the transition of S A 0 → 0. But in order to exclude the influences of nodes fromS β , we need additionally S j (t) = 0, ∀ j ∈S β , so that we can infer node l ∈Ẽ α if S l (t) = 1. As a result, the requirement of step (ii) is that S A (t) = 0 and S i (t) = 0, ∀i ∈S β ∪α (table 1). We call the input/output data that satisfy the requirement utilizable pairs of data for the EA step. Based on the two assumptions at the beginning of this section, the probability for a pair of data belonging to this utilizable set is p 2 : According to the binomial distribution, given the size ofD β , the probability that there are k α pairs of utilizable data is denoted by Similar to (6), the probability of the size ofD α can be computed as Then the total probability p(D α |D β ) that, in the setS α , there are D α nodes that do not activate node A can be summed by leading to the expectation of D α so (iii) AR. Based on D α and D β , the probability P(I β ) and the expectation I β can be computed.
If in an input pattern, S A (t) = 1, ∃ j ∈β, S j (t) = 1 and S i (t) = 0, ∀i ∈ (β ∪D β − { j}), then node A will transit from 1 → 0 and node j can be identified to repress A. In other words, in total (β + D β + 1) nodes in the pattern should stay at required states, so the probability for an input pattern to satisfy the requirement is 2 −(β+D β +1) . Then, given q pairs of data, the probability that node j ∈β can be affirmed is Treating the affirmation of each node inβ as independent, the probability of the size of the total affirmed repressive setĨ β reads Therefore, given D β , the expectation of I β reads (iv) AA. Finally, the formulae of P(I α ) and I α can be derived after those of D α and D β . If S A (t) = 0, ∃ j ∈α, S j (t) = 1 and S i (t) = 0, ∀i ∈ (β ∪D β ∪α ∪D α − { j}), and if further we observe S A (t + 1) = 1, the link that j activates A ( j ∈α) can be affirmed. The probability that an input pattern can be used to identify j ∈α is 2 −(α+β+D α +D β +1) . And given q pairs of data, the probability that ∀ j ∈α can be affirmed reads Given the sizes ofD α andD β from steps (i) and (ii), the probability that I α active links are affirmed in this step can be computed as and Finally, given q pairs of data, the expectation of the rigid edges we can identify can be calculated as and where p(D β ), p(D α |D β ), I β | D β and I α | D α ,D β are given in equations (7), (13), (18) and (21), respectively. The probability p α ( p β ) that an active link (repressive link) regulating gene A can be identified is Note that the quantities I α , I β and I α + I β as well as p i , i = 1, 2, 3, 4, all depend on α and β of node A; therefore, we should denote them as I α | α A ,β A , I β | α A ,β A and I α + I β | α A ,β A and so on.
In the next section, we will compare the expectation calculated by (22), (23) and (24) with the average inference results of direct numerical simulations. We use the method proposed in [38] to calculate all binomial distributions.

Numerical results
In this section, we compare our analytic results with numerical simulations of the algorithm in section 3 to check the validity of (24). For this purpose, we consider an example of a GRN with N = 128 and random links between nodes. Random input patterns are assigned to the network, based on which output patterns are generated according to the SIB mechanism (1). Given a group of data that consists of q pairs of input/output patterns, the logic algorithm can numerically identify a part of the links. For a node that is actually activated by α nodes and repressed by β nodes, the number of identified active links and repressive links can be denoted by I α | α,β and I β | α,β , respectively. For different sets of data, different inference results I α | α,β and I β | α,β can be obtained.
The above comparison demonstrates that with analytical results I α +I β tot = N i=1 I α i +I β i , we are able to answer a question of practical importance: how many data are required when we want to identify a required fraction of links of a GRN. The above calculations are based on the information on detailed α i and β i values. Unfortunately, these pieces of information are often not available in practical cases. Instead, some statistical features of GRNs, e.g. the distribution of α and β, P(α, β), can possibly be estimated; we can then derive the formula from this statistics . Circles represent the simulation results obtained by inferring randomly generated networks from randomly generated data, while the lines represent the prediction from (24). The numerical data are averaged from 100 random GRN samples.
For instance, let us consider the inference of a realistic biological network, where we use the cell-cycle network of the budding yeast [29] pruning all self-regulations as in figure 1 as an example. Assume that the distribution of links exerted on each node is quasi-Poisson as which guarantees that each node is regulated by at least one link. Assume further that the numbers of active and repressive links exerted on each node are almost the same, i.e. β = α (the degree is even) or β = α + 1 (the degree is odd). Then the expectation of inference of the whole GRN can be calculated as Given the actual node number N = 11 and the average regulation E = 2.9, we calculate the expectation with (28) shown by the solid line in figure 5. The numerically averaged results of inference with q pairs of data are recorded as circles, and the standard deviations are shown by error bars. The figure demonstrates that the formulae we derived can predict well the inference process even when the interaction degrees of each node are not fully accessible.
Regarding the temporal data, each initial input pattern generates a sequence of data that consists of a number of input/output patterns until the system evolves to a stationary state. As a result, given q initial input patterns, more than q pairs of input/output data can be acquired, so that as shown by the dotted line in the inset of figure 5, many more regulations can be inferred with these data. However, in order to predict the inference with temporal data, the equations in this paper should be considerably modified. It may be related to the properties of attractors and trajectories, and further study is required. The simulation results are averaged over 100 groups of data consisting of q pairs of random input patterns. In contrast, the dotted line demonstrates the numerical result of inference with temporal data. In the case of temporal data, each initial input pattern generates a sequence of data which consists of multiple pairs of input/output data. So given q initial input patterns, the algorithm can identify more links in the GRN, especially at the early stage of inference. The curve is also averaged over 100 groups of data. The details when q 40 are enlarged in the inset.

Conclusion
In this paper, we discussed the inference of GRNs with SIB logic dynamics, and obtained analytical formulae to calculate the probability that one link can be inferred from a certain number of data. The theoretical predictions are consistent with the numerical results.
The novelty of our work lies in two aspects. Firstly, in contrast with previous works on the inference of Boolean networks [25][26][27] in which the GRN is a 'black box' where we do not have any information about the regulation rules, we propose here an algorithm to infer GRNs treating them as 'gray boxes' as in [7], where we know a priori the rules of regulation. This algorithm can be used to efficiently infer GRNs in which the genes regulate each other via the SIB mechanism. Furthermore, its idea, i.e. to partition the network according to different kinds of regulations and to order properly the computation steps, is expected to apply to inference based on general regulation rules.
Secondly, we present analytical formulae for the number of links that can be inferred when the data are insufficient. In many cases, the data that can be acquired are not enough to realize the completely reconstruction of networks. Since the inference of GRNs is always underdetermined [8], researchers in practice are usually concerned with the problem of how many links can be inferred with insufficient information. However, to date most of the works on this topic are not analytical as in this paper, but just numerical. With these formulae presented above, we can estimate, for some classes of gene transcriptional regulatory networks, the experimental costs when we require a certain precision of inference to reveal the topology of GRNs [39]. This capability for estimation is important in many practical inference tasks.
Finally, the approach proposed in the paper needs to be extended in several aspects. Interesting directions for further research could include, for instance, treating the temporal data or uncertainties in the data. These topics will be explored in future work.