Informatics

Gene Regulatory Network (GRN) represents the complex interaction between Transcription Factors (TFs) and other genes with time delays. They are important in the working of the cell. Learning GRN is an important first step towards understanding the working of the cell and consequently curing diseases related to malfunctioning of the cell. One significant problem in learning GRN is that the available time series expression data is still limited compared to the network size. To alleviate this problem, besides using multiple expression replicates, we propose to decompose large network into small subnetwork without prior knowledge. Our algorithm first infers an initial GRN using CLINDE, then decomposes it into possibly overlapping subnetworks, then infers each subnetwork by either CLINDE or DD-lasso and finally merges the subnetworks. We have tested this algorithm on synthetic data of many networks with 500 and 1000 genes. We have also tested on real data on 41 human TF regulatory networks. Results show that our proposed algorithm does improve the GRN learning performance of using either CLINDE or DD-lasso alone on the large network.


uce proteins
hen transcribed into mRNAs and subsequently translated.Some proteins are Transcription Factors (TFs), which will trigger or inhibit the transcription of other genes.Also involved are miRNAss which inhibit the translations of some mRNAs.Moreover, transcription and translation each takes time, though the two may be done simultaneously (e.g. in bacteria), non-negligible translational time delays have been observed [1,2].The transcription rate has been observed ranging from 12 to 100 nucleotide per second (nt/s), under different conditions in different organisms [3,4].Also, it has been observed that RNA polymerase, a main working protein in transcription, may pause during transcription, adding a cumulative of 204-307s over a 2.3 kb region [4].These delays have been known to affect the network stability, or cause oscillations [5][6][7][8].Therefore, there are complex regulatory relationships in the cell between all these components, with different time delays.In order to understand the working of the cell and subsequently to understand the cause of diseases related to the malfunctioning of the cell (such as cancer), it is crucial that the GRN be first mapped out.It would be too time-consuming and expensive to experimentally determine the regulators and regulatory targets of each TF, therefore computational methods are attractive complementary means to infer the GRN.With the advent of high-throughput microarray and RNA-seqtechnology, it is possible to obtain the expression levels of thousands of genes at one time, and a time series of expression levels of many genes could be obtained when this is done on a number of time points.


Background

There have been many GRN learning algorithms and models, with different levels of details [9,10] for surveys of GRN mod lling and [11] for survey on GRN learning algorithms for microarray expression data.

Due to the nature of GRN, most models of GRN could be described as a graph, where the vertices are the genes under consideration, and the edges represent the regulatory relationships.


Methods

In this section, the proposed algorithm is described.We assume the true GRN consists of a number of subnetworks, where most connections are within subnetworks, but each subnetwork is not densely connected, and there are some connections across subnetworks.Figure 1 illustrates the assumed structure of the GRN.

These assumptions may suggest a clustering based method, but there are some issues to address in using plain clustering method.Firstly, the similarity measure needs to take into ac-count the time delays.This could be solved by considering the correlation between shifted time series, and try the possible time delays (up to a ma imum allowed delay) to define the similarity measure.Secondly, indirect effects, which lead to correlation between genes not directly dependent, need to be taken into account.For example, if a !b and b !c, then a high correlation between a and c will be observed.Since there are some connections across subnetworks, indirect effects may cause more genes in different subnetworks to appear dependent than they actually are, which may make the clustering more difficult.This is illustrated in Figure 2. Thirdly, since the subnetworks are not disconnected, if clustering is used, overlapping" clusters are preferred to disjoint ones, but most clustering methods give disjoint clusters as output.If only disjoint clusters are found, either further processing is needed to find the cross edges between subnet-works or these edges have to be ignored.Therefore, either overlapping clusters are found, or disjoint clusters Different levels of details could be achieved by labelling the edges with extra information.In the simplest case, an undirected graph could be used, in which case only an association network of the genes is captured.ARACNE [12] infers undirected network using mutual information, but also uses Dat

Processing
nequality to try to eliminate indirect interactions.C3NET [13] first identifies gene pairs with significant mutual information, then links each gene to the neighbour (if any) with highest mutual informatio , and output an undirected and conservative network, with no explicit means of eliminating indirect effect.But without direction in the edges, there is no causal interpretation.Alterna

vely, di
ected edges could be used, as in [14], which uses genetic algorithm to optimize a score based on partial correlation, estimated regulatory direction and effect, but the output edges are not labelled with time delays.Some algorithms consider only delay of one time step, as in [15], which considers discretized exp ession data, and uses association rule mining to find frequent regulatory patterns, but without eliminating indirection association.Boolean network, e.g. in [16], is a classic model of GRN where the expression of each gene is discretized to only on or off, and the expression of each gene at the next time step is a boolean function of expression of its regulators at the current time step.Another popular class of GRN model is Ordinary Differential Equations (ODE), where the rate of change of expression of each gene is a (linear or nonlinear) function of the expression of the gene and its regulators.When discretized in time, it reduces to one time step model.Examples are [17], which uses Gaussian process for Bayesian inference of an ODE model; and DELDBN [18], which combines ODE model with local Bayesian analysis, and uses estimated markov blanket as the regulators of each gene.There is also Dynamic Bayesian Network (DBN) based models, which avoids the limitation of plain Bayesian network that no cycles are allowed.An example is [19], which utilizes Bayesian structural expectation maximization to infer a onetime step DBN model.There are relatively few algorithms that infer multiple time delays.First estimates the possible delays from pairwise mutual information from discretized expression data, then infers multiple time [20] step DBN by minimizing MDL score using genetic algorithm.Banjo [21] also optimizes a score metric on DBN using discretized expression data by MCMC based method, and updated version of the program allows multiple delays.TD-ARACNE [22] is an extension of ARACNE with time delays.But these algorithms do not label the edges of GRN with regulatory effect.In contrast, in DD-lasso [23], the expression of a gene is a linear combination of expression of its regulators at (possibly different) previous time steps.It first estimates the delays between each gene pairs by maximum likelihood, then uses lasso [24] to remove indirect effects and estimate the coefficients, therefore the edges are labelled with the delays as well as the regulatory effects.CLINDE [25] uses a similar model, but uses conditional independence of the shifted time series to estimate the delays and eliminate indirect effects.Some other algorithms use perturbation data, or use a combination of perturbation data and time series expression is a [26] parallel implementation of the Network Identification by multiple Regression (NIR) algorithm utilizing perturbation data [27] needs promoter sequence and TF binding site information in addition to (non-time series) expression data [28] is an IC [29,30] based method, which uses steady state data, with partial prior knowledge of ordering of regulatory relationship, and uses entropy to test conditional independence, giving an acyclic network where some edges may remain undirected [31] uses convex programming on an ODE model using perturbation data.TSNI [32] solves a discretized linear ODE model using time series expression data after each gene is perturbed (Figure 1).have to be expanded" a little to allow the cross edges to be inferred in the process of inferring each subnetworks, instead of needing further inference.Because of the above, we propose to first infer an initial GRN using CLINDE, which can handle time delays and help eliminate some indirect effects, then to decompose the initial GRN into overlapping subnetworks, to get the subsets of genes.The over allow is given in Figure 3.The steps are 1) Initial GRN learning using CLINDE 2) Decomposition using edge betweenness to get grouped subsets of genes 3) Each gene subset is used to infer a subnetwork using either CLINDE or DD-lasso 4) The subnetworks are combined to obtain the final GRN.In the following, the input data, GRN model, and the steps of the proposed algorithm are described.


Data and model

The given data is {x i (t)g, for i=1,…, n, t=1,…,m, where x i (t) is the expression value of gene i at time t, and there are n genes and m equidistant time points.If the raw input data does 5 not have equidistant time points, interpolation (e.g.spline interpolation) could be performed as pre-processing before using this algorithm.

The GRN model assumed here is the same as that for DD-lasso [23] and CLINDE [25]:
( ) ( ) ( ) j ij i ij j x t a x T t = + ∈
∑ so that a ij is the regulatory effect of gene i on gene j, where the regulatory effect is repressive if a ij is negative, activatory if positive, and absent if zero; and T ij is the positive time delay of the edge i j (if a ij ̸ = 0); and ϵ j (t) is the error term for gene j at time t.We assume that the error terms of each gene and at each time point are zero-mean, and are mutually independent, but otherwise we do not make stringent assumptions on the distribution of the error terms.Note that this model does not preclude self-regulation or cycles in the GRN, though any cycles must have positive delays.


Initial GRN

The first step is to obtain an initial GRN.There are not many GRN learning algorithms that handles multiple time delays, CLINDE [25] and DD-lasso [23] are two of them.And from the comparison in [25], CLINDE outperforms DD-lasso when the number of time points is small relative to the number of genes, therefore we choose CLINDE to infer an initial GRN.CLINDE is based on the PC algorithm [30], and consists of two stages.Stage 1 considers all (directed) pairs of genes x and y, and all possible delays d up to a maximum allowed delay, to determine if xy is significant with the delay d based on either correlation test, or mutual information test.The test is considered significant if the score of the test is larger than a score threshold.In the correlation test and partial correlation test below, the score is -log10 (p-value), and in the mutual information test and conditional mutual information test below, the score is the (conditional) mutual information.So a higher score threshold means a more stringent test.And for correlation test, the regulatory effect (positive or negative) is also estimated if the edge is significant.After stage 1, there may be multiple edges from x to y, but with dif

rent delays.Sta
e 2 attempts to prune the edges by partial correlation tests or conditional mutual information tests.Iteratively, the remaining edges are considered for pruning by conditioning first on h=1 neighbour, then h=2 neighbours, and so on up to h=N0, for a given parameter N0.In each iteration, each remaining edge is tested by conditioning o h neighbours, shifted properly using the delays estimated in stage 1.If the condi pruned.After stage 2, the remaining edges are output as the GRN.For the purpose of finding the decomposition of subnetworks, we \condense" the multiple edges xy with different delays between the same pair of genes so that only the one with the most significant p-value in stage 1 remains (Figures 4 and 5).


Decomposition

Given an initial GRN, the proposed strategy of decomposing it into subnetworks is to identify which edges are likely to be across subnetworks.This is accomplished based on "edge between-ness" [33], as used in community structure discovery algorithms [33].In a graph, for every two distinc

vertices (ge
es), consider the undirected shortest path(s) between them.If there are k shortest paths, for each shortest path, its constituent edges each receives a weight of 1=k from the path.The edge betweenness" for an edge is the sum of weights received after considering all vertex pairs.This is illustrated in Figure 4. Intuitively, a higher edge betweenness means the edge is more likely to be across subnetworks, this is illustrated in Figure 5. Consider subnetworks connected by only a few edges, the shortest paths of gene pairs in the same subnetwork likely stay within the subnetwork and would not overlap too much.On the other hand, for genes not in the same subnetwork, their shortest paths would need to go through one of the few edges across subnetworks.Therefore, the edges across subnetworks would likely have higher edge betweenness [34] gives a fast method to calculate edge betweenness of a graph with m edges and n vertices in O (mn) time.

The steps for decomposition are as follows:

1. Starting from the initial GRN, first identify all the components (the maximally connected sub graphs, considering the edges as undirected).

2. For each component, if it is larger than or equal to a size threshold S 0 , calculate the edge betweenness, and remove the one with the highest edge betweenness.

3. Go back to step 2 until all components are small than S 0 .

For the choice of size threshold S 0 , it should not be set too small, because this decomposition procedure is only a heuristic, and removing more edges means higher chance that the genes are grouped wrongly.Also, even if S 0 is small, the resulting overlapping subnetworks may still be larger than desired after expansion below.On the other hand, S 0 should not be too large, as the subsequent subnetwork learning algorithm may have difficulty dealing with large network, given the limited data.As a balance, the choice should be a large value that the subnetwork learning algorithm can still comfortably handle, taking into account the pos

ble expansion
f the gene subset discussed below.The default threshold S 0 used is 60, based on the previous performance of CLINDE and DD-lasso.Having identified the components, we consider three ways to obtain the final subsets of gene for the subnetworks:

1. Component: Simply output each component as a subset, but this gives disjoint partitions of the genes, so is used mainly for comparison purpose, i.e. the performance of the algorithm if the predicted cross edges are ignored.

2. Parents: For each component, include its parents based on the initial GRN.Presumably the removed edges are those across subnetworks, we include the parents so that these cross edges could be identified in the subnetworks.And in this case, the subnetworks are likely overlapping.

3. XParents: For each component, include its parents based on the initial GRN.But for each subnetwork, after learning, the edges between the parents (not in the component) are removed.The rationale is that the parents of the parents may not be included in the subnetwork, so some indirect effects between the parents may not be removed properly.Also, if there is genuine edge between two parents, presumably these two parents would have been present in some component.


Infer subnetworks

After the poss bly overlapping subsets of genes for each su network (either one of Component, Parents or XParents) have been estimated from the decomposition of the initial GRN, the corresponding subset f expression data could be obtained and each subnetwork could be relearned.Both CLINDE and DD-lasso could be used for this purpose, as both could handle the now smal subnetworks.DD-lasso is based on lasso [24], which is a regula ized regression method that also has the effect of feature selection.DD-lasso extends lasso to handle time delays.DD-lasso consists of three stages.In stage 1, for all directed pairs of genes x and y, determine the delay d such that xy has the maximum absolute correlation when shifted by delay d.In stage 2, for each gene g, treat all the genes as potential parents, with the delays determined in stage 1.Then, lasso is used to predict the real parents of g through the feature selection nature of lasso, by using the expression data of g as target and the shifted expression data of all the genes as predictors.In stage 3, backward elimination is done for each gene to further remove parents that are likely only indirect effects.Finally the output GRN is obtained.Note that in DD-lasso, it is assumed that between any two genes there is only one edge with a single delay, whereas in CLINDE this is not ssumed.Since the initial GRN is estimated using CLINDE, so after decomposing into subnetworks, using CLINDE again to infer the subnetworks may not lead to great improvement, since the learning method is not changed much, unless

he learni
g parameters are adjusted based on the network sizes.In contrast, DD-lasso is based on different approach from CLINDE, and has different performance characteristics, so it may lead to greater improvement on the subnetworks, even though DD-lasso may not perform well on the large network, given the limited data.


Merge the subnetworks

After the subnetworks are re-learnt using either CLINDE or DDlasso, the subnetworks can simply be unioned to obtain the final estimate of the GRN, because the overlapping nature of the subnetworks avoids the need of post-processing for the cross edges.


Experiment Results

This section evaluates the effectiveness of our proposed algorithm.Since it is difficult to find known large gene networks and suffic

ntly long time ser
es data of the involved genes, we mainly rely on synthetic data, where we know the underlying gene network, and there is no lack of sufficient expression data.On the other hand, it is useful to evaluate the proposed algorithm on real data, even though the time series data may not be long enough or have high enough quality.For this purpose we use the human transcription factor regula ory networks from [35] as the known GRN and use GDS4238 obtained from Gene Expression Omnibus as expression data.


Performance metrics

We first describe the performance metrics used in evaluating the proposed algorithm.It is more appropriate to focus on correctly predicting the presence of links rather than its absence, due to the sparse nature of GRNs.Same as in [25], we assess the performance of the learning algorithm on three aspects, namely Links (which is considered correct if and only if both the gene pair and the direction are correct), Delays (which is considered correct if and only if both the link and the time delay _ij are correct) and Effects (which is considered correct if and only if both the link and the sign of the effect aij are correct).For each aspect, we look at three metrics respectively, namely


Recall=TP TP+FN, Precision=TP TP+FP and F-score=2_Recall_Precision

Recall+Precision, where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives.


Synthetic data generation

Generation of network: We first describe how to generate a small network of size k, where each gene has a maximum of M0 parents, and the maximum delay is _0.We represent a network of size k as a pair of
k by k matrices ([a ij ]; [T ij ])
, where a ij , if non-zero, is the coefficient of the link i j, with the associated time delay T ij and a ij is 0 if there is no link from i to j.For each gene i, 1 _ i _ k, we 1.Uniformly choose the number of par

ts s i from f 1 ,…,M 0
2. Randomly choose its parents P i _ f 1 ,…,kg where /P i /=s i 3.For u Ɛ P i , set a iu =ρ iu z iu , where ρ iu is uniformly chosen from {-1,1} and z iu is uniformly chosen from (0.5, 1.5) 4. For u Ɛ P i , uniformly choose T iu from {1,…,T O } 5.For u

P i , set a iu =0
nd T iu =0.

6. Lastly, scale the coefficients to make the network stable.

We generate a large network of n genes with the structure illustrated in Figure 1, where each subnetwork has size n 0 , as follows:
Generate 0 n K n   =     subnetworks ( ) ( ) ( ) { } 1 , u u ij ij u K a T ≤ ≤        
each with size n 0 as above Generate a subnetwork
( ) ( ) ( ) 0 0 , ij ij a T         of size K Set ij ij A a and T T ′ ′ ′ ′     = =     as: For1 u K ≤ ≤ set ( ) u pq ij a a ′ =and( ) u pq ij T T ′ = where ( ) ( ) 0 0 1 , 1 P u n i q u n j = − + = − + For each ( ) 0 0 ij a ≠ set ( ) 0 pq ij a a ′ =and( ) 0 pq ij T T ′ =
where p i

uniformly
chosen fro
( ) ( ) { } 0 0 0 1 1,..., 1 i n i n n − + − + and q is uniformly chosen from ( ) ( ) { } 0 0 0 1 1,..., 1 j n j n n − + − + 0 ij ij a T ′ = = otherwise 1. Generate a random permutation π of {1,…,n}
Output the final network as
( ) ( ) ( ) ( ) ( ) , , ,, 1 ,i j i j a i j n T i j n π π π π     ′ ′ ≤ ≤ ≤ ≤    
Essentially we first generate
0 n n      
subnetworks, each with size n 0 and then generate a network of size
0 n n      
to connect them.And the final permutation is to prevent the subnetworks from being easily identified from the indices.


Generation of expression data:

Having generated the network, we obtain a network as (a ij,ij ).Given the parameters 2 (which controls the gaussianity of error terms), we then generate the expression data with m time points as follows:
1. For 0 0,1 T t j n − < ≤ ≤ ≤ , set ( ) ( ) j j
x t e t = where each ( )
j e t
is generated from N(0, 1)
2. For 1 , random replicates; and for each replicate, we first generate ime series of 200 points and then take prefix of different lengths to give different number of time points.So there are a total of 4 CLINDE alone on the l using score thresholds of 2 s (m=20 and m=50), using a score threshold of 2 is usually slightly better than 3 and 4; and for more time points (m=100 and m=200), using a score thr

hold of 3 is usually better, th
ugh for m=200 and α=3 it is sometimes better to use 4. Therefore, in the following, if space is limited, we show the results of score threshold 3 for CLINDE.

For a given m and α, the F-score usually decre ests that it is more usual to predict both the delay and effect for a link correctly, rather than only getting one correct but not the other.Therefore, in the following, we focus mainly on the performance of Effects.


Results of CLINDE and DD-lasso alone on the large networks:

Table 4 and   respectively, where the medians are taken over 20 replicates.In Tables 4 and 5, the score threshold for CLINDE is 3. S1 and S2 Firstly, when the number of time points is small (m=20), both CLINDE and DD-lasso per-form badly, where DD-lasso is slightly better, but the performance is too poor to be conclusive.

As m increases, CLIN E performs increasingly well, but for DDlasso, while the recall increases quickly, the precision does not improve much or even decrease, leading to poor F-score.And for m ≥ 50, the F-score of CLINDE is better than DD-lasso.Secondly, for both CLINDE and DD-lasso, for a given m, there is only slight difference in F-score for different σ and α, except for m=200 where the drop in F-score is more prominent for CLINDE for α=3.This shows that both CLINDE and DD-lasso are quite robust to slight deviation from gaussian error terms, and quite robust to different variance of error terms.6 and 7 show the median effects F-score of first inferring an initial GRN by CLINDE (referred to as CL-init below), then decompose the initial GRN (Component, Parents and XParents), then use CLINDE and DD-lasso to infer the subnetworks (referred to as CL-sub and DD-sub respectively), and finally merge the subnetworks.CLINDE is 3 for both the initial GRN and subnetworks.Table 6 shows the results for n=500 networks, while Table 7 shows the results

o perform one-sided Wilcoxon signed-rank test t
test whether the median F-score of CL-sub and DD-sub are better than CLinit for m=100 and m=200.The p-values are shown in Table 8, where the entries larger than 0.1 are omitted.


Results of CLINDE and DD-lasso by first decomposing the lar e networks: Tables

For small number of time points (m=20 and m=50), CL-init is better than CL-sub.As m increases, the performance of CL-sub gets closer to but is still worse than CL-init for m <100.For m=100 and m=200, the performance of CL-sub is comparable to or sometimes slightly better than CL-init.For DD-sub, for small m, the performance is poor.Presumably this is the result of combined effects of poor decomposition because of poor initial GRN, and that even after decomposing the initial GR

into (possibly overlapping) subnetworks, the subnetworks are
still too large for DD-lasso to handle when the number of time points is small.Although the decomposition stops only when a component size is smaller than 60, Parents and XParents inc ude also the parents of the component, the subnetwork size for Parents and XParents may be larger than 60.From our preliminary analysis, the subnetwork sizes are around 100.And from our experience and results in [25], for a network of size k, DD-lasso needs at least k to 2k time points to perform decently.Therefore, when m=100, DD-sub becomes comparable to CL-sub and slightly worse than CL-init.When m=200, the performance of DD-sub becomes better than CL-init and CL-sub, although DD-lasso performs poorly when applied to the large network directly (Tables 4 and 5).

Comparing Component, Parents and XParents, there is not much difference for both CL-sub and DD-sub when m is small.When m is larger, the trend is not very clear for CL-sub, but for DD-sub, Parents and XParents are better than Component and Parents is slightly better than XParents.This shows that including the parents of the component helps, presumably that helps to recover the links across subnetworks.Also, comparing Parents and XParents, it shows that removing the links between parents not in the component does not help much; presumably it is because DD-lasso can already effectively remove indirect effects when given sufficient time points.

These results show that the strategy of using CLINDE to infer initial GRN and then decompose into subnetworks, than using DD-

sso (and sometimes CLINDE) to infer the subnetworks can be better than u
ing CLINDE or DD-lasso alone on large network.


Real data pre-processing

Human transcription factors regulatory network: It is rare to find large known gene regulatory networks.Fortunately [35] has compiled the regulatory networks of 475 sequence-specific human   transcription factors across 41 diverse cell and tissue types, using genome-wide maps of in vivo DNasel footprints.These networks should serve a useful role in evaluating GRN learning algorithms.But we cannot use the whole GRN as some constituent genes lack expression data used below.Table 9 shows the sizes of the TF GRNs before and after retaining only the genes with expression data used below.Also, since the TF GRNs give only regulatory links with direction, but not the regulatory effect (excitatory or repressive) nor the time delays, in the following we evaluate only the links performance.Expression data: Even with high throughput technologies, it is not common to find long time series with small equidistant time steps.We have found GDS4238, which is the time series expression data of human bronchial epithelial cell response to inuenza virus, viral RNA and interferon-beta.Unfortunately these experiments are not performed specifically for the cell/tissue types of the GRNs in Tables 6-9.This time series is relatively long, with samples at 0.25, 0.5, 1, 1.5, 2, 4, 6, 8, 12 and 18 hours after treatment, and has multiple treatments with up to 2 replic tes.We have filtered out some treatments with less time points, resulting in 14 (heterogeneous) series.
500 100 0.5 0.5 - - 2.658E-02 - - - 2 - - 8.450E-04 - 1.812E-02 - 8 - - 9.537E-07 - 1.907E-06 - 0.

lts with moderate amount
f data, so here we focus on DD-lasso for subnetworks.The links performances of CLINDE on the initial GRN, DD-lasso alone on the network, and using DD-lasso on the subnetworks (with Component, Parents and XParents) are shown in Tables 10-13.In Tables 10 and  11 we show the performances of the first 4 TF GRNs with different score thresholds for raw and log10 respectively, and in Tables 12 and  13 we show the performances on all TF GRNs with score threshold 3 for raw and log10 respectively, because the performances are quite simila Ns.Secondly, comparing raw and log10, both CLINDE on the initial GRN, DD-lasso alone and DDlasso on the subnetworks perform slightly better using raw.Thirdly, even DD-lasso alone performs better than CLINDE on the initial GRN, with much better recall, but slightly worse precision.DD-lasso on the subnetworks with Parents or XParents is better than DD-lasso alone, though DD-lasso with Component performs worse than CLINDE.The Precision of DD-lasso with Component is better than DD-lasso alone and CLINDE alone, but the Recall is much worse, leading to poorer F-score.Decomposing into subnetworks by Parents and XParents improves mainly the Recall and a little bit of the Precision for DD-lasso.This suggests that the decomposition is not done very well because the initial GRN is not very good, but including the parents of the components allows DD-lasso to recover more links, which alleviates the problem of bad decomposition.The performance of DD-lasso on subnetworks varies with the score threshold of CLINDE on the initial GRN, but th

trend is not very
lear, though all are better than DDlasso alone.We have also performed one-sided Wilcoxon signed rank test comparing the median F-scores of CLINDE on initial GRN, DDlasso on initial GRN, and DD-lasso with decomposition using Parents or XParents, with different score thresholds, using the 41 TF GRNs as the cases.The p-values are shown in Table 14.We can see that except for raw with score threshold of 2, in all other cases, the p-values are very small, showing that the improvements of decomposing into subnetworks are statistically significant.The results on real data show that there is good potential of our proposed strategy of decomposing an initial GRN into subnetworks and then inferring the subnetworks with a possibly different algorithm, even when the initial GRN is not estimated very accurately, and thus the decomposition is not accurate either.


Discussions and Conclusion

Ideally, there would be prior knowledge on the decomposition of a large network into subnetworks so that GRN learning can work on smaller networks using the limited expression data.Lacking this prior knowledge, expression data has to be utilized in the decomposition.11: Links Performances of log10 for first 4 Real Data for CLINDE for initial GRN, DD-lasso for whole network and for subnetworks.CL-init is CLINDE on the initial network with score threshold st.DD is DD-lasso alone on the network, so is the same for different st for a given TF GRN.DD-component, DD-parents and DD-XParents are DD-lasso on the subnetworks with different decompositions.R is Recall, P is Precision, F is F-score.

In this paper, we use CLINDE to infer the initial GRN because it can handle delays and indirect effects, and perform better than DD-lasso on small number of time points relative to the network size.It is natural to ask whether using other algorithm for the initial GRN would give better results.This should be investigated in future work.The goal of inferring the initial GRN is to decompose the large network into subnetworks.In this paper, we have used the measure of "edge betweenness" used in community structure discovery algorithms, and the simple strategy of successively removing the edge with the highest edge betweenness" until the components are below a certain size.Different algorithms used in community structure discovery could be investigated and compared with this simple strategy.Furthermore, clustering based algorithms should be considered to approximate the subnetworks, though most probably customized algorithm may need to be developed to handle delays and indirect effects, as mentioned before.

The synthetic data results show that while DD-lasso alone performs poorly on large network, the performance dramatically improves after decomposing the initial GRN into subnetworks.The real data results show that even the initi

GRN is not very good; DD-l
sso on the sub-networks nevertheless performs better than DD-lasso alone.The results raise the possibility of combining CLINDE and DD-lasso more tightly, for example by using CLINDE to estimate the parents and the associated delays for each gene, then using lasso as in DD-lasso to further remove indirect effects, without explicitly estimating the (possibly overlapping) subnetworks.

To conclude, we have proposed an algorithm to first infer an initial GRN using CLINDE, then decompose it into possibly overlapping subnetworks, then infer each subnetwork by either CLINDE or DDlasso and finally merge the subnetworks, to try to alleviate the problem of insufficient expression data relative to network size We have tested this algorithm on synthetic data of networks with 500 and 1000 genes.We have also tested on real data on 41 human TF regulatory networks.Results show that our proposed algorithm does improve the GRN learning performance of using either CLINDE or DD-lasso alone on the large networks.CL-init is CLINDE on the initial network.DD is DD-lasso alone on the network.DD-parents and DD-XParents are DD-lasso on the subnetwor s with different decompositions.

The tests are based on the 41 TF GRNs.

Figure 2 :
2
Figure 2: Indirect effect across subnetworks the dotted lines indicate the indirect effects across the two subnetworks.


Figure 1 :
1
Figure 1: Structure of subnetworks the overall GRN consists of some subnetworks connected by a few links.


Figure 5 :
5
Figure 5: Example of Edge Betweenness An example of edge betweenness calculated for each edge.The edge across subnetworks will get higher edge betweenness because the shortest path(s) of vertices in different subnetworks have to go through that edge.