Reverse Engineering Cellular Networks with Information Theoretic Methods

Building mathematical models of cellular networks lies at the core of systems biology. It involves, among other tasks, the reconstruction of the structure of interactions between molecular components, which is known as network inference or reverse engineering. Information theory can help in the goal of extracting as much information as possible from the available data. A large number of methods founded on these concepts have been proposed in the literature, not only in biology journals, but in a wide range of areas. Their critical comparison is difficult due to the different focuses and the adoption of different terminologies. Here we attempt to review some of the existing information theoretic methodologies for network inference, and clarify their differences. While some of these methods have achieved notable success, many challenges remain, among which we can mention dealing with incomplete measurements, noisy data, counterintuitive behaviour emerging from nonlinear relations or feedback loops, and computational burden of dealing with large data sets.


Introduction
Systems biology is an interdisciplinary approach for understanding complex biological systems at the system level [1]. Integrative mathematical models, which represent the existing knowledge in a compact and unambiguous way, play a central role in this field. They facilitate the exchange and critical examination of this knowledge, allow to test if a theory is applicable, and make quantitative predictions about the system's behaviour without having to carry out new experiments. In order to be predictive, models have to be "fed" (calibrated) by data. Although the conceptual foundations of systems biology had been laid several decades ago, during most of the 20th century the experimental data to support its models and hypotheses were missing [2]. With the development of high-throughput techniques in the 1990s, massive amounts of "omics" data were generated, providing the push required for the rapid expansion of the field.
This review paper deals with the problem of constructing models of biological systems from experimental data. More specifically, we are interested in reverse engineering cellular systems that can be naturally modeled as biochemical networks. A network consists of a set of nodes and a set of links between them. In cellular networks the nodes are molecular entities such as genes, proteins, or metabolites. The links or edges are the interactions between nodes, such as the chemical reactions where the molecules are present, or a higher level abstraction such as a regulatory interaction involving several reactions. Thus cellular networks can be classified, according to the type of entities and interactions involved, as gene regulatory, metabolic, or protein signaling networks.
The main goal of the methods studied here is to infer the network structure, that is, to deduce the set of interactions between nodes. This means that the focus is put on methods that-if we choose metabolism as an example-aim at finding which metabolites appear in the same reaction, as opposed to methods that aim at the detailed characterization of the reaction (determining its rate law and estimating the values of its kinetic constants). The latter is a related but different part of the inverse problem, and will not be considered here.
Some attributes of the entities are measurable, such as the concentration of a metabolite or the expression level of a gene. When available, those data are used as the input for the inference procedure. For that purpose, attributes are considered random variables that can be analyzed with statistical tools. For example, dependencies between variables can be expressed by correlation measures. Information theory provides a rigorous theoretical framework for studying the relations between attributes.
Information theory can be viewed as a branch of applied mathematics, or more specifically as a branch of probability theory [3], that deals with the quantitative study of information. The foundational moment of this discipline took place in 1948 with the publication by C.E. Shannon of the seminal paper "A mathematical theory of communication" [4]. Indeed, that title is a good definition of information theory. Originally developed for communication engineering applications, the use of information theory was soon extended to related fields such as electrical engineering, systems and control theory, computer science, and also to more distant disciplines like biology [5]. Nowadays the use of information-theoretic concepts is common in a wide range of scientific fields.
The fundamental notion of information theory is entropy, which quantifies the uncertainty of a random variable and is used as a measure of information. Closely related to entropy is mutual information, a measure of the amount of information that one random variable provides about another. These concepts can be used to infer interactions between variables from experimental data, thus allowing reverse engineering of cellular networks.
A number of surveys, which approach the network inference problem from different points of view, including information-theoretic and other methods, have been published in the past. To the best of the authors' knowledge, the first survey focused on identification of biological systems dates back to 1978 [6]. More recently, one of the first reviews to be published in the "high-throughput data era" was [7]. Methods that determine biochemical reaction mechanisms from time series concentration data were reviewed in [8], including parameter estimation. In the same area, a more recent perspective (with a narrower scope) can be found in [9]. Techniques developed specifically for gene regulatory network inference were covered in [10]-which included an extensive overview of the different modeling formalisms-and in [11], as well as in other reviews that include also methods applicable to other types of networks [12,13]. Methods for the reconstruction of plant gene co-expression networks from transcriptomic data were reviewed in [14]. The survey [15] covers not only network inference but also other topics, although it does not discuss information theoretic methods. Recently, [16] studied the advantages and limitations of network inference methods, classifying them according to the strategies that they use to deal with underdetermination. Other reviews do not attempt to cover all the literature, but instead focus on a subset of methods on which they carry out detailed comparisons, such as [17][18][19][20].
The problem of network inference has been investigated in many different communities. The aforementioned reviews deal mostly with biological applications, and were published in journals of bioinformatics, systems biology, microbiology, molecular biology, physical chemistry, and control engineering communities. Many more papers on the subject are regularly published in journals from other areas. Systems identification, a part of systems and control theory, is a discipline in its own right, with a rich literature [21,22]. However, in contrast to biology, it deals mostly with engineered systems, and hence its approaches are frequently difficult to adapt or not appropriate for reverse engineering complex biological systems. Other research areas such as machine learning have produced many theoretically rigorous results about network inference, but their transfer to biological applications is not frequently carried out. In this survey we intend to give a broad overview of the literature from the different-although sometimes partially overlapping-communities that deal with the network inference problem with an information theoretic approach. Thus we review papers from the fields of statistics, machine learning, systems identification, chemistry, physics, and biology. We focus on those contributions that have been or are more likely to be applied to cellular networks.

Correlations, Probabilities and Entropies
Biological sciences have a long history of using statistical tools to measure the strength of dependence among variables. An early example is the correlation coefficient r [23,24], which quantifies the linear dependence between two random variables X and Y . It is commonly referred to as the Pearson correlation coefficient, and it is defined as the covariance of the two variables divided by the product of their standard deviations. For n samples it is: where X i , Y i are the n data points, andX,Ȳ are their averages. If both variables are linearly independent, r(X, Y ) = 0 and knowledge of one of them does not provide any information about the other. In the opposite situation, where one variable is completely determined by the other, all the data points lie on a line and r(X, Y ) = ±1.
It should be noted that in this context the word "linear" may be used in two different ways. When applied to a deterministic system, it means that the differential equations that define the evolution of the system's variables in time are linear. On the other hand, when applied to the relationship between two variables, it means that the two-dimensional plot of their values (not of the variables as a function of time, X(t), Y (t), but of one variable as a function of the other, X(Y )) forms a straight line, independently of the character of the underlying system.
A related concept, partial correlation, measures the dependence between two random variables X and Y after removing the effect of a third variable Z. It can be expressed in terms of the correlation coefficients as follows: The Pearson coefficient is easy to calculate and symmetric, and its range of values has a clear interpretation. However, as noted in [25,26], it uses the second moment of the pair distribution function (1), discarding all higher moments. For certain strong nonlinearities and correlations extending over several variables, higher than the second moment of the pair probability distribution function may contribute and an alternative measure of dependence may be more appropriate. Hence the Pearson coefficient is not an accurate way of measuring nonlinear correlations, which are ubiquitous in biology. A more general measure is mutual information, a fundamental concept of information theory defined by Shannon [4]. To define it we must first introduce the concept of entropy, which is the uncertainty of a single random variable: let X be a discrete random vector with alphabet χ and probability mass function p(x). The entropy is: where log is usually the logarithm to the base 2, although the natural logarithm may also be used. Entropy can be interpreted as the expected value of log 1 p(x) , that is The joint entropy of a pair of discrete random variables (X,Y) is Conditional entropy H(Y |X) is the entropy of a random variable conditional on the knowledge of another random variable. It is the expected value of the entropies of the conditional distributions, averaged over the conditioning random variable. For example, for two random variables X and Y we have The joint entropy and the conditional entropy are related so that the entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other: The relative entropy is a measure of the distance between two distributions with probability functions p(x) and q(x). It is defined as: The relative entropy is always non-negative, and it is zero if and only if p = q. However, it is not a true distance because it is not symmetric and it does not satisfy the triangle inequality.

Mutual Information
Mutual information, I, is a special case of relative entropy: it is the relative entropy between the joint distribution, p(x, y), and the product distribution, p(x)p(y), that is: Linfoot [27] proposed the use of mutual information as a generalization of the correlation coefficient and introduced a normalization with values ranging from 0 to 1: The mutual information is a measure of the amount of information that one random variable contains about another. It can also be defined as the reduction in the uncertainty of one variable due to the knowledge of another. Mutual information is related to entropy as follows: Finally, the conditional mutual information measures the amount of information shared by two variables when a third variable is known: If Y and Z carry the same information about X, the conditional mutual information I(X, Y |Z) is zero.
The relationship between entropy, joint entropy, conditional entropy, and mutual information is graphically depicted in Figure 1. Note that until now we have considered implicitly discrete variables; in the case of continuous variables the ∑ are replaced by ∫ . For more detailed descriptions of these concepts, see [28]. Mutual information is a general measure of dependencies between variables. This suggests its application for evaluating similarities between datasets, which allows for inferring interaction networks of any kind: chemical, biological, social, or other. If two components of a network interact closely, their mutual information will be large; if they are not related, it will be theoretically zero. As already mentioned, mutual information is more general than the Pearson correlation coefficient, which is only rigorously applicable to linear correlations with Gaussian noise. Hence, mutual information may be able to detect additional non-linear correlations undetectable for the Pearson coefficient, as has been shown for example in [29] where it was demonstrated with metabolic data.
In practice, for the purpose of network inference, mutual information cannot be analytically calculated, because the underlying network is unknown. Therefore, it must be estimated from experimental data, a task for which several algorithms of different complexity can be used. The most straightforward approximation is to use a "naive" algorithm that partitions the data into a number of bins of a fixed width, and approximates the probabilities by the frequencies of occurrence. This simple approach has the drawback that the mutual information is systematically overestimated [30]. A more sophisticated option uses adaptive partitioning, where the bin size in the partition depends on the density of data points. This is the case of the classic algorithm by Fraser and Swinney [31], which manages to improve the estimations although at the cost of increasing the computation times considerably. A more efficient version of this method was presented in [32], together with a comparison of alternative numerical algorithms. Another computationally demanding option is to use kernel density estimation for estimating the probability density p(x), which can then be applied to estimation of mutual information [33]. Recently, Hausser and Strimmer [34] presented a procedure for the effective estimation of entropy and mutual information from small sample data, and demonstrated its application to the inference of high-dimensional gene association networks. More details about the influence of the choice of estimators of mutual information on the network inference problem can be found in [35,36], including numerical comparisons between several methods.
Another issue related to estimation of mutual information is the determination of a threshold to distinguish interaction from non-interaction. One solution is given by the minimum description length (MDL) principle [37], which states that, given a dataset and several candidate models, one should choose the model that provides the shortest encoding of the data. The MDL principle seeks to achieve a good trade-off between model complexity and accuracy of data fitting. It is similar to other criteria for model selection, such as the popular Akaike (AIC) and Bayesian information criterion (BIC). Like the BIC, the MDL takes into account the sample size, and minimizes both the model coding length and the data coding length.
We finish this section mentioning that a discussion of some issues concerning the definition of multivariate dependence has been presented in [38]. The aim of the analysis was to clarify the concept of dependence among different variables, in order to be able to distinguish between independent (additive) and cooperative (multiplicative) regulation.

Generalizations of Information Theory
In the 1960s Marko proposed a generalization of Shannon's information theory called bidirectional information theory [39,40]. Its aim was to distinguish the direction of information flow, which was considered necessary to describe generation and processing of information by living beings. The concept of Directed Transinformation (DTI) was introduced as extension of mutual information (which Shannon called transinformation). Let us consider two entities M 1 and M 2 , with X being a symbol of M 1 and Y of M 2 . Then the directed transinformation from M 1 to M 2 is where p(X|X n ) represents the conditional probability for the occurrence of X when n previous symbols X n of the own process are known, and p(X|X n Y n ) is the conditional probability for the occurrence of X when n previous symbols X n of the own process as well as of the other process Y n are known. The directed transinformation from M 2 to M 1 is defined in the same way, replacing X with Y and vice versa. The sum of both transinformations equals Shannon's transinformation or mutual information, that is: Marko's work was continued two decades later by Massey [41], who defined the directed information I(X N → Y N ) from a sequence X N to a sequence Y N as a slight modification of the directed transinformation: If no feedback between Y and X is present, then the directed information and the traditional mutual information are equal, Another generalization of Shannon entropy is the concept of nonextensive entropy. Shannon entropy (also called Boltzmann-Gibbs entropy, which we denote here as H BG ) agrees with standard statistical mechanics, a theory that applies to a large class of physical systems: those for which ergodicity is satisfied at the microscopic dynamical level. Standard statistical mechanics is extensive, that is, it assumes that, for a system S consisting of N independent subsystems S 1 , . . . , S N , it holds that This property is a result of the short-range nature of the interactions typically considered (think, for example, of the entropy of two subsets of an ideal gas). However, there are many systems where long-range interactions exist, and thus violate this hypothesis-a fact not always made explicit in the literature. To overcome this limitation, in 1988 Constantino Tsallis [42] proposed the following generalization of the Boltzmann-Gibbs entropy: where k is a positive constant that sets the dimension and scale, p i are the probabilities associated with the ω distinct configurations of the system, and q ∈ ℜ is the so-called entropic parameter, which characterizes the generalization. The entropic parameter characterizes the degree of nonextensivity, which in the limit the Boltzmann constant. The generalized entropy H q is the basis of what has been called non-extensive statistical mechanics, as opposed to the standard statistical mechanics based on H BG . Indeed, H q is non-extensive for systems without correlations; however, for complex systems with long-range correlations the reverse is true: H BG is non-extensive and is not an appropriate entropy measure, while H q becomes extensive [43]. It has been suggested that the degree of nonextensivity can be used as a measure of complexity [44]. Scale-free networks [45,46] are an example of systems for which H q is extensive and H BG is not. Scale-free networks are characterized by the fact that their vertex connectivities follow a scale-free power-law distribution. It has been recognized that many complex systems from different areas-technological, social, and biological-are of this type. For these systems, it has been suggested that it is more meaningful to define the entropy in the form of Equation (16) instead of Equation (3). By defining the q-logarithm function as ln q (x) = x 1−q −1 1−q , the nonextensive entropy can be expressed in a similar form as the Boltzmann-Gibbs entropy, Equation (3): and analogously one can define nonextensive versions of conditional entropy or mutual information.

Detecting Interactions: Correlations and Mutual Information
Early examples of techniques based on mutual information in a biological context can be found in [47], where it was used to determine eukaryotic protein coding regions, and [48], where it was applied for analyzing covariation of mutations in the V3 loop of the HIV-1 envelope protein. Since then many more examples have followed, with the first applications in network inference appearing in the second half of the 1990s. Specifically, in the 1998 Pacific Symposium on Biocomputing two methods for reverse engineering gene networks based on mutual information were presented. The REVEAL [49] algorithm used Boolean (on/off) models of gene networks and inferred interactions from mutual information. It was implemented in C and tested on synthetic data, with good results reported for a network of 50 elements and 3 inputs per element. In another contribution from the same symposium [50] mutual information was normalized as , and a distance matrix was then defined The distance matrix was used to find correlated patterns of gene expression from time series data. The normalization presents two advantages: the value of the distance d is between 0 and 1, and d(X i , X i ) = 0.
Two years later, Butte et al. [51] proposed a technique for finding functional genomic clusters in RNA expression data, called mutual information relevance networks. Pair-wise mutual information between genes was calculated as in Equation (11), and it was hypothesized that associations with high mutual information were biologically related. Simultaneously, the same group published a related method [52] that used the correlation coefficient r Equation (1) instead of mutual information. The method, known as relevance networks (RN), was used to discover functional relationships between RNA expression and chemotherapeutic susceptibility. In this work the similarity of patterns of features was rated using pair-wise correlation coefficients defined asr 2 = r abs(r) r 2 . Butte et al. mentioned a number of advantages of their method over previous ones. First, relevance networks are able to display nodes with varying degrees of cross-connectivity, while phylogenetic-type trees such as the aforementioned [50] can only link each feature to one other feature, without additional links. Second, phylogenetic-type trees cannot easily cluster different types of biological data. For example, they can cluster genes and anticancer agents separately, but do not easily determine associations between genes and anticancer agents. Third, clustering methods such as [50] may ignore genes whose expression levels are highly negatively correlated across cell lines; in contrast, in RN negative and positive correlations are treated in the same way and are used in clustering.
Pearson's correlation coefficient was also used in [53] to assemble a gene coexpression network, with the ultimate goal of finding genetic modules that are conserved across evolution. DNA microarray data from humans, flies, worms, and yeast were used, and 22,163 coexpression relationships were found. The predictions implied by some of the discovered links were experimentally confirmed, and cell proliferation functions were identified for several genes.
In [54] transcriptional gene networks in human and mouse were reverse-engineered using a simple mutual information approach, where the expression values were discretized into three bins. The relevance of this study is due to the massive datasets used: 20,255 gene expression profiles from human samples, from which 4,817,629 connections were inferred. Furthermore, a subset of not previously described protein-protein interactions was experimentally validated. For a discussion on the use of information theory to detect protein-protein interactions, see Section 3.1 of [55].
The aforementioned methods were developed mostly for gene expression data. In contrast, the next two techniques, Correlation Metric Construction (CMC) and Entropy Metric Construction (EMC), aimed at reverse engineering chemical reaction mechanisms, and used time series data (typically metabolic) of the concentration of the species present in the mechanism. In CMC [56] the time-lagged correlations between two species are calculated as where <> denotes the time average over all measurements, andx i is the time average of the concentration of the time series of species i. From these functions a correlation matrix R(τ ) is calculated; its elements are . Then the elements d CM C ij of the distance matrix are obtained as Finally, Multidimensional Scaling (MDS) is applied to the distance matrix, yielding a configuration of points representing each of the species, which are connected by lines that are estimates for the connectivities of the species in the reactions. Furthermore, the temporal ordering of the correlation maxima provides an indication of the causality of the reactions. CMC was first tested on a simulated chemical reaction mechanism [56], and was later successfully applied to the reconstruction of the glycolytic pathway from experimental data [57]. More recently, it has been integrated in a systematic model building pipeline [58], which includes not only inference of the chemical network, but also data preprocessing, automatic model family generation, model selection and statistical analysis.
The Entropy Metric Construction method, EMC [25,26], is a modification of CMC that replaces the correlation measures with entropy-based distances, Originally, EMC was applied to an artificial reaction mechanism with pseudo-experimental, for which it was reported to outperform CMC. Recently [59] it has been tested with the same glycolytic pathway reconstructed by CMC [57], with both methods yielding similar results.
Recently, CMC/EMC has inspired a method [60] that combines network inference by time-lagged correlation and estimation of kinetic parameters with a maximum likelihood approach. It was applied to a test case from pharmacokinetics: the deduction of the metabolic pathway of gemcitabine, using synthetic and experimental data.
The empirical distance correlation (DCOR) was presented in [61,62]. Given a random sample of n random vectors (X, Y ), Euclidean distance matrices are calculated as and similarly for B kl . Then the empirical distance covariance ν n (X, Y ) is the nonnegative quantity defined by Similarly, ν n (X) = ν n (X, X) = 1 n 2 ∑ n k,l=1 A 2 kl , and the distance correlation DCOR = R n (X, Y ) is the square root of Unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. Furthermore, it is defined for X and Y in arbitrary dimensions, rather than to univariate quantities. DCOR is a good example of a method that has gained recognition inside a research community (statistics) but whose merits have hardly become known to scientists working on other areas (such as the applied biological sciences). Some recent exceptions have recently appeared.
In [63] it was used for the detection of long-range concerted motion in proteins. In a study concerning mortality [64], significant distance correlations were found between death ages, lifestyle factors, and family relationships. As for applications in network inference, [65] compared eight statistical measures, including distance covariance, evaluating their performance in gene association problems (the other measures being Spearman rank correlation, Weighted Rank Correlation, Kendall, Hoeffding's D measure, Theil-Sen, Rank Theil-Sen, and Pearson). Interestingly, the least efficient methods turned out to be Pearson and distance covariance.
The Maximal Information Coefficient (MIC) is another recently proposed measure of association between variables [66]. It was designed with the goal of assigning similar values to equally noisy relationships, independently of the type of association, a property termed "equitability". The main idea behind MIC is that if two variables (X, Y ) are related, their relationship can be encapsulated by a grid that partitions the data in the scatter plot. Thus, all possible grids are explored (up to a maximal resolution that depends on the sample size) and for each m-by-n grid the largest possible mutual information I(X, Y ) is computed. Then the mutual information values are normalized between 0 and 1, ensuring a fair comparison between grids of different dimensions. The MIC measure is defined as the maximum of the normalized mutual information values [67]: where |X| and |Y | are the number of bins for each variable and B the maximal resolution. This methodology has been applied to data sets in global health, gene expression, major-league baseball, and the human gut microbiota [66], demonstrating its ability for identifying known and novel relationships. The claims about MIC's performance expressed in the original publication [66] have generated some criticism. In a comment posted on the publication web site, Simon and Tibshirani reminded that, since there is "no free lunch" in Statistics, tests designed to have high power against all alternatives have low power in many important situations. Hence, the fact that MIC has no preference for some alternatives over others (equitability) can be counterproductive in many cases. Simon and Tibshirani reported simulation results showing that MIC has lower power than DCOR for most relationships, and that in some cases MIC is less powerful than Pearson correlation as well. These deficiencies would indicate that MIC will produce many false positives in large scale problems, and that the use of the distance correlation measure is more advisable. In a similar comment, Gorfine et al. opposed the claim that non-equitable methods are less practical for data exploration, arguing that both DCOR and their own HHG method [68] are more powerful than the test based on MIC. At the moment of writing this article, the debate about the concept of equitability and its relation to mutual information and the MIC is very active at the arXiv website, with opposite views such as the ones expressed in [67,69].
Recently, the nonextensive entropy proposed by Tsallis has also been used in the context of reverse-engineering gene networks [70]. Given some temporal data, the method fixes a gene target x i and looks for the group of genes g that minimizes the nonextensive conditional entropy for a fixed q: The reported results show an improvement on the inference accuracy by adopting nonextensive entropies instead of traditional entropies. The best computational results in terms of reduction of the number of false positives were obtained with the range of values 2.5 < q < 3.5, which corresponds to subextensive entropy. This claim stresses the importance of the additional tuning parameter, q, allowed by the Tsallis entropy. The fact that q has to be fixed a priori is a drawback for its use in reverse engineering applications, since it is unclear how to choose its value.
Finally, we discuss some methods that use the minimum description length principle (MDL) described in Subsection 2.2. MDL was applied in [71] to infer gene regulatory networks from time series data, reporting good results with both synthetic datasets and experimental data from Drosophila melanogaster. While that method eliminated the need for a user-defined threshold value, it introduced the need for a user-defined tuning parameter to balance the contributions of model and data coding lengths. To overcome this drawback, in [72] it was proposed to use as the description length a theoretical measure derived from a maximum likelihood model. This alternative was reported to improve the accuracy of reconstructions of Boolean networks. The same goal was pursued in [73], where a network inference method that included a predictive MDL criterion was presented. This approach incorporated not only mutual information, but also conditional mutual information.

Distinguishing between Direct and Indirect Interactions
A number of methods have been proposed that use information theoretic considerations to distinguish between direct and indirect interactions. The underlying idea is to establish whether the variation in a variable can be explained by the variations in a subset of other variables in the system.
The Entropy Reduction Technique, ERT [25,26], is an extension of EMC that outputs the list of species X* with which a given species Y reacts, in order of the reaction strength. The mathematical formulation stems from the observation that, if a variable Y is completely independent of a set of variables X, then H(Y |X) = H(Y ); otherwise H(Y |X) < H(Y ). The ERT algorithm is defined as follows [25]: , or when all species except Y are already in X*; otherwise go to step 2 Intuitively, the method determines whether the nonlinear variation in a variable Y, as given by its entropy, is explainable by the variations of a subset-possibly all-of the other variables in the system, X*. It is done by iterating through cycles of adding a variable X* to X* that minimizes H(Y |X*) until further additions do not decrease the entropy. This technique leads to an ordered set of variables that control the variation in Y. A methodology called MIDER (Mutual Information Distance and Entropy Reduction), which combines and extends features of the ERT and EMC techniques, has been recently developed and a MATLAB implementation is available as a free software toolbox [59].
The ARACNE method [74][75][76] is an information-theoretic algorithm for identifying transcriptional interactions between gene products, using microarray expression profile data. It consists of two steps.
In the first step, the mutual information between pairs of genes is calculated as in Equation (11), and pairs that have a mutual information greater than a threshold I 0 are identified as candidate interactions. This part is similar to the method of mutual information relevance networks [51]. In the second step, the Data Processing Inequality (DPI) is applied to discard indirect interactions. The DPI is a well known property of mutual information [28] that simply states that, if X → Y → Z forms a Markov chain, then I(X, Y ) ≥ I(X, Z). The ARACNE algorithm examines each gene triplet for which all three MIs are greater than I 0 and removes the edge with the smallest value. In this way, ARACNE manages to reduce the number of false positives, which is a limitation of mutual information relevance networks. Indeed, when tested on synthetic data, ARACNE outperformed relevance networks and Bayesian networks. ARACNE has also been applied to experimental data, with the first application being reverse engineering of regulatory networks in human B cells [74]. If time-course data is available, a version of ARACNE that considers time delays [77] can be used.
The definition of Conditional Mutual Information (12) clearly suggests its application for distinguishing between direct and indirect applications. This is the idea underlying the method proposed in [78], which was tested on artificial and real (melanoma) datasets. The so-called direct connectivity metric (DCM) was introduced as a measure of the confidence in the prediction that two genes X and Y were connected. The DCM is defined as the following product: where min Z∈V −XY I(X, Y |Z) is the least conditional mutual information given any other gene Z. This method was compared with ARACNE and mutual information relevance networks [51] and was reported to outperform them for certain datasets. The Context Likelihood of Relatedness technique, CLR [79] adds a correction step to the calculation of mutual information, comparing the value of the mutual information between a transcription factor X and a gene Y with the background distribution of mutual information for all possible interactions involving X or Y . In this way the network context of the interactions is taken into account. The main idea behind CLR is that the most probable interactions are not necessarily those with the highest MI scores, but those whose scores are significantly above the background distribution; the additional correction step helps to remove false correlations. CLR was validated [79] using E. coli data and known regulatory interactions from RegulonDB, and compared with other methods: relevance networks [52], ARACNe [74], and Bayesian networks [80]. It was reported [79] that CLR demonstrated a precision gain of 36% relative to the next best performing algorithm. In [81] CLR was compared with a module-based algorithm, LeMoNe (Learning Module Networks), using expression data and databases of known transcriptional regulatory interactions for E. coli and S. cerevisiae. It was concluded that module-based and direct methods retrieve distinct parts of the networks.
The Minimum Redundancy Networks technique (MRNET [82]) was developed for inferring genetic networks from microarray data. It is based on a previous method for feature selection in supervised learning called maximum relevance/minimum redundancy (MRMR [83][84][85]). Given an output variable Y and a set of possible input variables X, MRMR ranks the inputs according to a score that is the difference between the mutual information with the output variable Y (maximum relevance) and the average mutual information with the previously ranked variables (minimum redundancy). By doing this MRMR is intended to select, from the least redundant variables, those that have the highest mutual information with the target. Thus, direct interactions should be better ranked than indirect interactions. The MRNET method uses the MRMR principle in the context of network inference. Comparisons with ARACNE, relevance networks, and CLR were carried out using synthetically generated data, showing that MRNET is competitive with these methods. The R/Bioconductor package minet [86] includes the four methods mentioned before, which can be used with four different entropy estimators and several validation tools. A known limitation of algorithms based on forward selection-such as MRNET-is that their results strongly depend on the first variable selected. To overcome this limitation, an enhanced version named MRNETB was presented in [87]; it improves the original method by using a backward selection strategy followed by a sequential replacement.
A statistical learning strategy called three-way mutual information (MI3) was presented in [88]. It was designed to infer transcriptional regulatory networks from high throughput gene expression data. The procedure is in principle sufficiently general to be applied to other reverse engineering problems. Consider three variables R 1 , R 2 , and T , where R 1 and R 2 are possible "regulators" of the target variable, T . Then the MI3 metric is defined as Both MI3 and ERT try to detect higher order interactions and, for this purpose, they use scores calculated from entropies H(*), and 2-and 3-variable joint entropies, H(*,*) and H(*,*,*). MI3 was specifically designed to detect cooperative activity between two regulators in transcriptional regulatory networks, and it was reported to outperform other methods such as Bayesian networks, two-way mutual information and a discrete version of MI3. A method [89] exploiting three-way mutual information and CLR was the best scorer in the 2nd conference on Dialogue for Reverse Engineering Assessments and Methods (DREAM2) Challenge 5 (unsigned genome-scale network prediction from blinded microarray data) [90].
A similar measure, averaged three-way mutual information (AMI3), was defined in [91] as where X i represents the target gene, and Y j , Y k are two regulators that may regulate X i cooperatively. The first two terms are the traditional mutual information. The third term represents the cooperative activity between Y j and Y k , and the fourth term ensures that Y j and Y k regulate X i directly (without regulation between Y j and Y k ): if Y j regulates X i indirectly through Y k , both the third and fourth terms will increase, cancelling each other and not leading to an increase in I ijk . In [91] this score was combined with non-linear ordinary differential equation (ODE) modeling for inferring transcriptional networks from gene expression, using network-assisted regression. The resulting method was tested with synthetic data, reporting better performance than other algorithms (ARACNE, CLR, MRNET and SA-CLR). It was also applied to experimental data from E. coli and yeast, allowing to make new predictions. The Inferelator [92] is another freely available method for network inference. It was designed for inferring genome-wide transcriptional regulatory interactions, using standard regression and model shrinkage techniques to model the expression of a gene or cluster of genes as a function of the levels of transcription factors and other influences. Its performance was demonstrated in [93], where the transcriptional regulatory network of Halobacterium salinarum NRC-1 was reverse-engineered and its responses in 147 experiments were successfully predicted. Although the Inferelator is not itself based on mutual information, it has performed best when combined with MI methods [94]. Specifically, it has been used jointly with CLR.
Other methods have relied on correlation measures instead of mutual information for detecting indirect interactions. A method to construct approximate undirected dependency graphs from large-scale biochemical data using partial correlation coefficients was proposed in [95]. In a first step networks are built based on correlations between chemical species. The Pearson correlation coefficient (1) or the Spearman correlation coefficient may be chosen. The Spearman correlation coefficient is simply the Pearson correlation coefficient between the ranked variables, and measures how well the relationship between two variables can be described using a monotonic function. In a second step edges for which the partial correlation coefficient (2) falls below a certain threshold are eliminated. This procedure was tested both on artificial and on experimental data, and a software implementation was made available at the website.
In [96] partial correlation was used to reduce the number of candidate genes. In the partial correlation Equation (2) the Pearson correlation coefficient r was replaced by the Spearman correlation coefficient, since the latter was found to be more robust for detecting nonlinear relationships between genes. However, the authors acknowledged that the issue deserved further investigation.
In [97] both Pearson and Spearman correlation coefficients were tested; no practical differences were found between both measures. Furthermore, no clear differences were detected between linear (correlation) and nonlinear (mutual information) scores. For detecting indirect interactions, three different tools were used: partial correlation, conditional mutual information, and the data processing inequality, which were found to improve noticeably the performance of their non-conditioned counterparts. These results were obtained from artificially generated metabolic data.

Detecting Causality
Inferring the causality of an interaction is a complicated task, with deep theoretical implications. This topic has been extensively investigated by Pearl [98]. Philosophical considerations aside, from a practical view point we can intuitively assign a causal relation from A to B if A and B are correlated and A precedes B. Thus, causal interactions can be inferred if time series data is available.
It was already mentioned that CMC can determine directionality because it takes time series information into account, as shown in [57] for a glycolytic path. Another network reconstruction algorithm based on correlations was proposed in [99] to deduce directional connections based on gene expression measurements. Here the directionality came from the asymmetry of the conditional correlation matrix, which expressed the correlation between two genes given that one of them was perturbed. Another approach for causal correlations was presented in Opgen-Rhein and Strimmer [100]. Once the correlation network is obtained, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. In this way a directed acyclic causal network is obtained as a subgraph of the original network. This method was validated using gene expression data of Arabidopsis thaliana.
Some methods based on mutual information have taken causality into account. One of them is EMC [26], which is essentially the same method as CMC with a different definition of distance. Another one is the already mentioned TimeDelay-ARACNE method [77]. The concept of directed information described in Section 2 has also been applied to the reconstruction of biological networks. In [101] it was used for reconstructing gene networks; the method was validated using small random networks and simulated data from the E.Coli network for flagella biosynthesis. It was reported that, for acyclic graphs with 7 or fewer genes with summation operations only, the method was able to infer all edges. In [102] directed information was used for finding interactions between transcription factor modules and target co-regulated genes. The validity of the approach was demonstrated using publicly available embryonic kidney and T-cell microarray datasets. DTInfer, an R-package for the inference of gene-regulatory networks from microarrays using directed information, was presented in [103]. It was tested on E. coli data, predicting five novel TF-target gene interactions; one of them was validated experimentally. Finally, directed information has also been used in a neuroscience context [104], for inferring causal relationships in ensemble neural spike train recordings.

Previous Comparisons
As mentioned in the Introduction, there are some publications where detailed analyses and comparisons of some of the methods reviewed here have been carried out. For example, in [17] the performance of some popular algorithms was tested under different conditions and on both synthetic and real data. Comparisons were twofold: on the one hand, conditional similarity measures like partial Pearson correlation (PPC), graphical Gaussian models (GGM), and conditional mutual information (CMI) were compared with Pearson correlation (PC) and mutual information (MI); on the other hand, linear measures (PC and PPC) were compared with non-linear ones (MI, CMI, and the Data Processing Inequality, DPI).
The differences and similarities of three other network inference algorithms-ARACNE, Context Likelihood of Relatedness (CLR), and MRNET-were studied in [35], where the influence of the entropy estimator was also taken into account. The performance of the methods was found to be dependent on the quality of the data: when complete and accurate measurements were available, the MRNET method combined with the Spearman correlation appeared to be the most effective. However, in the case of noisy and incomplete data, the best performer was CLR combined with Pearson correlation.
The same three inference algorithms, together with the Relevance Networks method (RN), were compared in [18], using network-based measures in combination with ensemble simulations. In [105] Emmert-Streib studied the influence of environmental conditions on the performance of five network inference methods, ARACNE, BC3NET [106], CLR, C3NET [107], and MRNET. Comparison of their results for three different conditions concluded that different statistical methods lead to comparable but condition-specific results. The tutorial [19] evaluated the performance of ARACNE, BANJO, NIR/MNI, and hierarchical clustering, using synthetic data. More recently, [20] compared four tools for inferring regulatory networks (ARACNE, BANJO, MIKANA, and SiGN-BN), applying them to new microarray datasets generated in human endothelial cells.

Conclusions, Successes and Challenges
A number of methods for inferring the connectivity of cellular networks has been reviewed in this article. Most of these methods, which have been published during the last two decades, adopt some sort of information theoretic approach for evaluating the probability of the interactions between network components. We have tried to review as many techniques as possible, surveying the literature from areas such as systems and computational biology, bioinformatics, molecular biology, microbiology, biophysics, physical and computational chemistry, physics, systems and process control, computer science, or statistics. Some methods were designed for specific purposes (e.g., reverse engineering gene regulatory networks), while others aim at a wider range of applications. We have attempted to give a unified treatment to methods from different backgrounds, clarifying their differences and similarities. When available, comparisons of their performances have been reported.
It has been shown that information theory provides a solid foundation for developing reverse engineering methodologies, as well as a framework to analyze and compare them. Concepts such as entropy or mutual information are of general applicability and make no assumptions about the underlying systems; for example, they do not require linearity or absence of noise. Furthermore, most information theoretic methods are scalable and can be applied to large-scale networks with hundreds or thousands of components. This gives them in some cases an advantage over other techniques that have higher computational cost, such as Bayesian approaches.
A conclusion of this review is that no single method outperforms the rest for all problems. There is "no free lunch": methods that are carefully tailored to a particular application or dataset may yield better results than others when applied to that particular problem, but frequently perform worse when applied to different systems. Therefore, when facing a new problem it may be useful to try several methods. Interestingly, the results of the DREAM challenges show that community predictions are more reliable than individual predictions [108][109][110]; that is, the best option is to take into account the reconstructions provided by all the methods, as opposed to trusting only the best performing ones.
In the last fifteen years different information theoretic methods have been successfully applied to the reverse engineering of genetic networks. The resulting predictions about existing interactions have enabled the design of new experiments and the generation of hypotheses that were later confirmed experimentally, demonstrating the ability of computational modeling to provide biological insight. Another indication of the success of the information theoretic approach is that in recent years methods that combine mutual information with other techniques have been among the top performers in the DREAM reverse engineering challenges [94]. Success stories have appeared also regarding their application to reconstruction of chemical reaction mechanisms. One of the earliest was the validation of the CMC method, which was able to infer a significant part of the glycolytic path from experimental data.
Despite all the advances made in the last decades, the problem faced by these methods (inferring large-scale networks with nonlinear interactions from incomplete and noisy data) remains challenging. To progress towards that goal, several breakthroughs need to be achieved. A systematic way of determining causality that is valid for large-scale systems is still lacking. Computational and experimental procedures for identifying feedback loops and other complex structures are also needed. For these and other obstacles to be overcome, the future developments should be aware of the existing methodologies and build on their capabilities. We hope that this review will help researchers in that task.