Learning Loosely Connected Markov Random Fields

We consider the structure learning problem for graphical models that we call loosely connected Markov random fields, in which the number of short paths between any pair of nodes is small, and present a new conditional independence test based algorithm for learning the underlying graph structure. The novel maximization step in our algorithm ensures that the true edges are detected correctly even when there are short cycles in the graph. The number of samples required by our algorithm is C*log p, where p is the size of the graph and the constant C depends on the parameters of the model. We show that several previously studied models are examples of loosely connected Markov random fields, and our algorithm achieves the same or lower computational complexity than the previously designed algorithms for individual cases. We also get new results for more general graphical models, in particular, our algorithm learns general Ising models on the Erdos-Renyi random graph G(p, c/p) correctly with running time O(np^5).

1. Introduction. In many models of networks, such as social networks and gene regulatory networks, each node in the network represents a random variable and the graph encodes the conditional independence relations among the random variables. A Markov random field is a particular such representation which has applications in a variety of areas (see [3] and the references therein). In a Markov random field, the lack of an edge between two nodes implies that the two random variables are independent, conditioned on all the other random variables in the network.
Structure learning, i.e, learning the underlying graph structure of a Markov random field, refers to the problem of determining if there is an edge between each pair of nodes, given i.i.d. samples from the joint distribution of the random vector. As a concrete example of structure learning, consider a social network in which only the participants' actions are observed. In particular, we do not observe or are unable to observe, interactions between the participants. Our goal is to infer relationships among the nodes (participants) in such a network by understanding the correlations among the nodes. The canonical example used to illustrate such inference problems is the US Senate [4]. Suppose one has access to the voting patterns of the senators over a number of bills (and not their party affiliations or any other information), the question we would like to answer is the following: can we say that a particular senator's vote is independent of everyone else's when conditioned on a few other senators' votes? In other words, if we view the senators' actions as forming a Markov Random Field (MRF), we want to infer the topology of the underlying graph.
In general, learning high dimensional densely connected graphical models requires large number of samples, and is usually computationally intractable. In this paper, we focus on a more tractable family which we call loosely connected MRFs. Roughly speaking, a Markov random field is loosely connected if the number of short paths between any pair of nodes is small. We show that many previously studied models are examples of this family. In fact, as densely connected graphical models are difficult to learn, some sparse assumptions are necessary to make the learning problem tractable. Common assumptions include an upper bound on the node degree of the underlying graph [7,15], restrictions on the class of parameters of the joint probability distribution of the random variables to ensure correlation decay [7,15,2], lower bounds on the girth of the underlying graph [15], and a sparse, probabilistic structure on the underlying random graph [2]. In all these cases, the resulted MRFs turn out to be loosely connected. In this sense, our definition here provides a unified view of the assumptions in previous works.
However, loosely connected MRFs are not always easy to learn. Due to the existence of short cycles, the dependence over an edge connecting a pair of neighboring nodes can be approximately cancelled by some short non-direct paths between them, in which case correctly detecting this edge is difficult, as shown in the following example. This example is perhaps well-known, but we present it here to motivate our algorithm presented later.
Example 1.1. Consider three binary random variables X i ∈ {0, 1}, i = 1, 2, 3. Assume X 1 , X 2 are independent Bernoulli( 1 2 ) random variables and X 3 = X 1 ⊕ X 2 with probability 0.9, where ⊕ means exclusive or. We note that this joint distribution is symmetric, i.e., we get the same distribution if we assume that X 2 , X 3 are independent Bernoulli( 1 2 ) and X 1 = X 2 ⊕X 3 with probability 0.9. Therefore, the underlying graph is a triangle. However, it is not hard to see that the three random variables are marginally independent. For this simple example, previous methods in [15,3] fail to learn the true graph.
We propose a new algorithm that correctly learns the graphs for loosely connected MRFs. For each node, the algorithm loops over all the other nodes to determine if they are neighbors of this node. The key step in the algorithm is a max-min conditional independence test, in which the maximization step is designed to detect the edges while the minimization step is designed to detect non-edges. The minimization step is used in several previous works such as [2,3]. The maximization step has been added to explicitly break the short cycles that can cause problems in edge detection. If the direct edge is the only edge between a pair of neighboring nodes, the dependence over the edge can be detected by a simple independence test. When there are other short paths between a pair of neighboring nodes, we first find a set of nodes that separates all the non-direct paths between them, i.e., after removing this set of nodes from the graph, the direct edge is the only short path connecting to two nodes. Then the dependence over the edge can again be detected by a conditional independence test where the conditioned set is the set above. In Example 1.1, X 1 and X 3 are unconditionally independent as the dependence over edge (1,3) is canceled by the other path (1,2,3). If we break the cycle by conditioning on X 2 , X 1 and X 3 become dependent, so our algorithm is able to detect the edges correctly. As the size of the conditioned sets is small for loosely connected MRFs, our algorithm has low complexity. In particular, for models with at most D 1 short paths between non-neighbor nodes and D 2 non-direct paths between neighboring nodes, the running time for our algorithm is O(np D 1 +D 2 +2 ).
If the MRF satisfies a pairwise non-degeneracy condition, i.e., the correlation between any pair of neighboring nodes is lower bounded by some constant, then we can extend the basic algorithm to incorporate a correlation test as a preprocessing step. For each node, the correlation test adds those nodes whose correlation with the current node is above a threshold to a candidate neighbor set, which is then used as the search space for the more computationally expensive max-min conditional independence test. If the MRF has fast correlation decay, the size of the candidate neighbor set can be greatly reduced, so we can achieve much lower computational complexity with this extended algorithm.
When applying our algorithm to Ising models, we get lower computational complexity for a ferromagnetic Ising model than a general one on the same graph. Intuitively, the edge coefficient J ij > 0 means that i and j are positively dependent. For any path between i, j, as all the edge coefficients are positive, the dependence over the path is also positive. Therefore, the non-direct paths between a pair of neighboring nodes i, j make X i and X j , which are positively dependent over the edge (i, j), even more positively dependent. Therefore, we do not need the maximization step which breaks the short cycles and the resulting algorithm has running time O(np D 1 +2 ). In addition, the pairwise non-degeneracy condition is automatically satisfied and the extended algorithm can be applied.
1.1. Relation to Prior Work. We focus on computational complexity rather than sample complexity in comparing our algorithm with previous algorithms. In fact, it has been shown that Ω(log p) samples are required to learn the graph correctly with high probability, where p is the size of the graph [19]. For all the previously known algorithms for which analytical complexity bounds are available, the number of samples required to recover the graph correctly with high probability, i.e, the sample complexity, is O(log p).
Not surprisingly, the sample complexity for our algorithm is also O(log p) under reasonable assumptions.
Our algorithm with the probability test reproduces the algorithm in [7,Theorem 3] for MRFs on bounded degree graphs. Our algorithm is more flexible and achieves lower computational complexity for MRFs that are loosely connected but have a large maximum degree. In particular, reference [15] proposed a low complexity greedy algorithm that is correct when the MRF has correlation decay and the graph has large girth. We show that under the same assumptions, we can first perform a simple correlation test and reduce the search space for neighbors from all the nodes to a constant size candidate neighbor set. With this preprocessing step, our algorithm and the algorithms in [7,15,18] have computational complexity O(np 2 ), which is lower than what we would get by only applying the greedy algorithm [15]. The results in [18] improve over [15] by proposing two new greedy algorithms that are correct for learning small girth graphs. However, the algorithm in [18] requires a constant size candidate neighbor set as input, which might not be easy to obtain in general. In fact, for MRFs with bad short cycles as in Example 1.1, learning a candidate neighbor set can be as difficult as directly learning the neighbor set.
Our analysis of the class of Ising models on sparse Erdős-Rényi random graphs G(p, c p ) was motivated by the results in [2] which studies the special case of the so-called ferromagnetic Ising models defined over an Erdős-Rényi random graph. The computational complexity of the algorithm in [2] is O(np 4 ). In this case, the key step of our algorithm reduces to the algo-rithm in [2]. But we show that, under the ferromagnetic assumption, we can again perform a correlation test to reduce the search space for neighbors, and the total computational complexity for our algorithm is O(np 2 ).
The results in [3] extend the results in [2] to general Ising models and more general sparse graphs (beyond the Erdős-Rényi model). We note that the tractable graph families in [3] is similar to our notion of loosely-connected MRFs. For general Ising models over sparse Erdős-Rényi random graphs, our algorithm has computational complexity O(np 5 ) while the algorithm in [3] has computational complexity O(np 4 ). The difference comes from the fact that our algorithm has an additional maximization step to break bad short cycles as in Example 1.1. Without this maximization step, the algorithm in [3] fails for this example. The performance analysis in [3] explicitly excludes such difficult cases by noting that these "unfaithful" parameter values have Lebesgue measure zero [3,Section B.3.2]. However, when the Ising model parameters lie close to this Lebesgue measure zero set, the learning problem is still ill posed for the algorithm in [3], i.e., the sample complexity required to recover the graph correctly with high probability depends on how close the parameters are to this set, which is not the case for our algorithm. In fact, the same problem with the argument that the unfaithful set is of Lebesgue measure zero has been observed for causal inference in the Gaussian case [20]. It has been shown in [20] that a stronger notion of faithfulness is required to get uniform sample complexity results, and the set that is not strongly faithful has non-zero Lebesgue measure and can be be surprisingly large.
Another way to learn the structures of MRFs is by solving l 1 -regularized convex optimizations under a set of incoherence conditions [17]. It is shown in [13] that, for some Ising models on a bounded degree graph, the incoherence conditions hold when the Ising model is in the correlation decay regime. But the incoherent conditions do not have a clear interpretation as conditions for the graph parameters in general and are NP-hard to verify for a given Ising model [13]. Using results from standard convex optimization theory [6], it is possible to design a polynomial complexity algorithm to approximately solve the l 1 -regularized optimization problem. However, the actual complexity will depend on the details of the particular algorithm used, therefore, it is not clear how to compare the computational complexity of our algorithm with the one in [17].
We note that the recent development of directed information graphs [16] is closely related to the theory of MRFs. Learning a directed information graph, i.e., finding the causal parents of each random process, is essentially the same as finding the neighbors of each random variable in learning a MRF. Therefore, our algorithm for learning the MRFs can potentially be used to learn the directed information graphs as well.
The paper is organized as follows. We present some preliminaries in the next section. In Section 3, we define loosely-connected MRFs and show that several previously studied models are examples of this family. In Section 4, we present our algorithm and show the conditions required to correctly recover the graph. We also provide the concentration results in this section. In Section 5, we apply our algorithm to the general Ising models studied in Section 3 and evaluate its sample complexity and computational complexity in each case. In Section 6, we show that our algorithm achieves even lower computational complexity when the Ising model is ferromagnetic. Experimental results are presented in Section 7.

Preliminaries.
2.1. Markov Random Fields (MRFs). Let X = (X 1 , X 2 , . . . , X p ) be a random vector with distribution P and G = (V, E) be an undirected graph consisting of |V | = p nodes with each node i associated with the i th element X i of X. Before we define an MRF, we introduce the notation X S to denote any subset S of the random variables in X. A random vector and graph pair (X, G) is called an MRF if it satisfies one of the following three Markov properties: In this case, we say G is an I-map of X. Further if G is an I-map of X and the global Markov property does not hold if any edge of G is removed, then G is called a minimal I-map of X.
In all three cases, G encodes a subset of the conditional independence relations of X and we say that X is Markov with respect to G. We note that the global Markov property implies the local Markov property, which in turn implies the pairwise Markov property. When P (x) > 0, ∀x, the three Markov properties are equivalent, i.e., if there exists a G under which one of the Markov properties is satisfied, then the other two are also satisfied. Further, in the case when P (x) > 0, ∀x, there exists a unique minimal I-map of X. The unique minimal I-map G = (V, E) is constructed as follows: 1. Each random variable X i is associated with a node i ∈ V.

(i, j) ∈ E if and only if
In this case, we consider the case P (x) > 0, ∀x and are interested in learning the structure of the associated unique minimal I-map. We will also assume that, for each i, X i takes on values in a discrete, finite set X . We will also be interested in the special case where the MRF is an Ising model, which we describe next.
2.2. Ising Model. Ising models are a type of well-studied pairwise Markov random fields. In an Ising model, each random variable X i takes values in the set X = {−1, +1} and the joint distribution is parameterized by constants called edge coefficients J and external fields h : where Z is a normalization constant to make P (x) a probability distribution.
If h = 0, we say the Ising model is zero-field. If J ij ≥ 0, we say the Ising model is ferromagnetic. Ising models have the following useful property. Given an Ising model, the conditional probability P (X V \S |x S ) corresponds to an Ising model on V \ S with edge coefficients J ij , i, j ∈ V \ S unchanged and modified external fields h i + h i , i ∈ V \ S, where h i = (i,j)∈E,j∈S J ij x j is the additional external field on node i induced by fixing X S = x S .

Random Graphs.
A random graph is a graph generated from a prior distribution over the set of all possible graphs with a given number of nodes. Let χ p be a function on graphs with p nodes and let C be a constant. We say χ p ≥ C almost always for a family of random graphs indexed by p if P (χ p ≥ C) → 1 as p → ∞. Similarly, we say χ p → C almost always for a family of random graphs if ∀ > 0, P (|χ p − C| > ) → 1 as p → ∞. This is a slight variation of the definition of almost always in [1].
The Erdős-Rényi random graph G(p, c p ) is a graph on p nodes in which the probability of an edge being in the graph is c p and the edges are generated independently. We note that, in this random graph, the average degree of a node is c. In this paper, when we consider random graphs, we only consider the Erdős-Rényi random graph G(p, c p ).

2.4.
High-Dimensional Structure Learning. In this paper, we are interested in inferring the structure of the graph G associated with an MRF (X, G). We will assume that P (x) > 0, ∀x, and G will refer to the corresponding unique minimal I-map. The goal of structure learning is to design an algorithm that, given n i.i.d. samples {X (k) } n k=1 from the distribution P, outputs an estimateĜ which equals G with high probability when n is large. We say that two graphs are equal when their node and edge sets are identical.
In the classical setting, the accuracy of estimating G is considered only when the sample size n goes to infinity while the random vector dimension p is held fixed. This setting is restrictive for many contemporary applications, where the problem size p is much larger than the number of samples. A more suitable assumption allows both n and p to become large, with n growing at a slower rate than p. In such a case, the structure learning problem is said to be high-dimensional.
An algorithm for structure learning is evaluated both by its computational complexity and sample complexity. The computational complexity refers to the number of computations required to execute the algorithm, as a function of n and p. When G is a deterministic graph, we say the algorithm has sample complexity f (p) if, for n = O(f (p)), there exist constants c and α > 0, independent of p, such that Pr(Ĝ = G) ≥ 1 − c p α for all P which are Markov with respect to G. When G is a random graph drawn from some prior distribution, we say the algorithm has sample complexity f (p) if the above is true almost always. In the high-dimensional setting n is much smaller than p. In fact, we will show that, for the algorithms described in this paper, f (p) = log p.
3. Loosely Connected MRFs. Loosely connected Markov random fields are undirected graphical models in which the number of short paths between any pair of nodes is small. Roughly speaking, a path between two nodes is short if the dependence between two node is non-negligible even if all other paths between the nodes are removed. Later, we will more precisely quantify the term "short" in terms of the correlation decay property of the MRF. For simplicity, we say that a set S separates some paths between nodes i and j if removing S disconnects these paths. In such a graphical model, if i, j are not neighbors, there is a small set of nodes S separating all the short paths between them, and conditioned on this set of variables X S the two variables X i and X j are approximately independent. On the other hand, if i, j are neighbors, there is a small set of nodes T separating all the short non-direct paths between them, i.e, the direct edge is the only short path connecting the two nodes after removing T from the graph. Conditioned on this set of variables X T , the dependence of X i and X j is dominated by the dependence over the direct edge hence is bounded away from zero. The following necessary and sufficient condition for the non-existence of an edge in a graphical model shows that both the sets S and T above are essential for learning the graph, which we have not seen in prior work.
Lemma 3.1. Consider two nodes i and j in G. Then, (i, j) ∈ E if and only if ∃S, ∀T, X i ⊥ X j |X S , X T .
Proof. Recall from the definition of the minimal I-map that (i, j) ∈ E if and only if X i ⊥ X j |X V \{i,j} . Therefore, the statement of the lemma is equivalent to where I(X i ; X j |X S ) denotes the mutual information between X i and X j conditioned on X S , and we have used the fact that This lemma tells that, if there is not an edge between node i and j, we can find a set of nodes S such that the removal of S from the graph separates i and j. From the global Markov property, this implies that X i ⊥ X j |X S . However, as Example 1.1 shows, the converse is not true. In fact, for S being the empty set or S = ∅, we have X 1 ⊥ X 2 |X S , but (1, 2) is indeed an edge in the graph. The above lemma completes the statement in the converse direction, showing that we should also introduce a set T in addition to the set S to correctly identify the edge.
Motivated by this lemma, we define loosely connected MRFs as follows.
for some conditional independence test ∆.
The conditional independence test ∆ should satisfy ∆(X i ; X j |X S , X T ) = 0 if and only if X i ⊥ X j |X S , X T . In this paper, we use two types of conditional independence tests: • Mutual Information Test: • Probability Test: Later on, we will see that the probability test gives lower sample complexity for learning Ising models on bounded degree graphs, while the mutual information test gives lower sample complexity for learning Ising models on graphs with unbounded degree. Note that the above definition restricts the size of the sets S and T to make the learning problem tractable. We show in the rest of the section that several important Ising models are examples of loosely connected MRFs. Unless otherwise stated, we assume that the edge coefficients J ij are bounded, i.e., 3.1. Bounded Degree Graph. We assume the graph has maximum degree d. For any (i, j) ∈ E, the set S = N i of size at most d separates i and j, and for any set T we have ∆(X i ; X j |X S , X T ) = 0. For any (i, j) ∈ E, the set T = N i \ j of size at most d − 1 separates all the non-direct paths between i and j. Moreover, we have the following lower bound for neighbors from [7, Proposition 2].
Therefore, the Ising model on a bounded degree graph with maximum degree d is a (d, d − 1, )-loosely connected MRF. We note that here we do not use any correlation decay property, and we view all the paths as short.
3.2. Bounded Degree Graph, Correlation Decay and Large Girth. In this subsection, we still assume the graph has maximum degree d. From the previous subsection, we already know that the Ising model is loosely connected. But we show that when the Ising model is in the correlation decay regime and further has large girth, it is a much sparser model than the general bounded degree case.
Correlation decay is a property of MRFs which says that, for any pair of nodes i, j, the correlation of X i and X j decays with the distance between i, j. When a MRF has correlation decay, the correlation of X i and X j is mainly determined by the short paths between nodes i, j, and the contribution from the long paths is negligible. It is known that when J max is small compared with d, the Ising model has correlation decay. More specifically, we have the following lemma, which is a consequence of the strong correlation decay property [22,Theorem 1].
Proof. For some given This lemma implies that, in the correlation decay regime (d−1) tanh J max < 1, the Ising model has exponential correlation decay, i.e., the correlation between a pair of nodes decays exponentially with their distance. We say that a path of length l is short if βα l is above some desired threshold.
The girth of a graph is defined as the length of the shortest cycle in the graph, and large girth implies that there is no short cycle in the graph. When the Ising model is in the correlation decay regime and the girth of the graph is large in terms of the correlation decay parameters, there is at most one short path between any pair of non-neighbor nodes, and no short paths other than the direct edge between any pair of neighboring nodes. Naturally, we can use S of size 1 to approximately separate any pair of non-neighbor nodes and do not need T to block the other paths for neighbor nodes as the correlations are mostly due to the direct edges. Therefore, we would expect this Ising model to be (1, 0, )-loosely connected for some constant . In fact, the following theorem gives an explicit characterization of . The condition on the girth below is chosen such that there is at most one short path between any pair of nodes, so a path is called short if it is shorter than half of the girth.
Proof. See Appendix A.
3.3. Erdős-Rényi Random Graph G(p, c p ) and Correlation Decay. We assume the graph G is generated from the prior G(p, c p ) in which each edge is in G with probability c p and the average degree for each node is c. For this random graph, the maximum degree scales as O( ln p ln ln p ) with high probability [1]. Thus, we cannot use the results for bounded degree graphs even though the average degree remains bounded as p → ∞.
It is known from prior work [2] that, for ferromagnetic Ising models, i.e, J ij ≥ 0 for any i and j, when J max is small compared with the average degree c, the random graph is in the correlation decay regime and the number of short paths between any pair of nodes is at most 2 asymptotically. We show that the same result holds for general Ising models. Our proof is related to the techniques developed in [2], but certain steps in the proof of [2] do rely on the fact that the Ising model is ferromagnetic, so the proof does not directly carry over. We point out similarities and differences as we proceed in Appendix C.
More specifically, letting γ p = log p K log c for some K ∈ (3, 4), the following theorem shows that nodes that are at least γ p hops from each other have negligible impact on each other. As a consequence of the following theorem, we can say that a path is short if it is at most γ p hops.
Theorem 3.6. Assume α = c tanh J max < 1. Then, the following properties are true almost always.
(1) Let G be a graph generated from the prior G(p, c p ). If i, j are not neighbors in G and S separates all the paths shorter than γ p hops between i, j, for all Ising models P on G, where κ = log 1 α 4 log c and B(i, γ p ) is the set of all nodes which are at most γ p hops away from i.. (2) There are at most two paths shorter than γ p between any pair of nodes.
Proof. See Appendix C.
The above result suggests that for Ising models on the random graph there are at most two short paths between non-neighbor nodes and one short nondirect path between neighboring nodes, i.e., it is a (2, 1, )-loosely connected MRF. Further the next two theorems prove that such a constant exists. The proofs are in Appendix C.
Theorem 3.7. For any (i, j) ∈ E, let S be a set separating the paths shorter than γ p between i, j and assume |S| ≤ 3, then almost always Theorem 3.8. For any (i, j) ∈ E, let T be a set separating the nondirect paths shorter than γ p between i, j and assume |T | ≤ 3, then almost always

Our Algorithm and Concentration results.
Learning the structure of a graph is equivalent to learning if there exists an edge between every pair of nodes in the graph. Therefore, we would like to develop a test to determine if there exists an edge between two nodes or not. From Definition 3.2, it should be clear that learning a loosely connected MRF is straightforward. For non-neighbor nodes, we search for the set S that separates all the short paths between them, while for neighboring nodes, we search for the set T that separates all the non-direct short paths between them. As the MRF is loosely connected, the size of the above sets are small, therefore the complexity of the algorithm is low.
Given n i.i.d. samples {X (k) } n k=1 from the distribution the empirical dis-tributionP is defined as follows. For any set A, Let∆ be the empirical conditional independence test which is the same as ∆ but computed usingP . Our first algorithm is as follows.
For clarity, when we specifically use the mutual information test (or the probability test), we denote the corresponding algorithm by CondST I (or CondST P ). When the empirical conditional independence test∆ is close to the exact test ∆, we immediately get the following result.
Proof. The correctness is immediate. We note that, for each pair of i, j in V , we search S, T in V . So the possible combinations of (i, j, S, T ) is O(p D 1 +D 2 +2 ) and we get the running time result.
When the MRF has correlation decay, it is possible to reduce the computational complexity by restricting the search space for the set S and T to a smaller candidate neighbor set. In fact, for each node i, the nodes which are a certain distance away from i have small correlation with X i . As suggested in [7], we can first perform a pairwise correlation test to eliminate these nodes from the candidate neighbor set of node i. To make sure the true neighbors are all included in the candidate set, the MRF needs to satisfy an additional pairwise non-degeneracy condition. Our second algorithm is as follows.
The following result provides conditions under which the second algorithm correctly learns the MRF.
for any node i, j and x i , x j , and for any node i, j and set A with |A| ≤ D 1 +D 2 , then CondST P re(D 1 , D 2 , , ) recovers the graph correctly. Let L = max i |L i |. The running time for the al- Proof. By the pairwise non-degeneracy condition (1), the neighbors of node i are all included in the candidate neighbor set L i . We note that this preprocessing step excludes the nodes whose correlation with node i is below 4 . Then in the inner loop, the correctness of the algorithm is immediate. The running time of the correlation test is O(np 2 ). We note that, for each i in V , we loop over j in L i and search S and T in L i . So the possible combinations of (i, j, S, T ) is O(pL D 1 +D 2 +1 ). Combining the two steps, we get the running time of the algorithm.
Note that the additional non-degeneracy condition (1) required for the second algorithm to execute correctly is not satisfied for all graphs (recall Example 1.1).

Concentration Results.
In this subsection, we show a set of concentration results for the empirical quantities in the above algorithm for general discrete MRFs, which will be used to obtain the sample complexity results in Section 5 and Section 6.
This lemma could be used as a guideline on how to choose between the two conditional independence tests for our algorithm to get lower sample complexity. The key difference is the dependence on the constant δ, which is a lower bound on the probability of any x S with the set size |S| ≤ D 1 +D 2 +1. The probability test requires a constant δ > 0 to achieve sample complexity n = O(log p), while the mutual information test does not depend on δ and also achieves sample complexity n = O(log p). We note that, while both tests have O(log p) sample complexity, the constants hidden in the order notation may be different for the two tests. For Ising models on bounded degree graphs, we show in the next section that a constant δ > 0 exists, and the probability test gives a lower sample complexity. On the other hand, for Ising models on the Erdős-Rényi random graph G(p, c p ), we could not get a constant δ > 0 as the maximum degree of the graph is unbounded, and the mutual information test gives a lower sample complexity.

Computational
Complexity for General Ising Models. In this section, we apply our algorithm to the Ising models in Section 3. We evaluate both the number of samples required to recover the graph with high probability and the running time of our algorithm. The results below are simple combinations of the results in the previous two sections. Unless otherwise stated, we assume that the edge coefficients J ij are bounded, i.e., J min ≤ |J ij | ≤ J max . Throughout this section, we use the notation x ∧ y to denote the minimum of x and y.

Bounded Degree Graph.
We assume the graph has maximum degree d. First we have the following lower bound on the probability of any finite size set of variables.
Proof. See Appendix A.
Our algorithm with the probability test for the bounded degree graph case reproduces the algorithm in [7]. For completeness, we state the following result without a proof since it is nearly identical to the result in [7], except for some constants.
The running time of the algorithm is O(np 2d+1 ).

5.2.
Bounded Degree Graph, Correlation Decay and Large Girth. We assume the graph has maximum degree d. We also assume that the Ising model is in the correlation decay regime, i.e., (d − 1) tanh J max < 1, and the graph has large girth. Combining Theorem 3.5, Fact 4.1 and Lemma 4.3, We can show that the algorithm CondST P (1, 0, ) recovers the graph correctly with high probability for some constant , and the running time is O(np 3 ) for n = O(log p).
We can get even lower computational complexity using our second algorithm. The key observation is that, as there is no short path other than the direct edge between neighboring nodes, the correlation over the edge dominates the total correlation hence the pairwise non-degeneracy condition is satisfied. We note that the length of the second shortest path between neighboring nodes is no less than g − 1.
Lemma 5.3. Assume that (d − 1) tanh J max < 1, and the girth g satisfies Proof. See Appendix A.
Using this lemma, we can apply our second algorithm to learn the graph.
Therefore, in the correlation test, L i only includes nodes within distance l from i and the size |L i | ≤ d l since the maximum degree is d; i.e., the algorithm CondST P re P (1, 0, , ) recovers G with probability 1 − c p α for some constant c. The running time of the algorithm is O(np 2 ).

5.3.
Erdős-Rényi Random Graph G(p, c p ) and Correlation Decay. We assume the graph G is generated from the prior G(p, c p ) in which each edge is in G with probability c p and the average degree for each node is c. Because the random graph has unbounded maximum degree, we cannot lower bound for the probability of a finite size set of random variables by a constant, for all p. To get good sample complexity, we use the mutual information test in our algorithm. Combining Theorem 3.7, Theorem 3.8, Fact 4.1 and Lemma 4.3, we get the following result.
, the algorithm CondST I (2, 1, ) recovers the graph G almost always. The running time of the algorithm is O(np 5 ).

Sample Complexity.
In this subsection, we briefly summarize the number of samples required by our algorithm. According to the results in this section and the next section, C log p samples are sufficient in general, where the constant C depends on the parameters of the model. When the Ising model is on a bounded degree graph with maximum degree d, the constant C is of order exp(−O(d + d 2 J max )). In particular, if the Ising model is in the correlation decay regime, then dJ max = O(1) and the constant C is of order exp(−O(d)). When the Ising model is on a Erdős-Rényi random graph G(p, c p ) and is in the correlation decay regime, then the constant C is lower bounded by some absolute constant independent of the model parameters.

Computational Complexity for Ferromagnetic Ising Models.
Ferromagnetic Ising models are Ising models in which all the edge coefficients J ij are nonnegative. We say (i, j) is an edge if J ij > 0. One important property of ferromagnetic Ising models is association, which characterizes the positive dependence among the nodes.
Definition 6.1. [9] We say a collection of random variables X = (X 1 , X 2 , . . . , X n ) is associated, or the random vector X is associated, if Cov(f (X), g(X)) ≥ 0 for all nondecreasing functions f and g for which Ef (X), Eg(X), Ef (X)g(X) exist.
Proposition 6.2. [12] The random vector X of a ferromagnetic Ising model (possibly with external fields) is associated.
A useful consequence of the Ising model being associated is as follows. Corollary 6.3. Assume X is a zero field ferromagnetic Ising model. For any i, j, P (X i = 1, X j = 1) ≥ 1 4 ≥ P (X i = 1, X j = −1).

Proof. See Appendix B.
Informally speaking, the edge coefficient J ij > 0 means that i and j are positively dependent over the edge. For any path between i, j, as all the edge coefficients are positive, the dependence over the path is also positive. Therefore, the non-direct paths between a pair of neighboring nodes i, j make X i and X j , which are positively dependent over the edge (i, j), even more positively dependent. This observation has two important implications for our algorithm.
1. We do not need to break the short cycles with a set T in order to detect the edges, so the maximization in the algorithm can be removed. 2. The pairwise non-degeneracy is always satisfied for some constant , so we can apply the correlation test to reduce the computational complexity.
6.1. Bounded Degree Graph. We assume the graph has maximum degree d. We have the following non-degeneracy result for ferromagnetic Ising models.
Proof. See Appendix B.
The following theorem justifies the remarks after Corollary 6.3 and shows that the algorithm with the preprocessing step CondST P re(d, 0, , ) can be used to learn the graph, where , are obtained from the above lemma. Recall that L i is the candidate neighbor set of node i after the preprocessing step and L = max i |L i |. and δ be defined as in Theorem 5.2. Let γ = 32 ∧ δ 16 ∧ δ 2 . If n > 2 (1 + α) log p + (d + 1) log L + (d + 2) log 2 γ 2 , the algorithm CondST P re P (d, 0, , ) recovers G with probability 1− c p α for some constant c. The running time of the algorithm is O(np 2 + npL d+1 ). If we further assume that (d − 1) tanh J max < 1, then the running time of the algorithm is O(np 2 ).
Proof. We choose |S| ≤ d and T = ∅ in our algorithm, and we have |N S | ≤ d 2 as the maximum degree is d. By Lemma 6.4, we have max for any |S| ≤ d. Therefore, the Ising model is a (d, 0, )-loosely connected MRF. Note that Lemma 6.4 is applicable to any set S (not necessarily the set S in the conditional independence test). Applying Lemma 6.4 again with S = ∅, we get the pairwise non-degeneracy condition Combining Fact 4.2 and Lemma 4.3, we get the correctness of the algorithm. The running time is O(np 2 + npL d+1 ), which is at most O(np d+2 ).
When (d − 1) tanh J max < 1, as the Ising model is in the correlation decay regime, L = max i |L i | ≤ d l is a constant independent of p as argued for Theorem 5.4. Therefore, the running time is only O(np 2 ) in this case.

Erdős-Rényi Random
Graph G(p, c p ) and Correlation Decay. When the Ising model is ferromagnetic, the result for the random graph is similar to that of a deterministic graph. For each graph sampled from the prior distribution, the dependence over the edges is positive. If i, j are neighbors in the graph, having additional paths between them makes them more positively dependent, so we do not need to block those paths with a set T to detect the edge and set D 2 = 0. In fact, we can prove a stronger result for neighbor nodes than the general case. The following result also appears in [2], but we are unable to verify the correctness of all the steps there and so we present the result here for completeness. Theorem 6.6. ∀i ∈ V, ∀j ∈ N i , let S be any set with |S| ≤ 2, then almost always I(X i ; X j |X S ) = Ω(1).

Proof. See Appendix C.
Moreover, the pairwise non-degeneracy condition in Theorem 6.5 also holds here. We can thus use algorithm CondST P re(2, 0, , ) to learn the graph. Without the pre-processing step, our algorithm is the same as in [2]. We show in the following theorem that using the pre-processing step our algorithm achieves lower computational complexity in the order of p.
Proof. Combining Theorem 3.7, Theorem 3.8, Fact 4.2, Lemma 4.3 and Lemma 6.4, we get the correctness of the algorithm.
From Theorem 3.6 we know that if j is more than γ p hops away from i, the correlation between them decays as o(p −κ ). For the constant threshold 2 , these far-away nodes are excluded from the candidate neighbor set L i when p is large. It is shown in the proof of [14, Lemma 2.1] that for G(p, c p ), the number of nodes in the γ p -ball around i is not large with high probability. More specifically, ∀i ∈ V, |B(i, γ p )| = O(c γp log p) almost always, where B(i, γ p ) is the set of all nodes which are at most γ p hops away from i. Therefore we get 7. Experimental Results. In this section, we present experimental results to show that importance of the choice of a non-zero D 2 in correctly estimating the edges and non-edges of the underlying graph of a MRF. We evaluate our algorithm CondST I (D 1 , D 2 , ), which uses the mutual information test and does not have the preprocessing step, for general Ising models on grids and random graphs as illustrated in Figure 1. In a single run of the algorithm, we first generate the graph G = (V, E): for grids, the graph is fixed, while for random graphs, the graph is generated randomly each time. After generating the graph, we generate the edge coefficients uniformly from [−J max , −J min ] ∪ [J min , J max ], where J min = 0.4 and J max = 0.6. We then generate samples from the Ising model by Gibbs sampling. The sample size ranges from 400 to 1000. The algorithm computes, for each pair of nodes i and j,Î using the samples. For a particular threshold , the algorithm outputs (i, j) as an edge ifÎ ij > and gets an estimated graphĜ = (V,Ê). We select optimally for each run of the simulation, using the knowledge of the graph, such that the number of errors inÊ, including both errors in edges and non-edges, is minimized. The performance of the algorithm in each case is evaluated by the probability of success, which is the percentage of the correctly estimated edges, and each point in the plots is an average over 50 runs. We then compare the performance of the algorithm under different choices of D 1 and D 2 . We omit the results for four-neighbor grids as the performances of the algorithm with D 2 = 0 and D 2 > 0 are very close. In fact, four-neighbor grids do not have many short cycles and even the shortest non-direct paths are weak for the relatively small J max we choose, therefore there is no benefit using a set T to separate the non-direct paths for edge detection. However, for eight-neighbor grids which are denser and have shorter cycles, the probability of success of the algorithm significantly improves by setting D 2 = 1, as seen from Figure 2. It is also interesting to note that increasing from D 1 = 2 to D 1 = 3 does not improve the performance, which implies that a set S of size 2 is sufficient to approximately separate the non-neighbor nodes in our eight-neighbor grids. The experimental results for the algorithm with D 1 = 0, . . . , 3 and D 2 = 0, 1 applied to random graphs on 20 and 30 nodes are shown in Figure 3. For a random graph on n nodes with average degree d, each edge is included in the graph with probability d n−1 and is independent of all other edges. In the experiment, we choose average degree 5 for the graphs on 20 nodes and 7 for the graphs on 30 nodes. From Figure 3, the probability of success of the algorithm improves a lot when we increase D 2 from 0 to 1, which is very similar to the result of the eight-neighbor grids. We also note that, unlike the previous case, the algorithm with D 1 = 3 does have a better performance than with D 1 = 2 as there might be more short paths between a pair of nodes in random graphs. In a true experiment where only the data is available and no prior knowledge of the MRF is available, the choice of itself may affect the performance of the algorithm. At this time, we don not have any theoretical results to inform the choice of . We briefly present a heuristic, which seems reasonable. However, extensive testing of the heuristic is required before we can confidently state that the heuristic is reasonable, which is beyond the scope of this paper. Our proposed heuristic is as follows.
For a given D 1 and D 2 , we computeÎ ij for each pair of nodes i and j. If the choice of D 1 and D 2 is good,Î ij is expected to be close to 0 for non-edges and away from 0 for edges. Therefore, we can view the problem of choosing the threshold as a two-class hypothesis testing, where the non-edge class concentrates near 0 while the edge class is more spread out. If we viewÎ, the collection ofÎ ij for all i and j, as samples generated from the distribution of some random variable Z, then the hypothesis testing problem can viewed as one of finding the right such that the density of Z has a big spike below . One heuristic is to first estimate a smoothed density function fromÎ via kernel density estimation [10] and then set to be the right boundary of the big spike near 0. In order to choose proper D 1 and D 2 for the algorithm, we can start with (D 1 , D 2 ) = (0, 0). At each step, we run the algorithm with two pairs of values (D 1 + 1, D 2 ) and (D 1 , D 2 + 1) separately, and choose the pair that has a more significant change on the density estimated fromÎ as the new value for (D 1 , D 2 ). We continue this process and stop increasing D 1 or D 2 if at some step there is no significant change for either pair of values.
Justifying this heuristic either through extensive experimentation or theoretical analysis is a topic for future research. napureddy for useful discussions. In particular, we would like to thank Anandkumar for suggesting the use of the SAW tree in the proof of Lemma C.7 and Annapureddy for suggesting the proof of Lemma 3.1.

APPENDIX A: BOUNDED DEGREE GRAPH
A.1. Proof of Lemma 5.1. Let N S be the neighbor nodes of S. Note that each node in S has at most d neighbors in N S .
A.2. Correlation Decay and Large Girth. We assume that the Ising model on the bounded degree graph is further in the correlation decay regime. Both Theorem 3.5 and Lemma 5.3 immediately follow from the following more general result, which characterizes the conditions under which the Ising model is (D 1 , D 2 , )-loosely connected. We will make the connections at the end of this subsection.
where A = 1 1800 (1 − e −4J min )e −8(D 1 +D 2 )dJmax , and let = 48Ae 4(D 1 +D 2 )dJmax . Assume that there are at most D 1 paths shorter than h between non-neighbor nodes and D 2 paths shorter than h between neighboring nodes. Then ∀(i, j) ∈ E, Proof. First consider (i, j) ∈ E. Without loss of generality, assume J ij > 0. By the assumption that there are at most D 2 paths shorter than h between neighboring nodes, there exists T ⊂ N i , |T | ≤ D 2 such that, when the set T is removed from the graph, the length of any path from i to j is no less than h. For any S, let T = T \ S. To simplify the notation, let R = S ∪ T and W = V \ R. For any value x R , let Q be the joint probability of X W conditioned on X R = x R , i.e., Q(X W ) = P (X W |x R ). Q has the same edge coefficients for the unconditioned nodes, but is not zero-field as conditioning induces external fields. LetQ denote the joint probability when edge (i, j) is removed from Q. We note that Q andQ satisfy the same correlation decay property as P , sõ Using the above inequality, we have the following lower bound on the P -test quantity.
LetQ denote the joint probability when all the external field terms are removed fromQ; i.e.,Q (X W ) ∝Q(X W )e h T W X W As there are at most (D 1 + D 2 )d edges between R and W , we have ||h W || 1 ≤ (D 1 + D 2 )dJ max . Hence, for any subset U ⊂ W and value x U , Moreover,Q is zero-field by definition and again has the same correlation decay condition as P , hencě which gives the lower boundQ(1, −1) ≥ . Therefore, we havẽ The same lower bound applies forQ(−1, 1). Hence, The second inequality uses the fact that e βα h < 2. The last inequality is by the choice of h. Next consider (i, j) / ∈ E. By the choice of h, there exists S ⊂ N i , |S| ≤ D 1 such that, when the set S is removed from the graph, the distance from i to j is no less than h. Let T set with |T | ≤ D 2 . As there is no edge between i, j, the joint probability Q andQ are the same. Then ∀x S , x T , x i , x j , Similar as above, we havẽ The same bound applies forQ(−x j ). Therefore, By correlation decay and the fact βα h < ln 2 < 1, Hence, by the choice of h, Now we specialize this lemma for large girth graphs, in which there is at most one short path between non-neighbor nodes and no short non-direct path between neighboring nodes. Setting D 1 = 1 and D 2 = 0 in the theorem, we get Theorem 3.5. For the lower bound on the correlation between neighbor nodes, we set D 1 = D 2 = 0 in the theorem and get Lemma 5.3.

APPENDIX B: FERROMAGNETIC ISING MODELS
B.1. Proof of Corollary 6.3. By Proposition 6.2, we apply Definition 6.1 to X with f (X) = X i and g(X) = X j , and get E[ As there is no external field, P (X i = 1) = P (X i = −1) = 0 for any i and P (X i = x i , X j = x j ) = P (X i = −x i , X j = −x j ) for any i, j. Therefore, E[X i ] = 0 and By the above inequality, noticing that P (X i = 1, X j = 1) + P (X i = 1, X j = −1) = 1 2 , we get the result.
B.2. Proof of Lemma 6.4. For any i ∈ V, j ∈ N i , S ⊂ V , Q,Q,Q are defined as in the proof of Lemma A.1. When X is ferromagnetic but with external field, as in Corollary 6.3, we can show that for any i, j. Therefore, we have max We note thatQ is zero field, so by Corollary 6.3 we getQ(1, 1) =Q(−1, −1) ≥ 1 4 . As shown in Lemma A.1, The same lower bound can be obtained forQ(−1, −1). Plugging the lower bounds to the above inequality, we get the result.

APPENDIX C: RANDOM GRAPHS
The proofs in this section are related to the techniques developed in [2,3]. The key differences are in adapting the proofs for general Ising models, as opposed to ferromagnetic models. We point out similarities and differences as we proceed with the section.
C.1. Self-Avoiding-Walk Tree and Some Basic Results. This subsection introduces the notion of a self-avoiding-walk (SAW) tree, first introduced in [21], and presents some properties of a SAW tree. For an Ising model on a graph G, fix an ordering of all the nodes. We say dge (i, j) is larger (smaller resp.) than (i, l) with respect to node i if j comes after (before resp.) l in the ordering. The SAW tree rooted at node i is denoted as T saw (i; G). It is essentially the tree of self-avoiding walks originated from node i except that the terminal nodes closing a cycle are also included in the tree with a fixed value +1 or −1. In particular, a terminal node is fixed to +1 (resp. −1) if the closing edge of the cycle is larger (resp. smaller) than the starting edge with respect to the terminal node. Let A denote the set of all terminal nodes in T saw (i; G) and x A denote the fixed configuration on A. For set S ⊂ V , let U (S) denote the set of all non-terminal copies of nodes in S in T saw (i; G). Notice that there is a natural way to define conditioning on T saw (i; G) according to the conditioning on G; specifically, if node j in graph G is fixed to a certain value, the non-terminal copies of j in tree T saw (i; G) are fixed to the same value.
One important result is [11,Theorem 7], motivated by [21], says that the conditional probability of node i on graph G is the same as the corresponding conditional probability of node i on tree T saw (i; G), which is easier to deal with.
Next we list some basic results which will be used in later proofs. First we have the following lemma about the number of short paths between a pair of nodes from [2]. The second part of Theorem 3.6 is an immediate result of this lemma.
Lemma C.2. [2] For all i, j ∈ V , the number of paths shorter than γ p between nodes i, j is at most 2 almost always.
Let B(i, l; T saw (i; G)) be the set of nodes of distance l from i on the tree T saw (i; G). Recall that A is the set of terminal nodes in the tree. Let A be the subset of A that are of distance at most γ p from i. The size of B(i, l; T saw (i; G)) andÃ are upper bounded as follows. Proof. Each terminal node inÃ corresponds to a cycle connected to i with the total length of the cycle and the path to i at most γ p . Let OLO l denote the subgraph consists of two connected circles with total length l. This structure has l − 1 nodes and l edges. Let H = {OLO l , l ≤ 2γ p } and N H denote the number of subgraphs containing an instance from H. Then it is equivalent to show that there is at most 1 such small cycle close to each node or N H = 0 almost always.

C.2. Correlation Decay in Random
Graphs. This subsection is to prove the first part of Theorem 3.6 which characterizes the correlation decay property of a random graph.
First we state a correlation decay property for tree graphs. This result shows that having external fields only makes the correlation decay faster.
Lemma C.5. Let P be a general Ising model with external fields on a tree T . Assume |J ij | ≤ J max . ∀i, j ∈ T , Proof. The basic idea in the proof is get an upper bound that does not depend on the external field. To do this, we proceed as in the proof of Lemma 4.1 in [5]. First, as noted in [5], w.l.o.g. assume the tree is a line from i to j. Then, we prove the result by induction on the number of hops in the line.
1. d(i, j) = 1 or j ∈ N i . The graph has only two nodes. We have Hence, This function is even in both J ij and h i . Without loss of generality, assume J ij ≥ 0, h i ≥ 0. It is not hard to see that the RHS is maximized The inequality suggests that, when there is external field, the impact of one node on the other is reduced.
2. Assume the claim is true for d(i, j) ≤ k. For d(i, j) = k + 1, pick any l on the path from i to j, and note that X i -X l -X j forms a Markov chain. Moreover, d(i, l) ≤ k and d(l, j) ≤ k.
The third equality follows by observing that P (x l |x j ) − P (x l |x j ) = −(P (x l |x j ) − P (x l |x j )). The last inequality is by induction.
Writing the conditional probability on a graph as a conditional probability on the corresponding SAW tree, we can apply the above lemma and show the correlation decay property for random graphs.
Lemma C.6. Let P be a general Ising model on a graph G. Fix i ∈ V . ∀j / ∈ N i , let S be the set that separates the paths shorter than γ between i, j and B = B(i, γ; T saw (i; G)) , then ∀x i , x j , x j , x S , Proof. Let Z be the subset of U (j) on T saw (i; G) that is not separated by U (S) from i. By the definition of S, Z is of distance at least γ from i. So the γ-sphere B separates Z and i.
In the above, (a) follows from the property of SAW tree in Prop C.1.
Step (b) is by the choice of S and the definition of Z.
Step (c) uses the fact that Z is separated from i by B. In (d), x M B , x m B represent the maximizer and minimizer respectively.
Step (e) is by telescoping the sign of x B . Notice that the Hamming distance between x M B , x m B is at most |B|, and we can apply the above lemma to each pair as the conditioning terms differ only on one node. The above proof is similar to the proof of Lemma 3 in [2]. However, in going from step (c) to step (d) above, it is important to note that our proof holds for general Ising models, whereas the proof in [2] is specific to ferromagnetic Ising models.
Proof of Theorem 3.6. As in [2], setting γ = γ p in the above lemma and noticing that C.3. Asymptotic Lower Bound on P (x i |x R ) When |R| ≤ 3. This subsection is to prove that P (x i |x R ) is lower bounded by some constant when |R| ≤ 3. This result comes in handy when proving the other two theorems. This result was conjectured to hold in [2] for ferromagnetic Ising models on the random graph G(p, c p ) without a proof. Here we prove that it is also true for general Ising models on the random graph.
This basic idea is that the conditional probability P (x i |x R ) is equal to some conditional probability on a SAW tree, which in turn is viewed as some unconditional probability on the same tree with induced external fields. Then we apply a tree reduction to the SAW tree till only the root is left, and show that the induced external field on the root is bounded, which implies that the probability of the root taking +1 or −1 is bounded.
On a tree graph, when calculating a probability which involves no nodes in a subtree, we can reduce the subtree by simply summing (marginalizing) over all the nodes in it. This reduction produces an Ising model on the rest part of the tree with the same J ij and h i except for the root of the subtree, which would have an induced external field due to the reduction of the subtree. The probability we want to calculate remains unchanged on this new tree. Such induced external fields are bounded according to the following lemma.
Lemma C.8. Consider a leaf node 2 and its parent node 1. The induced external field h 1 on node 1 due to summation over node 2 satisfies We first prove an inequality which is used in the proof of the above lemma.
Proof of Lemma C.8.
The last inequality follows from Lemma C.9.
It is easy to see that |h 1 | ≤ |h 2 | tanh |J max | < |h 2 |. By induction, we can bound the external field induced by the whole subtree.
Proof of Lemma C.7. First we have where Q is the probability on the tree with external fields induced by x m B , xŨ (R) , xÃ. We only need to consider the external fields on the parent nodes of B,Ũ (R),Ã as the conditional probability is on a tree. The nodes affected by B are all γ p away from i and the total number of them is no larger than |B|, which is bounded by Lemma C.3. The number of nodes affected byŨ (R),Ã is no larger than |Ũ (R)| + |Ã|. By Lemma C.2 and Lemma C.4, |Ũ (R)| ≤ 2|R| and |Ã| ≤ 1 almost always. Applying the reduction technique to the tree till a single root node i, by Lemma C.8, we bound the induced external field on i as When p is large enough, there exists some constant C such that P (x i |x R ) ≥ C.
C.4. Proof of Theorem 3.7. Let S be the set that separates all the paths shorter than γ p between nodes i, j with size |S| ≤ 3. It is straightforward to show that I(X i ; X j |X S ) = o(p −2κ ) in a manner similar to [2, Lemma 5]. The only difference is that the correlation decay property in Theorem 3.6 takes a different form, which is easier to apply, therefore the proof there needs to be modified accordingly. We also note that the constant C in Lemma C.7 is referred to as f min (S) in [2]. The details are omitted here.
C.5. Proof of Theorem 3.8. When j is a neighbor of i, conditioned on the approximate separator T , there is one copy of j which is a child of the root i in the SAW tree and is the only copy that within γ p from i. In Theorem 3.8, we show that the effect of conditioning on T is bounded and this copy of j has a nontrivial impact on i. With a little abuse of notation, we use j to denote this copy of j in T saw (i; G). W.l.o.g assume J ij > 0. As P (x i |x B , xŨ (T ) , xÃ; T saw (i; G))P (x B |x Z , x U (T ) , x A ; T saw (i; G))| ≥ min Using this result, the lower bound I(X i ; X j |X T ) = Ω(1) simply follows from the proof of [2, Lemma 7]. Again we note that the constant C in Lemma C.7 is referred to as f min (T ) in [2]. The details are omitted here.
C.6. Proof of Theorem 6.6. The proof of the theorem needs the following lemma.
So flipping one node from +1 to −1 reduces the conditional probability regardless the what value the rest of the nodes take. Continuing this process till we flip all the nodes in S, we get the result P (x i = +1|x S = +1) ≥ P (x i = +1|x S = −1).
As the size of Z is only a constant, by the same reasoning, we finish the theorem.
By Lemma D.1, |Ĥ(X i , X j , X S ) − H(X i , X j , X S )| ≤ − ||P (X i , X j , X S ) − P (X i , X j , X S )|| 1 log ||P (X i , X j , X S ) − P (X i , X j , X S )|| 1 |X | D 1 +D 2 +2 The last inequality used the fact that 0 < − √ γ log √ γ < 1 for 0 < γ < 1. Similarly, we have the same upper bound for |Ĥ(X i , X S ) − H(X i , X S )|, |Ĥ(X j , X S )−H(X j , X S )| and |Ĥ(X S )−H(X S )|. We finish the proof by noticing that I(X i ; X j |X S ) = H(X i , X S ) + H(X j , X S ) − H(X i , X j , X S ) − H(X S ).