The Minimax Learning Rate of Normal and Ising Undirected Graphical Models

Let $G$ be an undirected graph with $m$ edges and $d$ vertices. We show that $d$-dimensional Ising models on $G$ can be learned from $n$ i.i.d. samples within expected total variation distance some constant factor of $\min\{1, \sqrt{(m + d)/n}\}$, and that this rate is optimal. We show that the same rate holds for the class of $d$-dimensional multivariate normal undirected graphical models with respect to $G$. We also identify the optimal rate of $\min\{1, \sqrt{m/n}\}$ for Ising models with no external magnetic field.


Introduction
The Ising model is a popular mathematical model inspired by ferromagnetism in statistical mechanics. The model consists of discrete {−1, 1} random variables representing magnetic dipole moments of atomic spins. The spins are arranged in a graph-originally a lattice, but other graphs have also been considered-allowing each spin to interact with its graph neighbors. Sometimes, the spins are also subject to an external magnetic field.
The Ising model is one of many possible mean field models for spin glasses. Its probabilistic properties have caught the attention of many researchers-see, e.g., the monographs of Talagrand [30,31,32]. The analysis of social networks has brought computer scientists into the fray, as precisely the same model appears there in the context of community detection [6].
In this work we view an Ising model as a probability distribution on {−1, 1} d , and consider the following statistical inference and learning problem, known as density estimation or distribution learning: given i.i.d. samples from an unknown Ising model I on a known graph G, can we create a probability distribution on {−1, 1} d that is close to I in total variation distance? If we have n samples, then how small can we make the expected value of this distance? We prove that if G has m edges, the answer to this question is bounded from above and below by constant factors of (m + d)/n. In the case when there is no external magnetic field, the answer is instead m/n.
Our techniques carry over to the continuous case and allow us prove a similar minimax rate for learning the class of d-dimensional normal undirected graphical models on G. It is surprising that the minimax rate for this class was not known, even when G is the complete graph, corresponding to the class of all d-dimensional normal distributions.
1.1. Main results. We start by stating our result for normal distributions. For precise definitions of all terms mentioned below, see Section 2.
Theorem 1.1 (Main result for learning normals). Let G be a given undirected graph with vertex set {1, . . . , d} and m edges. Let F G be the class of d-dimensional multivariate normal undirected graphical models with respect to G. Then, the minimax rate for learning F G in total variation distance is bounded from above and below by constant factors of min{1, (m + d)/n}.
The upper bound follows from standard techniques (see Section 3.1) and a lower bound of min{1, d/n} is known (see Section 1.2); our main technical contribution is to show a lower bound of min{1, m/n}, from which Theorem 1.1 follows. This theorem immediately implies a tight result on the minimax rate for learning the class of all d-dimensional normals, if we take the graph G to be complete. In this specific case, the upper bound is already known, so our contribution is the matching lower bound.
Corollary 1.2. The minimax rate for learning the class of all d-dimensional multivariate normal distributions in total variation distance is bounded from above and below by constant factors of min{1, d/ √ n}.
We remark that for the class of mean-zero normal undirected graphical models, we prove a lower bound of min{1, m/n}, while the best known upper bound is min{1, (m + d)/n}. In practice, the underlying graph is typically connected, which means that m ≥ d − 1, so these bounds match.
We prove similar rates as in Theorem 1.1 for the class of Ising models, which resemble discrete versions of multivariate normal distributions. An Ising model in dimension d is supported on {−1, 1} d and comes with an undirected graph G = (V, E) with vertex set V = {1, . . . , d}, edge set E ⊆ {{i, j} : i = j ∈ V }, interactions w ij ∈ R for each {i, j} ∈ E, and external magnetic field Note that our definition has no temperature parameter; we have absorbed it into the weights. (i) The minimax rate for learning I G in total variation distance is bounded from above and below by constant factors of min{1, (m + d)/n}. (ii) Let I ′ G be the subclass I G of Ising models with no external magnetic field. The minimax rate for learning I ′ G in total variation distance is bounded from above and below by constant factors of min{1, m/n}.
In all of the above cases, the full structure and labeling of the underlying graph G is known in advance. We next consider the case in which it is only known that the underlying graph has d vertices and m edges. Theorem 1.4. Let F d,m and I d,m be the class of all normal and Ising undirected graphical models with respect to some unknown graph with d vertices and m edges. The minimax learning rates for F d,m and I d,m are both bounded from above by a constant factor of min{1, (m + d) log d/n}, and bounded from below by a constant factor of min{1, (m + d)/n}.
The lower bound in this theorem follows immediately from our lower bounds for the case in which the graph is known.
In the next section we review related work. In Section 2 we discuss preliminaries. Theorem 1.1, Theorem 1.3 and Theorem 1.4 are proved in Section 3, Section 4, and Section 5, respectively. We conclude with some open problems in Section 6.
1.2. Related work. Density estimation is a central problem in statistics and has a long history [9,10,19,33]. It has also been studied in the learning theory community under the name distribution learning, starting from [21], whose focus is on the computational complexity of the learning problem. Recently, it has gained a lot of attention in the machine learning community, as one of the important tasks in unsupervised learning is to understand the distribution of the data, which is known to significantly improve the efficiency of learning algorithms (e.g., [15, page 100]). See [12] for a recent survey from this perspective.
An upper bound on the order of d/ √ n for estimating d-dimensional normals can be obtained via empirical mean and covariance estimation (e.g., [2, Theorem B.1]) or via Yatracos' techniques based on VC-dimension (e.g., [3,Theorem 13]). Regarding lower bounds, Acharya, Jafarpour, Orlitsky, and Suresh [1, Theorem 2] proved a lower bound on the order of d/n for spherical normals (i.e., normals with identity covariance matrix), which implies the same lower bound for general normals. The lower bound for general normals was recently improved to a constant factor of d √ n log n by Asthiani, Ben-David, Harvey, Liaw, Mehrabian, and Plan [2]. In comparison, our result shaves off the logarithmic factor. Moreover, their result is nonconstructive and relies on the probabilistic method, while our argument is fully deterministic. For the Ising model, the main focus in the literature has been on learning the structure of the underlying graph rather than learning the distribution itself, i.e., how many samples are needed to reconstruct the underlying graph with high probability? See [27,29] for some lower bounds and [16,22] for some upper bounds. Otherwise, the Ising model itself has been studied by physicists in other settings for nearly a century. See the books of Talagrand for a comprehensive look at the mathematics of the Ising model [30,31,32].
Daskalakis, Dikkala, and Kamath [8] were the first to study Ising models from a statistical point of view. However, their goal is to test whether an Ising model has certain properties, rather than estimating the model, which is our goal. Moreover, their focus is on designing efficient testing algorithms. They prove polynomial sample complexities and running times for testing various properties of the model.
An alternative goal would be to estimate the parameters of the underlying model (e.g., [20]) rather than coming up with a model that is statistically close, which is our focus. We remark that these two goals are quantitatively different, although similar techniques may be used for both. In general, estimating the parameters of a model to within some accuracy does not necessarily result in a distribution that is close to the original distribution in a statistical sense. For instance, define and observe that Σ and Σ are entrywise very close. However, Σ is non-singular and Σ is singular, and thus two mean-zero normal distributions with covariance matrices Σ and Σ are at total variation distance 1 from one another. Conversely, if two distributions are close in total variation distance, their parameters are not necessarily close to within the same accuracy.

Preliminaries
The goal of density estimation is to design an estimatorf for an unknown function f taken from a known class of functions F . In the continuous case, F is a class of probability density functions with sample space X = R d for some d ≥ 1; in the discrete case, F is a class of probability mass functions with a countable sample space X . In either case, in order to create the estimatorf , we have access to samples X 1 , . . . , X n i.i.d.
∼ f . Our measure of closeness is the total variation (TV) distance: For functions f, g : X → R, their TV-distance is defined as where for any function f , the L 1 -norm of f is defined as in the continuous case, and Further along, we will also need the Kullback-Leibler (KL) divergence or relative entropy [23], which is another measure of closeness of distributions defined by dx in the continuous case, and in the discrete case.
Formally, in the continuous case, we can write f = dF dµ for a probability measure F and µ the Lebesgue measure on R d , and in the discrete case f = dF dµ for a probability measure F and µ the counting measure on countable X . In view of this unified framework, we say that F is a class of densities and thatf is a density estimate, in both the continuous and the discrete settings. The total variation distance has a natural probabilistic interpretation as TV(f, g) = sup A⊆X |F (A) − G(A)|, where F and G are probability measures corresponding to f and g, respectively. So, the TVdistance lies in [0, 1]. Also, it is well known that the KL-divergence is nonnegative, and is zero if and only if the two densities are equal almost everywhere. However, it is not symmetric in general, and can become +∞.
For density estimation there are various possible measures of distance between distributions. Here we focus on the TV-distance since it has several appealing properties, such as being a metric and having a natural probabilistic interpretation. For a detailed discussion on why TV is a natural choice, see [11,Chapter 5]. Iff is a density estimate, we define the risk of the estimatorf with respect to the class F as where the expectation is over the n i.i.d. samples from f , and possible randomization of the estimator. The minimax risk or minimax rate for F is the smallest risk over all possible estimators, For a class of functions F defined on the same domain X , its Yatracos class A is the class of sets defined by The following powerful result relates the minimax risk of a class of densities to an old well-studied combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension [34]. Indeed, let A ⊆ 2 X be a family of subsets of X . The VC-dimension of A, denoted by VC(A), is the size of the largest set X ⊆ X such that for each Y ⊆ X there exists B ∈ A such that X ∩ B = Y . See, e.g., [11,Chapter 4] for examples and applications.
There is a univeral constant c > 0 such that for any class of densities F with Yatracos class A, On the other hand, there are several methods for obtaining lower bounds on minimax risk; we emphasize, in particular, the methods of Assouad [4], Le Cam [25,26], and Fano [17]. Each of these involve picking a finite subclass G ⊆ F , and using the fact that R n (G) ≤ R n (F ), developing a lower bound on the minimax risk of G. See [9,11,37] for more details. We will use the following result, known as (generalized) Fano's lemma, originally due to Khas'minskii [17]. Then, In light of this lemma, to prove a minimax risk lower bound on a class of densities F , we shall carefully pick a finite subclass of densities in F , such that any two densities in this subclass are far apart in L 1 -distance but close in KL-divergence.
Throughout this paper, we will be estimating densities from classes with a given graphical dependence structure, known as undirected graphical models [24]. The underlying graph will always be undirected and without parallel edges or self-loops, so we will omit these qualifiers henceforth. Indeed, let G = (V, E) be a given graph with vertex set V = {1, . . . , d} and edge set E. A set of random variables {X 1 , . . . , X d } with everywhere strictly positive densities forms a graphical model or Markov random field (MRF) with respect to G if for every {i, j} ∈ E, the variables X i and X j are conditionally independent given {X k : k = i, j}.
Often, the problem of density estimation is framed slightly differently than we have presented it: given ε ∈ (0, 1), we can be interested in finding the smallest number of i.i.d. samples m F (ε) for which there exists a density estimatef based on these samples satisfying sup f ∈F E{TV(f , f )} ≤ ε. Or, given δ ∈ (0, 1), we might want to find the minimum number of samples m F (ε, δ) for which there is a density estimatef satisfying sup f ∈F TV(f , f ) ≤ ε with probability at least 1 − δ. The quantities m F (ε) and m F (ε, δ) are known as sample complexities of the class F . Note that m F (ε) and R n (F ) are related through the equation so that determining one also determines the other. Moreover, δ is often fixed to be some small constant like 1/3 when studying m F (ε, δ), since it can be shown that all other values of m F for smaller δ are within a log(1/δ) factor of m F (ε, 1/3). Then, there are versions of Theorem 2.1 and Lemma 2.2 for m F (ε, 1/3), which introduce some extraneous log(1/ε) factors. In order to avoid such extraneous logarithmic factors, we focus on R n (F )-equivalently, m F (ε)-rather than m F (ε, 1/3) or m F (ε, δ).
We now recall some basic matrix analysis formulae which will be used throughout (see Horn and Johnson [18]

for the proofs). For a matrix
Throughout this paper, we let c 1 , c 2 , . . . ∈ R denote positive universal constants. We liberally reuse these symbols, i.e., every c i may differ between proofs and statements of different results. From now on, we denote the set {1, . . . , d} by [d].

Learning Normal Graphical Models
Let d be a positive integer, P d ⊆ R d×d be the set of positive definite d × d matrices over R, and N (µ, Σ) denote the multivariate normal distribution with mean µ ∈ R d , covariance matrix Σ ∈ P d , and corresponding probability density , E) be a given graph with m edges. Let P G ⊆ P d be the following subset of all positive definite matrices, The main result of this section is a characterization of the minimax risk of It is known that F G is precisely the class of d-dimensional multivariate normal graphical models with respect to G [24, Proposition 5.2].
3.1. Proof of the upper bound in Theorem 1.1. We can already prove the upper bound in Theorem 1.1 without lifting a finger. The proof is similar to that of [3,Theorem 13], which is for an upper bound on the minimax risk of all multivariate normals, corresponding to the case in which G is complete. Let A be the Yatracos class of F G , which, after simplification, is easily seen to be contained in the larger class while the upper bound R n (F G ) ≤ 1 follows simply because the TV-distance is bounded by 1.

3.2.
Proof of the lower bound in Theorem 1.1. Since a lower bound on the order of min{1, d/n} for spherical normals was proved in [1, Theorem 2], the lower bound in Theorem 1.1 follows from subadditivity of the square root after the following proposition. Note that if n < c 1 m, then R n (F G ) ≥ R c1m (F G ) ≥ c 2 1/c 1 , which implies the lower bound in Theorem 1.1 in this regime for n. We prove Proposition 3.1 via Lemma 2.2. This involves choosing a finite subset of F G . Our normal densities will be mean-zero, but the covariance matrices will be chosen carefully. To make this choice, we use the next result which follows from an old theorem of Gilbert [14] and independently Varshamov [35] from coding theory. In other words, Σ(s) −1 is symmetric with all ones on its diagonal, ±δ everywhere along the nonzero entries of the adjacency matrix of G according to the signs in s, and 0 elsewhere. Proof. Since Σ(s) −1 is symmetric and real, all its eigenvalues are real. Write Σ(s) −1 = I + ∆, so that λ i (Σ(s) −1 ) = 1 + λ i (∆). Observe that Then, λ i (Σ(s) −1 ) ≥ 1/2 for every 1 ≤ i ≤ d, and so Σ(s) −1 is positive definite.
We will assume from now on that δ 2 m ≤ 1/8. In light of Lemma 3.3, Σ(s) −1 is positive definite, so it is invertible, and we let Σ(s) denote its inverse. Since we will always take the mean to be 0, we will write f Σ for f 0,Σ from now on. We define the set W = {Σ(s) : s ∈ S} of covariance matrices, and let F = {f Σ : Σ ∈ W}.
In order to prove Proposition 3.1 via Lemma 2.2, it suffices to exhibit upper bounds on the KL-divergence between any two densities in F , and lower bounds on their L 1 -distances. Lemma 3.4. There exist c 1 , c 2 > 0 such that for any Σ, Σ ∈ P d satisfying max{ Σ −1 − We consider a symmetrized KL-divergence, often called the Jeffreys divergence [23], , which clearly serves as an upper bound on the quantity of interest. It is well known that J(f Σ f Σ ) = tr((Σ − Σ)( Σ −1 − Σ −1 ))/2, e.g., by [23, Section 9.1]. By the Cauchy-Schwarz inequality for the inner product A, B = tr(A T B), Write Σ −1 = I + ∆ just as in the proof of Lemma 3.3. Then, and the same bound holds for Σ , whence J Unfortunately, the L 1 -distance between multivariate normals does not have such a nice expression as the Jeffreys divergence does. To control some of the quantities involved in the computation of the L 1 -distance, we recall some properties of subgaussian random variables.
The sub-gaussian norm of a random variable X is defined to be A random variable X is called sub-gaussian if X ψ2 < ∞. Observe in particular that N (0, 1) and any bounded random variable are sub-gaussian. Recall now the following well-known large deviation inequality for quadratic forms of sub-gaussian random vectors. . Let X = (X 1 , . . . , X d ) be a random vector with independent mean-zero components satisfying max 1≤i≤d X i ψ2 ≤ K, and let A ∈ R d×d . Then, for every t ≥ 0, for some universal constant C > 0.
A square matrix is called zero-diagonal if all its diagonal entries are zero.
Lemma 3.6. Let X = (X 1 , . . . , X d ) be a random vector with i.i.d. components where E{X 1 } = 0, E{X 2 1 } = 1, and X 1 ψ2 ≤ K. Let A ∈ R d×d be symmetric and zero-diagonal. Then, (iv) There exist c 1 , c 2 > 0 such that for any t > 0, if c 1 K 2 t A F ≤ 1, then Proof. Observation (i) follows simply by writing out the quadratic form, To prove (ii), we expand the square, and notice that only the monomials of the form E{X 4 i } or E{X 2 i X 2 j } are nonzero after taking expectations, so For (iii), we integrate To prove (iv), we use the power series representation of the exponential, so by (i) and (iii), as long as 2c 2 3 K 2 t A F ≤ 1. Lemma 3.7. There exist c 1 , c 2 > 0 such that for any Σ ∈ P d with tr(Σ −1 − I) = 0 and Σ −1 − I F ≤ c 1 , we have λ i (∆) = tr(∆) = 0, and the lower bound follows. Furthermore, observe that If c 1 is sufficiently small, then Then, since tr(∆) = 0 and tr(∆ 2 ) = ∆ 2 F by symmetry of ∆, log det Σ = − log det Σ −1 ≤ − tr(∆) + 2 tr(∆ 2 ) = 2 Σ −1 − I 2 F , and again for sufficiently small c 1 , Lemma 3.8. There are c 1 , c 2 , c 3 > 0 such that for any Σ, Σ ∈ P d such that Σ −1 − I and Σ −1 − I are zero-diagonal and max{ Proof. By Lemma 3.7 and the triangle inequality, where the expectation is with respect to X = (X 1 , . . . , X d ) ∼ N (0, I), a d-dimensional standard normal vector. Observe now the following chain of elementary inequalities, which holds for all t ∈ R. By the triangle inequality again, We start with the term (3) in this expression. By Cauchy-Schwarz and Lemma 3.6 (iii), (iv), for some c 5 > 0 A similar computation gives that (4) is also at most c 5 Σ −1 −I 2 F , so to complete the proof we need to bound (2) from below. By Hölder's inequality and Lemma 3.6 (ii), (iii), there exists some c 6 > 0 for which F . Proof of Proposition 3.1. In the notation of Lemma 2.2, we have by Theorem 3.2, Lemma 3.4, and Lemma 3.8, for some c 1 , c 2 > 0, as long as δ 2 m is smaller than some absolute constant. So, we may pick δ = c 3 / √ n for sufficiently small c 3 > 0 for which Then, by Lemma 2.2, R n (F G ) ≥ α/8 ≥ (c 1 c 3 /8) m/n as long as n ≥ c 4 m for some c 4 > 0.

Learning Ising Graphical Models
The Ising model describes a probability distribution on the binary hypercube {−1, 1} d for some d ≥ 1, where a particular vector x ∈ {−1, 1} d is called a configuration. One such distribution is parametrized by a graph G = ([d], E) with a set of edge weights w ij ∈ R for every edge {i, j} ∈ E called interactions, and some weights h i ∈ R for 1 ≤ i ≤ d called the external magnetic field. These parameters define the Hamiltonian H : Any configuration x ∈ {−1, 1} d then appears with probability proportional to exp{H(x)}. In fact, we can write H(x) = H h,W (x) = x T W x + h T x for a vector h ∈ R d and a matrix W ∈ M G , where and in particular, The probability mass function of the Ising model with interactions W and external magnetic field h is denoted by f h,W , where where the normalizing factor Z(h, W ) is called the partition function, which is defined by Probability distributions whose densities have the form (5) for general Hamiltonians are known as Gibbs distributions or Boltzmann distributions. Given a graph G, let I G be the class of all Ising models with interactions in M G , and let I ′ G be the subclass with no external magnetic field, I ′ G = {f 0,W : W ∈ M G }. As in Section 3, I G is the class of all d-dimensional Ising models whose components form a graphical model with respect to G, and similarly for I ′ G . We omit detailed proofs of the upper bounds in Theorem 1.3, since they are virtually identical to that of Theorem 1.1 as given in Section 3.1. For I G , the corresponding vector space has the basis   There exist c 1 , c 2 > 0 such that for any zero-diagonal symmetric W ∈ R d×d with W F ≤ c 1 , On the one hand, by Lemma 3.6 (i), and on the other hand, by Lemma 3.6 (iv), F , as long as W F ≤ c 1 for some sufficiently small constant c 1 > 0. Lemma 4.3. There exist c 1 , c 2 > 0 such that for any zero-diagonal symmetric W, W ∈ R d×d satisfying max{ W F , W F } ≤ c 1 , Proof. We again prove the result for J . It is not hard to see using (1) that for all t, s ∈ R, Using this, we get the following upper bound, The term (6) is 2 W − W 2 F ≤ 2( W F + W F ) 2 by Lemma 3.6 (ii) and the triangle inequality for the Frobenius norm. For (7), by two applications of the Cauchy-Schwarz inequality, and Lemma 3.6 (iii), (iv), A similar bound holds for (8), after which the result follows.
Lemma 4.4. There exist c 1 , c 2 , c 3 > 0 such that for any zero-diagonal symmetric W, W ∈ R d×d with max{ W F , W F } < c 1 , Proof. By the triangle inequality, there is c 4 > 0 for which By (1) and the triangle inequality again, We bound the second term. By Cauchy-Schwarz and Lemma 3.6 (iii), (iv), F , and a similar analysis works for the third term. For the first term, by Hölder's inequality and Lemma 3.6 (ii), (iii), there is a c 6 > 0 for which The proof of Proposition 4.1 is now identical to that of Proposition 3.1.

4.2.
Proof of the lower bound in Theorem 1.3 (i). Let I ′ d be the class of ddimensional Ising models with no interactions. The lower bound in Theorem 1.3 (i) will follow from the next proposition along with Theorem 1.3 (ii) and subadditivity of the square root, just as in Section 3.2.
Proposition 4.5. There exist c 1 , c 2 > 0 such that if n ≥ c 1 d, R n (I ′ d ) ≥ c 2 d/n. Proof sketch. As in the above arguments, we pick a subclass of 2 d/5 densities of I ′ d and apply Lemma 2.2. The corresponding magnetic fields will have entries ±δ, with the signs specified by Theorem 3.2, so that any two of them differ in at least d/6 components. One can then show that the KL-divergence between any two of these densities is at most a constant factor of δ 2 d, while the L 1 -distances are at least some constant factor of δ √ d. The proofs are simpler than those in the previous section; for example, in this case, the partition functions can be computed exactly, and are equal for every density in the subclass. We omit the details.
so by Theorem 2.1,

Discussion
Our work raises several open problems.
1. Higher order forms. We have studied estimating densities that are proportional to the exponential of some quadratic form. One can ask for the minimax rate of the class of densities in which this form has a higher order. Namely, let k, d ≥ 1 be given integers, and suppose that F is a class of densities supported on {−1, 1} d , where each density f ∈ F is parametrized by weights w i1,...,i k ∈ R for each 1 ≤ i 1 < i 2 < · · · < i k ≤ d, and when Then, just as in the proof of the upper bound of Theorem 1.3 (ii), we have that there is a universal constant c 1 > 0 for which Can this be shown to be tight to within a constant factor? It is straightforward to see that the answer is yes for k = 1, and the results of this paper show that the answer is yes for k = 2. However, for k ≥ 3, our techniques seem to fail. Auffinger and Ben Arous [5] noted that when the weights are w i1,...,i k i.i.d.
∼ N (0, 1), the random k-th order form g : S d−1 → R defined by w i1,i2,...,i k x i1 x i2 . . . x i k blows up in complexity once k ≥ 3. For example, they show that there is a c 3 > 0 for which g has at least e c3d local minima on S d−1 in expectation, as long as k ≥ 3. On the other hand, when k ≤ 2, deterministically g has only a constant number of local minima on S d−1 . This gap in complexity may indicate that analyzing the case k ≥ 3 for our purposes will require some more sophisticated techniques.
2. Tightness of the VC-dimension bound. We proved that R n (F ) is bounded from above and below by constant factors of VC(A)/n, where A is the Yatracos class of F , for F ∈ {F G , I G , I ′ G }. The upper bound here holds for any class F by Theorem 2.1, and it can be easily seen that there are classes of densities for which this is not tight. Can we characterize the classes of densities F for which R n (F ) is in fact on the order of VC(A)/n? 3. Learning Ising models efficiently. The learning method of Theorem 2.1 is not algorithmically efficient; its time complexity is polynomial in the size of the Yatracos class A. For Ising models, this size is exponential in d, while the sample complexity is polynomial in d. Is there a polynomial time algorithm for learning Ising models on the complete graph? Note that the answer is yes for multivariate normals, since a sample mean and sample covariance estimation will do the job (e.g., [3, Theorem B.1]). 4. The minimax rate of unlabeled graphical models. In our setting, the given graph G is labeled, so we are given the specific pairs of coordinates which interact. What if only the structure of the graph G is known, but its labeling is not? What if we know that G is a tree? If only the number of edges of G is known, Theorem 1.4 provides some bound that is tight up to a factor of √ log d. Can this gap be closed? 5. The minimax rate of Ising blockmodels. For a given S ⊆ [d] with |S| = d/2 and parameters α, β ∈ R, an Ising blockmodel has density for x ∈ {−1, 1} d , where i ∼ j means that either i, j ∈ S or i, j ∈ S, and i ∼ j means that one of i, j is in S and one is not, and Z(α, β) is the normalizing factor. This model, as introduced by Berthet, Rigollet, and Srivastava [6], is motivated by social network analysis and the notion of communities in such networks. In their work [6], Berthet et al. are mainly concerned with the estimation or recovery of S from n independent samples of f S,α,β , but one can also ask for the minimax learning rate for this class of densities, if some or all of α, β and S are unknown.