Efficient Learning of Discrete Graphical Models

Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for learning graphical models with least amount of data remains an active research topic. Reconstruction of graphical models that describe the statistics of discrete variables is a particularly challenging problem, for which the maximum likelihood approach is intractable. In this work, we provide the first sample-efficient method based on the Interaction Screening framework that allows one to provably learn fully general discrete factor models with node-specific discrete alphabets and multi-body interactions, specified in an arbitrary basis. We identify a single condition related to model parametrization that leads to rigorous guarantees on the recovery of model structure and parameters in any error norm, and is readily verifiable for a large class of models. Importantly, our bounds make explicit distinction between parameters that are proper to the model and priors used as an input to the algorithm. Finally, we show that the Interaction Screening framework includes all models previously considered in the literature as special cases, and for which our analysis shows a systematic improvement in sample complexity.


Introduction
Representing and understanding the structure of direct correlations between distinct random variables with graphical models is a fundamental task that is essential to scientific and engineering endeavors. It is the first step towards an understanding of interactions between interleaved constituents of elaborated systems [10]; it is key for developing causal theories [6]; and it is at the core of automated decision making [8], cybersecurity [5] and artificial intelligence [19].
The problem of reconstruction of graphical models from samples traces back to the seminal work of [7] for tree-structured graphical models, and as of today is still at the center of attention of the learning community. For factor models defined over general hypergraphs, the learning problem is particularly challenging in graphical models over discrete variables, for which the maximum likelihood estimator is in general computationally intractable. One of the earlier tractable algorithms that has been suggested to provably reconstruct the structure of a subset of pairwise binary graphical models is based on inferring the sparsity pattern of the so-called regularized pseudo-likelihood estimator, equivalent to regularized logistic regression in the binary case [16]. However, additional assumptions required for this algorithm to succeed severely limit the set of pairwise binary models that can be learned [15]. After it was proven that reconstruction of any discrete graphical models with bounded degree can be done in polynomial time in the system size [4], Bresler showed that it is possible to bring the computational complexity down to quasi-quadratic in the number of variables for Ising models (pairwise graphical models over binary variables); however, the resulting algorithm has non-optimal sample requirements that are double-exponential in other model parameters [3]. The first computationally efficient reconstruction algorithm for sparse pairwise binary graphical models with a near-optimal sample complexity with respect to the information theoretic lower bound [17], called RISE, was designed and analyzed in [18]. The algorithm RISE suggested in this work is based on the minimization of a novel local convex loss function, called the Interaction Screening objective, supplemented with an ℓ 1 penalty to promote sparsity. Even though it has been later shown in [13] that regularized pseudo-likelihood supplemented with a crucial post-processing step also leads to a structure estimator for pairwise binary models, strong numerical and theoretical evidence provided in that work demonstrated that RISE is superior in terms of worst-case sample complexity.
Algorithms for learning discrete graphical models beyond pairwise and binary alphabets have been proposed only recently in [9] and [12]. The method in [9] works for arbitrary models with bounded degrees, but being a generalization of Bresler's algorithm for Ising models [3], it suffers from similar prohibitive sample requirements growing double-exponentially in the strength of model parameters. The so-called SPARSITRON algorithm in [12] has the flavor of a stochastic first order method with multiplicative updates. It has a low computational complexity and is sample-efficient for structure recovery of two subclasses of discrete graphical models: multiwise graphical models over binary variables or pairwise models with general alphabets. A recent follow-up work [20] considered an ℓ 2,1 constrained logistic regression, and showed that it provides a slight improvement of the sample complexity compared to [12] in the case of pairwise models over non-binary variables.
In this work, we propose a general framework for learning general discrete factor models expressed in an arbitrary parametric form. Our estimator termed GRISE is based on a significant generalization of the Interaction Screening method of [18,13], previously introduced for pairwise binary models. Our primary insight lies in the identification of a single general condition related to model parameterization that is sufficient to obtain bounds on sample complexity. We show that this condition can be reduced to a set of local identifiability conditions that only depend on the size of the maximal clique of the factor graph and can be explicitly verified in an efficient way. We propose an iterative algorithm called SUPRISE which is based on GRISE and show that it can efficiently perform structure and parameter estimation for arbitrary graphical models. Existing results in the literature on this topic [18,9,12,20] can be obtained as special cases of our general reconstruction results, which noticeably includes the challenging case of multi-body interactions defined over general discrete alphabets. Our theoretical guarantees can be expressed in any error norm, and explicitly includes distinction between bounds on the parameters of the underlying model and the prior parameters used in the optimization; as a result prior information that is not tight only has moderate effect on the sample complexity bounds. Finally, we also provide a fully parallelizable algorithmic formulation for the GRISE estimator and SUPRISE algorithm, and show that they have efficient run times of O(p L ) for a model of size p with L-order interactions, that includes the best-known O(p 2 ) scaling for pairwise models.

Problem formulation
In this Section, we formulate the general discrete graphical model selection problem that we consider and describe conditions that makes this problem well-posed.

Parameterized family of models
We consider positive joint probability distributions over p variables σ i ∈ A i for i = 1, . . . , p. The set of variable indices i is referred to as vertices V = 1, . . . , p. Node-dependent alphabets A i are assumed to be discrete and of size bounded by q > 0. Without loss of generality, the positive probability distribution over the p-dimensional vector σ can be expressed as where {f k , k ∈ K} is a set of basis functions acting upon subsets of variables σ k ⊆ σ that specify a family of distributions and θ * k are parameters that specify a model within this family. The quantity Z denotes the partition function and serves as a normalization constant that enforces that the µ in (1) is a probability distribution. For i ∈ {1, . . . , p}, let K i ⊆ K denote the set of factors corresponding to basis functions acting upon subsets σ k that contain the variable σ i and |K i | = K i . Given any set of basis functions, we can locally center them by first defining for a given i ∈ [p], the local centering functions where σ k\i denotes the vector σ k without σ i , and define the locally centered basis functions, As their name suggests, the locally centered basis functions sum to zero σi∈Ai g ik (σ k ) = 0. To ensure the scales of the parameters are well defined, we assume that θ * k are chosen or rescaled such that all locally centered basis functions are normalized in the following sense: for all vertices i ∈ V and basis factor k ∈ K i . This normalization can always be achieved by choosing bounded basis functions |f k (σ k )| ≤ 1/2. An important special case is when the basis functions are already centered, i.e. g ik (σ k ) = f k (σ k ). In this case the basis functions are directly normalized max σ k |f k (σ k )| = 1. Note that one of the reasons to define the normalization in (4) in terms of the centered functions g k instead of f k is to avoid spurious cases where the functions f k have inflated magnitudes due to addition of constants f k ← f k +C. In Appendix A, we show that the other important reason to employ centered functions is that degeneracy of the local parameterization with these functions translates to degeneracy of the global distribution in Eq. (1).

Model selection problem
For each i ∈ [p], let T i ⊆ K i denote the set of target factors that we aim at reconstructing accurately and let R i = K i \ T i be the set of residual factors for which we do not need learning guarantees. The target and residual parameters are defined similarly as θ * Ti = {θ * k | k ∈ T i } and θ * Ri = {θ * k | k ∈ R i } respectively. Given independent samples from a model in the family in Section 2.1, the goal of the model selection problem is to reconstruct the target parameters of the model. Definition 1 (Model Selection Problem). Given n i.i.d. samples σ (1) , . . . , σ (n) drawn from some distribution µ(σ) in Eq. (1) defined by θ * , and prior information on θ * given in form of an upper bound on the ℓ 1 -norm of the local sub-components and a local constraint set Y i ⊆ Ê Ki for each i ∈ [p] such that compute estimates θ of θ * such that the estimates of the target parameters satisfy where · denotes some norm of interest with respect to which the error is measured.
The bound on the ℓ 1 -norm in (5) is a natural generalization of the sparse case where θ * only has a small number of non-zero components; in the context of parameter estimation in graphical models, the setting of parameters bounded in the ℓ 1 -norm has been previously considered in [12]. The constraint sets Y i are used to encode any other side information that may be known about the model.

Sufficient conditions for well-posedness
We describe some conditions on the model in (1) that makes the model selection problem in Definition 1 well-posed. We first state the conditions formally.
the following holds:  where x Ti denotes the components k ∈ T i of x, and · is the norm used in Definition 1.
(C2) Finite Maximum Interaction Strength: The following quantity γ is finite, Condition (C1) consists in satisfying the inequality in Eq. (9) involving a quadratic form x ⊤ Ix where the matrix I has indices k, k ′ ∈ K i and is explicitly defined as This matrix I is in fact related to the conditional Fisher information matrix.
The conditional Fisher information matrix I with indices k, k ′ ∈ K i is derived from the conditional distribution of σ i given the remaining variables and reads, We immediately see that the matrix I dominates the conditional Fisher information matrix in the positive semi-definite sense, that is is satisfied whenever the conditional Fisher information matrix is non-singular in the parameter subspace x Ti that we care to reconstruct and which is compatible with our priors, i.e. for x ∈ X i . We would like to add that the conditional Fisher information matrix is a natural quantity to consider in this problem as we deliberately focus on using conditional statistics rather than global ones in order to bypass the intractability of the global log-likelihood approach. We are strongly convinced that it should appear in any analysis that entails conditional statistics.
Condition (C2) is required to ensure that the model can be recovered with finitely many samples. For many special cases, such as the Ising model, the minimum number of samples required to estimate the parameters must grow exponentially with the maximum interaction strength [17]. A more detailed discussion about well-posedness and Conditions (C1) and (C2) can be found in Appendix A.
Conditions (C1) and (C2) differ from the concepts in [18] called restricted strong convexity property and bound on the interaction strength, respectively, in a subtle but critical manner. Conditions (C1) can be identified with restricted strong convexity only when the ℓ 2 -norm is used in Eq. (9). We will see later that the notion of restricted strong convexity is not required for the ℓ ∞ -norm that appears to be a natural metric for which the local learnability condition can be verified for general models. Moreover, for general models it remains unclear whether the restricted strong convexity holds for values of ρ i that are independent of the problem dimension p. Condition (C2) is a weaker assumption than the bound on the interaction strength from [17] for it does not require an extra assumption on the maximum degree of the graphical model.

Generalized interaction screening
In this Section, we introduce the algorithm that efficiently solves the model selection problem in Definition 1 and provides rigorous guarantees on its reconstruction error and computational complexity.

Generalized regularized interaction screening estimator
We propose a generalization of the estimator RISE, first introduced in [18] for pairwise binary graphical models, in order to reconstruct general discrete graphical models defined in (1). The generalized interaction screening objective (GISO) is defined for each vertex u separately and is given by where σ (1) , . . . , σ (n) are n i.i.d samples drawn from µ(σ) in Eq. (1), θ u := (θ k ) k∈Ku is the vector of parameters associated with the factors in K u and the locally centered basis functions g uk are defined as in Eq. (3). The GISO retains the main feature of the interaction screening objective (ISO) in [18]: it is proportional to the inverse of the factor in µ(σ), except for the additional centering terms φ uk . The GISO is a convex function of θ u and retains the "screening" property of the original ISO. The GISO is used to define the generalized regularized interaction screening estimator (GRISE) for the parameters given by where γ and Y u are the prior information available on θ * u as defined in (5) and (6).

Error bound on parameter estimation with GRISE
We now state our first main result regarding the theoretical guarantees on the parameters reconstructed by GRISE. We call θ u an ǫ-optimal solution of (13) if Theorem 1 (Error Bound on GRISE Estimates). Let σ (1) , . . . , σ (n) be i.i.d. samples drawn according to µ(σ) in (1). For some node u ∈ V, assume that the model satisfies Condition 1 for some norm · and some constraint set Y u , and let α > 0 be the prescribed accuracy level. If the number of samples satisfies then, with probability at least 1 − δ, any estimate that is an ǫ-minimizer of GRISE, with ǫ ≤ (ρ u α 2 e −γ )/(20(1 + γ)), satisfies θ Tu − θ * Tu ≤ α 2 . The proof of Theorem 1 can be found in Appendix B.
The computational complexity of finding an ǫ-optimal solution of GRISE for a trivial constraint set where c g is an upper-bound on the computational complexity of evaluating any g ik (σ k ) for k ∈ K i , and C is a universal constant independent of all the parameters of the problem, see Proposition 5 in Appendix C. For a certain class of constraint sets Y u , which we term parametrically complete, the problem can be solved in two steps: first, finding a solution to an unconstrained problem, and then projecting onto this set. Note, however, that in general the problem of finding ǫ-optimal solutions to constrained GRISE can still be difficult since the constraint set Y u can be arbitrarily complicated.
Any θ k ∈ Y u satisfying (16) is called an equi-cost projection of θ u onto Y u and is denoted by The computational complexity of finding of an ǫ-optimal solution of GRISE with parametrically complete set is C cg nKu ǫ 2 ln(1+K u )+C(P Yu ( θ unc u )), where C(P Yu ( θ unc u )) denotes the computational complexity of the projection step, see Theorem 3 in Appendix C.
As we will see, for many graphical models it is often possible to explicitly construct parametrically complete sets for which the computational complexity of the projection step C(P Yu ( θ unc u )) is insignificant compared to the computational complexity of unconstrained GRISE.

Structure identification and parameter estimation
In this Section we show that the structure of graphical models, which is the collection of maximal subsets of variables that are associated through basis functions, as well as the associated parameters, can be efficiently recovered. The key elements are twofold. First, we prove that for maximal cliques, the Local Learnability Condition (LLC) in (C1) can be easily verified and yields a LLC constant independent of the system size. Second, we leverage this property to design an efficient structure and parameter learning algorithm coined SUPRISE that requires iterative calls of GRISE.

The structure of graphical models
The structure plays a central role in graphical model learning for it contains all the information about the conditional independence or Markov property of the distribution µ(σ) from Eq (1). In order to reach the definition of the structure presented in Eq. (21), we have to introduce graph theoretic concepts specific to graphical models.
The factor graph associated with the model family is a bipartite graph G = (V, K, E) with vertex set V, factor set K and edges connecting factors and vertices, We see from Eq. (18) that the edge (i, k) exists when the variable σ i associated with the vertex i is an argument of the basis function f k (σ k ) associated with the factor k. Note that this definition only depends on the set of basis functions K and does not refer to a particular choice of model within the family. The factor graph G * = (V, K * , E * ) associated with a model, as defined in Eq. (1), is the induced subgraph of G obtained from the vertex set V and factor subset K * = {k ∈ K | θ * k = 0}. We also use the shorthand notation G * = G [(V, K * )] to denote an induced subgraph of G.
We define the neighbors of a factor k as the set of vertices linked by an edge to k and denote it by The largest factor neighborhood size L = max k∈K |∂k| is called the interaction order. Families of graphical models with L ≤ 2 are referred to as pairwise models as opposed to the generic denomination of L-wise models when L is expected to be arbitrary.
The set of maximal factors of a graph is the set of factors whose neighborhood is not strictly contained in the neighborhood of another factor, Notice that multiple maximal factors may have the same neighborhood. This motivates the definition of the set of maximal cliques which is contained in the powerset P (V) and consists of all neighborhoods of maximal factors, The set of factors whose neighborhoods are the same maximal clique c ∈ M cli (G) is called the span of the clique defined as Finally, the structure Ë of a graphical model is the set of maximal cliques associated with the factor graph of the model, We would like to stress that the structure of a model is different from the set of maximal cliques of the family of graphical models M cli (G) as the former is constructed with the set of factors associated with non-zero parameters while the latter consists of all potential maximal factors.

From local learnability condition to nonsingular parametrization of cliques
We show that the learning problem of reconstructing maximal cliques is well-posed in general and especially for non-degenerate globally centered basis functions. To this end, we demonstrate that the LLC in (C1) is automatically satisfied whenever the target sets T i consist of factors corresponding to maximal cliques of the graphical model family. Importantly, we prove that the LLC constant ρ i does not depend on the dimension of the model for the ℓ ∞,2 -norm but rather relies on the Nonsingular Parametrization of Clique (NPC) by the basis functions. Similarly, we also guarantee that the LLC holds for the ℓ 2 -norm in the case of pairwise colorable models.
We introduce globally centered basis functions defined for any factor k ∈ K through the inclusion-exclusion formula, where A r = j∈r A j . It is straightforward to see that globally centered functions sum partially to zero for any variables, i.e. σi∈Ai h k (σ k ) = 0 for all i ∈ ∂k. It is worth noticing that when the functions f k are already globally centered, we have f = g = h. We would also like to point out that unlike locally centered functions g ik , globally centered functions cannot in general be interchanged with functions f k without modifying conditional distributions. However they play an important role in determining the independence of basis functions around cliques through the Nonsingular Parametrization of Cliques (NPC) constant introduced below. Given a perturbation set X i , as defined in Eq. (8), the NPC constant is defined through the following minimization, where the vector x c ∈ Ê |[c] sp | belongs to X c i , the projection of the constraint X i ⊆ Ê Ki to the components k ∈ [c] sp and the expectation is with respect to the marginal distribution of σ i . Note that NPC constant only depends on L and not on the size of the system, and can be explicitly computed in time O(K). A detailed discussion can be found in Appendix D. The importance of the NPC constant is highlighted by the following proposition that guarantees that the LLC is satisfied for maximal factors in ℓ ∞,2 -norm as long as ρ NPC Then for discrete graphical models with maximum alphabet size q, interaction order L and models with finite maximum interaction strength γ as defined in Eq. (10), the Local Learnability Condition (C1) is satisfied whenever the Nonsingular Parameterization of Cliques constant ρ NPC i is nonzero and we have, Proposition 1 guarantees that the LLC can be satisfied uniformly in the size p of the model whenever ρ NPC i > 0. The proof of Proposition 1 can be found in Appendix D.
For family of models whose factors involve at most L = 2 variables, the so-called pairwise models, we can show that the LLC conditions for maximal factors also holds for the ℓ 2 -norm. This LLC conditions depends on the vertex chromatic number χ of the model factor graph. We recall that a vertex coloring of a graph G * = (V, K * , E * ) is a partition {S r } r∈N ∈ P (V) of the vertex set such that no two vertices with the same color are connected to the same factor node, i.e. i, j ∈ S r ⇒ ∄k ∈ K * s.t. i, j ∈ ∂k. The chromatic number is the cardinality of the smallest graph coloring.
Proposition 2 (LLC in ℓ 2 -norm for pairwise models). For a specific vertex i ∈ V, let the target set be maximal factors with i as neighbor, i.e.
the ℓ 2 -norm over components that are maximal factors x Ti 2 = k∈Ti x 2 k . Then for discrete pairwise graphical models with maximum alphabet size q and models with chromatic number χ and finite maximum interaction strength γ as defined in Eq. (10), the Local Learnability Condition (C1) is satisfied whenever the NPC constant ρ NPC i is nonzero and we have,  The reader will find the proof of Proposition 2 in Appendix D. Construct the induced sub-graph: Estimate θ t u using GRISE with accuracy at least Step 3: Identify max cliques associated with zero parameters 9 Initialize set of removable factors: N t ← ∅; 10 for c ∈ M cli (G t ) do 11 Compute average reconstruction: θ Update considered factors: K t+1 ← K t \ N t ; 17 end // Step 4: output structure and non-zero parameters of maximal factors Suppose that we know α > 0, a lower-bound on the minimal intensity of the parameters associated with the structure in the sense that α ≤ min c∈Ë(G * ) Then we can recover the structure and parameters associated with maximal factors of any graphical models using Algorithm 1, coined SUPRISE for Structure Unveiling and Parameter Reconstruction with Interaction Screening Estimation. SUPRISE that implements an iterative use of GRISE is shown to have a sample complexity logarithmic in the system size for models with non-zero NPC constants. Our second main result is the following Theorem 2, proved in Appendix D, which provides guarantees on SUPRISE.
Theorem 2 (Reconstruction and Estimation Guarantees for SUPRISE). Let µ(σ) in (1) be the probability distribution of a discrete graphical model with maximum alphabet size q, interaction order L, finite maximum interaction strength γ and smallest Nonsingular Parameterization of Cliques constant greater than zero, i.e. ρ NPC = min u∈V ρ NPC samples drawn according to µ(σ) and assume that where K = max u∈V K u is the maximal number of basis functions in which a variable can appear and γ ≥ max u∈V θ * u 1 is our ℓ 1 -prior on the components of the parameters. Then the structure of the general graphical model is perfectly recovered using Algorithm 1, i.e. Ë = Ë, with probability 1 − δ. In addition, the parameters associated with maximal factors are reconstructed with precision The total computational complexity scales as O(pK), for fixed L, α, γ, γ and δ, if the constraint sets Y u are parametrically complete.
As an application of Theorem 2, we show the sample and computational complexity of recovering parameter values and the structure of some well-known special cases in Table 1. The parameter α appearing in Table 1 is the precision to which parameters are recovered in the considered norm, χ is the chromatic number of the graph, L is the interaction order, q is the maximum alphabet size, γ is the maximum interaction strength and p is the number of variables. At this point, it is instructive to compare our sample complexity requirements to existing results. A direct application of bounds of [12] and [20] to the case of pairwise multi-alphabet models that we consider below yields O(exp(14γ)) dependence, whereas SUPRISE has a complexity that scales as O(exp(12γ)).
In the case of binary L-wise models, while [12] shows the O(exp(O(γL))) scaling, SUPRISE enjoys a sample complexity O(exp(4γL)). The algorithm of [9] recovers a subclass of general graphical models with bounded degree, but has a sub-optimal double-exponential scaling in γ, while SUPRISE leads to recovery of arbitrary discrete graphical models with a single-exponential dependence in γ and needs no bound on the degree. In terms of the computational complexity, SUPRISE achieves the efficient scaling O(p L ) for models with the maximum interaction order L, which matches the bestknown O(p 2 ) scaling for pairwise models [12,20]. In summary, SUPRISE generalizes, improves and extends the existing results in the literature. The proofs for special cases can be found in Section E of the Supplementary Material.

Conclusion and future work
A key result of our paper is the existence of a computationally efficient algorithm that is able to recover arbitrary discrete graphical models with multi-body interactions. This result is a particular case of the general framework that we have introduced, which considers arbitrary model parametrization and makes distinction between the bounds on the parameters of the underlying model and the prior parameters. The computational complexity O(p L ) that we achieve is believed to be efficient for this problem [12]. In terms of sample complexity, the information-theoretic bounds for recovery of general discrete graphical models are unknown. In the case of binary pairwise models, the sample complexity bounds resulting from our general analysis are near-optimal with respect to known information-theoretic lower bounds [17]. It would be interesting to see if the 1/α 4 factor in our sample complexity bounds can be improved to 1/α 2 using an ℓ 1 -norm penalty rather than an ℓ 1 -norm constraint, as it has been shown for the particular case of Ising models [13,18].
Other open questions left for future exploration include the possibility to extend the analysis to the case of graphical models with nonlinear parametrizations like in [11], and to graphical models with continuous variables. It is particularly interesting to see whether the computationally efficient and nearly sample-optimal method introduced in the present work could be useful for designing efficient learning algorithms that can improve the state-of-the-art in the well-studied case of Gaussian graphical models, for which it has been recently shown that the information-theoretic lower bound on sample complexity is tight [14].

Appendices
Appendix A contains a detailed discussion about Conditions (C1) and (C2). In Appendix B, the reader can find the proofs for the error bound on GRISE estimates. Appendix C contains the proofs relatives to the computational complexity of GRISE and its efficient implementation. In Appendix D, the reader can find the proofs in connection with the NPC constant and relative to structure and parameter estimation with SUPRISE. Finally, Section E contains the proofs for the applications of SUPRISE for structure and parameter estimation of iconic special cases.
A About well-posedness and local learnability conditions To illustrate further why Condition (C1) is required, we look at a case where the local constraint set is trivial, i.e. Y i = Ê Ki and we consider a model that violates Condition (C1). This implies that there exists a sequence x n ∈ X i such that x ⊤ n I(θ * )x n / x Ti 2 < ρ n with ρ n → 0. In the limit, we can find a vector x such that x ⊤ I(θ * )x = 0 and x Ti = 1. In other words, it implies that for this model there exists a vertex i and a perturbation vector x ∈ X i such that x Ti = 0 and for which k∈Ki x k g ik (σ k ) 2 = 0. Since the probability distribution in Eq. (1) is positive, it further implies that for all configurations σ we have the functional equality k∈Ki x k g ik (σ k ) = 0. This enables us to locally reparameterize the distribution: where c(x, σ \i ) = k∈Ki x k φ ik (σ k\i ) is a sum of locally centered functions that does not involve the variable σ i . At this point, we should distinguish between the two cases when the basis functions f k are centered or not. When the basis functions are centered, i.e. f k = g k , the residual in Eq. (27) is identically zero, c(x, σ \i ) = 0. Therefore, the probability distribution of the model in (1) can be reparameterized entirely with θ * k → θ * k − x k for k ∈ K i . It implies, as x Ti = 0, that there exists two parameterization of the same models with different target parameters and the model selection problem as stated in Definition 1 is ill-posed. In the case when the basis functions are not centered, i.e. f k = g k , it may not be possible to reparameterized the whole distribution of the model. However, the conditional probability distribution È(σ i | σ \i ) can be reparameterized as it is proportional to exp k∈Ki θ * k f k (σ k ) and exp k∈Ki (θ * k − x k ) f k (σ k ) thanks to Eq. (27). Thereby, even if the model is uniquely parameterized with θ Ti , local methods based on independent neighborhood reconstructions using conditional distributions will fail at selecting a unique model as shown in the following example. Example 1. Consider two family of models over two binary variables σ, s ∈ {−1, 1}, parameterized by θ and η, and µ η (σ, s) ∝ exp (η 1 σs + η 2 σ + η 3 s) .

leaves the conditional distribution (31) unaffected. Note that there does not exist a change of parameters that leaves both (30) and (31) unchanged. This is in agreement with the fact that the model in Eq. (28) is uniquely parameterized and can be in principle recovered by looking jointly at both conditional distributions.
For the specific models that we considered in Section E, the basis functions are always centered, which implies that failure to satisfy (C1) means that the model selection problem is ill-posed.

A.2 About condition (C2)
The bound on the interaction strength in (C2) translates directly into a uniform bound on the conditional probabilities of the models as shown in the following lemma.
Lemma 1 (Lower-Bounded Conditional Probabilities). Consider a graphical model with bounded maximum interaction strength of γ = max i∈V | max σ k∈Ki θ * k g ik (σ k )|. Then for any two disjoint subsets of vertices A, B ⊆ V the conditional probability of σ A given σ B is bounded from below, where |A i | is the alphabet size of σ i .
Proof of Lemma 1: Lower-bound on conditional probabilities. We start by bounding the conditional probability of one variable σ i given the rest σ \i . This is given by the following expression, as the centering functions φ ik (σ k\i ) are independent of σ i . The last expression can be simply bounded away from zero, using γ ≥ | k∈Ki θ * k g ik (σ k )|. Now we consider the conditional probability of one variable σ i given a subset of variable σ B where B ⊆ V and i / ∈ B. Denote the complementary set of i and B by S = V \ ({i} ∪ B). Then using the chain rule and the inequality from Eq. (36) we find the following lower-bound, log( 2Ku δ1 ), then with probability at least 1 − δ 1 the components of the gradient of the GISO are bounded from above as Define the residual of the first order Taylor expansion as Proposition 4 (Restricted Strong Convexity for GRISE). For some node u ∈ V, let n > and assume that Condition 1 holds for some norm · . Then, with probability at least 1 − δ 2 the error of the first order Taylor expansion of the GISO satisfies We first prove Theorem 1 before proving the propositions.
Proof of Theorem 1: Error Bound on GRISE. For some node u ∈ V, let n ≥ As the estimate θ u is an ǫ-optimal point of the GISO and θ * u lies in the constraint set from Eq. (6), we find that for Using the union bound on Proposition 3 and Proposition 4 with δ 1 = δ 2 = δ 2 and we can express the inequality as Since by assumptions θ * u 1 ≤ γ and θ u 1 ≤ γ for γ ≤ γ as the estimate is an ǫ-optimal point of the ℓ 1 -constrained GISO, the error ∆ 1 is bounded by 2 γ. By choosing and after some algebra, we obtain that

B.1 Gradient concentration
The components of the gradient of the GISO is given by Each term in the summation above is distributed as the random variable Lemma 2. For any u ∈ V and k ∈ K u , we have Proof. Simple computation.

Proof of Proposition 3: Gradient Concentration for GRISE. The random variable X uk is bounded as
Using Lemma 2 and the Hoeffding inequality, we get The proof follows by using (56) and the union bound over all k ∈ K u .

B.2 Restricted strong convexity
We make use of the following deterministic functional inequality derived in [18].
Lemma 3. The following inequality holds for all z ∈ Ê.
Proof of Lemma 3. Note that the inequality is true for z = 0 and the first derivative of the difference is positive for z > 0 and negative for z < 0.
Let H k1k2 denote the correlation between g k1 and g k2 defined as and let H = [H k1k2 ] ∈ Ê |Ku|×|Ku| be the corresponding matrix. We defineĤ similarly based on the empirical estimates of the correlationĤ k1k2 = 1 ). The following lemma bounds the deviation between the above two quantities. , we have for all k 1 , k 2 ∈ K u .
Proof of Lemma 4. Fix k 1 , k 2 ∈ K u . Then the random variable defined as Y k1k2 = g uk1 (σ k1 )g uk2 (σ k2 ) satisfies |Y k1k2 | ≤ 1. Using the Hoeffding inequality we get The proof follows by using the union bound over k 1 , k 2 ∈ K u .

Lemma 5. The residual of the first order Taylor expansion of the GISO satisfies
Proof of Lemma 5. Using Lemma 3 we have The proof follows by observing that | k∈Ku ∆ k g uk (σ We are now in a position to complete the proof of Proposition 4.

Proof of Proposition 4: Restricted Strong Convexity for GRISE. Using Lemma 5 we have
where (a) follows from Lemma 4 and (b) follows from Condition 1 as

C Efficient implementation of GRISE and its computational complexity
The iterative Algorithm 2 takes as input a number of steps T and output an ǫ-optimal solution of GRISE without constraints in Eq. (74). This algorithm is an application of the Entropic Descent Algorithm introduced by [1] to reformulation of Eq. (13) as a minimization over the probability simplex. Note that there exist other efficient iterative methods for minimizing the GISO, such as the mirror gradient descent of [2]. The following proposition provides guarantees on the computational complexity of unconstrained GRISE.
Proposition 5 (Computational Complexity for Unconstrained GRISE). Let 1 ≥ ǫ > 0 be the optimality gap and T ≥ 6ǫ −2 ln (2K u + 1) be the maximum number of iterations. Then Algorithm 2 is guaranteed to produce an ǫ-optimal solution of GRISE without a constraint set Y u with a number of operation less than C cg nKu ǫ 2 ln(1 + K u ), where c g is an upper bound on the computational complexity of evaluating any g ik (σ k ) for k ∈ K i and C is a universal constant that is independent of all parameters of the problem.
Proof of Proposition 5: Computational complexity of unconstrained GRISE. We start by showing that the minimization of GRISE in Eq. (13) in the unconstrained case where Y u = R Ku is equivalent to the following lifted minimization on the logarithm of GRISE, We first show that for all θ u ∈ R Ku such that θ u 1 ≤ γ, there exists x + , x − , y satisfying constraints (71), (72), (73). This is easily done by choosing x + k = max(θ k / γ, 0) , x − k = max(−θ k / γ, 0) and y = 1 − θ u 1 / γ. Second, we trivially see that for all θ u , x + , x − , y satisfying constraints (71), (72), (73), it implies that θ u also satisfies θ u 1 ≤ γ. Therefore, any θ min u that is an argmin of Eq. (70) is also an argmin of Eq. (13) without constraint set Y u . Moreover, if we find θ ǫ u such that log S n (θ ǫ u ) − log S n (θ min u ) ≤ ǫ/ √ 3, we obtain an ǫ-minimizer of Eq. (13) without constraint set Y u .
The remainder of the proof is a straightforward application of the analysis of the Entropic Descent Algorithm in [1, Th. 5.1] to the above minimization where θ u has been replaced by x + , x − , y using Eq. (71). In this analysis we use the fact that the logarithm of GRISE remains a convex function as it is a sum of exponential functions and also that the gradient of our objective function is bounded uniformly by ∇ log S n (θ u ) ∞ = ∇S n (θ u )/S n (θ u ) ∞ ≤ 1 as |g k (σ k )| ≤ 1. Note that the computational complexity of the gradient evaluation is proportional to nK u c g . This is because for each sample, one has to first compute an exponential containing K u terms g k (σ k ) with an evaluation cost of c g and then multiply the exponential by the factor −g k (σ k ) corresponding to each of the K u components of the gradient.
When the constraint set Y u is parametrically complete, an ǫ-optimal solution to (13) can be found by first solving the unconstrained version of GRISE and then performing an equi-cost projection onto Y u . We define an ǫ-optimal solution to the unconstrained GRISE problem as Lemma 6. Let θ unc u be an ǫ-optimal solution of the unconstrained GRISE problem in Eq. (74). Then an equi-cost projection of θ unc u is an ǫ-optimal solution of the constrained GRISE problem, Proof of Lemma 6. Since unconstrained GRISE is a relaxation of GRISE with the constraints θ u ∈ Y u , we must have Since, Y u is parametrically complete, by definition, The estimates P Yu ( θ unc u ) are feasible for the constrained GRISE problem, completing the proof.
Lemma 6 implies that the computational complexity of GRISE for parametrically complete cases is the sum of the computational complexity of the unconstrained GRISE and the projection step. Algorithm 3 is an implementation of GRISE for parametrically complete cases. Its computational complexity is obtained easily by combining Lemma 6 and Proposition 5.
Theorem 3 (Computational Complexity for GRISE with P.C. Constraints). Let Y u be a parametrically complete set and let 1 ≥ ǫ > 0 be given. Then Algorithm 3 computes an ǫ-optimal solution to GRISE with a number of operations bounded by C cgnKu ǫ 2 ln(1 + K u ) + C(P Yu ( θ unc u )), where c g is an upper bound on the computational complexity of evaluating any g ik (σ k ) for k ∈ K i and where C(P Yu ( θ unc u )) denotes the computational complexity of the projection step.

D Proofs & algorithms for structure and parameter estimation D.1 Dimension independence and easier computation of NPC constants
We recall the definition of the NPC constant, In order to give an intuition for the intricate formula in Eq. (78), let us define for maximal cliques c ∈ M cli (G), their clique parameterization matrix G c , with indices being maximal factors k, k ′ ∈ [c] sp .
The clique parameterization matrix is obtained by summing over variable σ c the globally centered basis functions, Note that the clique parametertization matrices are positive semi-definite matrices by construction. Bounding the expectation over σ i using Lemma 1, we see that the NPC constant is linked to the smallest eigenvalue of the clique parameterization matrix, where in the last line, we bounded the probability of È(σ S | σ \S ) using Lemma 1. We want to rewrite the sum over σ S in Eq. (82) using globally centered functions h k for factors k ∈ [c] sp instead of locally centered functions g ik . Using definitions of locally centered functions in Eq. (3) and globally centered functions in Eq. (22), we see that The sum in Eq. (82) can be expanded into the four contributions, We start by evaluating the contribution from terms in Eq. (85). For k ∈ [c] sp and l ∈ K i \ [c] sp , there exists at least one node u ∈ c such that u = i and u / ∈ ∂l. Summing over the variable σ u cancels the expression, as h k (σ c ) is globally centered and g il (σ l ) does not depend on σ u .
The contribution from Eq. (86) is also null. To see this we expand the sums using the formula for the reminder in Eq. (83), where the sum over σ S vanishes as h k (σ c ) depends on σ r = σ i while σ r f l (σ c ) does not. As the contribution from Eq.(87) is non-negative, we can lower-bound Eq. (82) by the following expression,  where in the last line we have recognized the definition of the NPC constant from Eq. (23). Since (91) holds for any c ∈ M cli (G) that contains the vertex i, the Local Learnability Condition is satisfied for a weighted ℓ ∞,2 -norm with LLC constant equal to ρ NPC ,  where the weighted ℓ ∞,2 -norm is defined as follows, x Ti w(∞,2) = max As the weighted ℓ ∞,2 -norm in Eq. (93) is lower-bounded by the ℓ ∞,2 -norm, we have that the LLC is also satisfied for the ℓ ∞,2 -norm with LLC constant equal to When {i} is a maximal clique, then K i = [{i}] sp and it straightforward to see that  Lemma 7. Let σ ∈ A, be a discrete random variable with probability distribution p(σ). Consider x σ ∈ R, a function defined over σ that is centered, i.e. σ∈A x σ = 0. The variance of the function x σ is lower-bounded by, where p min = min σ∈A p(σ).
Proof. The proof goes as follows, where in the last line we used that σ∈A x σ = 0 and Now suppose that {i} is not a maximal clique, i.e. there exists j ∈ V such that {i, j} ∈ M cli (G * ).
The expectation that arises in the LLC is lower-bounded by its variance,  Let {S r } r=1,...,χ be a minimal coloring of the graph G * . For a given color r, define the set C r = S r \ {i} and apply the law of total variance on the right-hand side of Eq. (104), conditioning on σ \Cr , where the marginal and the conditional probability distribution used to compute expectation and variance respectively are indicated by subscripts. As the variance on the right-hand side of Eq. (105) is conditioned on σ \Cr , only basis functions involving a pair (σ i , σ j ) with j ∈ C r are giving a non-zero contribution to the conditional variance, We can rewrite the locally centered functions with respect to globally centered functions using their definitions found in Eq. (3) and in Eq. (22), We see from Eq. (107) that the difference between locally and globally centered functions only depends on the variable σ i . This means that we can interchange locally centered functions with globally centered functions in the right-hand side of Eq. (106) as the variance is conditioned on σ i , Since {S r } r=1,...,χ is a vertex coloring, by definition all nodes j ∈ C r having the same color are not sharing a factor node, i.e. ∀j 1 , j 2 ∈ C r , ∄k ∈ K * such that j 1 , j 2 ∈ ∂k. This implies that variables σ j with j ∈ C r are independent conditioned on the remaining variables σ \Cr and the variance in Eq. (108) can be rewritten, The right-hand side of Eq. (109) is centered with respect to σ j and we can apply Lemma 7 and Lemma 1 to find a lower-bound that is only dependant on the random variable σ i , j∈Cr Var (σ j |σ \Cr ) Plugging back the results derived in Eq. (108), Eq. (109), and Eq. (110) into the initial inequality in Eq. (105), we find, where in Eq. (112) we used the definition of the NPC constant in Eq. (23) to bound the quadratic form involving x.
Finally, we average the inequality described by Eq. (112) over the different colors and hence possible conditioning sets C r to conclude the proof,

D.3 Proofs of estimation guarantees for the SUPRISE algorithm
Proof of Theorem 2: Reconstruction and Estimation Guarantees for SUPRISE. As the NPC constant is non-zero ρ NPC u > 0 for all nodes u ∈ V, we apply Proposition 1 in conjunction with Theorem 1 to find that for each step t ∈ {0, . . . , L − 1} and with probability at least 1 − δ/(pL), GRISE around a node u ∈ V recovers the parameters in each maximal clique c ∈ M cli (G [(V, K t )]) that contains u with precision k∈[c] sp (θ k − θ k ) 2 ≤ (α/2) 2 . Therefore, at each step t ∈ {0, . . . , L − 1} and with probability at least 1 − δ/L, the factor removal procedure is guaranteed to remove all factors of size L−t that are not present in the graph if all factors of size bigger than L−t were correctly removed in the previous steps. Since there are at most L removal steps, it implies that the overall procedure discovers all maximal cliques with probability at least 1 − δ.

E Application to special cases
In this section, we show how to apply Theorem 2 in order to derive the sample and computational complexity of reconstructing graphical models for some common basis functions.

E.1 Binary models on the monomial basis
In this subsection, we consider general models on binary alphabet A i = {−1, 1}. Let the factors be all nonempty subsets of {1, . . . , p} = V of size at most L, (115) The set K contains all potential subsets of variable of size at most L. The parameterization uses the monomial basis given by f k (σ k ) = j∈k σ j with k ∈ K. Note that the monomial basis functions are already globally centered f k ≡ g k ≡ h k . The probability distribution for this model is expressed as When L ≤ 2, the model in Eq. (116) is pairwise and it is referred as the Ising Model.
For each maximal clique there exists exactly one maximal factors in its span. Therefore, the NPC constant as defined in Eq. (23) is ρ NPC = 1 since for any clique c we have, and the minimum is achieved for cliques of size one. As every node is involved in at most K ≤ p L−1 factor functions, the structure of binary models can be recovered as a corollary of Theorem 2. Corollary 1 (Structure recovery for binary graphical models). Let σ (1) , . . . , σ (n) be i.i.d. samples drawn according to µ(σ) in (116) and let α ≤ min k∈Ë(G * ) |θ * k | be the intensity of the smallest non-zero parameter. If then the structure of the binary graphical model is perfectly recovered using Algorithm 1, i.e. Ë = Ë(θ * ), with probability 1 − δ. Moreover the total computational complexity scales as O(p L ), for fixed L, α, γ, γ and δ.
For pairwise Ising models that are χ colorable, we have also guarantees on the ℓ 2 -norm reconstruction by SUPRISE of pairwise parameters. Corollary 2 (ℓ 2 -parameter estimation for Ising models). Let σ (1) , . . . , σ (n) be i.i.d. samples drawn according to µ(σ) in (116) for L = 2 and let α > 0 be the prescribed estimation accuracy. If then, with probability at least 1 − δ, the parameters are estimated by Algorithm 1 with the error The computational complexity of obtaining these estimates is O(p 2 ) for fixed χ, α, γ, γ and δ.
As graphs with bounded degree d have a chromatic number at most d + 1 ≥ χ, Corollary 2 recovers the ℓ 2 -guarantees for sparse graphs recovery of [18] albeit with slightly worse dependence with respect to γ and α. The worse γ dependence is an artifact of the general analysis presented in this paper. For models over binary variables one can improve the e 8γ dependence to e 6γ using Berstein's inequality in Proposition 3 instead of Hoeffding's inequality. However, the worse α dependence seems to be more fundamental. It is caused by the replacement of the ℓ 1 -penalty used in [18] by an ℓ 1 -constraint.
For graphs with unbounded vertex degree but low chromatic number, such as star graphs or bipartite graphs, Corollary 2 shows that the parameters of the corresponding Ising model can be fully recovered with a bounded ℓ 2 -error using a number of samples that is only logarithmic in the model size p.

E.2 L-wise models with arbitrary alphabets on the indicator basis
In this subsection, we consider L-wise graphical models over variables taking values in arbitrary alphabet A i of size q i , parametrized with indicator-type functions. The building block of the set of basis functions is the centered univariate indicator function defined as where s i , σ i ∈ A i are prescribed letters of the alphabet. The univariate indicator functions in Eq. (121) are centered Kronecker delta functions and possess similar properties such as symmetry Φ si,σi = Φ σi,si and contraction under a summation, τi∈Ai Φ τi,si Φ τi,σi = Φ si,σi .
The set of factors K are pairs associating elements of R = {r ∈ P (V) | |r| ≤ L} which are subsets of variable of size at most L with an alphabet configuration in A r = i∈r A i , K = {(r, s r ) | r ∈ R, s r ∈ A r }.
In what follows, we slightly abuse the notation of factors and parameters by shortening (r, s r ) ≡ s r . With these notations, the indicator basis functions are constructed as f s r (σ r ) = i∈r Φ si,σi . Note that the indicator basis functions are globally centered i.e. f s r ≡ g s r ≡ h s r . The probability distribution of an L-wise graphical model with arbitrary alphabet is defined as follows, The family of distribution in Eq. (124) is not uniquely parameterized by the parameters θ * . To see this, we introduce the linear application P r acting on arrays θ s r as follows, Using the contraction property from Eq. (122), it is easy to see that P r is a projector, i.e P 2 r = P r . It is also straightforward to verify that P r θ is always a globally centered array and if θ is already globally centered then P r θ = θ. Therefore, the applications P r are projectors on the space of array θ s r which are globally centered, i.e. si θ s r = 0 for all i ∈ r. We lift the parametrization degeneracy in Eq. (124) by imposing that parameters θ * are in the range of the projector P r . We thus require that the parameters satisfy the following linear constraints at each vertex u ∈ V, Y u = r∈R r∋u θ u ∈ Ê Ku ∀i ∈ r, si∈Ai θ s r = 0 .
The constraint set in Eq. (126) is parametrically complete according to Definition 2 as we explicitly exhibited the equi-cost projection {P r } r∈R onto it. The computational complexity of this projection is no more than O(p L−1 q L ).
As the constraint set in Eq. (126) forms a linear subspace, the perturbation set is simply X u = Y u ∩ B 1 (2 γ), the intersection of the constraint set with the ℓ 1 -ball of radius 2 γ. Maximal cliques are subset of vertices and hence are also elements of R. Therefore, the NPC constant as defined in Eq. (23) is bounded by ρ NPC ≥ exp(−2γ)/q since for each clique we have, as x ∈ X u is globally centered and thus is in the range of the projector P c . Every node is involved in at most K ≤ p L−1 q L factor functions and the structure of L-wise models with arbitrary alphabets can be recovered as a corollary of Theorem 2. For pairwise models with arbitrary alphabet that are χ colorable, we have also guarantees on the ℓ 2 -norm reconstruction by SUPRISE of pairwise parameters.
The computational complexity of obtaining these estimates is O(p 2 ) for fixed χ, q, α, γ, γ and δ.