Estimation of High-Dimensional Graphical Models Using Regularized Score Matching

Graphical models are widely used to model stochastic dependences among large collections of variables. We introduce a new method of estimating undirected conditional independence graphs based on the score matching loss, introduced by Hyvarinen (2005), and subsequently extended in Hyvarinen (2007). The regularized score matching method we propose applies to settings with continuous observations and allows for computationally efficient treatment of possibly non-Gaussian exponential family models. In the well-explored Gaussian setting, regularized score matching avoids issues of asymmetry that arise when applying the technique of neighborhood selection, and compared to existing methods that directly yield symmetric estimates, the score matching approach has the advantage that the considered loss is quadratic and gives piecewise linear solution paths under $\ell_1$ regularization. Under suitable irrepresentability conditions, we show that $\ell_1$-regularized score matching is consistent for graph estimation in sparse high-dimensional settings. Through numerical experiments and an application to RNAseq data, we confirm that regularized score matching achieves state-of-the-art performance in the Gaussian case and provides a valuable tool for computationally efficient estimation in non-Gaussian graphical models.


Introduction
Undirected graphical models, also known as Markov random fields, are important tools for summarizing dependency relationships between random variables and have found application in many fields, including bioinformatics, language and speech processing, and digital communications.Each such model is associated to an undirected graph G = (V, E), with vertex set V and edge set E ⊂ V × V .For a random vector X = (X j : j ∈ V ) indexed by the nodes of G, the graphical model given by G requires that X j and X k be conditionally independent given all other variables whenever nodes j and k are not joined by an edge in G (Lauritzen, 1996).If G is the smallest graph such that X satisfies this requirement, we term G the conditional independence graph of X.In this case, X j and X k are conditionally independent given all other variables if and only if j and k are non-adjacent in G.We will always take the vertex set to be V = {1, . . ., m}, so m is the number of observed variables in X.
Specific models are obtained from additional distributional assumptions.Particularly, an assumption of multivariate normality gives Gaussian graphical models, for which estimation of conditional independence graphs is equivalent to covariance selection (Dempster, 1972).If X is jointly multivariate normal with mean vector µ and covariance matrix Σ-in symbols, settings.The method also performs well in our applications to non-Gaussian models, which include models that seem rather difficult to handle via other methods.
In the Gaussian setting, regularized score matching is structurally closest to pseudo-likelihood methods with symmetry constraints, such as SPACE (Peng et al., 2009), symmetric lasso (Friedman, Hastie and Tibshirani, 2010) and SPLICE (Rocha, Zhao and Yu, 2008).A thorough discussion of these different methods is given by Khare, Oh and Rajaratnam (2015) who also reformulate the SPACE objective function to ensure convergence of coordinate descent algorithms.They abbreviate their method as CONCORD.For brevity, we refer to these algorithms collectively as SPACE.We note that in contrast to regularized score matching, the SPACE methods do not have piecewise linear solution paths.Furthermore, as remarked before, the computational convenience of regularized score matching carries over to non-Gaussian settings.
A limitation of the original score matching introduced by Hyvärinen (2005) is that it requires the data to be generated from a distribution whose density is twice differentiable on R m .Hyvärinen (2007) proposed a generalization of the approach to the important case of non-negative data.For exponential families, the non-negative score matching loss is again a semidefinite quadratic function.We explore regularization of the non-negative score matching loss as a tool for estimation of conditional independence graphs from high-dimensional non-negative data, and we establish consistency of the method.
The remainder of the paper is organized as follows.Section 2 provides the needed background on score matching and its applications.In Section 3, we describe the proposed method, regularized score matching.Implementation details are given in Appendix A. In Section 4, we present results of numerical experiments to compare the performance of the procedure with existing approaches.An application to RNAseq data is given in Section 5. Section 6 provides sparsistency theory for both basic and non-negative regularized score matching.Proofs are given in Section 7 with details deferred to Appendix B and C. We end with a discussion in Section 8.

Notation
The following notational conventions are used throughout the paper: (i) Random variables/vectors are denoted by upper case letters; lower case letters are used for observed values.So, x ∈ R m is an observed value of the random vector X.Similarly, x = (x ij ) ∈ R n×m is a matrix of observed values, which will typically hold the realizations of n i.i.d.copies of X in its rows.We index the columns of a matrix with subscripts, so x j refers to the jth column of x.Superscripts in parentheses are used to refer to the rows of a matrix, so x (i) is the ith row of x. (ii) For a matrix U = (u ij ) ∈ R m×m , we denote the vectorization obtained by stacking columns by vec(U) = u 11 , u 21 , . . ., u m1 , . . ., u 1m , . . ., u mm T . (

Score Matching
We begin with an overview of Hyvärinen's score matching, discussing first random vectors supported on all of R m and then random vectors supported on the nonnegative orthant.We also review the convenient form of the score matching estimating equations in exponential families.

Basic score matching
Suppose X is a continuous random vector taking values in R m , with joint distribution P .Suppose further that P belongs to the family P that comprises all probability distributions with support equal to R m and a twice differentiable density with respect to Lebesgue measure.We emphasize that in a statistical context the differentiability requirement is with respect to data.We write p to denote the density of P and adopt the usual notation for the gradient and Laplacian For a distribution Q ∈ P with density q, define the divergence function as the expected squared distance between the gradients of the log-densities of the two distributions Q and P .By choosing Q to minimize (2.1), we are matching 'scores' with respect to the data vector x.Hence, (2.1) has been referred to as the score matching loss.It is evident from (2.1) that the score matching loss is uniquely minimized when Q = P .Upon initial inspection, optimization of J(Q) seems to require knowledge of P in an important way.However, Hyvärinen (2005) showed that, under mild regularity conditions, the score matching loss (2.1) can be rewritten as: where 'const' refers to a term independent of Q.The key term in the integrand in (2.2) is the so-called Hyvärinen scoring rule The integral in (2.2) admits an empirical version in which the integration with respect to P is replaced by an average over an observed sample, which we arrange into a data matrix x ∈ R n×m .This leads to the empirical score matching loss and the score matching estimator (SME) The score matching loss J(Q) was motivated by problems involving models whose distributions have an intractable normalization constant.Indeed, evaluating (2.2) and computing the SME Q requires no knowledge of the normalization constant, which is eliminated upon taking logarithmic derivatives with respect to x.Besides the imaging problems considered by Hyvärinen (2005), score matching has been applied to spatial statistics (Dawid and Musio, 2013) and neural networks (Köster and Hyvärinen, 2007;Vincent, 2011;Le et al., 2011).
The statistical properties of SMEs in classical large sample settings have been investigated by Hyvärinen (2005Hyvärinen ( , 2007) ) and Forbes and Lauritzen (2015).In particular, it has been shown that, under the usual regularity conditions, SMEs are asymptotically consistent and normal in large-sample theory.However, SMEs are not necessarily asymptotically efficient.

Extension to non-negative data
The partial integration arguments underlying (2.2) may fail to apply when considering distributions Q that are not supported on all of R m .In particular, when Q is taken to be from P + , i.e. the family of distributions that are supported on R m + = [0, ∞) m with Lebesgue densities that are twice differentiable on (0, ∞) m , then partial integration may not be possible due to discontinuities at points with zero coordinates.We thus consider the non-negative score matching loss, as proposed in Hyvärinen (2007).Here, '•' stands for the Hadamard product, that is, elementwise multiplication.
The score matching loss (2.1) can be thought of as a function of the Euclidean distance between the gradients of the model density q and true density p with respect to a hypothetical location parameter µ, evaluated at 0. That is, we may write (2.1) as Likewise, the non-negative score matching loss compares the gradient of the model density q and true density p with respect to a hypothetical scale parameter σ evaluated at 1, Under suitably adjusted regularity conditions, Hyvärinen (2007) showed that the non-negative score matching loss from (2.4) can be simplified into with scoring rule (2.6)For a data matrix x ∈ R n×m , one obtains the empirical non-negative score matching loss and the non-negative score matching estimator (SME + ) Again, under the usual regularity conditions, the estimator Q+ is asymptotically consistent and normal in traditional large-sample theory.

Score matching in exponential families
Hyvärinen (2007) and Forbes and Lauritzen (2015) have shown that the SME has a convenient closed form as a rational function of the data when P is an exponential family.Hyvärinen (2007) showed the same for SME + for the example of truncated normal distributions.As they provide the basis for our later work, we revisit these results for both SME and SME + .Let P = (Q θ : θ ∈ Θ) be an exponential family with natural parameter space Θ.Suppose that the distributions Q θ have their common support equal to either X = R m or X = R m + , and that P is dominated by Lebesgue measure on R m .Assuming that the sufficient statistics t(x) take values in R s , the log-densities of the distributions Q θ have the form (2.9) Lemma 1.Let x ∈ R n×m be a data matrix, and suppose P = (Q θ : θ ∈ Θ) is an exponential family characterized by (2.8) and (2.9).If P has support X = R m , then the empirical score matching loss Ĵ(x, Q θ ) is a quadratic function in θ with where Γ(x) is a positive semidefinite s × s matrix, and g(x) is an s-vector.The same is true for Ĵ+ (x, Q θ ) when P has support X = R m + .Proof.For j = 1, . . ., m and x ∈ R m , define the s-vectors It then follows from (2.8) that Ĵ(x, Q θ ) can be expressed in the claimed form with (2.13) For non-negative score matching, Ĵ+ (x, Q θ ) admits the claimed form with where the x ij are the entries of the n × m data matrix x.
Lemma 1 implies that, when working with exponential families, both score matching objectives are quadratic functions of the unknown parameter vector θ.A score matching estimator θ thus satisfies a set of linear estimating equations θT Γ(x) + g(x) = 0. (2.17)

Pairwise interaction models
The most basic class of exponential families that appear in graphical modeling are pairwise interaction models with log-densities Here, the t jk are sufficient statistics that depend only on the jth and kth coordinate of x, and the θ jk are interaction parameters.If Q θ denotes the distribution with density given by (2.18), then the Hammersley-Clifford Theorem implies that an edge between nodes j and k exists in the conditional independence graph of Q θ if and only if θ jk is nonzero.The specific models we consider later either exactly have the form in (2.18) or are closely related extensions with log-densities where pairwise interactions may be of A different types and we also include L sets of sufficient statistics t (l) j depending on the individual coordinates.The latter appear, for instance, when allowing distributions to vary in location.The distribution Q θ defined by (2.19) has no edge between j and k in its conditional independence graph if and only if θ In our study of score matching methods for models of the type (2.18) or (2.19), it will be convenient to introduce the symmetric m × m interaction matrix Θ with entries Lemma 2. Let P to be the pairwise interaction model given by (2.18) with symmetric m × m interaction matrix Θ.If P has support X = R m , then the empirical score matching loss for a symmetric m 2 × m 2 matrix Γ(x) that is block-diagonal, with all blocks of size m × m.
The same is true for Ĵ+ (x, Q θ ) when P has support X = R m + .Proof.By (2.11) and (2.14), it suffices to show that there exists a block-diagonal matrix (2.21) where θ = (θ jk : j ≤ k).Now, Define a vector hj (x) ∈ R m 2 , indexed by pairs (k, l) with 1 ≤ k, l ≤ m, by setting the entries to (2.22) Then h j (x) T θ = hj (x)vec(Θ) and (2.21) holds with Γ j (x) = hj (x) hj (x) T , which is blockdiagonal as it is zero with the exception of the m × m block indexed by pairs (k, l) with k = j.
Remark 1.When P is a model as specified in (2.19), then the empirical (non-negative) score matching loss may still be represented as an explicit quadratic form with a block-diagonal symmetric matrix Γ(x) as in (2.20).However, Γ(x) is then of size (Am 2 + Lm) × (Am 2 + Lm), and its m diagonal blocks are of size (Am + L) × (Am + L).The jth block has its rows and columns corresponding to the jth columns of each of Θ (1) , . . ., Θ (A) as well (θ (1) j , . . ., θ (L) j ).Example 1.If the exponential family is taken to be the family of centered multivariate normal distributions with precision matrix K = (κ jk ), then the support is X = R m and and dropping a term that is constant in K, the empirical score matching loss from (2.2) takes the form − tr(K) + 1 2 tr(KKW), (2.24) where is the empirical covariance matrix (under knowledge of zero mean).Lemma 2 applies with t jk (x j , x k ) = x j x k , in which case the matrix Γ j (x) constructed in the proof of the lemma does not depend on j, other than through the location of the nonzero block.Indeed, (2.20) holds with Γ(x) = I m×m ⊗ W and g(x) = vec(I m×m ), where I m×m is the m × m identity matrix.Clearly, Γ(x) is positive definite if and only if W is as well.If W is invertible then SME of K is K = W −1 and coincides with the maximum likelihood estimator.
Example 2. Consider truncated normal densities of the form (2.25) Using κ j to denote the jth column of K, it can be shown that the empirical non-negative score matching objective is (2.26) The loss can be written as in (2.10) with Γ(x) a block diagonal m 2 × m 2 matrix, whose jth block is given by 1 Moreover, g(x) = 2w + w diag , where w = vec(W) and w diag = vec(diag(W)).The maximum likelihood estimator for K has no closed form due to intractable normalizing constants.
Example 3. Finally, consider the family of distributions with densities of the form Here, b = (β 1 , . . ., β m ) T is an m-vector, and B = (β jk ) and jk ) are symmetric m×m interaction matrices, the latter having a zero diagonal.This family is a class of distributions with normal conditionals, with densities that need not be unimodal (Arnold, Castillo and Sarabia, 1999;Gelman and Meng, 1991).This family is intriguing from the perspective of graphical modeling as, in contrast to the Gaussian case, conditional dependence may also express itself in the variances.For conditional independence of X j and X k both β jk and β (2) jk need to vanish.
By Remark 1, the empirical score matching loss for the family from (2.27) can be written as a quadratic function with the quadratic term given by block-diagonal matrix Γ(x) of size (2m 2 + m) × (2m 2 + m).The blocks are of size (2m + 1) × (2m + 1), and the jth block has its rows and columns corresponding to the jth columns of B and B (2) and the jth entry in b.

Regularized Score Matching
In this section, we propose the use of regularized score matching for graphical model selection in the setting of high-dimensional sparse graphical models.We begin by discussing the proposed method and its implementation.Later sections show that, despite the fact that SMEs need not be asymptotically efficient in the sense of traditional large-sample theory, regularized score matching achieves state-of-the-art statistical performance in high-dimensional problems, all the while allowing seemingly complicated non-Gaussian graphical models to be treated in a computationally efficient manner.

Methodology
Building on the ideas underlying methods such as glasso, neighborhood selection and SPACE, we augment the score matching loss with a sparsity-promoting penalty.Our focus is on the most basic case of an 1 penalty but other regularization schemes could be considered instead; see also Example 3 below.
Using the generic representation given in Lemma 1, for an exponential family, the proposed method is based on minimizing the objective where Γ(x) is positive semidefinite and λ ≥ 0 is a tuning parameter that controls the sparsity level.Larger values of λ yield sparser solutions, and λ = 0 gives the unregularized SME.Since Γ(x) is positive semidefinite, the function Ĵλ (θ) is convex but in the settings of interest here Γ(x) will be singular and Ĵλ (θ) will not be strictly convex.
The regularized score matching objective from (3.1) is similar to the lasso objective in linear regression (Tibshirani, 1996), where the function to be minimized takes the special form for a 'response vector' y and a 'design matrix' X.In the applications we have in mind (3.1) cannot be written exactly as in (3.2) because the vector g(x) is generally not in the column span of Γ(x).However, we may adapt existing optimization methods for lasso to solve the regularized score matching problem.Implementation details are given in Appendix A.
If the considered exponential family is supported on X = R m and we use the loss from (2.3), then we call the minimizer of (3.1) the regularized score matching estimator (rSME).If X = R m + and we use the loss from (2.7), then we abbreviate to rSME + .In specific instances of graphical models, we may apply the 1 penalty only to those coordinates of θ whose vanishing corresponds to absence of edges in a conditional independence graph.If the subset E ⊆ {1, . . ., s} holds the relevant coordinates then we use the penalty Example 1 (cont.).For the (centered) Gaussian case considered in Example 1, the target of estimation is the symmetric precision matrix K.The conditional independence graph corresponds to the pattern of zeros in the off-diagonal entries of K and the rSME is where W is the empirical covariance matrix and K 1,off = K 1,E penalizes only the offdiagonal entries indexed by E = {(j, k) : j = k}.We emphasize that while in this example the natural parameter space is the positive definite cone, we propose minimizing simply over the entire space of symmetric m × m matrices, denoted by Sym m .As our interest is primarily in graph selection, we do not enforce positive definiteness of K, which is in line with methods such as SPACE or neighborhood selection; compare Khare, Oh and Rajaratnam (2015).We remark that evaluating the function from (3.3) at a nonsymmetric matrix K as well as its transpose K T gives the same value.By convexity, minimizing over all m × m matrices gives a solution in Sym m , which then must equal K.
Example 2 (cont.).In the truncated normal family from Example 2, the conditional independence graph corresponds again to the zero pattern in the off-diagonal entries of the positive definite interaction matrix K. Proceeding in analogy to the Gaussian case, we define the rSME + as the minimizer K+ of the objective given by (2.26) with the penalty λ K 1,off added on.Again, we ignore the positive definiteness requirement and minimize the penalized non-negative score matching loss with respect to K ∈ Sym m .Example 3 (cont.).For the family of distributions with normal conditionals from Example 3, we would like a penalty to induce joint sparsity in the two symmetric interaction matrices B and B (2) , because an edge between nodes j and k is absent from the conditional independence graph if and only both B and B (2) have their (j, k) entries zero.For this purpose, it is natural to adopt the group lasso penalty (Yuan and Lin, 2006).The rSME is then obtained by minimizing the empirical score matching loss augmented by the penalty Ignoring again any refined constraints from the natural parameter space of the family, we propose minimizing the penalized loss with respect to b ∈ R m and B, B (2) ∈ Sym m .Since the group lasso is applied with small groups (of size 2), the problem would be suitable for application of exact block-coordinate descent as discussed in Foygel and Drton (2010a).
The graphical models we are interested in are pairwise interaction models that have additional special structure in that the matrix Γ(x) is block-diagonal with m blocks of equal size; recall Lemma 2 and Remark 1. Denote the diagonal blocks by Γ 1 (x), . . ., Γ m (x), which in the setup from (2.19) are of size (Am 2 + Lm) × (Am 2 + Lm).Each block is the sum of n symmetric rank one matrices and we have the decomposition (3.5) The n columns of each of the matrices H j (x) were specified in (2.22).It now holds that the regularized score matching problem from (3.1) has a unique minimizer provided each one of the n × (Am + L) blocks H 1 (x), . . ., H m (x) defined in (3.5) has its columns in general position.
Example 1 (cont.).In the Gaussian case, By the Lemma in Okamoto (1973), the set of matrices x that fail to be in general position has measure zero.The rSME K is unique almost surely when data are generated from a continuous joint distribution.
Example 2 (cont.).In the truncated normal case, H j (x) is equal to the matrix obtained from x by multiplying each column element-wise with x j , the jth column of x.The Lemma in Okamoto (1973) implies that the rSME + is unique almost surely.
For the normal conditionals model from Example 3, almost sure uniqueness would have to be derived by appealing to results on uniqueness of group lasso (Roth and Fischer, 2008).

Piecewise linear paths
The rSME depends on the regularization parameter λ.In this section we make this explicit and denote it by θλ .Adopting standard language, we refer to the set of θλ obtained by varying λ as the solution path and call this path piecewise linear if there exists 0 Piecewise linear solution paths have the appeal that the entire solution path can be found by calculating the change points λ r and associated slopes ξ r .
The next lemma is a consequence of the quadratic nature of the score matching objective for exponential families, and holds for the lasso problem as well.
Lemma 3. The solution path θλ for the regularized score matching problem from (3.1) is piecewise linear.
Proof.An s-vector z belongs to ∂ θ 1 , the subdifferential of the 1 norm, if The Karush-Kuhn-Tucker (KKT) conditions characterizing optimality in (3.1) are The linear relationship between θ and λ (for "fixed" ẑ) implies the claim.
While straightforward to show, the property of piecewise linear paths is special to the score matching method we propose.Other methods that give symmetric estimates of precision matrices in Gaussian graphical models, such as glasso or the SPACE-type methods discussed in Khare, Oh and Rajaratnam (2015) do not have piecewise linear solution paths.This said, piecewise linear paths also arise in neighborhood selection (Meinshausen and Bühlmann, 2006), which, however, is a formulation without symmetry.Note also that when using a group lasso penalty as suggested for Example 3, rSME solution paths are no longer piecewise linear.Example 1 (cont.).In the Gaussian model, the KKT conditions state that K is a solution to (3.1) if and only if , which in slight abuse of notation, we take to mean that (3.9) The first case accounts for the fact that the objective is smooth in the diagonal entries of the precision matrix, which are not penalized.Combining (3.8) and (3.9), we have that A Gaussian solution path is shown in Figure 1b, with the horizontal axis transformed to t(λ) = j =k |κ λ jk |.The data were drawn from a multivariate normal distribution with the conditional independence graph from Figure 1a, with sample size n = 12.We note that, as one would hope, the coefficient that last enters the solution corresponds to the absent edge (1, 4).

Tuning
A number of methods have been proposed for selecting the regularization parameter λ in 1 penalization methods and can be applied in our context.On the one hand, a predictive assessment as in cross-validation can be considered, but the selected graphs are typically too dense.Other possibilities include generalized cross validation (GCV) (Tibshirani, 1996), Akaike's Information Criterion (AIC), approaches based on stability under resampling (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013;Liu, Roeder and Wasserman, 2010), the Bayesian Information Criterion (BIC) (Schwarz, 1978) as well as extensions of BIC proposed to cope with large model spaces (Chen and Chen, 2008;Gao et al., 2012;Foygel and Drton, 2010b;Barber and Drton, 2015).The latter come with some consistency guarantees.
As a demonstration, for the Gaussian case from Example 1, we may consider an extended BIC criterion based on the basic score matching loss (2.2), defined as where Êλ = {(j, k) : κλ jk = 0, j < k} and γ is typically taken to be 1/2 or 1.Alternatively, we could refit, that is, replace K λ by an unregularized SME computed in the submodel given by constraining all κ jk with (j, k) ∈ Êλ to be zero.In either case, we choose λ to minimize (3.12).

Numerical Experiments
We perform numerical experiments comparing regularized score matching to existing methods when data is simulated from (i) a multivariate normal distribution, (ii) a multivariate truncated normal distribution, and (iii) a distribution with normal conditionals.The comparison is made against three methods for estimation of Gaussian graphical models, namely, glasso, neighborhood selection (both implemented in the R packages huge) and SPACE (in its CON-CORD formulation, with R package gconcord).In addition, we consider the nonparanormal SKEPTIC, which applies glasso to a matrix of rank correlations (Kendall's τ or Spearman's ρ) and can be motivated by a Gaussian copula model (Liu et al., 2012).We utilize the version based on Kendall's τ .Finally, we compare to SPACEJAM (Voorman, Shojaie and Witten, 2014), which is based on additive modeling of conditional means and implemented in the R package spacejam.We conclude this section with brief investigations on the robustness of regularized score matching when data is not generated under the assumed model.All results in this section are based on averaging over 100 independently generated datasets.

Gaussian data
We consider a graph with m = 1000 nodes, composed of 10 connected components, each 100 nodes in size and structured as a 10 × 10 2-D lattice (4 nearest neighbors).Each connected component also features three hubs with node degree 20, randomly selected from the subset of nodes in the component.
We follow a procedure similar to the one from Peng et al. (2009) to convert the adjacency matrix of the graph into a sparse diagonally dominant partial correlation matrix.For each nonzero element of the adjacency matrix, we sample a draw from a uniform distribution on [0.5, 1].Each row of this new matrix is then rescaled by 1.5 times the sum of the absolute values of the off-diagonal entries in the row.We average this matrix with its transpose to ensure symmetry, and set its diagonal elements to 1.This matrix is inverted and converted into a correlation matrix to form Σ * .
Data is then generated from a multivariate normal distribution with mean zero and a covariance matrix Σ * .We choose sample size n = 600 and 1000.The setup agrees with that in Peng et al. (2009), except that the number of nodes has been scaled up.
Figure 2 shows the ROC curves obtained under both sample sizes.Since the truth is Gaussian, we do not report results for SKEPTIC or SPACEJAM.For both sample sizes, the curve for regularized score matching almost perfectly aligns with those for neighborhood selection, SPACE, and glasso.The results indicate that regularized score matching estimators achieves state-of-the-art statistical efficiency in Gaussian models.

Non-negative Gaussian data
Glasso, SPACE, neighborhood selection and SKEPTIC all presume some form of underlying Gaussianity.In this and the next subsection, we demonstrate the application of regularized score matching in scenarios where these assumptions do not hold to highlight the versatility of the proposed appraoch.
Similar to the Gaussian setting, we consider a graph with m = 100 nodes, composed of 10 disconnected subgraphs with equal number of nodes.Using the lower triangular elements adjacency matrix of each 10 node subgraph, we construct ten matrices, where in each matrix, the element is drawn independently to be 0 with probability 0.2, and from a uniform distribution on [0.5, 1] with probability 0.8.The matrices, after symmetrization, are combined into a 100 × 100 block matrix.The diagonal elements are set to a common positive number such that the minimum eigenvalue is 0.1 to form the precision matrix of the pre-truncated normal, K * .
Data was then generated from a truncated centered multivariate normal, left-truncated at 0 and with Σ * = (K * ) −1 as normal covariance.We used the Gibbs sampler from the tmvtnorm package in R with a burnin period of 100 samples.We thinned out the remaining samples, keeping one in ten.The sample size n is taken to be either 2500 or 5000.The need for a larger sample size is explained by our theoretical findings in Section 6, specifically Corollary 2.
The ROC curves are shown in Figure 3, where regularized score matching outperforms all competitors considered.The closest competitor to regularized score matching are SKEPTIC The color to method correspondence is as follows: regularized score matching ( ), neighborhood selection ( ), glasso ( ), and SPACE ( ).The curves are almost perfectly aligned.
and SPACEJAM, both of which, objectively, perform well, being capable of capturing some of the non-Gaussianity in the data.
We emphasize that here score matching was applied in its non-negative version from Section 2.2.The basic score matching procedure from Section 2.1 is far less efficient based on experiments not reported here.

Normal conditionals
Next, we take the data-generating distribution to have a density from the class where B = {β jk } is a symmetric matrix with diagonal entries 0. This family is a special case of the distributions with normal conditionals from Example 3. We consider the case m = 625, with the graph being a 25 × 25 2-D lattice (4 nearest neighbors).The true interaction matrix B * is constructed by multiplying the adjacency matrix by −1/25.The coefficients for the terms x 2 j are all set equal to −1 and those for the x j all equal to 8/50, which makes the marginal distributions deviate noticeably from Gaussianity.Data can be generated by Gibbs sampling using the Gaussian full conditionals.We discard the first 100 samples and thin out the remaining samples, keeping one in ten, as in Section 4.2.
We plot the ROC curves for conditional normal data in Figure 4. Regularized score matching outperforms its competitors by a clear margin.This is not surprising, as both glasso and SPACE are derived under normality.A Gaussian copula model as underlying SKEPTIC is of little help.SPACEJAM does best among the competitors but cannot fully extract the available signal about the edge structure as the conditional means are non-additive and the conditional variances are not constant.

A robustness check
It is of interest to see how score matching performs when the data-generating mechanism is misspecified.We consider two scenarios.First, we apply the Gaussian score matching to a contaminated Gaussian setting similar to that explored in Finegold and Drton (2011).That is, a random subset of Gaussian observations is replaced with Gaussian noise.In the second example, we investigate the performance of the regularized Gaussian score matching when the observations are not Gaussian but rather drawn from a multivariate t-distribution.

Contaminated Gaussians
We mimic the setup used in the numerical experiments in Finegold and Drton (2011), who consider these settings to test the robustness of their tlasso.Fixing m = 200, we construct a sparse precision matrix K * according to the following steps: (1) choose each (strictly) lower triangular element of K * to be independently -1, 0, 1 with probability 0.01, 0.98 and 0.01 respectively, (2) symmetrize the matrix (3) for each row, i.e. for j = 1, . . ., m, set κ * jj = 1 + κ * j,−j 0 where κ * j,−j refers to the jth row of K * with the diagonal element in that row removed.To strengthen partial correlations, the diagonal elements are scaled down by a common positive factor such that the minimum eigenvalue of the resulting matrix is approximately 0.6 (close to 0.62 in our setup).The covariance matrix Σ * is obtained by inverting K * .
We generate either n = 150 or n = 200 observations from a multivariate normal distribution with mean zero and a covariance matrix Σ * .We then corrupt 2% of the observations, substituting them with i.i.d.N (0, 0.2) draws.The corrupted observations cannot easily be differentiated from normal observations, and this elevates the difficulty of the estimation problem.
We present the ROC curves in Figure 5. Interestingly, score matching performs reasonably well, on par with SKEPTIC and neighborhood selection.For both sample sizes, the differences, which are subtle, are most apparent in the regime where the number of false positives detected is small: score matching falls slightly short of neighborhood selection, but it also appears to slightly outperform SKEPTIC.Surprisingly, there is a clear margin of difference between the performances of regularized score matching and SPACE, the former outperforming the latter, despite their noted structural similarities.Glasso, which utilizes the full Gaussian likelihood, performs the worst.Overall, we conclude that regularized score matching is competitively robust when compared to its alternatives in the contaminated Gaussian setting.

Multivariate t-distributed observations
In this section, we apply regularized Gaussian score matching to observations arising from a multivariate t-distribution with mean 0 and covariance matrix Σ * .This corresponds to testing the robustness of regularized score matching under model misspecification.Like in the previous section, we consider the case when m = 200.To set up Σ * , we construct a m × m adjacency matrix based on an Erdős-Rényi graph with the probability of drawing an edge between any two arbitrary nodes set to 0.01.We then convert the adjacency matrix into Σ * using the same procedure as in Section 4.1.Samples were drawn from a multivariate t-distribution with covariance matrix Σ * and three degrees of freedom.
The ROC curves are plotted in Figure 6 for n = 100 and n = 150.As expected, SKEP-TIC outperforms all others, owing to its flexibility to accommodate outliers, as previously demonstrated in Liu et al. (2012).In fact, for elliptical distributions, such as the multivariate t-distribution, Kendall's τ allows for consistent estimation of Σ * , so SKEPTIC should perform optimally (Liu, Han and Zhang, 2012).Nonetheless, regularized score matching is reasonably robust under this setting: its performance is comparable to that of SPACEJAM -only falling slightly short -SPACE, and neighborhood selection.Again, glasso yields the poorest results.), glasso ( ), SPACE ( ), SKEPTIC ( ), and SPACEJAM ( ).

Application to RNAseq Data
The American Cancer Society estimates that in 2015 there will be 220,800 new cases of prostate cancer and 27,540 deaths.To understand how the cancer develops, as well as how it may be treated, it is necessary to decipher the genetic machinery which drives it.Since cancer is such a complex disease, it is insufficient to study a single gene at a time, as genes may interact with one another in many ways.Graphical modeling of gene expression data has the potential to aid in discovery of such interactions.RNAseq data from next-generation sequencing technology can be used to identify genes that are activated/transcribed or suppressed at the time of measurement.However, RNAseq data are non-negative and have skewed marginals, which presents a challenge for existing methodologies.Graphical models based on truncated Gaussian models are interesting alternatives to existing approaches that primarily consist of applying Gaussian methods after transformations.Whether truncation models are truly useful scientifically deserves a fuller exploration; here we simply illustrate how different estimates can be obtained from the proposed methodology.
Our case study is based on the RNAseq data from 487 prostate adenocarcinoma samples available in The Cancer Genome Atlas dataset.We focus on 350 genes that belong to "known" cancer pathways in the Kyoto Encyclopedia of Genes and Genomes.Removing genes with more than 10% missing values, we obtained a dataset with m = 333 genes.Remaining missing values were simply set to zero, adding to the challenge.(We will comment on the issue of missing data in the discussion.)In illustration of the regularized score matching methodology, we consider an exponential family of truncated normal distributions with density This generalizes the family of distributions considered in Example 2 by allowing the truncated normal distribution to have nonzero mean.18 The color to method correspondence is as follows: regularized score matching ( ), neighborhood selection ( ), glasso ( ), SPACE ( ), SKEPTIC ( ), and SPACEJAM ( ).
We compare regularized non-negative score matching, SPACE (using CONCORD formulation), glasso, SKEPTIC and SPACEJAM.We apply SPACE and glasso directly to the standardized data.We do not consider any marginal transformations as they are naturally accounted for when comparing to the rank correlation-based SKEPTIC.For each method, we tune the regularization parameter λ in order to obtain |E| = 333 (or 334) edges.Figure 7 depicts the estimated networks, with isolated nodes removed, in layouts optimized for each graph.To allow for easier comparison, we also show the estimated networks in fixed layouts in Figure 8. Node degree distributions are plotted in Figure 9.
By visual inspection, glasso and SKEPTIC give similar topologies, which can be explained by the fact that both are derived from the full Gaussian likelihood.Interestingly, we observe that SPACEJAM and SPACE likewise yield similar graphs, which reinforces findings from Shojaie and Sedaghat (2016).Regularized non-negative score matching yields a graph that is fairly different from the rest.
While the usefulness of these models remains to be further explored, our case study demonstrates that regularized score matching can provide estimates that differ in interesting ways to the estimates generated by other methods.We compile a list of the top ten most highly connected genes in each of the estimated graphs in Table 1 (some lists have more than ten genes due to ties), as there is strong evidence that highly connected nodes play important roles in biological networks (Carter et al., 2004;Jeong et al., 2001;Han et al., 2004).There are slight overlaps between the lists.Upon further inspection, we observe that six of the ten genes listed under regularized score matching have been previously linked to prostate cancer, five of which have not been identified by the competing methods: • CCNE2 (cyclin E2): a protein which is required for transition of the G 1 to S phase of the cell cycle, which determines cell division.Regulated by PTEN, a tumor suppressor, it is over-expressed in metastatic prostate tumor cells (Wu et al., 2009).• BRCA2 (breast cancer 2): mutations in the BRCA2 gene have been associated with early-onset prostate cancer in men; men carrying mutations have a predisposition to more aggressive phenotypes (Gayther et al., 2000;Mitra et al., 2008;Tryggvadóttir et al., 2007;Fan et al., 2006).
• BIRC5 (survivin): a protein which prevents cell death, or apoptosis, and regulates cell division.Heightened expression has been found to be associated with higher final Gleason score, i.e., more aggressive cancer and worse prognosis (Kishi et al., 2004;Shariat et al., 2004).• SKP2 (S-phase kinase-associated protein 2, E3 ubiquitin protein ligase): a positive regulator of the G 1 to S phase of the cell cycle, which determines cell division.SKP2 labelling frequency in cancer was positively correlated with the Gleason score, and shown to be a significant predictor of reduced recurrence-free survival time after radical prostatectomy (Yang et al., 2002;Wang et al., 2008).It has been proposed elsewhere as a promising therapeutic target for prostate cancer (Wang et al., 2012).• STAT5B (signal transducer and activator of transcription 5B): a transcription factor that encourages metastatic behavior of human prostate cancer cells.Its inhibition has been shown to induce apoptosis in human prostate cancer cells (Gu et al., 2010;Ahonen et al., 2003;Moser et al., 2012).Furthermore, via the Kolmogorov-Smirnov test, we fail to reject the hypothesis that the degrees of the nodes for the regularized score matching graph estimate follow a power law distribution, with significance level of 0.05.On the other hand, we reject this hypothesis for all other generated estimates at the same significance level.There is evidence that genetic networks are 'scale-free', which implies that their degree distribution can be approximated by a power law distribution (Albert, 2005;Barabási and Albert, 1999;Jeong et al., 2001).In this aspect, the topology of regularized score matching estimate is most similar to the hypothesized structure of gene networks.
Finally, we would like to emphasize that we do not intend to claim that regularized score matching provides the best estimate of the underlying gene network, as the truth is unknown to us.What we can posit is that truncated Gaussian may be a useful model that provides potentially valid targets for therapy which may be missed by other methods.

Theory
This section establishes high-dimensional model selection consistency (sparsistency) of regularized score matching.We focus on pairwise interaction models as in (2.18), although our results could be extended to more general models.Theorem 1 below identifies general deterministic conditions on data that yield sparsistency of regularized (non-negative) score matching.Two subsequent corollaries make probabilistic statements about sparsistency in the Gaussian and the non-negative Gaussian case.Proofs are given in Section 7. Experiments that corroborate the theoretical findings are shown in Appendix C. Before stating the main results, we describe a key assumption for model selection consistency of 1 -penalized estimators, the irrepresentability assumption, and highlight differences between various estimators of Gaussian graphical models with respect to this assumption.
and g * to be the expected values of Γ and g.The support of θ * , that is, is the edge set of the true conditional independence graph.Similarly, determines the graph inferred by regularized score matching.Finally, we write d for the maximum degree of the m nodes of the conditional independence graph.In other words, d is the maximum number of nonzero off-diagonal entries in any row (or column) of Θ * .

Irrepresentability
We say that the irrepresentability (or mutual incoherence) condition holds with incoherence parameter α if the following assumption holds.
Assumption 1.There exists an α ∈ (0, 1] such that Irrepresentability conditions play a key role in the analysis of 1 regularization techniques (Bühlmann and van de Geer, 2011).For neighborhood selection in Gaussian graphical models, it has been formulated in terms of the covariance matrix Σ * (Meinshausen and Bühlmann, 2006).In the theoretical analysis of the glasso, the constraint is placed on the Hessian of the log-determinant of the precision matrix K * , i.e., (K * ) −1 ⊗ (K * ) −1 (Ravikumar et al., 2011).
Meinshausen showed that for samples drawn from N (0, Σ), glasso can consistently recover G only if ρ ≤ 3/2−1 ≈ 0.23.For neighborhood selection, the corresponding necessary condition is ρ ≤ 0.5.If these conditions fail, then for large sample size, the probability of erroneously including the edge (1, 4), i.e., P (κ 14 = 0) can be shown to be at least 0.5.It turns out that for regularized score matching, the analogous necessary condition gives a bound that falls in between 0.23 and 0.5, specifically, ρ ≤ √ 2 − 1 ≈ 0.41.We observe that glasso, which yields positive definite estimates, requires the most stringent condition.When working with symmetric matrices as in regularized score matching, the condition is markedly relaxed.Allowing non-symmetric matrices in neighborhood selection leads to further relaxation of the condition.Interestingly, the pseudo-likelihood methods classified under SPACE have the same necessary condition as score matching.

Main Results
We define such that the KKT conditions from (3.7) can be written as Theorem 1. Assume that Γ * SS is invertible and the irrepresentability condition holds with incoherence parameter α ∈ (0, 1] (Assumption 1).Furthermore, assume that ) then the following statements hold: (a) The rSME θ is unique, has its support included in the true support ( Ŝ ⊆ S), and satisfies then Ŝ = S and sign( θjk ) = sign(θ * jk ) for all (j, k) ∈ S.
Theorem 1 imposes deterministic conditions on the data, namely, the bounds in (6.6).In the following corollaries, we will consider specific distributional assumptions and impose population conditions that imply bounds of the form (6.6) with high probability.
First, we provide a result for regularized score matching for the Gaussian case (Example 1), which has Γ = I m×m ⊗ W with W being the sample covariance matrix, and g = vec(I m×m ).When the data is generated from a normal distribution with covariance matrix Σ * then Γ = I m×m ⊗ Σ * and, of course, g * = g = vec(I m×m ).
Take any τ 1 > 2. If the sample size satisfies and the regularization parameter is then the following statements hold with probability 1 − 1/m τ1−2 : (a) The rSME K from (3.3) is unique, has its support included in the true support ( Ŝ ⊆ S), and satisfies then Ŝ = S and sign( Kjk ) = sign(κ * jk ) for all (j, k) ∈ S. The corollary is proven in Appendix 7.2.Numerical experiments reported in Appendix C suggest that the sample size n indeed needs to scale at least Ω(d 2 log m) for sparsistency.
From Theorem 1, we can also derive an analogous result for regularized non-negative score matching for the truncated Gaussian case (Example 2).The result requires the sample size to be larger than in the Gaussian case, due to the need to control higher order moments.Recall that here, Γ(x) a block diagonal m 2 × m 2 matrix, with the jth block given by and g = 2w + w diag , where w = vec(W) and w diag = vec(diag(W)).
Corollary 2. Suppose the data is generated from a non-negative Gaussian distribution with parameter K * , i.e., N (0, (K * ) −1 ) is truncated to R m + .Suppose further that Γ * SS is invertible and irrepresentability holds for α ∈ (0, 1].Let where L > 0 is an absolute constant.Take any τ 2 > 3.If the sample size satisfies n > c * * c 2 2 d 2 (log m τ2 + log 2) 8 , (6.10) and the regularization parameter is (6.11) then the following statements hold with probability 1 − 1 m τ 2 −3 : (a) The rSME K+ based on penalizing (2.26) with λ K 1,off is unique, has its support included in the true support ( Ŝ ⊆ S), and satisfies then Ŝ = S and sign(( K+ ) jk ) = sign(κ * jk ) for all (j, k) ∈ S. The proof of the corollary, which is given in Section 7.3, uses general tail bounds that apply to log-concave measures.The lower bound for n given in (6.10) could well be suboptimal and a lower power of log m may be sufficient for sparsistency.However, the experiments in Appendix C suggest that the exponent for log m cannot be taken too much smaller than 8.
We also compared the lower bound we obtained for the non-negative Gaussian case to a result implied by the work of Yang et al. (2013) who treat consistency of neighborhood selection in a general framework that allows node-wise conditional distributions to arise from exponential families.Interestingly, when working out what their general theorem would say about the above non-negative Gaussian model we found that the sample size n would also be required to be at least Ω(d 2 (log m) 8 ).Our result from Corollary 2 is thus at least comparable to existing results in the literature.

Proof of Theorem 1
First, we note that claim (b) is an immediate consequence of claim (a).To show (a), we apply the primal-dual witness method (PDW) from Wainwright (2009).As explained in detail below, PDW entails construction of a pair ( θ, z), with θ ∈ R m 2 and z ∈ ∂ θ 1 , that satisfies the KKT optimality conditions from (6.5) and has the support of θ included in S. If the construction is successful then it ensures that the rSME problem admits a unique solution such that the rSME θ is equal to θ and inherits all the properties the latter has by definition.These properties include the ∞ bound on estimation error in addition to the claim about the support.
Replacing Γ by Γ * and g by g * in the empirical (basic or non-negative) score matching loss recovers the population loss which, in the present exponential family context, is quadratic and minimized when θ = θ * .(Recall that the score matching loss is consistent.)It follows that r 3 from (6.4) is zero as it is the gradient of the population loss.In block form, (6.5) becomes We construct the PDW pair ( θ, z) according to the following steps: (i) Take θ to be the unique solution to the support-restricted problem, that is, θ = arg min (iii) Solving (7.1), set 3) (iv) Check the strict dual feasibility condition that zS c ∞ < 1. (7.4) By step (i), θ has support contained in S. By step (iii), ( θ, z) is guaranteed to fulfill the equations from (7.1).By step (ii), the S-coordinates of z satisfy 'their part' of the subgradient condition.Thus, if the strict dual feasibility from step (iv) holds, then ( θ, z) satisfies the KKT conditions from (6.5).Having a strict inequality in (7.4) ensures that every solution to the original rSME problem has support contained in the true support S and since Γ * SS is assumed invertible, there is then only one solution (Wainwright, 2009, Lemma 1).The invertibility of Γ * SS is also what guarantees the uniqueness in step (i).
If the PDW construction is successful, that is, if the strict dual feasibility condition can be established, then we may conclude the rSME θ possesses all the desired properties.Indeed, θ equals θ which has these properties by construction.
Let ∆ = θ − θ * , where θ is the solution to the support-restricted regularized score matching problem from (7.2).By definition, ∆ ∞ = ∆S ∞ .Furthermore, by step (iii) in the PDW construction, By Assumption 1, and the triangle inequality for the ∞ norm, where the equality in the second to last line follows from the fact that θ * S c = 0. We observe that where is an m 2 × m 3 matrix whose diagonal blocks are given by the rows of the the interaction matrix Θ * , each row being replicated m times.Moreover, vec(R 1,blocks ) refers to the vectorization of the m diagonal blocks of R 1 that are each of size m × m; recall Lemma 2.More precisely, if R 1,1 , . . ., R 1,m are the diagonal blocks of R 1 , then vec(R 1,blocks ) is obtained by concatenating vec(R 1,1 ), . . ., vec(R 1,m ) in that order.Equation (7.6) is the only argument relying on the block-diagonality of Γ and R 1 .From (7.6), we obtain that By the assumption that r 2 ∞ < 2 , we have and it remains to similarly bound G 2 .We treat |||R 1,•S ||| ∞ and ∆S ∞ separately.We note that the rows of R 1,•S have at most d non-zero elements.It follows that where the last inequality holds by assumption.Since Γ SS is assumed invertible, we have from the top block of equations in (7.1) that Note that by assumption, Γ SS is invertible.We obtain that which gives us the following bound in the error in the inverse in the matrix ∞ norm, Application of the triangle inequality, along with our definition of c Γ where the last inequality uses the assumption that d 1 ≤ α/6c Γ * .Substituting (7.8) into (7.7), it is straightforward to show that G 2 < α/3.Therefore, G 1 + G 2 + G 3 < α, which yields that zS c < 1.
Along the way we have also proven the second part of the claim.Indeed, from (7.7) and (7.8), we have

Proof of Corollary 1
We need to show that the conditions in Theorem 1, specifically those in (6.6), hold with the claimed probability.Since r 2 = g − g * = vec(I m×m ) − vec(I m×m ) = 0, the second inequality in (6.6) can be trivially satisfied with any 2 > 0. Thus, we only need to show that we can bound R 1 ∞ by some suitable 1 with sufficiently large probability.To do so, we apply a Bernstein-type concentration inequality for the entries of W that is also used by Ravikumar et al. (2011).Lemma B.1 below states the inequality, as given in their paper.The matrix R 1 features only entries in W − Σ * .By taking a union bound over the m 2 entries of W, plugging in our lower bound for n and observing that σ = 1 in the Gaussian case In addition, each row in |||R •S ||| ∞ features at most d entries from the matrix W − Σ * .Hence, it follows from another union bound, and choosing n at least where c * and c 1 are defined in the corollary statement, that Thus, applying Theorem 1 with 1 = c * (log m τ1 + log 4) n shows that our choices for λ and n give the high probability statement in Corollary 1.
When looking back at the proof of Theorem 1, we see that as a consequence of having r 2 = 0, we need only be concerned with bounding terms G 1 and G 2 .We may thus bound G 1 and G 2 each by α/2 instead of α/3 and ignore the G 3 term entirely, as it is 0. This leads us to having c 1 = (4/α)c Γ * , as opposed to the expected (6/α)c Γ * .

Proof of Corollary 2
We proceed as for the proof of Corollary 1 and use concentration results to satisfy the bounds from (6.6) in Theorem 1.However, we now bound R 1 ∞ and r 2 ∞ using concentration inequalities for general log-concave measures (any truncated multivariate normal density is log-concave).
Let X (i) = (X i1 , . . ., X im ) be i.i.d.according to N (0, (K * ) −1 ) with truncation to R m + .Take for all j, k, = 1, . . ., m.By a union bound over no more than 2m 3 events, we have both R 1 ∞ < 1 and r 2 ∞ < 2 with probability at least 1 − 1/m τ2−3 as m → ∞.Applying Theorem 1 with the chosen 1 and 2 thus shows that our choices for λ and n lead to the claim in Corollary 2.

Discussion
This paper proposes the use of regularized score matching for estimation of conditional independence graphs in high dimensions.The focus is on modifying the score matching loss of Hyvärinen (2005) with an 1 penalty to accommodate underlying sparsity, which is in the spirit of popular existing methods such as glasso and neighborhood selection.This said, any other regularization scheme can be considered instead.For instance, the method from Defazio and Caetano (2012) can be applied to encourage hub structure in the inferred graph.
Our study of the Gaussian example of Meinshausen (2008) suggests that 1 -regularized score matching falls in between neighborhood selection and glasso in terms of conditions for required 30 Algorithm 1 Add variable that attains equality to Ŝ. 12: else 13: Remove variable that attains 0 from Ŝ. 14: 16: end while for graph selection consistency.Here, the glasso requires the most stringent conditions, and the score matching approach appears to be similar to pseudo-likelihood methods that work with symmetric estimates of precision matrices, such as SPACE (Peng et al., 2009) and subsequent reformulations such as CONCORD (Khare, Oh and Rajaratnam, 2015).However, regularized score matching is particularly convenient in that the score matching loss is a quadratic function, even for non-Gaussian exponential families.This brings about piecewise linear solution paths and allows for a simple theoretical analysis.We anticipate that the simple structure of score matching will lead to further advances in graphical modeling, such as computationally efficient techniques to deal with corrupted or missing data, in the spirit of Loh and Wainwright (2012), or new methods to tune regularization parameters, as in Chichignoud, Lederer and Wainwright (2014).Regularized score matching is an interesting method for Gaussian models, as we showed empirically and theoretically.In particular, for consistency (under the usual irrepresentability conditions), the sample n must be on the order Ω(d 2 log m), which matches the conditions for the existing methods mentioned above.However, as our simulation study shows, regularized score matching really shines in the context of non-Gaussian models, where it eliminates the need to deal with computationally intractable normalization constants in a way that the loss continues to be a quadratic function of parameters.This opens a lot of new possibilities for graphical modeling such as the truncated normal model we applied to RNAseq data.
Score matching applies to continuous data.While Hyvärinen (2007) discusses a ratio matching method for discrete data, it is not as computationally convenient as its continuous counterpart.A different approach of adding Gaussian noise to discrete data was proposed for imaging problems by Kingma and LeCun (2010).Exploring the merits of their approach for graphical modeling, and supplying supporting theory, would be an interesting problem for future work.sample covariance matrix W, for n i.i.d.samples, satisfies the bound for any fixed choice of two indices 1 ≤ j, k ≤ m and for all δ ∈ (0, 40 max j=1,...,m Σ * jj ).
Lemma B.2 (Carbery and Wright, 2001).Let X be a Banach space, and let f : R m → X be a polynomial of degree at most z.Suppose 0 < ζ 1 ≤ ζ 2 < ∞ and µ is a log-concave probability measure on R m .Then f (x) ζ2/z dµ(x) where L > 0 is an absolute constant.
From this lemma we may derive the following concentration result.After proving the lemma, we comment on how it is used in the proof of Corollary 2.
Lemma B.3.Consider a degree z polynomial f (X) = f (X 1 , . . ., X m ), where X 1 , . . ., X m are possibly dependent random variables with log-concave joint distribution on R m .Let L > 0 be the constant from Lemma B.2.Then, for all δ such that Hence, by Markov's inequality, for any δ satisfying (B.3), and the proof is complete.
In the proof of Corollary 2, we apply Lemma B.3 with δ = 1 from (7.9) and with δ = 2 from (7.10).It thus needs to be checked that condition (B.3) holds in these two cases.Indeed, the condition holds as long as m ≥ exp 2 √ e − log 2 τ 2 .(B.9) To see this, we substitute 1 and 2 for δ in (B.3), take z = 4 and 2 respectively, to find a term that is lower bounded by (τ 2 log m + log 2)/e 2 .Here, the 1/ √ n factor in 1 and 2 cancels out with the 1/ √ n term generated by the Var[f (X)] term in the denominator.(Recall that in our scenario f (X) is an empirical average).The more stringent condition on m comes from 2 and is stated in (B.9).Thus, if (B.9) holds, (B.3) is satisfied.Since τ 2 > 3, the right-hand side of (B.9) never exceeds exp 1 3 (2 √ e − log 2) < 3.
Hence, in our application of Lemma B.3, the condition from (B.3) holds for m ≥ 3.

Appendix C: Experiments
We perform experiments, similar to those found in related work, that give empirical support for Corollary 1.This corollary treats Gaussian graphical models for which the sample size n ought to be of order d 2 log m.We experiment by varying the number of variables m, the degree d, and the minimum signal strength.Following Ravikumar et al. (2011), we define the 'model complexity' to be In addition, we investigate how the sample size n required for sparsistency for non-negative Gaussian graphical models needs to depend on m.All reported results are based on averaging over 100 trials.

C.1. Gaussian experiments
We conduct our experiments using three graph structures: (a) a chain, (b) a 2-D lattice with 4 nearest neighbors, and (c) a star.We consider (a) and (b) when varying the number of variables m, in which case we vary the length of the chain and the number of nodes in the lattice.This keeps the degree d constant.The effect that d has on the sample complexity is investigated using stars.We let the regularization parameter λ scale with log m/n, a choice corroborated by Corollary 1.

Dependence on number of nodes
Consider first the case where the underlying conditional independence graph is a chain of length m ∈ {64, 100, 225, 375}.The degree d is always 2, and we choose the tridiagonal precision matrix K * to have entries κ * jk = 0.3 if (j, k) ∈ E and κ * jj = 1 for j = 1, . . .m. Here, α, c K * and c Γ * are constant across all m.
Figure 10 shows the probability of correct signed support recovery plotted against the sample size n, with different curves corresponding to different m.As expected, we see from Figure from (C.1).We plot the probability of correct signed support recovery against n for varying C.
In the resulting Figure 13, the curves shift right as C becomes larger so a larger n is needed to attain the same probability of correct signed support recovery when C grows.This is again consistent with the implications of Corollary 1.We do not believe that the lower bound we found for n is sharp enough in terms of its dependence on α, c K * and c Γ * to determine the rescaling we must perform on n to align the curves.

C.2. Non-negative Gaussian experiments
Finally, we experiment with regularized non-negative score matching for normal observations truncated to the positive orthant.According to Corollary 2, a sample size of n = Ω(d 2 (log m) 8 ) is sufficient for signed support recovery.The aim of our experiments is to explore to what extent this scaling is necessary.Specifically, we will consider exponents other than 8 for log m.
For our experiments, we revisit the chain-structured graphs from Section C.1 and choose a triangular matrix K * with κ * jk = 0.3 if (j, k) ∈ E and and κ * jj = 1 for j = 1, . . .m.The degree d is fixed at 2 and we only vary m ∈ {20, 25, 30}.We let the regularization parameter λ to scale with (log m) 8 /n. Figure 14 plots the probability of correct signed support recovery against n, with different curves for the different values of m.
Panel (a) in Figure 14 illustrates that, larger n is needed account for larger m.The other three panels have the x-axis rescaled to n/(log m) a for exponents a ∈ {6, 7, 8}.Panel (b) suggests that n scaling with (log m) 6 is not sufficient for support recovery.Comparing panels (c) and (d), (log m) 8 seems more than what is necessary.It thus appears that the scaling of the sample size we assumed in Corollary 2 is suboptimal but not drastically so.).
iii) Let a, b ∈ [1, ∞].We denote the a norm of a vector u ∈ R m by and write |||U||| a,b = max x a =1 Ux b for the a / b operator norm of a matrix U ∈ R m×m .We let |||U||| ∞ = |||U||| ∞,∞ and U a = vec(U) a .
Fig 2: ROC curves for the Gaussian case.The dashed grey line represents random selection of edges.The color to method correspondence is as follows: regularized score matching ( ), neighborhood selection (), glasso ( ), and SPACE ( ).The curves are almost perfectly aligned.
Fig 3: ROC curves for the non-negative Gaussian case.The dashed line represents random selection of edges.The color to method correspondence is as follows: regularized score matching (), glasso (), SPACE ( ), SKEPTIC ( ), and SPACEJAM ( ).
Fig 4: ROC curves for the normal conditionals case.The dashed line represents random selection of edges.The color to method correspondence is as follows: regularized score matching (), glasso (), SPACE ( ), SKEPTIC ( ), and SPACEJAM ( ).The curve for glasso overlaps with the curve for SPACE.

Fig 7 :
Fig 7: Topology of inferred networks of |E| = 333 or 334 edges for all considered methods.The layout has been optimized for each graph.Isolated nodes are not shown.Red colored nodes have degree greater or equal to 10.

Fig 8 :
Fig 8: Topology of inferred networks of |E| = 333 or 334 edges for all considered methods.Layout of nodes is fixed across graph estimates and was optimized for the SPACE estimate.Isolated nodes have now been included.Red colored nodes have degree greater or equal to 10.

Fig 9 :
Fig 9: Node degree distributions for inferred networks of |E| = 333 or 334 edges for all considered methods.

Fig 11 :
Fig 11: Relative frequencies of signed support recovery for Gaussian observations whose conditional independence graph is a 4-nearest neighbor lattice with m nodes.Panels (a) and (b) differ only in the scaling of the x-axis.The colored lines correspond to m = 64 ( ), m = 100 ( ), and m = 225 ().

Fig 12 :Fig 13 :Fig 14 :
Fig 12: Relative frequencies of signed support recovery for Gaussian observations whose conditional independence graph is a star with varying degree d.Panels (a) and (b) differ only in the scaling of the x-axis.The colored lines correspond to d = 10 ( ), d = 15 ( ), and d = 20 ( ).