On estimation of the diagonal elements of a sparse precision matrix

In this paper, we present several estimators of the diagonal elements of the inverse of the covariance matrix, called precision matrix, of a sample of iid random vectors. The focus is on high dimensional vectors having a sparse precision matrix. It is now well understood that when the underlying distribution is Gaussian, the columns of the precision matrix can be estimated independently form one another by solving linear regression problems under sparsity constraints. This approach leads to a computationally efficient strategy for estimating the precision matrix that starts by estimating the regression vectors, then estimates the diagonal entries of the precision matrix and, in a final step, combines these estimators for getting estimators of the off-diagonal entries. While the step of estimating the regression vector has been intensively studied over the past decade, the problem of deriving statistically accurate estimators of the diagonal entries has received much less attention. The goal of the present paper is to fill this gap by presenting four estimators---that seem the most natural ones---of the diagonal entries of the precision matrix and then performing a comprehensive empirical evaluation of these estimators. The estimators under consideration are the residual variance, the relaxed maximum likelihood, the symmetry-enforced maximum likelihood and the penalized maximum likelihood. We show, both theoretically and empirically, that when the aforementioned regression vectors are estimated without error, the symmetry-enforced maximum likelihood estimator has the smallest estimation error. However, in a more realistic setting when the regression vector is estimated by a sparsity-favoring computationally efficient method, the qualities of the estimators become relatively comparable with a slight advantage for the residual variance estimator.


Introduction
We consider the problem of precision matrix estimation that has been extensively studied in recent years partly because of its tight relation with the graphical models.More precisely, assuming that we observe p features on n individuals, an interesting object to display is the graph of associations between the features, especially when the number of features is large.The associations may be of different type: linear correlations, partial correlations, measures of independence and so on.A measure of association between the features, which is particularly relevant for Gaussian (Lauritzen, 1996) and, more generally, non paranormal distributions (Liu et al, 2009;Lafferty et al, 2012) is the partial correlation.This leads to a Gaussian graphical model in which two nodes are connected by an edge if the partial correlation between the features corresponding to these two nodes is nonzero, which is equivalent to the nonzeroness of the corresponding entry of the precision matrix (Lauritzen, 1996, Proposition 5.2).The graph constructed in such a way relies on the population precision matrix, which is not available in practice.Therefore, an important statistical problem is to infer this graph from n iid observations of the p-dimensional feature-vector.In view of the aforementioned connection with the precision matrix, the estimated graph may be deduced from the estimated precision matrix by comparing its entries with a suitably chosen threshold.
Another important problem for which the precision matrix estimation is relevant 1 is the linear (Fisher, 1936) or quadratic discriminant analysis (Anderson, 2003).Indeed, the decision boundary in the binary or multi-class classification problem-under the assumption that the conditional distributions of the features given the class are Gaussian-is defined in terms of the precision matrix.In order to infer this decision boundary from data, it is therefore relevant to start with estimating the precision matrix.The simplest way of estimating the latter is by inverting the sample covariance matrix or, if the inverse does not exist, by computing the pseudo-inverse of the sample covariance matrix.However, when the dimension p is such that the number of unknown parameters p(p + 1) is comparable to or larger than the sample-size n, the (pseudo-)inversion of the sample covariance matrix leads to very poor results.To circumvent this shortcoming, additional assumptions on the precision matrix should be imposed which should preferably be realistic, interpretable and lead to statistically and computationally efficient estimation procedures.The sparsity of the precision matrix offers a convenient setting in which these criteria are met.
To present in a more concrete fashion the content of the present work, let X be a n × p random matrix representing the values of p variables observed on n individuals.Assume that the rows of the matrix X are independent and Gaussian with mean µ * and covariance matrix Σ * .The inverse of Σ * , called the precision matrix and denoted by Ω * = (ω * ij ), is an object of central interest since-as mentioned earlier-it encodes the conditional dependencies between pairs of variables given the values of all the other variables.Based on the precision matrix, the graph G * of relationships between the p variables is constructed as follows: each node of the graph represents a variable and two nodes i and j are connected by an edge if and only if ω * ij = 0. Estimating this graph from a sample of size n represented by the rows of X is a challenging statistical problem that has attracted a lot of attention in the past decade.In a frequently encountered situation of the dimension p comparable to or even larger than n, a commonly used assumption is the sparsity of the graph G * .Namely, it is assumed that the maximal degree of the nodes of G * is much smaller than p (see, for instance, Meinshausen and Bühlmann (2006); Yuan and Lin (2007) for early references).
Most approaches of estimating sparse precision matrices that gained popularity in recent years rely on weighted 1 -penalization of the off-diagonal elements of the precision matrix; recent contributions on the statistical aspects of this approach can be found in Yuan (2010); Cai et al (2011); Sun and Zhang (2013); Cai et al (2015) and the references therein.The rationale behind this approach is that the weighted 1 -penalty can be viewed as a convexified version of the 0 -penalty, the latter being understood as the number of nonzero elements.The convexity of the penalty in conjunction with the convexity of the data fidelity term leads to estimators that can be efficiently computed by convex programming (Friedman et al, 2008;Banerjee et al, 2008).
To further improve the computational complexity, it is possible to split the problem of estimating p 2 entries of the precision matrix into p independent problems of estimating the p-dimensional columns of it (Meinshausen and Bühlmann, 2006).To this end, the matrix Ω * is written as B * D * , where D * is a diagonal matrix while B * is a p × p matrix with all diagonal entries equal to one.Each columns of the matrix B * can be estimated by regressing one column of the data matrix X 1 In the case of linear discriminant analysis for binary classification, a simpler approach consisting in replacing the sparsity of the precision matrix by the sparsity of the product of the latter with the difference of the class means has been proposed and studied by Cai and Liu (2011).n error 500 1500 2500 3500 0.01 0.06 0.36 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q RV PML SML RML n error 500 1500 2500 3500 0.05 0.1 0.2 0.4 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q RV PML SML RML 1500 2000 2500 3000 3500 n error 0.06 0.08 0.1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q RV PML SML RML B coincides with B * B estimated by SqRL+OLS B estimated by SqRL+OLS: zoom Fig. 1 The average 2 -error (computed from 50 independent trials) of the four estimators considered in this work as a function of the sample size.The plots concern Model 2 described in Section 4.1 and dimension p = 60.One can observe, in particular, that when B * is estimated without error (left panel), the estimators SML and PML improve on the residual variance and relaxed maximum likelihood estimators.
on all the remaining columns.In the context of high dimensionality and sparse precision matrix, this can be performed by sparsity favoring methods (Bühlmann and van de Geer, 2011) such as the Lasso (Tibshirani, 1996), the Dantzig selector (Candes and Tao, 2007), the square-root Lasso (Belloni et al, 2011), etc.A crucial observation at this stage is that the sparsity patterns, i.e., the locations of nonzero entries, of the matrices B * and Ω * coincide.In particular, the degree of the j-th node in the graph G * is equal to the number of nonzero entries of the j-th column of B * , for every j = 1, . . ., p.
Once the columns of B * successfully estimated, one needs to estimate the diagonal matrix D * , the diagonal entries of which coincide with those of the precision matrix Ω * .This step is necessary for recovering the precision matrix (both diagonal and nondiagonal entries) but it is also important for constructing the graph2 of conditional dependencies.Of course, the latter can be estimated by thresholding the entries of the estimator of B * without resorting to an estimator of D * , but the choice of the threshold is in this case a difficult issue deprived of clear statistical interpretation.In contrast with this, if along with an estimator of B * , an estimator of D * is available, then one may straightforwardly estimate the partial correlations and threshold them to infer the graph of conditional dependencies.In this case, the threshold has a more clear statistical meaning since the partial correlations are in absolute value bounded by one.
It follows from the above discussion that the problem of estimating the matrix D * built from the diagonal entries of the precision matrix is an important ingredient of the estimation of the precision matrix and the graph of conditional dependencies between the features.The purpose of the present work is to propose several natural estimators of D * and to study their statistical properties, essentially from an empirical point of view.Combining standard arguments, we present four estimators, termed residual variance (RV), relaxed maximum likelihood (RML), symmetryenforced maximum likelihood (SML) and penalized maximum likelihood (PML).The first one, residual variance, is the most commonly used estimator when the matrix B * is estimated columnwise by a sparse linear regression approach briefly mentioned in the foregoing discussion.The other three methods considered in this paper are based on the principle of likelihood maximization under various approaches for handling the prior information.In order to give the reader a foretaste of the content of next sections, we present in Figure 1 the accuracy of the four methods of estimating the diagonal elements of the precision matrix on a synthetic data-set.More details are given in Section 4.1.

Notation and preliminaries on precision matrix estimation
This section introduces the main notation used throughout the paper and presents some preliminary material on sparse precision matrix estimation.

Notation
For an unknown parameter θ we note θ * its true value.As usual, N p (µ * , Σ * ) is the Gaussian distribution in R p with mean µ * and covariance matrix Σ * .The expectation of a random vector X is denoted by E(X) and its covariance matrix by V(X).We denote by 1 n the vector from R n with all the entries equal to 1 and by I n the n × n identity matrix.We write 1 for the indicator function, which is equal to 1 if the considered condition is satisfied and 0 otherwise.The smallest integer greater than or equal to x ∈ R is denoted by x .In what follows, [p] := {1, . . ., p} is the set of positive integers from 1 to p.For i ∈ [p], the complement of the singleton {i} in [p] is denoted by i c .For a vector v ∈ R p , D v stands for the p × p diagonal matrix satisfying (D v ) j = v j for every j ∈ [p].The matrix build keeping only the diagonal of a square matrix M is denoted by diag(M).
The transpose of the matrix M is denoted by M .If this matrix is square, we note |M| its determinant.For a n×p matrix M, the vector of the elements of the kth row (resp.the jth column) whose indices are given by the subset J of [p] (resp.K of [n]) is denoted by M k,J (resp.M K,j ).In particular, the vector made of all the elements of the jth column of the matrix M at the exception of the element of the kth row is given by M k c ,j .Moreover, the whole kth row (resp.jth column) of M is denoted by M k,• (resp.M •,j ).We use the following notation for the (pseudo-)norms of matrices: if q 1 , q 2 > 0, then . With this notation, M 2,2 and M 1,1 are the Frobenius and the element-wise 1 -norm of M, respectively.The sample covariance matrix of the data points {X k,• } k∈[n] is defined by where µ is either the sample mean 1 n (1 n X) (when the mean µ * is unknown) or the theoretical mean µ * (when it is considered as known).

Preliminaries
Throughout the paper we will present estimators of the diagonal elements of the precision matrix in the case of a general multidimensional Gaussian distribution, but in all theoretical developments we will assume that the marginals of X are standard Gaussian distributions, i.e., µ * = 0 and Σ * jj = 1 for every j ∈ [p].This assumption is reasonable, since we are concerned with the problems in which the sample size is large enough to consistently estimate the individual means and the individual variances of the variables.So, one can always center the variables by the sample mean and divide by the sample standard deviation to get close to the assumption3 that random variables X 1,j , . . ., X n,j are i.i.d.N (0, 1) for every j.
Let us recall that the precision matrix is closely related to the problem of regression of one feature on all the others.Indeed, there exists a p × p matrix B * and two vectors c * , where ξ j is drawn from N n (0, I n ) and is independent of X •,j c .According to the theorem on normal correlations (Marsaglia, 1964), the regression coefficients B * j c ,j ∈ R p−1 and the variance φ * j ∈ R of residuals can be expressed in terms of the elements of the precision matrix Ω * as follows: If we assume that µ * = 0 then c * j = 0 for any j.With these notation, the precision matrix can be written as Ω * = B * D −1 φ * .Several state-of-the-art methods for estimating sparse precision matrices proceed in two steps (Meinshausen and Bühlmann, 2006;Cai et al, 2011;Sun and Zhang, 2013).The first step consists in estimating the matrix B * and the vector φ * by solving the sparse linear regression problems (1) for each j, while in the second step an estimator of the matrix Ω * is inferred from the estimators of B * and φ * using relations (NC).The goal of the present work is to explore both theoretically and empirically different possible strategies for this second step.
The square-root Lasso is perhaps the method of estimating the matrix B * that offers the best trade-off between the computational and the statistical complexities.It can be redefined as follows: scaled Lasso estimates the matrix B * by solving the convex optimization problem where the first min is over all matrices B having all their diagonal entries equal to 1.The tuning parameter λ > 0 corresponds to the penalty level.The purpose of the penalization is indeed to get a precision matrix estimate which fits the sparsity assumption.As the penalty of a matrix B is its • 1,1 norm, the resulting precision matrix estimator is expected to be sparse in the sense that its overall number of non-zero elements should be small.In addition, one can check that computing a solution to problem (2) is equivalent to computing each column of B separately (and independently) by solving the optimization problem In addition to being efficiently computable even for large p, this estimator has the following appealing property that makes it preferable, for instance, to the column-wise Lasso (Meinshausen and Bühlmann, 2006) and the CLIME (Cai et al, 2011).The choice of the parameter λ in (2-3) is scale free: it can be chosen independently of the noise variance in linear regression (1).This fact has been first established by Belloni et al (2011) and then further investigated in (Sun and Zhang, 2012;Belloni et al, 2014a).In the context of precision matrix estimation, this method has been explored4 by Sun and Zhang (2013).
3 Four estimators of φ * As mentioned earlier, the aim of this work is to compare different estimators of the vector φ * based on an initial estimator of the matrix B * .Clearly, the error of the estimation of B * impacts the error of the estimation of φ * and, therefore, the latter is not easy to assess in full generality.In order to gain some insight on the behavior of various natural estimators, in theoretical results we will consider the ideal situation where the matrix B * is estimated without error.

Residual variance estimator
In view of the regression equation presented in (1), a standard and natural method-used, in particular, by the square-root Lasso of Sun and Zhang (2013)-to deduce estimators φ and Ω from an estimator B is to set Note that the matrix (I n − n −1 1 n 1 n ) present in this expression is the orthogonal projector in R n onto the orthogonal complement of the linear subspace Span(1 n ) of all constant vectors.The multiplication by this matrix annihilates the intercept c * j in (1) and is a standard way of reducing the affine regression to the linear regression.In what follows, we refer to φ defined by (4) as the residual variance estimator and denote it by φ RV .Using the sample covariance matrix S n , the residual variance estimator of φ * can be written as Note also that if we consider the linear regression model (1) conditionally to X •,j c , then the residual variance estimator of φ * j coincides with the maximum likelihood estimator.
Proof.Using equation ( 1) and the assumption B •,j = B * •,j , we get Since ξ j is a standard Gaussian vector, the random variable This completes the proof.
Note that in this result, the case of known means µ * j is considered.The case of unknown µ j can be handled similarly, the estimation bias is then φ * j /n and the resulting mean squared error is (2n − 1)φ * j 2 /n 2 .One may observe that, as expected, the rate of convergence of the quadratic risk is the usual parametric rate 1/n and that the asymptotic variance is 2φ * j 2 .

Relaxed maximum likelihood estimator
One could expect that the global maximum likelihood estimator of φ * would be better than the maximum of the conditional likelihood, since it is well known that under proper regularity conditions, the quadratic risk of the maximum likelihood estimator is the smallest, at least asymptotically.Since the vectors X k,• ∼ N p (µ * , Ω * −1 ) are independent, the log-likelihood is given by (up to irrelevant additive terms independent of the unknown parameters µ * and Ω * ) Maximizing the log-likelihood with respect to µ ∈ R p leads to Recall now that in view of (NC), we have Therefore, the profiled log-likelihood (w.r.t.µ) of X given the parameters B and φ is For a given B, this profiled log-likelihood is a decomposable function of φ and, therefore, can be easily maximized with respect to φ.This leads to arg max Thus, when an estimator B of B * is available, one possible approach for estimating φ * is to set We call this estimator relaxed maximum likelihood (RML) estimator.It will be clear a little bit later why it is called relaxed.The analysis of the risk of the RML estimator is more involved than that of the RV estimator considered in the previous section.This is due to the truncation at the level 0. For this reason, the next result does not provide the precise value of the risk, but just an inequality which is sufficient for our purposes.
Proposition 2. If B estimates B * •,j without error, then the risk of the RML estimator of φ * Before providing the proof of this result, let us present a brief discussion.Note that in view of (1), Σ * jj is always not smaller than φ * j .Furthermore, Σ * jj > φ * j if B * j c ,j has at least one nonzero entry.Therefore, the last proposition, combined with Proposition 1, establishes that the residual variance estimator has an asymptotic variance which is smaller (and, in many cases, strictly smaller) than the asymptotic variance of the maximum likelihood estimator.At a first sight, this is very surprising and seems to be in contradiction with the well established theory (Le Cam and Yang, 2000;Ibragimov and Has minskiȋ, 1981) of asymptotic efficiency of the maximum likelihood estimator for regular models.Our explanation of this inefficiency of φ RML j is that it is not really the maximum likelihood estimator.It maximizes the likelihood, certainly, but not over the correct set of parameters.Indeed, when we defined the RML we neglected an important property of the vector φ * : the fact that Ignoring this constraint allowed us to get a tractable optimization problem but caused the loss of the (asymptotic) efficiency of the estimator.This also explains why we call φ RML relaxed maximum likelihood estimator.
Proof of Proposition 2. Since µ * is assumed to be known and equal to zero, according to (1), we have Denoting Since, in addition, for different ks the random variables X k,j c B * j c ,j are independent, centered and Gaussian, we get that-in view of the independence of ξ j and X •,j c -the conditional distribution of η 1 given ξ j is Gaussian with zero mean and variance ξ j This relation readily implies that E[(S n B) jj ] = φ * j and Furthermore, for the fourth moment, we have To analyze the truncated estimator, we set We have already computed the first expectation in the right-hand side, as well as upper bounded the second one.Let us show that the probability of the event ζ ≤ 0 goes to zero as n increases to ∞.This follows from the Tchebychev inequality, since This completes the proof of the proposition.

MLE taking into account the symmetry constraints
As we have seen in previous sections, the relaxed maximum likelihood estimator is suboptimal; in particular, it is less accurate than the residual variance estimator.To check that this lack of efficiency is indeed due to the relaxation of the symmetry constraints, we propose here to analyze the constrained maximum likelihood estimator in the following idealized set-up.We will consider, as in Propositions 1 and 2, that B estimates B * without error, and that5 there is a column in B * such that all the elements of B * •,i are different from zero.Without loss of generality, we suppose that i = 1 and, consequently, for every j ∈ [p], we have B * j1 = 0 which is equivalent to ω * j1 = 0. Therefore, the symmetry constraint This relation entails that in the case of known matrix B * and unknown vector φ * , only the first entry of φ * needs to be estimated, all the remaining entries can be computed using the first one by the formula φ j = (B * 1j /B * j1 ) φ 1 .Proposition 3.Under the assumption that the rows of X are i.i.d.Gaussian vectors with precision matrix Ω * = B * D −1 φ * , the maximum likelihood estimator of φ * is defined by The quadratic risk of this estimator is given by Proof.To ease notation, we denote by D * the diagonal matrix whose jth element is B * j,1 /B * 1,j .Then, applying (8) for a given B * , the profiled Gaussian log-likelihood can be written as The goal is to maximize the right-hand side over all the vectors φ ∈ R p such that B * D −1 φ is a valid precision matrix.
Let us first check that under the conditions of the proposition, for B * D −1 φ to be a valid precision matrix it is necessary and sufficient that φ 1 > 0 and φ j = (B * 1j /B * j1 )φ 1 for every j ∈ [p].The necessary part follows from that fact that a precision matrix is symmetric and positivesemidefinite, which entails that ( To check the sufficient part, we remark that if φ satisfies φ is symmetric and positive-semidefinite, hence a valid precision matrix.The maximum likelihood estimator φ SML is thus given by φ SML ∈ arg min The aforementioned cost function is continuously differentiable and convex, its minimum is attained at the point where the derivative vanishes, which provides φ SML , this leads to (11).To check (12), we start by noting that Using the well-known commutativity property of the trace operator and setting Y = Σ * −1/2 X , we get trace(X XΣ * −1 ) = trace(Y Y).Since X has iid rows drawn from a N p (0, Σ * ) distribution, Y has iid columns drawn from N p (0, I p ) distribution.Hence, the random variable trace jk is distributed according to χ 2 np distribution.This readily implies that φ SML j is an unbiased estimator of φ * j and, therefore, its quadratic risk coincides with its variance and is given by ( 12).
Assuming that there exists i ∈ [p] such that for any j ∈ [p], ω * ij = 0, put differently that the i-th node of the graph G * is connected by an edge to any other node is quite restrictive.Among other implications, it entails that the graph G * is connected which might be a strong assumption.It is therefore useful to adapt what precedes to the case where the graph G * has more than one connected component.The rest of this subsection is devoted to the description of this adaptation.
We note C the set of the connected components of the graph G * .Each connected component c ∈ C is a subset of vertices of G * whose cardinality is denoted by p c .Clearly, the sum of p c over all c ∈ C equals p.For two vertices i and j, we will write i ∼ G * j for indicating that they belong to the same connected component.Thus, each connected component is a class of equivalence with respect to the relation ∼ G * .Let i ∼ G * j be two vertices from c ∈ C and let C ji be a path connecting these two vertices,i.e., C ji is a sequence of q distinct vertices {v 1 , . . ., v q } such that v 1 = j, v q = i, q ≤ p c and each pair (v h , v h+1 ) is connected by an edge in G * .Recall that the symmetry of the precision matrix To ease notation, we introduce the p × p diagonal matrix ∆ * j the diagonal entries of which are defined by where {v 1 , . . ., v q } = C ji is any path connecting j to i in G * .With this notation, φ * j = (∆ * j ) ii φ * i .One can reproduce the arguments of the proof of Proposition 3 to check that the maximum likelihood estimator of φ * , if B * is known (and therefore so is ∆ * j ), is defined by for j belonging to the connected component c.
Comparing the results of Propositions 1, 2 and 3, we observe that the RV-estimator outperforms the RML estimator, but-at least in the case where there is a column in B * which has only nonzero entries-they are both dominated by the maximum likelihood estimator that takes advantage of the symmetry constraints.Furthermore, using the same type of arguments as those of Proposition 3, one can check that if the vertex j of the graph G * belongs to a connected component of cardinal p c then the risk of the MLE in the ideal case of known B * is equal to 2 npc φ * j 2 .This shows that in the ideal case the MLE systematically outperforms the widely used residual variance estimator, and the gain in the risk may be huge for vertices belonging to large connected components.On the other extreme, all the three estimators discussed in the previous section coincide when the matrix B * is diagonal.
In order to apply equation ( 14) for estimating φ * when an estimator B of B * is available, we need to construct an estimator G of the graph G * .We propose here an original approach for deriving G from B. It is based on the observation that , the square of the partial correlation between the i-th and j-th variables.As mentioned earlier, this quantity is always between 0 and 1 and provides a convenient rule of selection for the edges to keep in the graph.More precisely, we connect i to j if the estimated squared partial correlation B ij B ji is larger than a prescribed threshold t ∈ (0, 1).In our implementation, we chose (somewhat arbitrarily) the threshold t = 0.01 ∧ n −1/2 .Note that when B * is replaced by an estimator, the right-hand side of ( 14) is not necessarily invariant with respect to the choice of the path connecting i to j.Therefore, even when B and G are fixed, if G contains loops there are different ways of estimating φ * based on (14) depending on how the paths are chosen.We have tried two possible approaches: the minimum spanning tree and the shortest path tree based on the following weight function6 defined on the edges: Combining these ingredients, we get the algorithm summarized in Algorithm 1.
The rationale behind the foregoing definition of the weights and the use of the minimum spanning tree or shortest path tree algorithm is to favor the paths that are short and contain edges corresponding to large (in absolute value) partial correlations.The aim is to reduce the risk of Algorithm 1 Estimator φ SML based on shortest path trees or minimum spanning trees Input: matrices X and B, threshold t.Output: vector φ SML .1: compute the matrix of weights W. 2: initialize k to 1. repeat 3: choose the node with the largest degree as root.4: compute the shortest path tree (or the minimum spanning tree) T k from the chosen root.5: estimate φ SML 's elements related to T k using Eq. ( 14) 6: remove all the nodes of the tree T k from the initial graph.7: increment k. until graph is empty propagating the estimation error of B. We have implemented both versions of the algorithm and have observed that the version using the minimum spanning tree leads to better results.More details on the implementation and computational complexity are given in the next section.

Penalized maximum likelihood estimation
We have seen that enforcing symmetry constraints is beneficial when the matrix B has a small error, but raises intricate issues related to the graph estimation and, more importantly, path selection in the graph.A workaround to this issue is to replace the hard constraints by a penalty term that measures the degree of violation of the constraints.This provides an intermediate solution between the SML and the RML.More precisely, we propose a penalized maximum likelihood (PML) estimator of φ * defined by where κ > 0 is a tuning parameter responsible for the trade-off between the likelihood and the constraint violation.The choice κ = ∞ corresponds to enforcing the symmetry constraints: its main shortcoming is that the feasible set might very well be empty.On the other extreme, when κ = 0, PML coincides with the RML.The PML estimator also coincides with the previous ones if B is known to be diagonal.Note that the parameter t appearing in the penalty term of the PML plays the same role as the one used in the SML.The definition of the feasible set in the above optimization problem is justified by the fact that we assume all the individual variances of the features to be equal to one.In other terms, the assumption V(X 1,j ) = 1 in (1) implies that φ * j ≤ 1. Making the change of variable v = (1/φ j ) j∈[p] , the optimization problem of Eq. ( 15) becomes convex with the feasible set v ∈ [1 + ∞) p and the objective function: Furthermore, if we restrict the feasible set to v ∈ V = [1, n 1/2 ] p , the problem becomes strongly convex.In addition, on this restricted feasible set the gradient of the objective function is Lipschitzcontinuous.It is possible to use the standard steepest gradient descent algorithm with a fixed step-size for efficiently approximating the solution φ PML .Indeed, in the optimization problem ( 16), if ∇f is Lipschitz-continuous with constant L < ∞ and strongly convex with constant l > 0, the gradient descent algorithm with a constant step-size t = 2/(l + L) converges at a linear rate (see Nesterov (2004) for a detailed proof).Note that the convergence rate depends on L/l which is an upper bound on the condition number of the Hessian matrix ∇ 2 f (v); this ratio should not be too high for the algorithm to converge fast.Unfortunately, the values of l and L that we manage to obtain in our problem are far too loose.That is why we resort to a steepest descent algorithm with adaptive step-size and scaled descent direction −∇f (v h )/ ∇f (v h ) 2 .More details on the implementation are provided in Section 4.2.

Experimental evaluation
In this section, we describe the experimental set-up and report the results of the numerical experiments performed on synthetic data-sets.We also provide detailed explanation of the implementation used for the symmetry-enforced and the penalized maximum-likelihood estimators.A

Experiments on synthetic datasets
We conducted a comprehensive experimental evaluation of the accuracy of different estimates of diagonal elements of the precision matrix.In order to cover as many situations as possible, we used in experiments our six different forms of precision matrices along with various values for n and p.
In each configuration, we considered several methods of estimating the matrix B * .Let us first describe in a precise manner the precision matrices used in our experiments.It is worthwhile to underline here that all the precision matrices are normalized in such a way that all the diagonal entries of the corresponding covariance matrix Σ * = (Ω * ) −1 are equal to one.To this end, we first define a p × p positive semidefinite matrix A and then set Ω * = (diag(A −1 )) . The matrices A used in the six models for which the experiments are carried out are defined as follows.
Model 1: A is a Toeplitz matrix with the entries A ij = 0.6 |i−j| for any i, j ∈ [p].
Model 2: We start by defining a p × p pentadiagonal matrix with the entries Then, we denote by A the matrix with the entries A ij = ( Ā−1 ) ij 1(|i − j| ≤ 2).One can check that the matrix A defined in such a way is positive semidefinite.
Model 3: We set A ij = 0 for all the off-diagonal entries that are neither on the first row nor on the first column of A. The diagonal entries of A are whereas the off-diagonal entries located either on the first row or on the first column are Model 4: We introduce the integer k = √ p and define a sparse k × k matrix Ā so that its only non-zero elements are Ā11 = k and, for any i ∈ [2; k], Āii = 2k and Ā1i = Āi1 = √ 2.Then, we set Model 5: We introduce k = √ p and define a sparse k × k matrix Ā so that its only non-zero elements are Ā11 = 50 and, for any i ∈ [2; k], Āii = 5 and Ā1i = Āi1 = 5/2.Then, similarly to previous model, we set Model 6: We set k = 6, p = k p/k and define the k × k matrix Ā as in model 5 above.Then, we build the p × p block-diagonal matrix A by Note that, in general, the resulting precision matrix in this model is not of size p × p but of size p × p with p = 6 p/6 .However, since in the experiments reported in this section p is always a multiple of 6, we have p = p .
In this experimental evaluation, we compare the performance of the following four estimatorsintroduced in previous sections-of the diagonal elements of the precision matrix: • RV corresponds to the residual variance estimator defined in Section 3.1.
• RML corresponds to the relaxed maximum likelihood estimator described by equation ( 10).
• SML corresponds to the symmetry-enforced maximum likelihood estimator described in Algorithm 1. • PML corresponds to the penalized maximum likelihood estimator described by equation ( 15).Note that all these algorithms need an estimator of the matrix B * to produce an estimator of the diagonal entries of the precision matrix.We conducted experiments in three different scenarios.The first scenario is when the matrix B * is estimated column-by-column by the square-root Lasso, using the penalization parameter λ = √ 2 log p.This value for λ is commonly called the universal choice and has proved to lead to optimal theoretical results and fairly good empirical results (Dalalyan and Chen, 2012;Sun and Zhang, 2012;Dalalyan et al, 2013).The second scenario is when the matrix B * is estimated column-by-column by the ordinary least squares estimator applied to the covariates that correspond to nonzero entries of the square-root Lasso estimator8 with the aforementioned value of λ.Finally, the third scenario is an unrealistic one; it corresponds to the case of a known matrix B * .This scenario is included in the experimental evaluation in order to check the consistency between the theoretical and the empirical results as well as in order to better understand how the error in estimating B * impacts the quality of estimation of the diagonal entries of the precision matrix.
Thus, each configuration of our empirical study corresponds to choosing • a model out of 6 models described above • a dimension p ∈ {30, 60, 90} • a sample size n ∈ {200, 800, 2000} • a method of estimating B * .
In each configuration, we computed the estimators RV, RML, SML and PML for 50 independent datasets.Using these R = 50 replications, we estimate the expected risk of estimating φ * , E( In Tables 1-6, we report these averages along with the standard deviations of the errors measured by 2 -vector norm.All the experiments were conducted in R, using the Mosek solver (see Andersen and Andersen (2000)) for computing the square-root Lasso estimator by second-order cone programming.
In the ideal case when B * is estimated without error (by itself), the empirical results reflect perfectly the theoretical results of the previous sections.The comparison of the performance of the estimators indicates that the maximum likelihood estimators SML and PML are preferable to the residual variance estimator.The maximum likelihood estimator considering symmetry constraints outperforms all the other estimators.However, in practice when B is obtained by the square-root Lasso without any refinement, φ RV outperforms all the other estimators in the vast majority of configurations.Some exceptions can be observed in models 5 and 6 (see the top part of Tables 5  and 6, where RV is slightly worse than the other procedures for small sample sizes (n = 200).It should be, however, acknowledged that the difference of the quality between the estimators in these cases is not large enough to advocate for using RML, SML or PML.
It is interesting to observe what happens when an additional step of estimation of B * using the ordinary least squares on the sparsity pattern provided by the square-root Lasso is performed.The impact of this step is not the same in all the models under consideration.In particular, the quality of estimation is mostly improved for all the four estimators in models 1 and 2. Furthermore, thanks to this variable selection step, the maximum-likelihood-type estimators perform nearly as well as the residual variance estimator RV.In model 3, the variable selection step deteriorates the quality of estimation in most configurations, whereas in models 4-6 this step has almost no consequence on the estimation accuracy.
The graphics of Figure 1 are drawn for Model 2 with p = 60.The left plot corresponds to the estimation error-measured by 2 -vector norm-as a function of the sample size in the scenario B = B * , whereas the central plot corresponds to the same error when B * is estimated by the OLS on the sparsity pattern furnished by the square-root Lasso.The right plot is just a zoom  on the center plot.These plots illustrate the convergence to zero of the error of estimation for the estimators considered in this paper.The speed of convergence in these empirical results, as expected, is nearly n −1/2 for fixed dimension p.

Details on the implementation
Symmetry-enforced maximum likelihood.As we explained earlier, the product structure of the term ∆ * j in (13) may cause the amplification of the estimation error when passing from B to φ.In order to reduce as much as possible this phenomenon, we suggested to choose the path C by minimizing its length.In addition, the fact that some entries of B * appear in the denominator of ∆ * j , make it unsuitable to include in C edges corresponding to small values of B ij .The combination of these two arguments suggests to define edge weights as decreasing functions of B ij and to look for paths that somehow minimize the overall weight defined as the sum of the weights of the edges contained in C .
The two versions of the SML algorithm that have been implemented and tested in this work make use of the minimum spanning tree (MST) and the shortest path tree in the step of determining  the way of computation the elements of φ belonging to a connected component C of the graph G .
A MST of C is a tree that spans C and has the smallest total weight among all the spanning trees of C .The shortest path tree having a given node r as a root is a spanning tree T of C such that for any node j ∈ C the weight of the path from j to r in T is the smallest among the weights of all possible paths from j to r in C .
We have used the Kruskal (Kruskal, 1956) algorithm for finding the MST and the Jarnik-Prim-Dijkstra algorithm (Jarník, 1930;Prim, 1957;Dijkstra, 1959) for the shortest path tree.The worst-case computational complexities of the construction of these trees are the following (Cormen et al, 2009).When the graph G has p nodes and q edges, the Kruskal algorithm runs in O(q log p) time.Its output is a set of MSTs per connected component.The version of the SML based on the shortest path tree requires O(p + q) operations to find the connected components.In a connected component having p c nodes and q c edges, the node of largest degree can be obtained in O(q c ) operations, while the computational complexity of finding the shortest paths from a node to all the others is O(q c log(p c )).Therefore, determining a shortest path tree per connected component has a complexity of O(p + q log(p)), or O(sp log(p)) where s is the maximal degree of a node of G .Thus, the computational complexities of the two versions of the SML estimator are comparable and, at most, of the order O(sp log(p)).
In our experiments, we have also tried9 a third version consisting in computing the shortest path trees from every node of a connected component and then choosing the one with the minimal overall weight, rather than first choosing the root as the node having largest degree.Several other variants have been tested as well, but the simplest version based on choosing the MST has lead to the best empirical results.
Penalized maximum likelihood.As mentioned earlier, the PML estimator is computed by solving the optimization problem (16).We implement a steepest descent algorithm with adaptive stepsize and scaled descent direction −∇f (v h )/ ∇f (v h ) 2 .At each iteration, one common adaptation for every coordinate of the descent direction is performed.If the objective function increases, the current iteration is done again with a halved step-size.On the opposite, if the objective function decreases, the step-size is increased by a constant factor for the next iteration.Mathematically speaking, the update operations for our gradient descent algorithm are where the descent direction is u h = −∇f (v h )/ ∇f (v h ) 2 and t h is the step-size.Thanks to the convexity, the convergence of this algorithm is guaranteed for any starting point v 0 .The step-size is updated at each iteration according to the following rule: The multiplicative factors we use for adaptive step-size are those propose by Riedmiller and Braun (1992) for the Rprop algorithm.We stop iterating when the gradient magnitude measured in the 2 -norm is below a certain level (10 −5 in our experiments) or when the limit of 5000 iterations is attained.
For the choice of the tuning parameter κ, we did a cross-validation by choosing a geometric grid over the values of κ ranging from 1/p to √ p.The results, for Models 2 and 4, are plotted in Fig. 2 and 3, respectively.We can clearly see that there is a large interval of values of κ for which the error is nearly minimal.Based on this observation, we chose κ = 1 3 √ log p for all the numerical experiments reported in Tables 1-6.

Conclusion
This paper introduces three estimators of the diagonal entries of a sparse precision matrix when n iid copies of a Gaussian vector with this precision matrix are observed.The properties of these estimators are discussed and compared with those of the commonly used residual variance estimator.At a theoretical level, an interesting finding is that the naive maximum likelihood estimator (MLE) that does not take into account the symmetry constraints has a significantly larger risk than the residual variance estimator and, hence, is not optimal even asymptotically.The symmetry-enforced MLE and the penalized MLE circumvent this drawback and are shown in all numerical experiments to outperform the residual variance estimator when the matrix B * is known.Similar but unreported results are obtained when the estimators of the diagonal entries use a noisy matrix B = B * + Ξ, provided the noise matrix Ξ has iid Gaussian entries with zero mean and small variance.However, in a more realistic situation when B * is estimated by the square-root Lasso or by the ordinary least squares conducted over the submodel selected by the square-root Lasso, the accuracies of the four estimators of the diagonal entries become comparable with a slight advantage for the residual variance estimator.
We would like also to mention the introduction of a novel and simple method of estimating partial correlations and of symmetrizing the precision matrix estimator derived from the nonsymmetric matrix B. It is based on the observation that the square of the partial correlation between i-th and j-th variables is equal to B * ij B * ji .In the future, it would be interesting to look for an estimator of B * which is more accurate than the square-root Lasso and could hopefully-in combination with the symmetry-enforced MLE or the penalized MLE-lead to better precision matrix estimate than the one obtained by the association of the square-root Lasso and the residual variance estimator.Another appealing avenue for future research is the investigation of the case when the matrix X is observed with an error.Recent papers (Rosenbaum and Tsybakov, 2013;Belloni et al, 2014b) may provide valuable guidance for accomplishing this task.

Fig. 2
Fig.2The estimation error of the PML as a function of κ.The plots are obtained for the synthetic experiment of Model 2 with various values of p and for n = 200 (left), n = 800 (middle) and n = 2000.Please note that the limits of the y-axis are not the same in the three plots and that the x-axis is presented in logarithmic scale.

Fig. 3
Fig.3The estimation error of the PML as a function of κ.The plots are obtained for the synthetic experiment of Model 4 with various values of p and for n = 200 (left), n = 800 (middle) and n = 2000.Please note that the limits of the y-axis are not the same in the three plots and that the x-axis is presented in logarithmic scale.

Table 1
Performance of the estimators of diagonal elements of the precision matrix in Model 1.The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.
companion R package called SPMDEE (for Sparse Precision Matrix Diagonal Elements Estimation) is created and uploaded on CRAN 7 .

Table 2
Performance of the estimators of diagonal elements of the precision matrix in Model 2. The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.

Table 3
Performance of the estimators of diagonal elements of the precision matrix in Model 3. The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.

Table 4
Performance of the estimators of diagonal elements of the precision matrix in Model 4. The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.

Table 5
Performance of the estimators of diagonal elements of the precision matrix in Model 5.The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.

Table 6
Performance of the estimators of diagonal elements of the precision matrix in Model 6.The number of replications in each case is R = 50.More details on the experimental set-up are presented in Section 4.1.