Partial Correlation Graphical LASSO

Standard likelihood penalties to learn Gaussian graphical models are based on regularising the off-diagonal entries of the precision matrix. Such methods, and their Bayesian counterparts, are not invariant to scalar multiplication of the variables, unless one standardises the observed data to unit sample variances. We show that such standardisation can have a strong effect on inference and introduce a new family of penalties based on partial correlations. We show that the latter, as well as the maximum likelihood, $L_0$ and logarithmic penalties are scale invariant. We illustrate the use of one such penalty, the partial correlation graphical LASSO, which sets an $L_{1}$ penalty on partial correlations. The associated optimization problem is no longer convex, but is conditionally convex. We show via simulated examples and in two real datasets that, besides being scale invariant, there can be important gains in terms of inference.

In Gaussian graphical models, most popular frequentist approaches to sparse estimation of the precision matrix penalise the absolute value of the entries of the precision matrix.Gaussian graphical models are invariant to scalar multiplication of the variables, however it is well-known that such penalisation approaches do not share this property.We show that the only scale-invariant strategies, within a large class of precision matrix penalties, are the logarithmic and L 0 penalties.It is possible to address this issue via a data preprocessing step of standardising the data to have unit sample variances.However, as we illustrate next, this standardisation can adversely affect inference.In this paper we propose a family of methods based on partial correlations and show that they ensure scale invariance without requiring this standardisation step.
As motivation we present a simple example where the goal is to estimate the entries in a p×p precision matrix Θ.We set p = 50 and generate n = 100 independent Gaussian draws with zero mean and covariance Θ −1 , where Θ follows the so-called star pattern, with θ ii = 1 and θ i1 = θ 1i = −1/ √ p for i = 1, . . ., p, and θ ij = 0 otherwise.This is a setting in which recovering the graphical model is relatively straightforward, see for example Yuan and Lin [2007].The top left panel in Figure 1 shows the regularisation path for the estimated partial correlations when applying GLASSO [Friedman et al., 2008] to the unstandardised data.For a large range of values for the regularisation parameter ρ the truly zero θ ij 's are completely separated from the non-zeroes.However, the top right panel shows that when standardising the data to unit sample variances the quality of the inference suffers.
In particular the true graphical model is not recovered for any ρ.The bottom left panel shows the results obtained by applying a LASSO penalty to the partial correlations, our proposed PC-GLASSO, which as we show is scale invariant.The bottom right panel demonstrates how the estimation accuracy measured by Kullback-Leibler loss (see Section 7) of GLASSO and two other methods reviewed below suffer in comparison to PC-GLASSO when using standardised data.
Lack of invariance is not restricted to the GLASSO, but, as we show later, affects essentially all continuous penalties, as well as standard prior distributions in Bayesian settings.
The paper is organised as follows.Section 1 sets notation and reviews popular classes of likelihood penalties which we refer to as regular penalty functions, and their Bayesian equivalents, regular prior distributions.Section 2 introduces a class of penalties and prior distributions on partial correlations, and the PC-GLASSO as a particular case.Section 3 shows that the PC-GLASSO, as well as the logarithmic and L 0 penalties are scale invariant, while regular penalty functions are not.Section 4 offers an alternative argument for standardising the data when using regular penalties, related to situations where the likelihood function is exchangeable in two partial correlations, hence one may wish for inference to be exchangeable as well.Section 5 compares the related prior distributions of GLASSO and PC-GLASSO and Section 6 discusses computational issues for the PC-GLASSO and gives a certain conditional convexity result.Section 7 shows examples on simulated, gene expression and stock market datasets.We end the paper with a short discussion. 1 Penalised likelihood in Gaussian graphical models Let X = (X (1) , ..., X (p) ) ∼ N(µ, Σ) be a p-dimensional multivariate Gaussian random vector with unknown mean µ ∈ R p and p × p positive-definite covariance Σ = (σ ij ) i≤i,j≤p .Suppose we observe n independent samples (X 1 , . . ., X n ) of X and denote their sample covariance by S. Our goal is to estimate the precision matrix Θ = (θ ij ) 1≤i,j≤p = Σ −1 .
A common assumption in Gaussian graphical models is that the data generating process is governed by a sparse undirected graph so that Θ is a sparse matrix with many zero entries, and we have a particular interest in the location of its zero entries.This is due to the equivalence between zero partial covariances and conditional independencies in Gaussian graphical models.The most common frequentist approach to sparse estimation is to maximise a penalised likelihood function of the form l(Θ | S) − P en(Θ), where is the log-likelihood function, P en(Θ) some penalty function and tr(A) the trace of A. Most popular choices (discussed below) consider penalties that are additive and monotone in |θ ij |, which we refer to as separable penalties, and in particular the subclass of penalties differentiable everywhere other than zero, which we refer to as regular penalties.
Definition 1.A penalty function P en(Θ) is separable if where pen ii : (0, ∞) → R and pen ij : R → R are non-decreasing in θ ii and |θ ij | respectively for all i and i < j.
A separable penalty is regular if pen ii = pen jj for all (i, j) and, for all i < j, pen ij does not depend on (i, j), is symmetric about 0 and differentiable away from 0.
Most popular penalty functions used for Gaussian graphical models are regular.The GLASSO is a prominent example using an L 1 penalty to produce the point estimate for some given regularization parameter ρ ≥ 0. See Meinshausen and Bühlmann [2006] for an alternative that places L 1 penalties on the full conditional regression of each X (i) given X −(i) , Banerjee et al. [2008] for computational methods based on parameterising (2) in terms of Σ and Yuan and Lin [2007] for a variation that omits the diagonal of Θ from the penalty.Other popular regular penalties include the SCAD penalty [Fan andLi, 2001, Fan et al., 2009] and the MCP penalty [Zhang, 2010, Wang et al., 2016], which were proposed to reduce bias in the estimation of large entries in Θ relative to the L 1 penalty.

PENALISED LIKELIHOOD IN GAUSSIAN GRAPHICAL MODELS
Another notable regular penalty is the L 0 penalty The adaptive LASSO [Zhou et al., 2009, Fan et al., 2009] is an important example of a non-regular penalty.It uses an L 1 penalty where weights depend on the data via some initial estimate of Θ, and hence does not satisfy Definition 1.However, as noted by Bühlmann and Meier [2008] and Candès et al. [2008], the adaptive LASSO can be seen as a first-order approximation of the logarithmic penalty where pen ij (θ ij ) = ρ log(|θ ij |), which is regular.Both papers propose an iterative version of adaptive LASSO that formally targets this logarithmic penalty.
There is a well known equivalence between penalised likelihood and maximum a posteriori estimates in Bayesian frameworks.In particular, the estimate under a penalty P en is equal to the mode of the posterior distribution under the prior density π(Θ) ∝ exp(−P en(Θ))I(Θ ∈ S) where S is the set of symmetric, positive definite matrices.With this in mind we define separable and regular prior distributions.
Definition 2. A prior distribution with density π on Θ is separable if where π ii is a density function with support (0, ∞) and π ij is a density function with support R which are non-increasing in θ ii and |θ ij | respectively for all i and i ≤ j.
A separable prior distribution is regular if π ii = π jj for all (i, j) and for all i < j, pen ij does not depend on (i, j), is symmetric about 0 and differentiable away from 0.
The correspondence between penalised likelihoods and prior distributions has been utilised by the Bayesian LASSO regression of Park and Casella [2008] and Hans [2009] and in Gaussian graphical models by Wang [2012] and Khondker et al. [2013].Of particular interest to this paper, Wang [2012] showed that under the GLASSO prior the marginal prior distribution of partial correlations does not depend on the regularisation parameter.We explore this further in Section 5.The Bayesian interpretation has also been used to create new penalties functions, for example by Banerjee and Ghosal [2015] and Gan et al. [2018], both of whom set mixture priors on the entries of Θ.

Partial Correlation Graphical LASSO
We propose basing penalties on a reparameterisation of Θ in terms of the (negative) partial correlations where X −(ij) denotes the vector X after removing X (i) and X (j) .
The precision matrix can be decomposed as Θ = θ , where θ = diag(Θ) and ∆ is the matrix with unit diagonal and off-diagonal entries ∆ ij .The penalised likelihood function then becomes We believe that partial correlations are a better measure of dependence than the off-diagonals θ ij , in that they are easier to interpret and invariant to scalar multiplication of the variables.We now introduce a class of additive penalties in this parameterisation, a corresponding prior class, and subsequently state our PC-GLASSO as a particular case.
Definition 3. A penalty P en is partial correlation separable (PC-separable) if it is of the form where pen ii : (0, ∞) → R and pen ij : [−1, 1] → R are non-decreasing in θ ii and |∆ ij | respectively, for all i and i < j.
A PC-separable penalty function is symmetric if pen ii = pen jj for all (i, j) and, for all i < j, pen ij does not depend on (i, j) and is symmetric about 0.
Any PC-separable prior can be written as where S 1 is the set of symmetric, positive definite matrices with unit diagonal.
PC-GLASSO is a symmetric PC-separable penalty applying the L 1 norm to the partial correlations , and a logarithmic penalty to the diagonal pen ii (θ ii ) = 2 log(θ ii ).The penalised likelihood function, after removing constants, is given by The logarithmic penalty on the diagonal entries ensures scale invariance of the PC-GLASSO (Section 3).A coefficient of 2 is used since in the univariate p = 1 case this minimises the asymptotic mean squared error of the estimated precision amongst logarithmic penalties (see Appendix B.1).Although many methods use the same penalty forms for diagonal and off-diagonal entries, it seems natural to use different forms since the former do not aim to induce sparsity.For example, Yuan and Lin [2007] argued for a GLASSO framework where one does not penalise the diagonal.
As usual, one may calculate the PC-GLASSO estimate for a sequence of regularisation parameters ρ and select the solution that maximizes some suitable criterion.In Section 7 we used the Bayesian information criterion (BIC), which selects the estimate minimising Parameter selection via the BIC has been shown to provide consistent graphical model selection when used with the SCAD and MCP penalties.Other potential criteria that have been explored for GLASSO are cross validation and the extended Bayesian information criterion (EBIC, Foygel and Drton [2010]), which we also consider in our real data applications.For further discussion see, for example, Vujačić et al. [2015].
There are some examples of penalty functions for Gaussian graphical models based on partial correlations.Ha and Sun [2014] utilised a ridge penalty.The space method of Peng et al. [2009], similarly to PC-GLASSO, uses an L 1 penalty on the partial correlations, but in combination with a function other than the log-likelihood.Azose and Raftery [2018] introduced a separable prior on the marginal correlations.They argued that a key benefit of their prior is the ability to specify beliefs about correlations.A similar argument can be made for PC-separable priors allowing one to specify prior beliefs on partial correlations.

Scale invariance
A key property of graphical models is invariance to scalar multiplication.In the Gaussian case, if we consider the transformation DX for some fixed diagonal p × p matrix D with non-zero diagonal, then DX is also Gaussian with precision matrix In particular, the zero entries of Θ D are identical to those of Θ.
We argue that it is desirable for an estimator of Θ to mirror the relationship in (7) under scalar multiplication of the data, a property we call scale invariance.We now show that, among regular penalty functions, only the L 0 and logarithmic penalties are scale invariant, whereas PC-separable penalties are.Recall that any estimator can be made scale invariant by standardising the data to unit sample variances prior to obtaining the estimate, but as discussed this has an effect on inference.
We start by defining two notions of scale invariance related to the point estimate and to the recovered graphical structure.
Definition 5.An estimator Θ is scale invariant if for any sample covariance matrix S and any diagonal p × p matrix D with non-zero diagonal entries, Θ is selection scale invariant if Θ(S) and Θ(DSD) have identical zero entries for any S and D.
Scale invariance ensures that the estimate under the scaled data corresponds to that under the original data as in ( 7).Meanwhile selection scale invariance ensures that one recovers the same graphical structure under scalar multiplications.It is clear that scale invariance implies selection scale invariance.
We now present results on the scale invariance of different penalties.Note that the results could equivalently be written in terms of the maximum a posteriori estimate under corresponding prior distributions.All proofs are in Appendix B.2.
Proposition 1.Let Θ be an estimator based on a regular penalty, and suppose that there exists a sample covariance matrix S such that Θ(S) is not a diagonal matrix.Then Θ is scale invariant if and only if pen ij is either an L 0 or logarithmic penalty, and pen ii is either a constant or a logarithmic penalty.
In particular, the GLASSO, SCAD and MCP estimators are not scale invariant.Further, as illustrated in Figure 1 these estimators are also not selection scale invariant.We conjecture that lack of selection scale invariance holds more widely for regular penalty functions, but settle with the counterexample for these three cases provided by Figure 1.
We present an example to further illustrate how scaling can affect the inferred conditional independence structure.Suppose we observe the inverse sample covariance matrix The left panel in Figure 2 shows the associated GLASSO estimates Θ ρ GLASSO (S).The right panel considers the situation where the data were given on a different scale, specifically the sample covariance is DSD where D has diagonal entries 1, 1 and 10, and provides the estimates DΘ ρ GLASSO (DSD)D.The estimates set to zero, as well as their relative magnitudes, differ significantly depending on the scale of the data.We observed similar results for the SCAD and MCP penalties (not shown, for brevity).
As shown in Proposition 1, the only scale invariant regular penalties are the L 0 and logarithmic penalties, both of which are also PC-separable.In fact scale invariance holds more widely in PCseparable penalties, from which it follows that PC-GLASSO is scale invariant.
Proposition 2. Any estimator based on a symmetric PC-separable penalty is scale invariant, pro- for all measurable sets A where In particular, (8) implies that the two posterior distributions on the partial correlations ∆ are equal up to appropriate sign changes i.e. when D has all positive entries, π(∆ | S) = π(∆ | DSD) (since ∆ associated to Θ is equal to that associated to DΘD).
Proposition 3. Any symmetric PC-separable prior distribution with π ii (θ ii ) ∝ θ −c ii for some constant c ≥ 0 leads to scale-invariant posterior inference.

Exchangeable inference
We now discuss an alternative view on the desirability of standardising the data when using regular penalties, based on notions of exchangeable inference.The simplest situation occurs when the likelihood function is exchangeable in two or more ∆ ij 's, for example when two rows in the sample correlation matrix R = diag(S) −1/2 Sdiag(S) −1/2 are equal (up to the necessary index permutations).
In such a situation the likelihood provides the same information on these ∆ ij 's, hence it seems desirable to obtain the same inference for all of them.If the log-likelihood is exchangeable in some parameters, then any symmetric PC-separable penalty and prior trivially leads to exchangeable inference on those parameters.Yet, as illustrated in our example below, regular penalties can lead to significantly different inference (unless one standardises the data).
The left panel of Figure 3 shows the GLASSO path for the partial correlations.The estimate for ∆ 13 is fairly different than for ∆ 12 and ∆ 14 , and so is the range of ρ's for which they are set to 0.
Note however that the estimates for the remaining ∆ ij 's are close to 0. To address this issue, one may note that the diagonal of S is not equal to 1. Indeed, if one standardises the data, so that the sample covariance is equal to R, one obtains the center panel of We remark that the notion can be extended to conditional exchangeability, i.e. the likelihood being symmetric in (∆ ij , ∆ kl ) given the remaining parameters in ∆ and θ.For example, the likelihood is conditionally exchangeable in (∆ ij , ∆ ik ) when the sample covariances and precisions are related by the same constant, i.e. S ij = cS ik and θ 1/2 kk = cθ 1/2 jj for some c > 0, and the partial correlations with other variables are equal, i.e. ∆ jl = ∆ kl for all l ∈ {i, j, k}.See Appendix B.3 for additional information and supplementary results.Conditional exchangeability would be relevant in situations where two variables (j, k) have the same estimated partial correlations with all other variables (e.g.zero), as well as the same sample covariances with a third variable i.In such situations, one may wish for equal inference, in particular equal point estimates ∆ij = ∆ik .

GLASSO and PC-GLASSO prior distributions
In this section we provide further insights into the shrinkage induced by GLASSO and PCGLASSO, by comparing their implied prior distributions in a Bayesian framework.
The GLASSO prior [Wang, 2012] can be written as where λ = nρ, whereas the PC-GLASSO prior is given by To illustrate the effect of increasing the parameter λ for fixed p (see Wang [2012] for results on growing p for fixed λ), we sampled from each prior via rejection sampling for λ = 1, 2 and 4. Figure 4 plots the densities of ∆ 12 and θ 11 .The top left panel verifies the claim of Wang [2012] that the GLASSO prior π G (∆ ij ) does not depend on λ, whereas the bottom panel shows that π G (θ ii ) is shrunk towards 0 as λ increases.In contrast, the PC-GLASSO prior (top-right panel) on partial correlations π P G (∆ ij ) concentrates around zero as λ grows.The marginals on the diagonal entries are given by This demonstrates a fundamental difference in how GLASSO and PCGLASSO induce sparsity in the PCGLASSO achieves sparsity through regularisation of the partial correlations, while GLASSO does so by shrinking the diagonal θ ii .

Computation
An important feature of GLASSO is its defining of a convex problem that significantly facilitates computation and its theoretical study.For example, Friedman et al. [2008] related GLASSO to a sequence of LASSO problems, see also Sustik and Calderhead [2012] for improved algorithms.
Computation for non-convex penalties such as SCAD and MCP poses a harder challenge, but the Local Linear Approximation of Zou and Li [2008] greatly facilitates this task, see also Fan et al. [2009].The PC-GLASSO optimisation problem is non-convex, however it is conditionally convex this paper, provided the starting point is close to the optimum then the algorithm typically converges in a few iterations.To exploit this observation, when considering a sequence of penalty parameters we used the estimated ( θ, ∆) associated to ρ i as the starting point for the problem associated to ρ i+1 .For ρ 0 = 0 the algorithm is initialised at S −1 , or at (S + αI) −1 where I is the identity matrix if n < p.The matrix S + αI is guaranteed to be invertible and positive definite for any α.

Applications
We now assess the performance of PC-GLASSO against GLASSO, SCAD and MCP, setting the regularization parameters via the BIC in (6).SCAD and MCP have an additional regularization parameter, which we set to the default proposed in Fan and Li [2001] and Zhang [2010] respectively.
For all methods we standardised data to unit sample variances, and rescaled the estimates via (7).
GLASSO was implemented using the R package glasso and SCAD and MCP using the package GGMncv (see Williams [2020]).
Our primary interest is studying PC-GLASSO versus GLASSO, as they are directly comparable in the sense of using the same L 1 penalty structure.We consider SCAD and MCP as benchmarks designed to ameliorate the estimation bias associated to the L 1 penalty.Although not considered here for brevity, it would also be interesting to study the use of SCAD and MCP penalties on partial correlations.

Simulations
We considered four simulation scenarios with Gaussian data, truly zero mean and precision matrix Θ with unit diagonal and off-diagonal entries as follows.
Scenario 1: Star graph - Scenario 2: Hub graph -Partition variables into 4 groups of equal size, with each group associated to a 'hub' variable i.For any j = i in the same group as i we set θ ij = θ ji = −2 √ p and otherwise

Gene expression data
We assessed the predictive performance of the four penalised likelihood methods in the gene expression data of Calon et al. [2012].The data contain 262 observations of p = 173 genes related to colon cancer progression.We took n = 200 of the samples as training data, left the remaining 62 observations as test data, and assessed the predictive accuracy of each method by evaluating the log-likelihood on the test data.

Stock market data
We analyzed the stock market data in the R package huge, investigated in the graphical model context by Banerjee and Ghosal [2015].The data contain daily closing stock prices of companies in the S&P 500 index between 1st January 2003 and 1st January 2008.We consider de-trended stock-market log-returns, to study the dependence structure after accounting for the overall mean market behavior.Specifically, let Y jt be the closing price of company j at time t, Xjt = log Y jt the log-returns, and X jt = Xjt − Xt the de-trended returns, where Xt = p j=1 Xjt .We randomly selected p = 30 companies and, to avoid issues with stock market data exhibiting thicker tails than the assumed Gaussian model, we removed outlying observations more than 5 sample standard deviations away from the mean in any of the p variables.There remained 1,121 observations of which we randomly selected 1,000 for the training and 121 for the test data.
Figure 7 (right) shows the results, which highlight interesting trade-offs in sparsity vs. predictive accuracy.PC-GLASSO selected a smaller model than GLASSO for BIC and EBIC, and achieved a higher log-likelihood in the test data for any model with < 200 edges, whereas GLASSO attained a higher log-likelihood at the selected model.Interestingly, the SCAD and MCP penalties provided a similar accuracy to PC-GLASSO, albeit slightly higher for models with < 150 edges and slightly lower for larger models.

Discussion
Penalised likelihood methods based on regular penalty functions are a staple of Gaussian graphical model selection and precision matrix estimation.They provide a conceptually easy strategy to obtain sparse estimates of Θ and, particularly in the case of GLASSO, fairly efficient computation, even for moderately large dimensions.However, in this paper we demonstrated that estimates obtained from regular penalties depend on the scale of the variables.This gives a situation where a simple change of units (measuring a distance in miles rather than kilometers) can result in different graphical model selection.Further, we showed that notions of exchangeability also motivate the need for standardising the data when using regular penalties.
Standardising the data is not innocuous.First, even when the variables follow a Gaussian distribution, that is no longer the case for the scaled variables, which exhibit thicker tails.Second, as demonstrated in several of our examples, applying regular penalties to scaled data can adversely affect inference.This effect was particularly detrimental in examples where the true underlying graph has a large range in node degrees, as in the Star graph setting.
A wide class of PC-separable penalties, including the PC-GLASSO, overcome these issues as they are scale invariant and do not require standardisation.Using a Bayesian viewpoint, we illustrated that PCGLASSO induces a different shrinkage than standard penalties, in that the former induces shrinkage on partial correlations, whereas the latter do not.Our examples showed that such differential shrinkage can offer significant improvements both in estimation and model selection.
A limitation of our work lies in the computation.While the efficiency of the coordinate descent algorithm is reasonable in lower dimensions, the computations become impractical for larger p. How-ever, the conditional convexity of the PC-GLASSO problem opens interesting strategies for future improvements.
Further interesting future work is to investigate the theoretical properties of PC-GLASSO, for example model selection consistency, which holds for GLASSO only under certain nontrivial conditions [Ravikumar et al., 2009].The wider set of PC-separable penalties also warrant further exploration, most obviously PC-separable versions of the SCAD and MCP penalties.On the Bayesian side, a PC-separable version of the spike and slab penalty of Gan et al. [2018] may also be of interest.Beyond the Gaussian case, penalisation of partial correlations also seems natural for partial correlation graphs in elliptical and transelliptical distributions, see Rossell and Zwiernik [2020].
Although no guarantees are made about the convergence of Algorithm S2, results in Patrascu and Necoara [2015] and Wright [2015] suggest that convergence towards a local maximum is guaranteed and give reasonable assurance of convergence towards the global maximum.Their results focus on a coordinate descent algorithm that cycles randomly through the indices with replacement and so are not directly applicable to Algorithm S2.However, we prefer cycling through the indices without replacement since this provides a more simple and clear stopping rule for the algorithm.Algorithm S2 assesses the convergence after updating each entry of ∆ exactly once, so that the stopping rule at the end of each iteration is made on the same grounds.For an algorithm which selects indices with replacement it is less clear when to enact the stopping rule.
As a final note about Algorithm S2, Step 2 maximising (5) with respect to ∆ ij , θ ii , θ jj whilst all other variables are held fixed is non-trivial due to the non-smoothness of the objective function.
The remainder of this section will focus on solving this maximisation problem.To ease notation let where The log(ax 2 + bx + c) term comes from the log det(∆), since the determinant of a symmetric matrix is quadratic in the off-diagonal entries.The coefficients (a, b, c) do not have a simple closed-form, as they depend on the matrix determinant, but they can be easily obtained by evaluating the determinant of ∆ for three different values of ∆ ij (faster methods for computing these determinants are possible since they only involve changing a single entry) and solving the resulting system of equations.The range of values that x is can take given by (l, u) := {x : Any value of x in this set ensures positive definiteness of ∆.This is because ∆ is positive definite if and only if all its leading principal minors are positive.WLOG, letting ∆ ij be in the bottom row of ∆, if the previous estimate is positive definite then the first p − 1 leading principal minors are positive.The condition ax 2 + bx + c > 0 ensures that the final leading principal minor, det(∆), is also positive.The maximisation problem can then be expressed as max We denote the partial derivatives of f by To solve this problem we consider separately the cases c > 0 and c ≤ 0.
We begin by looking at the case c > 0, which implies that 0 ∈ (l, u).We split the problem into three sections, finding local maxima in x = 0, x ∈ (0, u), x ∈ (l, 0) separately and then selecting from these the global maximum.
The constraint x < u results in some condition on the following quartic which we refer to as q(y 1 ) We first summarize the range of y 1 values that needs to be considered, depending on the values of (c 12 , c 2 ), and subsequently outline their derivation.If the positive root is taken in (11) for y 2 then the following constraints are required The negative root in (11) must only be considered if c 2 < 0 and y 1 < −c 1 (also implying that c 1 < 0 and, from constraint 1, c 12 > 0).In this case the inequalities in constraints 5 and 6 must be reversed.
We outline how to obtain the above constraints.The constraint x > 0 along with (10) implies that sign(y Hence, if c 12 > 0 then the range of values to consider can be restricted to giving constraint 1, while if c 12 < 0 then the inequality is reversed giving constraint 2. Note that if c 12 = 0 then the optimisation problem is simpler and so the details of this case are omitted. For y 2 to take a real value in (11) we must have 4y 2 1 + 4c 1 y 1 + c 2 2 ≥ 0 which implies that either giving constraint 3.
Combining the constraint y 2 > 0 with (11), if c 2 > 0 then we need y 1 ≥ −c 1 in order for there to be a solution for y 2 , giving constraint 4. On the other hand, if c 2 < 0 and 0 < y 1 < −c 1 then there are two solutions for y 2 and one must consider both the positive and negative roots in (11).For all other situations one must only consider the positive root.Now combining the constraint x < u with ( 10) and ( 11), one obtains the inequality from which constraints 5 and 6 follow.
Combining each of these constraints give the range of possible values for y 1 to numerically search for a stationary point.Once y 1 is found, ( 11) and ( 10) give the corresponding (x, y 2 ).Note that it is possible that there be no stationary points within x > 0.

B Proofs
In this section we present the proofs for each of the results in this paper as well as some supplementary results.

B.1 Mean squared error of logarithmic penalty
This section addresses the claim of Section 2 related to the mean squared error of logarithmic penalties in the p = 1 case.Specifically, we show that amongst penalty functions of the form c log(x) for constant c ≥ 0 on the precision, choosing c = 2 asymptotically minimises the mean squared error of the estimate of the precision.
Suppose we have n observations of X ∼ N(µ, θ −1 ) with sample variance s.Note that and so From this we get that Consider estimating θ via a penalised likelihood of the form This can easily be shown to be maximised at It follows that and so It can be shown that this function is minimised at c = 2n n−1 .Letting n → ∞ we get that the MSE is asymptotically minimised amongst logarithmic penalties by taking c = 2.

B.2 Proofs for Section 3
Proof of Proposition 1.Let S be some sample covariance matrix for which Θ(S) is not diagonal and D be some diagonal matrix with non-zero diagonal entries d i , i = 1, . . ., p. Suppose that Θ is scale invariant.Let θij = Θ(S) ij be some non-zero off-diagonal entry of Θ(S), and θij = Θ(DSD) ij be the corresponding entry in Θ(DSD).By scale invariance we must have θij = θij d i d j .
For these to maximise their corresponding penalised likelihoods, the derivatives of the penalised likelihood function (1) with respect to θ ij must be equal to 0 at θij and θij respectively (note that the derivative exists because P en is regular and θij = 0, θij = 0).Therefore where we used that, since Θ is scale invariant then Θ(DSD) = D −1 Θ(S)D −1 and hence ( Θ(DSD It follows that That is, for scale invariance to hold the penalty must satisfy pen ij The latter requirement can only hold in two scenarios.First, there is the trivial scenario where Treating θij , and therefore also k, as fixed, we denote by x = θij d .Then we have pen ij (x) = θij k x .It follows that pen ij (x) = θij k log(|x|) + c for some constant c and x = 0, that is pen ij is a logarithmic penalty.
This proves that for a regular penalty to be scale invariant it must have L 0 or logarithmic pen ij .We now turn our attention to the diagonal penalty.
Let S be some diagonal covariance matrix, and D some diagonal matrix as before.Let θii = Θ(S) ii and θii = Θ(DSD) ii .By scale invariance we must have θij = θij Since S is diagonal, it is easy to see that both Θ(S) and Θ(DSD) must also be diagonal, and that θii maximises the function: while θii maximises the same function but with S ii replaced by d 2 i S ii .It follows that the corresponding derivatives must both be equal to zero at θii and θii respectively (P en is regular so pen ii is differentiable).Using this along with θij = θij d 2 i we obtain: As before, it follows that pen ii must be either constant or logarithmic.This proves that for a regular penalty function to be scale invariant it must have either constant or logarithmic penalty on the diagonal entries.
To complete the proof we must show that such penalty functions (L 0 or logarithmic off-diagonal penalty and constant or logarithmic diagonal penalty) are always scale invariant.This follows from Proposition 2 since the L 0 and logarithmic penalties are also symmetric PC-separable.
Proof of Proposition 2. Let S be a sample covariance matrix and D be a diagonal matrix with non-B.2Proofs for Section 3 B PROOFS zero entries d i .Suppose that the estimate Θ(S) decomposes as θ1/2 ∆θ 1/2 and that the estimate Θ(DSD) decomposes as θ1/2 ∆θ 1/2 .To prove scale invariance we need that ∆ = sign(D) ∆sign(D) and θ = D 2 θ.
Since Θ(S) maximises the penalised likelihood at S, θ, ∆ must maximise log(det(Θ)) and similarly, θ, ∆ must maximise log(det(Θ)) By substituting θ ii = d 2 i θ ii and ∆ ij = sign(d i d j )∆ ij into (15), and noting that pen ij is symmetric about 0, we get log(det(Θ)) Since log(d 2 i ) is a constant, ( 16) is of the same form as ( 14) and they are maximised at the same point.Hence we have that ∆ = sign(D) ∆sign(D) and θ = D 2 θ.
Proof of Proposition 3. Let π be a prior density as given in Proposition 3, S be some sample covariance and D some diagonal matrix with non-zero entries.Writing L(Θ | S) as the likelihood function, Θ = θ 1/2 ∆θ 1/2 and treating D as a constant, the posteriors given S and DSD are For any measurable set A and A D = {Θ : D −1 ΘD −1 ∈ A} the probabilities in Definition 6 can be written as The result follows by noting that expression (17) can be obtained by multiplying (18) by the constant

B.3 Supplementary results for Section 4
Suppose the value of an estimator θ = diag( Θ) and all the entries in ∆ are given, except for a pair of partial correlations (∆ k 1 k 2 , ∆ k 1 k 3 ), for some indexes k 1 , k 2 , k 3 ∈ {1, . . ., p}.Suppose that S, and the given elements in ∆ and θ satisfy the following conditions: (C2) ∆k 2 j = ∆k 3 j for all j ∈ {k 1 , k 2 , k 3 }.
Proof.Without loss of generality suppose that the variable indexes are k 1 = 1, k 2 = 2 and k 3 = 3.
We shall show that the two terms log det(Θ) and tr(SΘ) are symmetric in (∆ 12 , ∆ 13 ), when (C1)-(C3) hold.Using straightforward algebra gives that tr(SΘ) = tr(Sθ Note that because the log-likelihood is a convex function, and therefore has a unique maximum, symmetry in (∆ k 1 k 2 , ∆ k 1 k 3 ) implies that the MLE will estimate these two partial correlations to be equal.
Proof.From Proposition S5 the penalised likelihood is symmetric if and only if pen pen k 1 k 3 θk 1 k 1 θk 3 k 3 ∆ k 1 k 3 is symmetric.Since P en is regular, this only happens when θk 2 k 2 = θk 3 k 3 or when pen ij is either L 0 or logarithmic.

B.4 Proofs for Section 6
Proof of Proposition 4. For a fixed θ, optimisation of the penalised likelihood function ( 5) is equivalent to optimisation of the following function log(det(∆)) The log-determinant function is known to be concave over the space of positive definite matrices.
For fixed θ the second term is simply a sum of linear functions.The third term is simply a sum of clearly concave functions.Hence the objective function is a sum of concave functions and is therefore concave.

Figure 1 :
Figure 1: Top: partial correlation regularisation paths for GLASSO in the p = 50 star graph example on the original data (left), and standardised data (right).Estimates of truly non-zero θ ij are in black.Bottom: Partial correlation regularisation paths for PC-GLASSO in the p = 50 star graph example (left) and KL loss over the regularisation paths for different penalties applied to standardised data (right).

Figure 3 :
Figure 3: Partial correlation regularisation paths in p = 4 star graph example for GLASSO on the original S (left), standardised S (center) and PC-GLASSO (right).

Proposition 4 .Figure 4 :
Figure 4: Marginal prior densities for the partial correlations under GLASSO prior (top left) and PC-GLASSO prior (top right) and for the diagonal entries under the GLASSO prior (bottom).
4: Random graph -randomly select 3 2 p of the θ ij and set their values to be uniform on [−1, −0.4] ∪ [0.4,1], and the remaining θ ij = 0. Calculate the sum of absolute values of offdiagonal entries for each column.Divide each off-diagonal entry by 1.1 times the corresponding column sum and average this rescaled matrix with its transpose to obtain a symmetric, positive definite matrix.For each setting we used p = 20 variables, considered sample sizes n ∈ {30, 100} and we performed 100 independent simulations.To assess estimation accuracy we used the Kullback-Leibler (KL) loss KL(Θ, Θ) = − log( Θ) + tr( ΘΘ −1 ) + log(Θ) − p.To assess model selection accuracy we considered the Matthews correlation coefficient (MCC) MCC = TP × TN − FP × FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) , where TP, TN, FP and FN stand for the number of true positives, true negatives, false positives and false negatives (respectively) and measure the ability to recover the true edges in the graph corresponding to Θ.The MCC combines specificity and sensitivity into a single assessment and ranges between −1 and 1, where 1 indicates perfect model selection.More information on the MCC can be found in, for example, Chicco and Jurman [2020].

Figure 5
Figure 5 summarises the results.More detailed results, including Frobenius norm, sensitivity and specificity, are in Appendix C. PCGLASSO generally outperformed GLASSO in all scenarios, and either outperformed or was competitive to SCAD and MCP.More specifically, PCGLASSO strongly outperformed other methods in the Star graph setting in estimation and model selection.The Star graph is an example where there is a large range in the node degrees, suggesting that penalising partial correlations can be particularly beneficial in such situations.The AR2 model is the opposite situation where every node has either 1 or 2 edges.Here PCGLASSO still improved significantly over GLASSO, and to a lesser extent over SCAD or MCP in the n = 30 case, but for n = 100 the latter two provided better estimation and model selection recovery.PCGLASSO was also generally better in the Hub and Random graph settings, particularly for n = 30, although SCAD and MCP offered slight improvements for n = 100.

Figure 6 Figure 5 :Figure 6 :
Figure6shows the proportion of the 100 simulations in which each edge was selected, illustrating that PCGLASSO generally selected sparser models than GLASSO, particularly in the Star and Hub scenarios.

Figure 7 (Figure 7 :
Figure7(left) plots the model size vs. test sample log-likelihood, and indicates the models chosen by the BIC and EBIC.For both these solutions, PC-GLASSO achieved a significantly higher loglikelihood than the other three methods, and selected a model of roughly comparable size.