Efficient Bayesian Regularization for Graphical Model Selection

There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse–Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example. reversible jump and for and adaptive lasso were obtained from the supplementary materials of the frequentist lasso and the method by and implemented using the R. We wrote the code for implementing the reversible Monte the i is the class of decomposable priors on the edge

measurements obtained from n samples, denoted by x i = (x i1 , …, x ip ), i = 1, …, n. Here, we are concerned with Gaussian graphical models for continuous data, that are designed to detect conditional dependency relationships by discovering a pattern of zeros in the inverse covariance or precision matrix, a process typically referred to as covariance selection (Dempster, 1972). Our goal is to propose a novel, flexible and efficient Bayesian covariance selection strategy in large dimensional settings (we consider p in several hundreds) which has theoretical guarantees and encouraging numerical performance. Let X = (x 1 , …, x n ) T = (x c1 , …, x cp ) be the n × p dimensional data matrix, with subscript c denoting the columns. The cornerstone of Bayesian approaches for Gaussian graphical models has been discrete mixture formulations that specify x l | Σ G N(θ, Σ G ), Σ G π(Σ G | G), G π(G), l = 1, …, n, (1) where the graph G is defined using a set of nodes or vertices V* = {1, …, p} and an edge set E = (e ij ) with e ij = 1 if and only if the (i,j)th entry of the precision matrix is non-zero. The term discrete mixture comes from the fact that the prior on the covariance in model (1) can be represented as the convex linear combination π(Σ) = ∑ G ∈ G π(G)π(Σ G | G) with G having a discrete probability distribution. For a fixed graph G, the support of Σ G −1 is the cone M G + , the space of all positive definite matrices having exact zeros for off-diagonals corresponding to absent edges. Here θ denotes the mean which is usually set to zero after centering the measurements.
Some prominent examples of discrete mixture priors include the hyper inverse-Wishart prior (Dawid and Lauritzen, 1993) for the covariance and the G-Wishart prior for the precision (Diaconis and Ylvisaker, 1979;Roverato, 2000;Atay-Kayis and Massam, 2005). The implementation of most of these approaches relies on reversible jump Markov chain Monte Carlo (MCMC) algorithms (Giudici and Green, 1999;Dellaportas et al., 2003;Wong et al., 2003;Wang and West, 2009;Green and Thomas, 2013). For a comparison of model fitting approaches for decomposable graphical models, refer Fitch et al. (2014). These algorithms explore the graph space and subsequently select graphs with high posterior probabilities Pr(G|X) or estimate a graph by including edges having a posterior inclusion probability greater than some threshold. Jones et al. (2005) proposed the shotgun stochastic search algorithm designed to efficiently move toward regions of high posterior probability in the model space using a parallel computing approach, whereas Scott and Carvalho (2008) developed a greedy approach called the feature inclusion search algorithm for decomposable Gaussian graphical models. Recently, Mohammadi and Wit (2015) proposed a transdimensional Markov chain Monte Carlo approach based on a continuous-time birth-death process.
As p increases, the cardinality of the graph space increases exponentially, making it computationally intractable if not impossible for many discrete mixture-based approaches to efficiently explore the graph space. This problem is somewhat akin to known difficulties encountered by stochastic search variable selection approaches (George and McCulloch, context of graphical model estimation, as the graph space (having cardinality 2 p(p−1)/2 ) explodes far more quickly. As a result the usual discrete mixture-based approaches can fail to discover models with high posterior probabilities, while estimates of the edge-specific posterior inclusion probability are susceptible to instability under finite runs of the Markov chain Monte Carlo, as demonstrated in our simulations. Moreover in large dimensions, the results can be sensitive to the choice of the prior on the graph space. An additional constraint is that the optimal graph is often restricted to lie in the class of decomposable graphs, due to the computationally demanding heuristic approximations required for non-decomposable models (Atay-Kayis and Massam, 2005;Lenkoski and Dobra, 2011).
Motivated by the success of shrinkage methods in Bayesian variable selection, we propose a shrinkage approach for estimating graphical models which bypasses the limitations of discrete mixture approaches and has connections with global-local priors (Carvalho et al., , 2010Polson and Scott, 2011) in regression settings. The proposed approach decouples model fitting and covariance selection. We first fit the full model based on a complete graph under a class of conjugate shrinkage priors, which is denoted as regularized inverse-Wishart priors hereafter. Our approach is novel in assigning suitable priors on the scale matrix of the inverse-Wishart, which can be marginalized to induce adaptive shrinkage on the elements of the Cholesky factor of the precision matrix, thus leading to a Choleskybased regularization (Pourahmadi, 1999;Smith and Kohn, 2002;Wu and Pourahmadi, 2003;Frühwirth-Schnatter and Tüchler, 2008). However, unlike the usual Cholesky-based regularization approaches, the proposed prior is order invariant and allows for conjugate updates of the precision matrix, leading to efficient posterior computation.
Although shrinkage priors have elegant properties and are routinely used, they do not immediately provide an automated procedure for model selection. While some valid thresholding approaches are available for sparse covariance matrix estimation (Bickel and Levina, 2008;Cai and Liu, 2012), there is a lack of systematic thresholding approaches to obtain sparse precision matrices in the Bayesian paradigm. However, there has been an increasing interest in continuous shrinkage approaches for Bayesian variable selection, which apply decision theoretic methods to reduce unimportant effect sizes to zero for model selection purposes (Fouskakis et. al, 2009;Bondell and Reich, 2012;Hahn and Carvalho, 2015). This motivates us to propose a decision theoretic approach for graphical model estimation, that uses L 0 penalized joint credible regions to perform neighborhood selection for each node. Our work is distinct compared to the approach in Hahn and Carvalho (2015), who used expected loss functions involving a L 0 penalization as a post-processing decision theoretic step for model selection. We show that the resulting approach achieves neighborhood selection consistency for fixed and increasing dimensions, and also yields precision matrices which are positive definite and asymptotically consistent for fixed dimensions.
In summary, the proposed approach overcomes several difficulties associated with existing Bayesian alternatives: (i) it obviates having to specify prior graph probabilities, which can adversely affect final inferences under mis-specification; (ii) it does not require long runs of Markov chain Monte Carlo to search over the model space, with the computation involving a straightforward fully Gibbs sampler; (iii) it is computationally efficient and feasible for large dimensions; (iv) it is applicable to a broad class of models, including decomposable and non-decomposable graphs; and (v) it attains selection consistency in fixed p and p n = o(n) settings.

The regularized inverse-Wishart prior
In this section, we propose shrinkage priors on the precision matrix characterized by mixtures of inverse-Wishart priors on the covariance. Without loss of generality we assume a zero mean model, i.e. set θ = 0 in (1), indicating the data matrix X is appropriately centered. The general construction of the prior can be written as, where D = diag(d 1 , …, d p ), and d k G k ( • ), with G k ( • ), k = 1, …, p, denoting mixing distributions allowing for adaptive shrinkage across different scales. By setting G k ( • ) to different mixing distributions, various types of shrinkage can be obtained.
Model (2) relies on a conjugate inverse-Wishart prior on Σ = Ω −1 , and varies from the traditional discrete mixture formulation (1), in having a continuous mixture representation for the covariance as π(Σ) = ∫N(Σ|D)dπ(D). The traditional model (1) constrains the support of Ω G to the cone M G + which depends on G ∈ G, while the continuous mixture prior in (2) has an unconstrained support M + (the space of all positive definite matrices). Our choice of the inverse-Wishart formulation (2) is based on both theoretic and computational considerations: (i) it induces a Gaussian distribution on the off-diagonals of Σ −1 (Lemma 1), which is a necessary condition in establishing model selection consistency; and (ii) the associated conjugacy allows us to draw posterior samples of Σ −1 in an efficient manner even for large dimensions.
We now state the following well-known result as a Lemma ), which serves as a first step toward understanding the regularization properties of the prior in (2). The Lemma captures the distribution of elements in the last row of Σ −1 ≡ Ω p conditional on D. The corresponding result for any row can be adapted in a straightforward manner.
Lemma 1. For Σ ~ Inverse Wishart(b, diag(d 1 , …, d p )), we have π(ω p,pp ) = Ga(ω p,pp | b/2; d p /2), and, π(ω p, 12 |ω p, pp ) = l = 1 Lemma 1 shows us that the precision off-diagonals have a scale mixture representation under a prior on d 1 , …, d p , and a careful choice of G k ( • ) is likely to yield a prior on Σ −1 with desirable shrinkage properties. We propose the following priors on d 1 , …, d p , which achieves joint regularization for all the elements in Σ −1 (see Theorem 1) d k Inverse Gamma((b + 1)/2, λ k 2 /2), λ k Ga(a λ, k , b λ, k ), k = 1, …, p, where b is the degrees of freedom of the inverse-Wishart prior in (2). The hyperparameters λ = (λ 1 , …, λ p ) control the shrinkage under our approach, and one can propose hyperpriors on λ as in (3) to achieve a hierarchical specification that lets the data control the degree of shrinkage. We demonstrate the shrinkage properties induced by λ in Figure 1, which plots the density of the precision off-diagonals under formulation (2)-(3), for varying shrinkage parameters. From Figure 1, it is evident that higher values of λ/b encourage greater shrinkage. We provide an analytic justification for such a phenomenon in Remark 3 in the next section.
Due to its regularization properties as elaborated in the next section, we denote the prior formulation in (2)-(3) as the regularized inverse-Wishart prior on the covariance matrix. We note that our specification is related but different to Huang and Wand (2013) whose focus was on covariance matrix estimation. They had a similar generic specification as in (2), but used conjugate Gamma priors on d 1 , …, d p , which resulted in the elements of the covariance matrix having marginally non-informative distributions for certain hyper-parameter choices. However, such an approach is not equipped to provide adequate regularization for precision off-diagonals, and may perform much worse compared to the regularized inverse Wishart method in terms of graph estimation, as demonstrated is simulations. We note that while it is also possible to use a fixed choice of D, our hierarchical specification is motivated by global-local priors resulting in adaptive shrinkage.
Remark 1. The prior in Theorem 1 is maximized with respect to λ k when , and similar conclusions hold for any element in λ. The above can be seen by taking the logarithm of π(Ω | λ) as expressed in Theorem 1, and then differentiating both sides with respect to λ k and subsequently equating to zero. Thus, a large value of λ k /b implies shrinkage for the elements of the Cholesky factor T′V −1/2 and supports our earlier observation in Figure 1 that larger values of λ k , k = 1, …, p, encourage greater shrinkage in Σ −1 .
Remark 2. Although Theorem 1 imposes shrinkage due to equivalence with Cholesky-based regularization, it has an important difference in terms of being order invariant.

Posterior computation steps
The Markov chain Monte Carlo sampler for the regularized inverse-Wishart approach proceeds by using a straightforward fully Gibbs approach, and employing conjugacy to sample the precision matrices as a whole. We consider Gamma hyperpriors on λ k ~ Ga(a λ,k , b λ,k ), k = 1, …, p, and fix b = 3 in our computations as in Jones et al. (2005). The posterior computation steps are highlighted in the Supplementary Materials.

Model selection
We develop a post-Markov chain Monte Carlo fitting strategy for graphical model estimation, which assigns exact zeros to precision off-diagonals that correspond to absent edges, by using a decision theoretic approach incorporating joint penalized credible regions. We note that the proposed decision theoretic approach is very general since (a) it can be applied to posterior samples of Σ −1 under any prior specification; and (b) it does not make any assumptions about the underlying graph structure, which allows for both decomposable and non-decomposable graphs. The decision theoretic approach performs neighborhood selection (Meinshausen and Buhlmann, 2006) for each node in the graph, which are then combined to obtain estimates for the entire edge set. The neighborhood for node i ∈ V* is defined as ne i = {j ∈ V*\{i}:(i, j) ∈ E}, and is estimated by using equivalent L 0 minimization based approaches in regression settings. In this paper, we adapt the approach proposed by Bondell and Reich (2012) to our context, which uses an approximation to solve the L 0 problem and is briefly summarized below.
For a n × 1 vector of responses y and n × p covariate matrix Z, Bondell and Reich (2012) first fit the full regression model y = Zβ + ϵ, ϵ i N(0, σ 2 ), β j N(0, σ 2 /τ), σ 2 π(σ 2 ), i = 1, …, n, j = 1, …, p, and subsequently perform variable selection under a post-MCMC decision theoretic approach. First, they estimate an ordered sequence of models corresponding to a sequence of credible regions C α that have probability content 1 − α, with α ∈ (0, 1) indexing the sequence. The model corresponding to a credible region C α is obtained via a sparse solution for β induced by minimizing the L 0 norm ‖β‖ 0 which constrained to lie within C α . In particular, they use the following criteria where P r (β ∈ C α ) = 1 − α, and β, Σ are the posterior mean and covariance of β, based on the full model (5). Since solving the exact L 0 minimization problem (8) involves a combinatorial search which is computationally infeasible for moderate to high dimensions, they approximate the L 0 criterion by a smooth homotopy between L 0 and L 1 , which can be solved using existing algorithms such as least angle regression (Efron et al., 2004). Finally, an optimal value of α is chosen from the sequence, which yields a point estimate comprising exact zero effect sizes for unimportant predictors.
After convergence of the Markov chain Monte Carlo, the posterior samples of (ω p,kk , β k ) can be viewed as arising from the stationary distribution π(ω p,kk , β k |X), implied by (7). It is worth pointing out here that although the reverse process of fitting the series of p regression models in (7) may potentially result in similar edge set estimates under suitable choices of hyper-parameters, it is not a valid joint distribution procedure, does not yield positive definite precision matrix estimates, and involves two times the number of parameters compared to the proposed approach. Hence we do not consider this alternate approach further in our work. Finally, we note that the hierarchical formulation (2) specifies a globallocal prior (Poison and Scott, 2011;Carvalho et al., , 2010 on the conditional regression coefficients in (7), with ω p,kk and d j −1 , j ≠ k, j = 1, …, p, being the global and local scale parameters under the k-th regression. The global scale parameter controls the global shrinkage to the origin, and the local scales allow deviations in the degree of shrinkage, which enables a sharp spike at zero along with thick tails where necessary.
The posterior samples of Σ −1 can then be used directly to obtain posterior realizations of β k , k = 1, …, p. Further, we note that (7) is very similar to regression model (5), which allows us to adapt the penalized joint credible region approach to obtain a sparse solution of β k corresponding to level α as where β k and Σ k are the posterior mean and covariance of β k respectively, under the regularized inverse-Wishart approach. The solution β k α corresponds to a distinct estimated neighborhood ne k, α = l ∈ V * : β kl α ≠ 0, l ≠ k for node k ∈ V* since β kj α = 0 implies that the (k,j)-th precision matrix element is zero under the equivalence β kj = − ω p, kk −1 ω p, kj , j ≠ k, k = 1, …, p. As in Bondell and Reich (2012), we solve an approximate version of the L 0 optimization in (8). The proposed solution is unique for both p < n and p > n settings for each regression in (7). The details of the computational procedure for the solution can be found in Section 3.2.
Under our approach, two estimates for the edge set are possible for a given α: E α, ∧ = (k, l): k ∈ ne l, α ∧ l ∈ ne k, α and E α, v = (k, l): k ∈ ne l, α ∨ l ∈ ne k, α . Although the two edge sets are not guaranteed to be equal for finite samples, both estimates are equal asymptotically due the neighborhood selection consistency, as elaborated in Theorem 2.
Hereafter, we suppress the second subscript and denote the estimated edge set for level α as E α . The precision matrix corresponding to level α can be computed as where Ω is the posterior mean of the Markov chain Monte Carlo samples under the regularized inverse-Wishart approach, ADJ α is the adjacency matrix corresponding to the edge set E α and ⊗ denotes the element-wise product. As demonstrated in the next section, a careful choice of α depending on n leads to asymptotic consistency for Ω E α for fixed p settings, which also implies asymptotic positive definiteness.
Noting that the above estimate for the precision matrix is not guaranteed to be positive definite for finite samples, we propose an alternate estimator obtained by fixing the offdiagonals corresponding to absent edges to be zero, while rescaling the elements in the Cholesky factor of Ω in order to ensure positive definiteness. The estimator and the algorithm needed to obtain it, are described in Section 3.5.

Computation steps for neighborhood selection
Following Bondell and Reich (2012), who noted that solving an exact L 0 minimization problem involves a combinatorial search which is computationally infeasible for moderate to high dimensions, we replace the L 0 criterion in (8), by a criterion proposed by Lv and Fan (2009) which is a smooth homotopy between L 0 and L 1 . Instead of optimizing (8) under the L 0 penalty, we use the modified penalty j = 1, j ≠ k p τ a (|β kj |) where τ a (|t|) = ( |t| a + |t| )I(|t| ≠ 0) + ( a a + |t| )|t|, a > 0 .
As in Bondell and Reich (2012), the above can be solved using a local linear approximation (Zou and Li, 2008) which reduces the modified optimization problem (10) to where Δ α corresponds to a penalty parameter having a one-to-one correspondence with α.
The above is a Lagrangian optimization problem and can be equivalently written as where Y k * = Σ k 1/2 β k , X k * = Σ k −1/2 B k and B k is a diagonal matrix having elements β kj 2 , j ≠ k, j = 1, …, p. Equation (12) is just the usual LASSO problem and can be solved using the efficient LARS algorithm, and solution for the original quantity of interest can simply be obtained using β k = B k β k * . For p < n settings, the above is a strictly convex problem with having full rank (i.e. rank(X k * ) = p − 1), and hence has a unique solution. For p > n, results in Tibshirani (2013) suggest that our solution is unique with probability one, as the responses are continuous under our set-up.
We note that the proposed neighborhood selection method via penalized credible regions can be interpreted as a decision theoretic approach. In particular, the solution to the neighborhood estimation equation (11) can be approximated as the minimizer to an expected loss function as in the following Lemma. Let D −k denote the posterior mean for diagonal matrix D without the k-th diagonal entry, and let β 2, A 1/2 2 denote the scaled L 2 norm β T Aβ.
The proof is presented in the Supplementary Materials. In the above decision theoretic criteria, the first term determines the sparsity of the solution, the second term is based on an expected scaled squared loss function under a posterior predictive distribution similar to the loss function for regression in Hahn and Carvalho (2015), and the third term imposes an additional scaled squared loss penalty which encourages the solution for the regression coefficients to concentrate around the posterior mean of the regression coefficients. The scaling matrices (1/s k )I and (1/s k )D −k account for the variance corresponding to the prior ) under the regularized inverse-Wishart prior.

Selection consistency
In this section, we establish that the proposed model selection approach leads to consistent neighborhood selection under some suitable assumptions. Suppose that for a given sample of size n, we estimate the neighborhood corresponding to level α n in the ordered sequence, and denote the corresponding estimated neighborhood for the k-th node as ne k, n . By choosing α n such that 1 − α n → 1 as n → ∞, we attain neighborhood selection consistency which is mathematically defined as P (ne i = ne i * ) 1, as n → ∞, for all i ∈ V*, where ne and ne* refer to the estimated and true neighborhoods respectively. In other words, the probability of the estimated neighborhood for each node being equal to the true neighborhood asymptotically goes to 1 as the coverage increases with n.
For a sample size n, denote the credible region for β k with content 1 − α n as C n, k = β k : (β k − β k ) T Σ k −1 (β k − β k ) ≤ C n . Let E 0 be the true edge set corresponding to an undirected graph and having true neighborhood ne k0 for node k, k = 1, …, p. Consider the following assumptions: where Ω E 0 = (ω 0, ij ) i, j = 1 p has exact zeros for the offdiagonals corresponding to absent edges in E 0 .
(A3) When p n = o(n), each node has a finite number of neighbors.
Assumption (A1) states that the true model is a Gaussian graphical model with edge set E 0 . Assumption (A2) implies that the true precision off-diagonal elements corresponding to edges in E 0 are sufficiently large but bounded above. Assumption (A3) ensures that the true model is sparse under increasing dimensions. The following Theorem establishes selection consistency.
Theorem 2. For fixed p, suppose assumptions (A1)-(A2) hold, and choose a sequence of credible regions C n, k such that C n → ∞ and n −1 C n → 0 For p n = o(n), suppose assumptions (A1)-(A3) hold, and choose C n, k such that C n → ∞, p n −1 C n ∞ and n −1 C n → 0. Then neighborhood selection consistency is attained under the regularized inverse-Wishart approach for both cases.
The proof of Theorem 2 is provided in the Supplementary Materials. Let the estimated generic edge set for level α n be denoted as E n , with the corresponding estimated precision matrix denoted as Ω E n . The following result holds for both E n, ∧ and E n, ∨ for the case of fixed p.
The first part of Corollary 1 follows since the conditional independence structure of a multivariate normal can be consistently estimated by combining the neighborhood estimates of all variables, for fixed p. The edge set estimates, E n, ∧ and E n, ∨ , converge asymptotically when the truth is a Gaussian graphical model, due to consistent selection of the neighborhoods in Theorem 2. The proof for the second part of Corollary 1 is in the Supplementary Materials. Since Ω E 0 is positive definite by assumption, Ω E n is asymptotically positive definite for fixed p.
The above results are based on the assumption that the scale matrix D is diagonal. However, the next Corollary shows that consistency still holds under a non-diagonal D ∈ M + under the following additional assumption. The proof is presented in the Supplementary Materials. Let D −k denote the D matrix excluding the k-th row and column, d k, − j T denote the k-th row of D excluding the j-th element, and Ω 0,k denote the true precision matrix of (x 1 , …, x k−1 , x k+1 , …, x p ). Corollary 2. Suppose D ∈ M + is fixed and non-diagonal, and that assumptions (A1)-(A4) hold. Then the neighborhood selection consistency holds as in Theorem 2.

Edge selection based on false discovery rates
Bayesian methods usually obtain point estimates of the graph by including edges that have a posterior probability > 0.5, or reporting the graph that has the highest log marginal likelihood. In lieu of these approaches, we propose an approach based on controlling the false discovery rate, which includes a natural multiplicity correction and can directly control the level of sparsity in edge selection. This approach is based on the strategy that the edges which are strongly supported by the data will likely appear often in the ordered sequence of neighborhoods computed via penalized credible regions, whereas other edges with weaker evidence will likely appear less often. This approach has similarities to the one used to compute brain network using sliding window correlations (Monti et al., 2014). Instead of specifying a confidence level α and then computing the corresponding neighborhoods, the proposed strategy for edge selection instead relies on fitting a series of neighborhoods for varying sparsity levels corresponding to different values of the penalty parameter Δ α which has a one-to-one but highly non-linear relationship with α (Bondell and Reich, 2012). After fitting a series of neighborhoods corresponding to varying sparsity levels, we select those edges as important which have pseudo-probabilities greater than some threshold, where these pseudo-probabilities are computed based on the frequency of occurrence of an edge in the ordering. We adapt the approaches in Morris et al. (2008) and Baladandayuthapani et al. (2010) to determine such a threshold, which is designed to control false discovery rates and is described below.
We first compute a pseudo posterior inclusion probability matrix P = (P ij ) by computing the proportion of times each edge is included in the ordering of graphs based on the sequence of credible regions having probability content (1 − α m ) m = 1 R , where R is the chosen number of credible regions in the sequence. Then 1 − P ij can be considered akin to Bayesian q-values, or estimates of the "local false discovery rate" (Newton et al., 2004), as they measure the probability of a false positive if the (i,j)-th edge is called a 'discovery' or is significant. Given a desired global false discovery rate bound η ∈ (0, 1), one can determine a threshold c η that flags the set of important edges as E η = (i, j): P ij ≥ c η . This yields a point estimate of the graph.
The threshold c η can be chosen in the following manner. Let vec(P) be the vectorized upper triangular matrix of P excluding diagonals, containing the pseudo posterior inclusion probabilities of the edges stacked column-wise. We first sort vec(P) in descending order to yield the sorted vector vec(P ) = P k , k = 1, …, p(p − 1)/2 . Then we can estimate c n as the ζth entry of vec(P ), where ζ = max j * : 1 j * k = 1 j * P k ≤ η , and a lower value of c η leads to sparser graphs, while simultaneously controlling the FDR at a pre-specified level. However, we would note that this FDR is based on pseudo-probabilities which may not reflect the true posterior edge inclusion probabilities, and hence these FDR values and associated pseudoprobabilities need to be interpreted with care. Finally, if the upper limit of Δ α is increased drastically, or alternately, if all the Δ α values lie within a very small neighborhood of zero, then the resulting estimated graph may be somewhat sensitive to such choices. However, by choosing the threshold c η appropriately, one can mitigate the sensitivity to the grid values to a large extent and obtain similar graphs with comparable false discovery rates.

Algorithm for estimating positive definite precision matrix
We propose an algorithm for estimating the precision matrix which rescales the elements of the Cholesky factor of the posterior mean of Ω given a set of zero entries under the estimated graph, in a manner that ensures positive definiteness. This algorithm is derived based on the results in Chan and Jeliazkov (2009), who propose a Markov chain Monte Carlo sampling approach for restricted covariance matrices. Our algorithm is different from their approach, and focuses on the inverse covariance matrix while requiring only a series of post-Markov chain Monte Carlo deterministic steps, instead of having to sample restricted precision matrices at each Markov chain Monte Carlo iteration. The resulting approach is thus computationally efficient even for large p. The algorithm takes in the positive definite posterior mean Ω and the estimated edge set E as inputs, and outputs another positive definite estimate Ω* which has exact zeros corresponding to absent edges in E.
Using the form of the Cholesky factor in the proof of Theorem 1 and similar to eqn (18) in Chan and Jeliazkov (2009), it is straightforward to see that setting ω kl = 0 leads to an adjustment in the Cholesky factor matrix as ω k, kl = − ω l, ll −1 ℎ = 1 l − 1 ω k, kℎ ω l, lℎ , where ω k, kl is the modified value for the (k,l)-th element in the Cholesky factor which maintains positive definiteness of the precision matrix under the restriction ω kl = 0. It is understood that the right hand side of the equation is zero when summing over the empty set (i.e. ω k, k1 = 0). We propose the following post-Markov chain Monte Carlo algorithm which takes as inputs the estimated graph G having edge set E, and the posterior mean of the precision matrix Ω, and outputs a positive definite matrix with exact zeros for absent edges.
Step 1: Accept Ω = (T * )F (T * ) T , the posterior mean of Ω as the input, along with the estimated edge set E, where T* = (t ij ) is a lower triangular matrix and F = diag(f 1 , …, f p ).
Step 2: For increasing k = 2, …, p, and l = 2, …, k − 1, if edge (k, l) is absent in E, modify the (k, l)-th element in T* as −(1/f l ) ℎ = 1 l − 1 t kℎ t lℎ f ℎ . It is understood that the (k, l)-th element is updated before the (k, l′)-th element in T*, for 1 ≤ l < l′ ≤ k − 1. Note that this step implies that we fix the (k, 1)-th element in the Cholesky factor T* as zero if edge (k, 1) is missing from E.
Step 3: The final estimated precision matrix is given as Ω* = LF(L) T , where L now denotes the modified lower triangular matrix obtained by rescaling all elements in T* corresponding to absent edges in E, and F is unchanged.
This estimate Ω* is guaranteed to be positive definite, and the number of steps required to compute Ω* is equal to the number of absent edges. For greater clarity, we present a toy numerical example of the proposed positive definite estimator in the Supplementary Materials.

A related approach
We note that the idea of fitting a posterior distribution and then performing a post-MCMC decision theoretic step was also proposed recently in Hahn and Carvalho (2015). They primarily focused on variable selection in the context of a Bayesian linear regression model and extend the approach to graphical models, using a post-processing decision theoretic step which involved minimizing the expected loss E (L(Y , γ)). For graphical model selection, the Hahn and Carvalho (2015) approach reduces to minimizing the expected loss function arg min Ω E λ Ω 0 − log(det(Ω)) − n −1 tr(XX T Ω) = arg min Ω λ Ω 0 − log(det(Ω)) − n −1 tr(ΣΩ) , (13) where Ω represents the inverse covariance matrix, Σ denotes the posterior mean of the covariance matrix, and the expectation is taken with respect to the posterior predictive distribution of X. The authors propose to solve a surrogate problem by replacing the L 0 penalty in (13) with the L 1 penalty, which can be solved using a graphical lasso (Friedman et al., 2008) algorithm.
Although both approaches operate by fitting a Bayesian model, and then performing a postprocessing step involving a L 0 penalization to estimate the model, there are important differences. Firstly, the proposed approach is based on neighborhood selection via penalized joint credible regions, whereas the Hahn and Carvalho (2015) relies on sparse precision matrix estimation. Second, our article develops a novel prior on the covariance matrix, which induces shrinkage on the precision matrix off-diagonals, whereas the Hahn and Carvalho (2015) article did not focus on developing novel priors. Third, the two approaches choose the tuning parameters (Δ α and λ) in a completely different manner -the Hahn and Carvalho (2015) method uses the full posterior distribution, while the proposed approach fits a series of models under a sequence of tuning parameter values, and then selects the optimal graph using a criteria designed to control for false positive rates. Moreover, the proposed approach has attractive theoretical properties in terms of graphical model selection consistency, even when the number of nodes grows with the sample size, whereas the theoretical properties of the graphical modeling approach in Hahn and Carvalho (2015) have not been vetted, to our knowledge. The two approaches also perform differently in numerical studies, as elaborated in the simulation section.

Description
We present several simulation scenarios comparing our approach to (a) frequentist graphical lasso (Friedman et al., 2008); (b) neighborhood selection approach by Meinshausen and Bühlmann (2006); (c) the hyper inverse-Wishart approach employing reversible jump Markov chain Monte Carlo (Giudici and Green, 1999); (d) shrinkage approaches such as the Bayesian graphical lasso and Bayesian adaptive graphical lasso (Wang, 2012); and (e) the unregularized inverse-Wishart prior which has the same formulation as (2), but with D = dI p and d ~ Ga(1, 1), which resembles the prior in Huang and Wand (2013). The Matlab code for the Bayesian lasso and adaptive lasso were obtained from the supplementary materials of Wang (2012), while the frequentist graphical lasso and the method by Meinshausen and Bühlmann (2006) were implemented using the 'glasso' package in R. We wrote the code for implementing the reversible jump Markov chain Monte Carlo under the model x i ~ N(0, Σ G ), Σ G | G ~ HIW(b, D), i = 1, …, n, with G ~ π(G) is restricted to the class of decomposable graphs, and π(G) defined by independent Bernoulli(p*) priors on the edge inclusion indicators. Here p* ~ U(0, 1) and HIW refers to the hyper inverse-Wishart prior. For the shrinkage procedures, 15000 MCMC iterations with a burn in of 5000 was used, while 100000 iterations with burn in of 10000 was used for the discrete mixture approach (c), with the initial adjacency matrix corresponding to a null graph.
We considered several cases for data-generation, with each case having 50 replicates. We fit a slightly modified model (2) for p > n settings, by specifying D ~ Wishart(b D , I p ), where D ∈ M + is no longer constrained to be diagonal.
Case I. Data is generate from a Gaussian distribution with the covariance matrix being a fractional Gaussian noise process having elements where H ∈ [0.5, 1] is the Hurst parameter, and chosen to be H = 0.7.
Case II. We generate data emulating a real data application, using mRNA expression levels for 49 genes that are available from The Cancer Genome Atlas portal. The full details and the corresponding results are available in the Supplementary materials.
Case III. We generate data from a two component mixture of Gaussians, i.e. x i ~ πN(−1 p , Σ) + (1 − π)N(1 p , Σ), where 1 p denotes a p-vector of ones, Σ is defined as in Case I, and π ∈ (0, 1) is the mixing proportion. This case corresponds to a non-Gaussian truth, violating the Gaussian assumption inherent in our formulation.

Case IV.
We generate data from a Gaussian graphical model as in Peng et al. (2009), where the edges were generated randomly with probability 0.002, and the precision matrix offdiagonals were set to zero for absent edges and were generated from a U(−1, 1) distribution otherwise, with the k-th diagonal then being computed as ω kk = 1 + ∑ l≠k |ω kl |, to maintain diagonal dominance. This case corresponds to a sparse graphical model.
In addition to the above cases, we also look at another simulation scenario based on a genomics example, the results for which are presented in the Supplementary section. Since Cases I and III considered here correspond to non-sparse precision matrices having few exact zero entries, we adopt a slightly different notion to define the true edge set. In particular, the true edge set includes all edges corresponding to absolute partial correlations greater than a certain threshold c m . We examine point estimates corresponding to true edge sets ES025 and ES005, obtained by choosing c m = 0.025, and c m = 0.005 respectively. The edge set ES005/ES025 essentially treats all edges corresponding to an absolute partial correlation > 0.005/0.025 as important and other edges as unimportant.
For our approach, the point estimate is obtained under a false discovery rate of 0.2, whereas the estimates for the frequentist approaches were obtained by minimizing a Bayesian Information Criteria or BIC, as in Yuan and Lin (2007). The point estimate of the graph under the discrete mixture approach corresponds to all edges having posterior inclusion probability ≥ 0.5, while the heuristic thresholding method of Wang (2012) is adopted for Bayesian graphical lasso.

Results
For comparing results, we look at the area under the curve as well as the sensitivity and specificity levels under the point estimate for the graph. We also report the sensitivity corresponding to a specificity of 90% for p > n cases. The results are presented in Tables 1-2, and the ROC curves for Case I are illustrated in Figure 2. In addition, we examined if the reported metrics are significantly better under one particular method compared to others using a permutation test. We note that it was not feasible to obtain results for competing Bayesian approaches under p > n settings due to an unrealistic computational burden.
Under Case I, we find that both the proposed approach and the HC2015 method have a significantly higher area under the curve compared to other approaches. Moreover the proposed approach has a significantly higher sensitivity compared to all approaches except the Bayesian graphical lasso in some cases, with the latter reporting denser graphs having low specificity levels. On the other hand, the proposed method has slightly lower specificity levels compared to some of the other approaches which report much sparser graphs. For p > n settings, the sensitivity corresponding to a specificity of 90% is very similar under the proposed method, HC2015, and graphical lasso, but it is significantly lower under the penalized neighborhood selection approach (MB). Similar conclusions hold when the data is generated under Case III, from a bimodal distribution.
For Case IV involving sparse true graphs, the proposed approach performs significantly better compared to all other approaches in terms of area under the curve, whereas both GL and the proposed approach have equivalent sensitivity levels for a specificity of 90%, which is significantly higher than the MB method as well as the Hahn and Carvalho (2015) approach. Under Case IV, it is evident that the Hahn and Carvalho (2015) approach has the least favorable performance in terms of the area under the curve and sensitivity levels corresponding to 90% specificity.
In addition, the limitations of the discrete mixture approaches are evident from significantly lower areas under the curve and low sensitivity levels in Case I, in spite of the true graph being decomposable. Moreover, the results are highly sensitive to different choices of the initial adjacency matrices, and somewhat unstable for higher dimensions under finite runs of the Markov chain Monte Carlo. We note that a similar unstable behavior was reported by Scott and Carvalho (2008) for Metropolis-based approaches under discrete mixture priors.
Finally, we note that the proposed approach is several orders of magnitude faster compared to other Bayesian graphical modeling approaches. As noted above, we have difficulty implementing the competing Bayesian approaches for large dimensions due to an unrealistically large net computation time. More details, including the computation times for Bayesian approaches under Case I are presented in the Supplementary Materials.

Comments
From the simulation results, it is clear that the proposed approach (i) compares favorably to, and often dominates, competing Bayesian approaches in terms of true graph recovery in large dimensions; (ii) has a higher area under the curve compared to penalized approaches in several cases, and a similar performance in other scenarios; and (iii) has demonstrably better computational efficiency compared to other Bayesian approaches, with the latter approaches quickly becoming computationally infeasible for increasing p.

Application to cancer genomics
We consider the problem of inferring the association networks between microRNAs, or miRNAs, and the corresponding target genes or messenger RNAs, or mRNAs. MiRNAs are small non-coding RNA molecules, which regulate gene expression levels by silencing the target mRNAs. Our motivating dataset is derived from The Cancer Genome Atlas based study of glioblastoma multiforme, a rapidly growing malignant brain tumor that is the most common in adults. mRNAs and miRNAs play complementary roles in the development and disease progression of this tumor (Tang et al., 2013). The main scientific question of interest is to find major miRNA regulators of mRNA expression in individuals with tumors, by jointly analyzing mRNA and miRNA data. For our analyses, we focus on a set of 49 genes mapped to core pathways implicated in glioblastoma multiforme such as the receptor tyrosine kinase, phosphatidylinositol-3-OH kinase and etinoblastoma pathways (Cancer Genome Atlas Research Network, 2008).
For inferring the gene regulatory network, we chose the top 200 prognostic miRNAs from a list of 538 candidate miRNAs, in addition to the 49 mRNAs, with the measurements being obtained from n = 280 samples. We fit the regularized inverse-Wishart model to the combined mRNA and miRNA measurements having dimension p = 249. We use η = 0.2 in our false discovery rate based approach for edge selection. In Table 6, we provide a ranked list of important miRNAs having negative partial correlations with their target mRNAs, along with the magnitude and 99% credible intervals of the partial correlation. We omit those mRNAs not having negative associations with any miRNA. In biological terms, the partial correlation between a miRNA and its target mRNA measures the association between the two, after accounting for the remaining mRNAs and miRNAs, with negative associations implying a down-regulation of target mRNAs. The credible intervals for the partial correlation provide a measure of uncertainty for the ranking of important miRNAs, with overlapping credible intervals for two miRNAs implying a potential change in rankings under different experimental and biological conditions for a particular target mRNA. In addition, we find that KSHV-K12-9 down-regulates several mRNA expressions, and this miRNA is known to be associated with glioblastoma (Delfino et al., 2011).
We also identify several hub genes based on mRNA expressions, EFGR, RAF1, NF1, SPRY2, CDKN2A, and PIK3C2G, with such nodes having greater than 8 neighbors. Some of the highly connected genes such as PI3KC2G, EGFR and CDKN2A, with 14, 9 and 9 connections respectively, have been previously shown to be associated with glioblastoma (Dong et al., 2010;Wong et al., 1992). We further explored the biological implications of our results using Ingenuity Pathway Analysis (IPA version 16542223) which identified a number of enriched pathways including; GLIOMA, GBM, PTEN signaling and other cancer related molecular mechanisms. Most of these genes encode proteins critical to cellular functions such as DNA recombination, and repair, cellular development, cell cycle and connective tissue development which may be attributed to their highly connected nature. The estimated graph is shown in Figure 4.
The estimated miRNA graph had 9 hub nodes each having greater than 10 neighbors, while 107 nodes did not have any neighbors. These hub nodes were ebv-mir-bart7, hsa-mir-106a, hsa-mir-142-3p, hsa-mir-17-5p, hsa-mir-let7, hsa-mir-181c, hsa-mir-184, hsa-mir-19a, and hsa-mir-20a. Analogous to gene expression a similar analysis of the miRNA with at least 4 neighbors (based on partial correlations) using IPA suggest they are critical for various cellular processes in cancer progression. The selected molecules modulate important transcription factors and signaling molecules including genes such as MYC, CCLE1 and CLDND1 which have been shown to be associated with cancer, inflammatory response and connective tissue disorders. These findings concur with studies that have indicated associations between GBM and significant modulation of listed miRs such as miR-106 (26 connections), miR-184 (15 connections) and miR-let-7a (26 connections), as in Lee et al. (2011).

Discussion
In summary, we propose a novel graphical model estimation approach which fits a conjugate model to the inverse covariance matrix that shrinks the off-diagonal elements corresponding to absent edges to zero, and subsequently uses a post-processing decision theoretic step involving neighborhood selection which infers absent edges based on penalized credible regions. By decoupling model fitting and selection, the proposed method overcomes several difficulties for existing Bayesian graphical modeling approaches such as sensitivity to the prior on the model space, assumptions on the underlying graphical structure, and mixing issues associated with discrete mixture approaches. In addition, it has attractive theoretical properties and is scalable to high dimensional settings. Our work makes a timely and important contribution to the sparse but appealing body of work involving systematic decision theoretic Bayesian approaches for efficient graphical model computation.
We demonstrate the numerical advantages of the proposed approach over commonly used Bayesian and penalized approaches using extensive simulation studies. The method is applied to an important problem of target prediction using genomic data, where main scientific question of interest is to find major miRNA regulators of mRNA expression in individuals with tumors and characterize uncertainty of the resulting estimators, by jointly analyzing mRNA and miRNA data. Our inferential methodology gives a holistic systemslevel view of the mRNA-miRNA regulatory mechanisms using graphical modeling by detecting miRNA targets conditional on other miRNAs and mRNAs in the relevant biological pathways. In addition our framework provides a probabilistic quantification of potential targets that can be used for miRNA target ranking and subsequent biological and experimental validation.
While the approach provides an efficient computational method for graphical model selection, there are some potential limitations. For example, it does not immediately yield a positive definite precision matrix with exact zeros corresponding to absent edges, and it is only possible to obtain such an estimate after implementing an additional post-processing algorithm. Moreover the current method is limited to high dimensions involving several hundred nodes, but may not be scalable to extremely high dimensions involving tens of thousands of nodes which is encountered in genome-wide association studies. In future work, our goal is to extend the methodology, computation, and theory to p >> n settings. In addition we would like to generalize the regularized inverse shrinkage prior to more general class of priors which induce different types of shrinkage marginally on the elements of inverse covariance matrix.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  ROC curves for true edge set corresponding to absolute partial correlation threshold of 0.025, for p < n settings under Case I. Thick solid line is the regularized inverse-Wishart, dashes is frequentist graphical Lasso, dots is Bayesian graphical lasso, dots & dashes are Bayesian adaptive graphical lasso, long dashes is hyper inverse-Wishart, thin black line corresponds to Meinshausen and Bühlmann (2006). Kundu et al. Page 24 Bayesian Anal. Author manuscript; available in PMC 2020 October 28.

Figure 3:
Parent miRNAs depicted with the strength of negative partial correlations along with 99% credible intervals for important mRNA-miRNA associations. The mRNAs are labeled on the horizontal axis, while the miRNAs are labeled on the vertical. Refer Table 5 in main article. Kundu et al. Page 25 Bayesian Anal. Author manuscript; available in PMC 2020 October 28.

Figure 4:
Estimated graph for mRNA-miRNA. Black nodes represent mRNA, and green nodes are miRNA. The edges reflect connections between pairs of miRNAs and pairs of mRNAs, and also miRNA-mRNA connections corresponding to negative partial correlations. In biological terms, the partial correlation between a miRNA and its target mRNA measures the association between the two, after accounting for the remaining mRNAs and miRNAs, with negative associations implying a down-regulation of target mRNAs. Our analysis identified several hub genes based on mRNA expressions, EFGR, RAF1, NF1, SPRY2, CDKN2A, and PIK3C2G, with such nodes having greater than 8 neighbors. In addition, there were 9 hub miRNAs each having greater than 10 neighbors, while 107 miRNAs did not have any neighbors. These hub nodes were ebv-mir-bart7, hsa-mir-106a, hsa-mir-142-3p, hsamir-17-5p, hsa-mir-let7, hsa-mir-181c, hsa-mir-184, hsa-mir-19a, and hsa-mir-20a. Such hub nodes are known to modulate important transcription factors and signaling molecules including genes such as MYC, CCLE1 and CLDND1 which have been shown to be associated with cancer, inflammatory response and connective tissue disorders. Kundu et al. Page 28  Area under the curve for true edge sets ES005 (ROC005) and ES025 (ROC025) under Case I for p < n, along with the sensitivity and specificity under the point estimate. The largest squared standard errors across rows corresponding to the area under the curve are 0.01, 0.008, 0.01, 0.01, 0.01, 0.008, 0.07 and 0.03 for p < n. The reported estimates are inflated by 100. RIW, IW, BGLA, BGAD, HIW, GL, MB, and HC, refer to regularized inverse-Wishart, inverse-Wishart, Bayesian graphical lasso, Bayesian adaptive graphical lasso, hyper inverse-Wishart, frequentist graphical lasso, Meinshausen and Bühlmann (2006) method, and the approach by Hahn and Carvalho (2015), respectively. Results based on 50 replicates.