Regression‐based Bayesian estimation and structure learning for nonparanormal graphical models

Abstract A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables in which it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. We consider a Bayesian approach to inference in a nonparanormal graphical model in which we put priors on the unknown transformations through a random series based on B‐splines. We use a regression formulation to construct the likelihood through the Cholesky decomposition on the underlying precision matrix of the transformed variables and put shrinkage priors on the regression coefficients. We apply a plug‐in variational Bayesian algorithm for learning the sparse precision matrix and compare the performance to a posterior Gibbs sampling scheme in a simulation study. We finally apply the proposed methods to a microarray dataset. The proposed methods have better performance as the dimension increases, and in particular, the variational Bayesian approach has the potential to speed up the estimation in the Bayesian nonparanormal graphical model without the Gaussianity assumption while retaining the information to construct the graph.


INTRODUCTION
The Gaussian graphical model (GGM) is a mathematical model commonly used to describe conditional independence relationships among normally distributed random variables. The estimation of the underlying graph in a GGM is known as structure learning. Zeros in the inverse covariance matrix, or the precision matrix, indicate that the corresponding variables in the dataset are conditionally independent given the rest of the variables in the dataset, and this relationship is represented by the absence of an edge in the graph. Similarly, nonzero entries in the precision matrix are represented by edges in the graph and correspond to conditionally dependent variables in the dataset. Thus, an assumed sparsity condition is used to learn the conditional dependence structure in a GGM. An extension of the GGM is the nonparanormal graphical model [26] in which the random variables are replaced by transformed variables that are assumed to be normally distributed. Liu et al. [26] use a truncated empirical distribution function to estimate the transformation functions and then estimate the precision matrix of the transformed variables using the graphical lasso. A Bayesian method for the nonparanormal graphical model [36] uses a random series B-splines prior to estimate the transformation functions and a Student-t spike-and-slab prior to estimate the resulting precision matrix. These extensions differ from the Gaussian copula graphical model [14,25,33,43] in that the nonparanormal graphical model concurrently estimates the transformation functions and the precision matrices. Nonparanormal graphical model approaches have been applied to discrete data models of interactions between genes [38] and to test differential gene networks [57].
Estimation of a sparse precision matrix is necessary to learn the structure in GGMs and nonparanormal graphical models. For unstructured precision matrices, a commonly used algorithm in the frequentist literature is the graphical lasso [16]. Many algorithms have been proposed to solve this problem including [2, 13,16,28,31,32,45,46,54,56].
Analogous methods in the Bayesian literature use priors to aid the edge selection procedure. For instance, off-diagonal entries of the precision matrix may be set to zero by allowing a point mass at zero in the prior [3], but the posterior is harder to compute or sample from. A normal spike-and-slab prior [51] replaces the point mass at zero by a highly concentrated normal distribution around zero and similarly, a Laplace spike-and-slab prior [17] has been used. From a computational point of view, continuous shrinkage priors such as the horseshoe prior [9], the Dirichlet-Laplace prior [6], and generalized double exponential prior [1], bring in the effects of both a point mass and a thick tail by a single continuous distribution with an infinite spike at zero.
Ideally, we seek solutions that guarantee a sparse positive definite matrix using continuous shrinkage priors. Since continuous shrinkage priors do not assign exact zeros, a variable selection procedure needs to be used to determine which of the small and nonzero elements should be specified as exactly zero. Methods that use spike and slab priors naturally incorporate variable selection, whereas methods that use alternative priors need a thresholding procedure. However, post-hoc thresholding procedures do not guarantee a positive definite precision matrix. The methods in [50,51] guarantee a positive definite matrix by way of the sampling algorithm. [42,50] use the double exponential prior and improve on its use for sparsity by allowing each double exponential prior to have its own shrinkage parameter. More recent methods estimate the inverse covariance matrix by using the normal spike and slab prior [22,23,41,51] for variable selection in the graphical model context. Lastly, Williams et al. [53] constructs Gaussian graphical models by estimating the partial correlation matrix using a horseshoe prior for regularization and for sparsity, using projection predictive selection, a method that allows for variable exclusion based on predictive utility, with good results.
The purpose of this paper is to explore the use of a Cholesky decomposition, horseshoe prior, and variational Bayesian (VB) techniques to construct a Bayesian nonparanormal graphical model. Utilizing a Cholesky decomposition is an alternative way to incorporate the positive definiteness constraint on precision matrices, but is very dependent on the ordering of the variables [44]. We consider a prior based on Cholesky decomposition of the precision matrix that reduces this dependence. We derive a sparsity constraint that ensures a weak order invariance in that it maintains the same order of sparsity in the rows of the precision matrix by increasing the order of sparsity down the rows of the lower triangular matrix. We construct a pseudo-likelihood through regression of each variable on the preceding ones. The approach splits the very high dimensional original problem to several lower dimensional ones. The method in [55] is also based on Cholesky decomposition, but it uses a noninformative Jeffreys' prior and the ordering issue of the Cholesky decomposition is not addressed.
We consider two different priors, the horseshoe and the Bernoulli-Gaussian [47]. These priors have clear interpretations of the probability of nonzero elements [47,48], which allows us to effectively calibrate sparsity. The strength of the Bernoulli-Gaussian prior is that it leads to a sparse positive definite precision matrix that does not require thresholding and the strength of the horseshoe prior is that it is a better model of sparsity than the Bernoulli-Gaussian prior due to its heavier tails. Horseshoe priors have not yet been considered for Bayesian nonparanormal graphical models that use transformation functions. We compare the performance of the methods using both a VB algorithm and a full Markov chain Monte Carlo (MCMC) sampling scheme. Mean field variational Bayes [18,49] is an alternative to MCMC that allows for faster fitting by deterministic optimization. A VB method for Gaussian graphical models is developed in [10] and an expectation conditional-maximization approach is used by Li and McCormick [23] in Gaussian copula graphical models. This approach has not yet been explored in the setting of a nonparanormal graphical model. We wish to determine if we can retain the information learned in a Bayesian nonparanormal graphical model while speeding up the estimation process using VB techniques.
The paper is organized as follows. In the next section, we describe the model and the sparsity constraint. In Section 3, we describe the VB algorithm. In Sections 4 and 5, we discuss particular priors and their corresponding Markov Chain Monte Carlo algorithms. In Section 6, we describe a thresholding procedure and in Section 7, we detail the tuning procedure. In Section 8, we present a simulation study. In Section 9, we describe a real data application.

Nonparanormal transformation
, a normal distribution with mean , covariance matrix Σ, and dimension p, and where f( We put prior distributions on the unknown transformation functions through a random series based on B-splines. In [36], we have described the prior distributions, including the motivation and support for the choices made, in greater detail. We briefly describe the prior in this section. We represent the transformation functions where each dj are coefficients, B (⋅) are the B-spline basis functions, d = 1, … , p, = 1, … , J, and J is the number of B-spline basis functions used in the expansion. We assume that the precision matrix Ω = Σ −1 is sparse, in that, most of its off-diagonal entries are zero. However, the model is not identifiable, since location-scale changes in the transformation functions and the normal distributions can be canceled by each other. To resolve the issue, one possibility is to fix the mean-vector to zero and assume that the covariance matrix is a correlation matrix, but putting a prior on such a matrix maintaining sparsity of its inverse appears inconvenient. Therefore, we let the mean and the precision matrix be free parameters while putting restrictions on the transformations. We begin with a normal prior on each of the coefficients of the B-splines, where o 2 is some positive constant, is some vector of constants, and I is the identity matrix, and impose a monotonicity restriction on them to make the transformation f d monotone (see below for details). We impose the following two linear constraints on the coefficients through function values of the transfor- . The linear constraints can be written in matrix/vector form as A d = c for each d = 1, … , p. The linear nature of the constraints allows us to retain the joint normality of the coefficient vectors before the monotonicity restriction, and hence a truncated joint normal after the restriction is imposed.
By the properties of a B-spline basis function, if the B-spline coefficients, dj are increasing in , then f is an increasing function. We thus impose the monotonicity constraint on the coefficients, which is equivalent with the series of inequalities d2 − d1 > 0, … , dJ − d,J−1 > 0. The monotonicity constraint can be expressed in matrix/vector form as F d > 0 for each d = 1, … , p. Thus, the prior on the coefficients before the truncation is imposed is given by d | {A d = c} ∼ N J ( , Γ), where the prior mean and variance are = . To ensure we have a Lebesgue density on R J−2 , we work with a dimension-reduced coefficient vector by removing two coefficients and we denote this reduction with a bar over the vector and matrix. The final prior on the coefficients is given by a truncated normal prior distribution d | {A d = c} ∼ TN J−2 ( , Γ,  ), where d is the dimension-reduced coefficient vector with the dimension-reduced mean vector , dimension-reduced covariance matrix Γ, restric- . Additionally, F is the dimension-reduced matrix of the monotonicity constraints and g is a dimension-reduced vector of the constant pertaining to the monotonocity constraints. We denote the truncated normal distribution as TN p ( , Σ,  ) with mean , covariance matrix Σ, restriction  , and dimension p. Any choice of is acceptable, but we use where is a constant, is a positive constant, and Φ −1 is the inverse of the cumulative distribution function of the standard normal distribution. The idea is that by increasing the original components of the mean vector , the truncation set  in the final prior of the B-spline coefficients will have a substantial prior probability. Finally, we put an improper uniform prior on the mean p( ) = ∏ p d=1 p d ( d ) ∝ 1. The resulting transformed variables, Z d = Y d − d , which are assumed to be distributed as N ( 0, Ω −1 ) and Y d = ∑ J =1 dj B (X d ), d = 1, … , p, are used to estimate the precision matrix and learn the structure of the underlying graph.

Cholesky decomposition reformulated as regression problems
We learn the structure of the precision matrix using a Cholesky decomposition. Denote the Cholesky decomposition of Ω as Ω = LL ′ , where L is a lower triangular matrix with elements l kd . Define the coefficients kd = −l kd ∕l dd and the precision as d = 1∕ 2 d = l 2 dd , where d = 1, … , p. Then, as described in [55], the lower triangular entries of Ω, denoted as Ω kd , are given by Accordingly, the multivariate Gaussian model Z ∼ N(0, Σ) is equivalent to the set of independent regression problems where kd are the regression coefficients for k = d + 1, … , p and d = 1, … , p, and Z d and Z k are, respectively, the dth column and kth columns selected from matrix Z. We use the notation k > d to indicate that the columns are greater than the dth column. We use a standard conjugate noninformative prior on the variances. We consider two different continuous shrinkage priors on the regression coefficients, the horseshoe prior and the Bernoulli-Gaussian prior. Using these priors, we enforce a sparsity constraint along the rows of the lower triangular matrix. The sparsity constraint is one in which the global sparsity parameter of the continuous shrinkage prior is scaled by √ k, where k > d and d = 1, … , p. Using this constraint, we expect that the precision matrix will be sparse through weak order invariance. The sparsity constraint is derived in the next section.

Sparsity constraint
In order to ensure that the probability that an entry is nonzero (i.e., sparsity) remains roughly the same over different rows we cannot simply impose the same degree of sparsity on the rows of the Cholesky factor L, but need to change it over rows appropriately. Denote the probability as P(⋅). To see how the Cholesky factor L depends on the row index, we observe that Let k =P(nonzero entry in the kth row of L). Then where c p depends on p but not on k. Then, we obtain the probability of nonzero to be 1 − exp . Furthermore, choosing c p to be small for p → ∞ makes the probability small, which is essential in higher dimension. We choose k = P(nonzero in kth row)= c∕(p √ k), and tune the value of c ∈ {0.1,1,10} to cover a range of three orders of magnitude, that is 10 −1 , 10 0 , 10 1 .

VARIATIONAL BAYES ESTIMATION
We observe n independent samples, X 1 , … , X n , from the nonparanormal model NPN ( , Ω −1 , f ) with a sparse Ω. Based on these observations and the prior described in Section 2.1, we intend to compute the posterior distribution to make inferences about Ω and its structure, using the transformations f. Ideally, we would want to construct a complete VB algorithm in which the B-spline coefficients, mean, and inverse covariance matrix are estimated all in one setting. However, for our problem, there is no closed form solution for the truncated multivariate normal distribution, and closed form solutions are needed for the mean field VB algorithms. Instead, we use an exact Hamiltonian Monte Carlo within Gibbs scheme to sample the B-spline coefficients and the mean. We obtain the Bayes estimate of the B-spline coefficients,̂d = E ( d |X 1 , … , X n ), and the Bayes estimate of the mean,̂d = E ( d |X 1 , … , X n ), where E (⋅|X 1 , … , X n ) is the posterior mean operator. We then apply the VB method on the synthetic data obtained by transforming the original observations using the estimated transformations. Thus we estimate the transformed variables using Ideally, instead of plugging in, one can obtain samples from the posterior distributions of the transformations and draw samples from the variational distributions of the precision matrix for each generated sample and accumulate them. However, even in moderately high dimension, such an approach is extremely computationally intensive. Since the posterior distributions of the transformations are consistent [36], they concentrate near the Bayes estimate. As the main goal is structure learning, the inability of the plug-in to assess the posterior variability of the transformations is not a highly deterring issue. Thus, although the proposed algorithm is not fully Bayesian, it utilizes the strength of the VB approach to identify conditional independence relations in a nonparanormal graphical model within a manageable time. While the variational inference generally underestimates the posterior variance [8], the quality of uncertainty quantification is affected, but that of estimation is hardly compromised. Moreover, since the goal of structure learning is to identify zero or nearly nonzero elements in the precision matrix, the main purpose is not affected at all. We illustrate the variational method on the Bernoulli-Gaussian prior, following the strategy described in [39]. Let the Bernoulli distribution be denoted as Ber and the inverse gamma distribution be denoted as IG(A, B) with shape parameter A and scale parameter B. We can describe the joint distribution by is the vector of regression coefficients, Z k>d is the matrix of transformations, and k>d is a binary indicator matrix of 0s and 1s that is modeled by the Bernoulli distribution with elements kd . The hyperparameters g 2 , A, and B, are fixed, and * kd ∈ [0, 1] controls the sparsity. This variant of the spike-and-slab prior indirectly models sparsity on the regression coefficients by putting a binary indicator on the regression coefficients in the likelihood, instead of directly modeling sparsity on the regression coefficients. As such, if kd = 0 for the Bernoulli-Gaussian prior, then kd | kd ∼ N ( 0, g 2 ) , unlike in usual spike-and-slab priors in which kd would be exactly equal to 0. We select * kd using a tuning procedure that incorporates the sparsity constraint and is discussed in Section 3.1.
The joint posterior distribution that we aim to compute is .
By plugging in the estimated transformed variables, we use a VB algorithm to compute the posterior distribution of the sparse precision matrix. Mean field VB inference involves minimizing the Kullback-Leibler divergence between the true posterior distribution and a factorized approximation of the posterior. Let represent the set of parameters in the model and Z represent the matrix of estimated transformed variables.
is a partition of . The optimal q k densities satisfy where E ∖q k ( k ) is the expectation with respect to all densities except q k ( k ) [7]. The variational lower bound (VLB) for the marginal likelihood for Z is then given by where E q is the expectation with respect to the density q k ( k ). Using the coordinate ascent method, optimizing each q k while holding the others fixed will result in the algorithm converging to a local maximum of the lower bound.
Following [39], the choice of factorization that we use for the VB approximation is with, for some choice of parameters The parameters are obtained by the VLB with respect to them by coordinate ascents, called variational updates, which we can derive as in [39]. Introduce the notations expit(x) = exp(x)∕{1 + exp(x)}, and logit(x) = log(x∕(1 − x)), and let the symbol • denote the Hadamard product between two matrices. Then, we obtain Note that we use the notation l > k to indicate that the columns are greater than the kth column and #(k > d) means the number of columns k higher than d. In addition, The VB algorithm (Algorithm 1) is detailed in Appendix C.

Tuning procedure
For every (p − 1) regression problem, we choose the parameter * kd , used in the prior in Equation (2), by applying the tuning algorithm described in detail in [39, sec. 4] because the authors also describe a way to select the hyperparameter , the tuning parameter that they use to control sparsity. In this section, we describe the changes that we made to add the sparsity constraint to our tuning parameter. We use the value of discussed in [39] and multiply that value with k = c∕(p √ k) to incorporate the sparsity constraint discussed in Section 2.3. Thus, for the fixed that was discussed in [39], for our work, that Note that, since the dimension d is not changing for * k , we do not need to include c for tuning. For a fixed w, which was discussed in [39], for our work, which translates to the fixed w k>d , and where c is taken from an equally spaced grid of 50 points between 0.1 and 10, and varies over an equally spaced grid of 50 points between −15 and 5. We replace the c with c which leads a grid of 50 values of c between 0.1 and 10 instead of the three values of c ∈ {0.1,1,10} that was discussed in Section 2.3. The variational lower bound for the tuning procedure is only based on the preceding (p − 1) regressions and not the regression relations that involve Z p and 2 p .

Horseshoe prior
We use the horseshoe prior described in [37], to shrink the coefficients , Z k>d is the matrix of transformations, and A and B are fixed hyperparameters.
The global scale parameter is roughly equivalent to the probability of a nonzero element [48]. We enforce the sparsity constraint using, ( d c) ∕(p √ k). Thus, since we are working with the squared parameter, the factor in the variance term for kd is , where c ∈ {0.1,1,10}. The joint posterior distribution and the corresponding conditional posterior distributions are provided in Appendix A, and the sampling algorithm (Algorithm 2) is provided in Appendix C.

Bernoulli-Gaussian prior
We use the same Bernoulli-Gaussian prior described in Equation (2). The joint posterior distribution and the corresponding conditional posterior distributions are provided in Appendix B and the sampling algorithm (Algorithm 3) is provided in Appendix C.

THRESHOLDING
Since the horseshoe prior is a continuous shrinkage prior, it does not assign exact zeros to the elements of the inverse covariance matrix, so we must apply a thresholding procedure that determines which of the elements should be exactly zero. The resulting thresholded matrices are then used to construct the graphical model. The thresholding procedure that we consider for the method using the horseshoe prior (Equation 4) is based on a 0-1 loss function described in [50] for classification under absolutely continuous priors. Although this procedure is heuristic, it seems to perform well in practice. Other thresholding rules may be used, such as those based on posterior credible intervals [19], information criterion [20], clustering [24], posterior model probabilities [3,34], and projection predictive selection [53], but we choose to focus on the 0-1 loss procedure for this study.

0-1 Loss procedure
We find the posterior partial correlation using the precision matrices from the Gibbs sampler of the horseshoe prior (Equation 4) and the posterior partial correlation using the standard conjugate Wishart prior. The posterior samples of the partial correlation using the precision matrices from the Gibbs sampler are defined as where kd,m is an MCMC sample from the posterior distribution of Ω m , where m = 1, … , M, M is the number of MCMC samples, and k, d = 1, … , p. The posterior partial correlation using the standard conjugate Wishart prior is found by starting with the latent observation, Z m , which is obtained from the MCMC output. We put the standard Wishart prior on the precision matrix, Ω m ∼ W p (3, I), which was used in [50] for their thresholding procedure, where I is the identity matrix. Note that this Wishart prior does not assume sparsity, but Z is obtained from the MCMC output assuming sparsity of the precision matrix. Through conjugacy, the posterior distribution is Ω m ∼ W p for k, d = 1, … , p and m = 1, … , M. The idea is that we are comparing the regularized precision matrix from the horseshoe prior to the nonregularized precision matrix from the Wishart prior. If the absolute value of the partial correlation coefficient from the regularized precision matrix is similar in size or larger than the absolute value of the partial correlation coefficient from the Wishart precision matrix, then there should be an edge in the edge matrix. If the absolute value of the partial correlation coefficient from the regularized precision matrix is much smaller than the absolute value of the coefficient from the Wishart matrix, then the entry should not appear in the edge matrix.

CHOICE OF PRIOR PARAMETERS
For the precision matrix being estimated with a horseshoe prior (Equation 4), we need to select the value of the parameter c which controls the sparsity. We solve a convex constrained optimization problem in order to use the Bayesian Information Criterion (BIC), as described in [11,12]. First, we find the Bayes estimate of the inverse covariance matrix,Ω = E(Ω|Z). We also find the average of the transformed variables, where  represents the constraint that all elements ofΩ at the locations of the zeros of the estimated edge matrix from the MCMC sampler are zero. The estimated edge matrix from the MCMC sampler will be described in more detail in Section 8. For computational simplicity, in the code, we represent this problem as an unconstrained optimization problem as described in [11,12].  . We select the value of c that results in the smallest BIC.

SIMULATION RESULTS
We conduct a simulation study to assess the performance of the proposed methods using the horseshoe MCMC, indicated as Horseshoe, Bernoulli-Gaussian MCMC, indicated as Bernoulli-Gaussian, and VB algorithm, indicated as variational Bayes. We choose not to include the Bayesian method for the nonparanormal graphical model described in [36] because we want to compare only the Cholesky decomposition-based Bayesian methods to the empirical method for the nonparanormal graphical model [26] and to a Bayesian Gaussian copula graphical model [33] based method. We indicate the Bayesian Gaussian copula graphical model by the Bayesian Copula, in which the rank likelihood is used to transform the random variables with a uniform prior on the graph, a G-Wishart prior on the inverse correlation matrix, and estimation is used with the birth-death MCMC [34]. These competing methods all utilize a transformation of the data to learn the graphical structure.
We assess the performance of these methods by calculating sensitivity, specificity, and the Matthews correlation coefficient (MCC). We assess the effect of the transformation functions of our proposed methods by calculating the scaled L 1 -loss. These metrics are detailed in Section 8.1. In this section, we describe the data generation process used to conduct the simulation study.
The random variables, Y 1 , … , Y p , are simulated from a multivariate normal distribution such that Y i1 , … , Y ip The means are selected from an equally spaced grid between 0 and 2 with length p. We consider nine different combinations of n, p, and sparsity for Ω: • p = 25, n = 25, sparsity = 10% nonzero entries in the off-diagonals; • p = 50, n = 100, sparsity = 5% nonzero entries in the off-diagonals; • p = 100, n = 300, sparsity = 2% nonzero entries in the off-diagonals; • p = 25, n = 25, AR(2) model, sparsity ≈ 16%; • p = 50, n = 100, AR(2) model, sparsity ≈ 8%; • p = 100, n = 300, AR(2) model, sparsity ≈ 4%; • p = 25, n = 25, circle model, sparsity = 8%; • p = 50, n = 100, circle model, sparsity = 4%; We follow the procedure in [36] to estimate the transformation functions. The hyperparameters for the normal prior are chosen to be = 1, = 1, and o 2 = 1. To choose the number of basis functions, we use the Akaike Information Criterion as described in [36]. Samples from the truncated multivariate normal posterior distributions for the B-spline coefficients are obtained using the exact Hamiltonian Monte Carlo (exact HMC) algorithm [40]. The initial coefficient values, dj,initial , for the exact HMC algorithm are calculated using quadratic programming as described in [36]. After finding the initial coefficient values d , we construct initial values for Y d,initial = ∑ J =1 dj,initial B (X d ) using the observed variables. These initial values Y initial are used to find the initial values for Σ, , and Ω for the algorithm, where Σ initial = cov (Y initial ) , initial = Y initial , where Y initial is the average of Y initial , and Ω initial = Σ −1 initial . For the part of the simulation study in which we do not estimate the transformation functions, the initial values for the Horseshoe, Bernoulli-Gaussian, and variational Bayes algorithms are constructed from the observed variables, X, with Σ initial = cov(X), initial = X, where X is the average of X, and Ω initial = Σ −1 initial . Afterward, the mean and the precision matrix Ω are estimated using the algorithms as described in the previous sections. The hyperparameter g 2 for the Bernoulli-Gaussian prior and the variational Bayes algorithm is fixed at 10. The hyperparameters A and B for the inverse gamma distribution for the Bernoulli-Gaussian prior, the variational Bayes algorithm, and the horseshoe prior, are fixed at A = B = 0.01. The initial value, 0 , where t = 0, for the variational Bayes algorithm is chosen to be 1000. The threshold for stopping the variational Bayes algorithm is set to = 10 −6 . For the variational Bayes algorithm and the MCMC algorithm using the Bernoulli-Gaussian prior, the tuning procedure described in Section 3.1 is used to find the hyperparameter for the Bernoulli distribution, * kd . Since the vector w k>d from the tuning procedure consists of only 0 and 1 values, it is used as the initial indicator vector d for the MCMC algorithm using the Bernoulli-Gaussian prior. The data matrix that is used as input for the tuning procedure is Z initial = Y initial − initial , which was described in the previous paragraphs.
For the MCMC algorithm for the horseshoe prior, we consider three values of c that are a range of three orders of magnitude: c ∈ {0.1,1,10}. The value of c that yields the lowest BIC was selected for the final estimates of the precision matrix and edge matrix. The 0-1 loss procedure described in Section 6.1 was used to threshold the precision matrices and construct the edge matrices.
For the simulation study, we run 100 replications for each of the nine combinations and assess structure learning for each replication. We collect 10,000 MCMC samples for inference after discarding a burn-in of 5000. We do not apply thinning. The Bayesian copula method is implemented by the R package, BDGraph [35] using the option "gcgm." Posterior graph selection is done using Bayesian model averaging, the default option in the BDGraph package, in which it selects the graph with links for which their estimated posterior probabilities are greater than 0.5. The nonparanormal graphical model is implemented by the R package huge [58] using the option "truncation." The graphical lasso method is selected for the graph estimation and the default screening method, lossless [30,54], is used. Three regularization selection methods are used to find the estimated precision matrix and select the nonparanormal graphical model: the Stability Approach for Regularization Selection (StARS) [27], the modified Rotation Information Criterion (RIC) [29], and the Extended Bayesian Information Criterion (EBIC) [15]. The default parameters in the huge package are used for each selection method. As in Liu et al. [26], the number of regularization parameters used is 50 and they are selected among an evenly spaced grid in the interval [0.16, 1.2].
The code for the proposed Bayesian methods is written in MATLAB and sparse representations of the matrices are used when appropriate. For the variational Bayes algorithm, when calculating w * kd = expit ( kd ), it is set to 0 if exp ( kd ) is below 2 −52 , which is eps, the floating-point relative accuracy in MATLAB, while w * kd is set to 1 if exp ( kd ) is equal to infinity in MATLAB for numerical stability. Infinity results from operations that lead to results too large to represent as conventional floating-point values. Similar adjustments are also applied for the Bernoulli-Gaussian MCMC. The code is given in Appendix C.

Performance assessment
We compute the Bayes estimate of the precision matrix Ω = E(Ω|Z) by averaging all MCMC samples after burn-in, or the variational Bayes estimate by averaging over 500 independent samples from the variational distribution. The median probability model [4] is used to obtain the Bayes estimate of the edge matrix. We find the estimated edge matrix by first using the 0-1 loss procedure discussed in Section 6.1 to threshold the MCMC precision matrix samples, and then we take the mean of the thresholded precision matrices. If each off-diagonal element of the mean of the thresholded matrices is greater than 0.5, the element is registered as an edge in the estimated edge matrix, and if each off-diagonal element of the mean is not greater than 0.5, it is registered as no edge. We use 0.5 as the cut-off since an average above 0.5 means on average, the matrices included an edge more than half of the time. We compute specificity (SP), sensitivity (SE), and MCC to assess the performance of the graphical structure learning. They are defined as follows: where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. For all three metrics, the higher the values are, the better is the classification. If there are models that are estimated to have no edges, they result in NaNs as MCC values.
We also look at the effect of the transformation functions on parameter estimation for our methods. We consider the scaled L 1 -loss function, the average absolute distance, as a measure of parameter estimation. Scaled L 1 -loss is defined as where Ω true,kd stands for the true covariance matrix. Note that for the Bayesian Copula method, we use the estimated inverse correlation matrix and the true correlation matrix in place of the precision matrix for loss calculation. We review the results of sensitivity, specificity, MCC, and the scaled L 1 -loss for each method using violin plots. In general, for sensitivity, specificity, and MCC, the closer the violin plots are to one and the tighter the violin plots, the better the performance of the method. For the scaled L 1 -loss, the closer the violin plots are to zero and the tighter the violin plots, the better performance.
First, we consider sensitivity. In Figure 1 for the p = 25 dimension and AR(2) model, the StARS model has the best sensitivity, followed with the Bayesian Copula model. For p = 50 and the AR(2) model, the Bayesian Copula performs the best, followed by the StARS model. Notably, the proposed methods perform better at the p = 50 dimension than at the p = 25 dimension, with the Horseshoe method performing the third best. Finally, for the p = 100 dimension and AR(2) model, the Bayesian Copula method performs the best and the proposed methods perform second best, with the Horseshoe method performing the best and the variational Bayes and Bernoulli-Gaussian methods performing third and fourth best. The Bayesian Copula is the best, the Horseshoe is the second best, and the Bernoulli-Gaussian and variational Bayes methods are the third and fourth best, respectively. For the p = 25 dimension and the circle model, all methods are high performing, but the RIC and StARS methods perform the best and the Bayesian Copula method is the third best. For the p = 50 and p = 100 dimensions and the circle model, all methods perform similarly. For the p = 25 dimension and the 10% model, the StARS method is the best and the Bayesian Copula method is the second best. For the p = 50 dimension and 5% model, the Bayesian Copula method performs the best. The Horseshoe and StARS methods perform similarly and are the second best, while the variational Bayes and Bernoulli-Gaussian methods perform similarly and are the third best. For the p = 100 dimension and the 2% model, the Bayesian Copula slightly outperforms the Horseshoe model, and the Bernoulli-Gaussian and variational Bayes methods perform similarly at third best.
Next, we review how the methods perform when considering specificity. In Figure 2 for all dimensions and AR(2) model, the three proposed methods, Bernoulli-Gaussian, Horseshoe, and variational Bayes methods, as well as the EBIC method, perform the best. For the p = 25 dimension and circle model, the three proposed methods, Bernoulli-Gaussian, Horseshoe, and variational Bayes methods, as well as the EBIC method, perform the best. For the p = 50 and p = 100 dimensions and the circle model, the three proposed methods, Bernoulli-Gaussian, Horseshoe, and variational Bayes methods, perform the best, outperforming all other methods. For the p = 25 dimension and the 10% model, the three proposed methods, Bernoulli-Gaussian, Horseshoe, and variational Bayes methods, as well as the EBIC method, perform the best. For the p = 50 dimension and 5% model and the p = 100 dimension and 2% model, the three proposed methods, Bernoulli-Gaussian, Horseshoe, and variational Bayes methods, perform the best.
We consider the MCC to compare the overall performance of structure learning. In Figure 3 for the p = 25 and p = 50 dimensions and the AR(2) model, the Bayesian Copula method performs the best and the Horseshoe method performs the second best. No edges were selected by the nonparanormal model using EBIC for the sparsity models of dimension p = 25 and for the p = 50 AR(2) model. For the p = 100 dimension and the AR(2) model, the three proposed methods, Horseshoe, Bernoulli-Gaussian, and variational Bayes methods, perform the best. For all dimensions of the circle model, the three proposed methods, Horseshoe, Bernoulli-Gaussian, and variational Bayes methods, perform the best. Lastly, for the p = 25 dimension and 10% model, the Horseshoe method performs the best, and the Bernoulli-Gaussian and RIC methods perform similarly and are the second best. For the p = 50 and 5% model and p = 100 and 2% model, the three proposed methods, Horseshoe, Bernoulli-Gaussian, and variational Bayes methods, perform the best. Thus, when considering the overall structure learning, the proposed methods outperform all competing methods except in the cases of p = 25 and p = 50 and the AR(2) model.
Finally, in Figure 4, we review the results of parameter estimation, using the scaled L 1 -loss, for the three proposed methods. We consider whether or not the transformation decreases the scaled L 1 -loss. For all three methods, the transformation functions resulted in a smaller scaled L 1 -loss, implying an improvement in parameter estimation. Overall, the Horseshoe method had a higher scaled L 1 -loss than the Bernoulli-Gaussian and variational Bayes methods. In addition, overall, the variational Bayes method had a similar or lower scaled L 1 -loss compared to the Bernoulli-Gaussian method.

REAL DATA APPLICATION
For the real data application, we consider the dataset based on the GeneChip (Affymetrix) microarrays for the plant Arabidopsis thaliana originally referenced in Wille et al. [52]. This dataset features gene expression levels from isoprenoids. Isoprenoids serve a great many biochemical functions in plants, such as components of membranes (sterols) and photosynthetic pigments (carotenoids and chlorophylls). The cytosolic pathway, often described as the mevalonate or MVA pathway, is responsible for the synthesis of sterols and the plastidial (nonmevalonate or MEP) pathway is used for the synthesis of isoprenes, carotenoids and the side chains of chlorophyll. Although both pathways operate independently, interaction between them has been discovered [21]. There are n = 118 microarrays and p = 39 genes from the isoprenoid pathway that are used. For pre-processing, the expression levels for each gene, x i for i = 1, … , 118, are log-transformed. We study the associations among the genes using the Bayesian nonparanormal methods, the nonparanormal method of Liu et al. [26], and the method based on the Bayesian copula graphical model of [33]. These data are treated as multivariate Gaussian originally in Wille et al. [52].
Using the same set-up as in the simulation study, we fit the Bayesian copula graphical model using the BDGraph package and we fit the nonparanormal graphical model using the huge package. The BDGraph package selected 211 edges using Bayesian model averaging. The huge package using the RIC selection resulted in 140 edges and using the StARS method resulted in 209 edges. The EBIC-selected model results in no edges.
In order to construct the graphical models using our methods which use B-spline transformations, we converted the observations to be between 0 and 1 using the equation (x − min (x i )) ∕ (max (x i ) − min (x i )). The variational Bayes method results in 98 edges, the horseshoe prior based method results in 257 edges, and the  Figure 5 shows the graphs of our proposed methods and Figure 6 shows the graphs of the existing methods.
Since we use a sparsity inducing prior for each of the graphs, we consider the sparsity to compare the performance of the graphs. The variational Bayes and Bernoulli-Gaussian prior methods result in the sparsest graphs. The method based on the Horseshoe prior results in the densest graph. Out of the three proposed methods, this method is the most sensitive method, so it appears for this dataset, it is selecting more edges than the other models. The variational Bayes method is the fastest method out of the three proposed methods. The variational Bayes and Bernoulli-Gaussian prior methods proposed in this paper give sparser graphs than that based on the Gaussian copula graphical model, which uses a G-Wishart prior on the precision matrix. Sparse graphs can aid in simpler scientific interpretation and could be used for further exploration, such as understanding the mechanisms involved in the isoprenoid pathway.
We also compare features related to the graphs. Wille et al. [52] found three subgroups in their GGM that were nearly or fully connected. They found that the genes DXR, MCT, CMK, and MECPS are nearly fully connected, the genes AACT2, HMGS, HMGR2, MK, MPDC1, FPPS1, and FPPS2 share many edges in the MVA pathway, and the subgroup AACT2, MK, MPDC1, and FPPS2 is completely interconnected [52]. We will refer to these subgroups as Subgroup 1, Subgroup 2, and Subgroup 3, respectively. The maximum number of edges in an undirected graph is p(p − 1)∕2, where p is the number of nodes. The maximum number of edges for Subgroup 1, Subgroup 2, and Subgroup 3 is 6, 21, and 6, respectively. Table 1 shows the number of edges for each of the methods for the subgroups. The EBIC-selected method is not shown since it resulted in no edges. The RIC and StARS methods results in subgroups that have the highest number of edges. The Horseshow and Bayesian Copula methods have the next highest number of edges. The variational Bayes and Bernoulli-Gaussian have the least number of edges.

DISCUSSION
We have introduced a Bayesian regression method to construct graphical models for continuous data that do not rely on a normality assumption. The method assumes the nonparanormal structure, that under some unknown monotone transformations, the original observation vector reduces to a multivariate normal vector. The precision matrix of the transformed observations can be used to learn the graphical structure of conditional independence of the original observations. We use a prior distribution on the underlying transformations through a finite random series of B-splines with increasing coefficients that are given a multivariate truncated normal prior. We incorporate the positive definiteness constraint on the precision matrix of the transformed variables by utilizing the Cholesky decomposition. We consider two different priors based on the Cholesky decomposition, the Bernoulli-Gaussian prior and the horseshoe prior, and we impose a sparsity constraint. We use a VB algorithm to learn the conditional independence relations more efficiently as well as use a traditional Gibbs sampling approach. The VB approach and the approaches using Bernoulli-Gaussian and horseshoe priors result in most cases with better overall structure learning, measured using the Matthews correlation coefficient, than competing methods. The competing methods perform similarly or in some cases, better, than the proposed methods with smaller dimension. In addition, the VB algorithm performs similarly to the proposed methods in terms of overall structure learning and parameter estimation. It appears that information is not lost with the VB algorithm and we have the potential to speed up the estimation of the Bayesian nonparanormal graphical model. Lastly, when comparing the horseshoe to the Bernoulli-Gaussian prior, the horseshoe prior has higher sensitivity than the Bernoulli-Gaussian prior.
Although the Bernoulli-Gaussian methods perform similarly to the horseshoe in terms of specificity and overall structure learning, they do better parameter estimation. In summary, the proposed methods perform best at higher dimension (p ≥ 50). Thus, for higher dimensional problems, we recommend using the VB algorithm to reduce the computational time while still maintaining good estimation properties. Bayesian nonparanormal graphical models are flexible. They can be used to estimate the elements of the precision matrix directly or via a Cholesky decomposition. Researchers can try different sparsity inducing priors on the precision matrix based on their interests and needs. In addition, researchers can use a fully Bayesian approach to learn the graphical structure or employ a partially Bayesian approach to increase the speed in learning the structure without sacrificing much in quality. The Bernoulli-Gaussian prior, used in the VB method and the traditional Bayesian approach, resulted in the sparsest graphs using real data, which might be useful for researchers who would like greater variable reduction for data exploration.  (  , 2 , 2 , a, b, h ∑ M m=1 dm , where M is the number of Markov Chain Monte Carlo samples. 6: Compute Z id = ∑ J =1̂d B (X id ) −̂d. 7: Using Z, tune * kd and find the initial values for w k>d using the tuning procedure described in Subsection 3.1. 8