Bayesian Variable Selection Regression of Multivariate Responses for Group Data

. We propose two multivariate extensions of the Bayesian group lasso for variable selection and estimation for data with high dimensional predictors and multi-dimensional response variables. The methods utilize spike and slab priors to yield solutions which are sparse at either a group level or both a group and individual feature level. The incorporation of group structure in a predictor matrix is a key factor in obtaining better estimators and identifying associations between multiple responses and predictors. The approach is suited to many biological studies where the response is multivariate and each predictor is embedded in some biological grouping structure such as gene pathways. Our Bayesian models are connected with penalized regression, and we prove both oracle and asymptotic distribution properties under an orthogonal design. We derive eﬃcient Gibbs sampling algorithms for our models and provide the implementation in a comprehensive R package called MBSGS available on the Comprehensive R Archive Network (CRAN). The performance of the proposed approaches is compared to state-of-the-art variable selection strategies on simulated data sets. The proposed methodology is illustrated on a genetic dataset in order to identify markers grouping across chromosomes that explain the joint variability of gene expression in multiple tissues.


Introduction
In this article, we consider the challenging task of developing a fully Bayesian sparse regression analysis for the situation when the numbers of predictors is larger than observations for a multivariate response and covariates grouped by blocks with the sparsity for blocks and within blocks. It is well established that the incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability. In genomics, genes within the same pathway have similar functions and act together in regulating a biological system. These genes can add up to have a larger effect and therefore can be detected as a group (i.e., at a pathway or gene set level). Incorporation of this grouping structure is becoming increasingly common due to the success of geneset enrichment analysis approaches (Subramanian et al., 2005). For instance, the incorporation of group structure in regression analysis has been found to be effective for biomarker identification (Yuan and Lin, 2006;Meier et al., 2008;Puig et al., 2009;Simon and Tibshirani, 2012). Penalized regression methods are a popular approach for incorporating group structure and performing variable selection. Among these methods, Yuan and Lin (2006) proposed the group lasso by placing an L 2 penalty on the size of the regression coefficients. This method has drawn attention due to its ability to simultaneously perform group variable selection and estimate regression coefficients. The method was later extended by Meier et al. (2008) to logistic regression and modified by Puig et al. (2009) and Simon (2013) to consider non-orthonormal predictor matrices. Although the group lasso penalty can improve the quality of the variable selection, it requires a strong group-sparsity (Huang and Zhang, 2010), and cannot yield sparsity within a group. Ma et al. (2007) proposed a supervised group lasso which selects both significant gene clusters and significant genes within clusters for logistic binary classification and Cox survival analysis. Simon (2013) proposed a sparse group lasso penalty by combining an L 1 penalty with a group lasso to yield sparsity at both the group and individual feature level. Zhou (2010) applied this approach to genomic feature identification. Garcia et al. (2014) developed a sparse group-subgroup lasso to accommodate selecting important groups, subgroups and individual predictors. In a regression context, with a multivariate response variable, Li et al. (2015) have recently proposed a multivariate sparse group lasso. A review of group variable selection methods is presented by Huang et al. (2012). Recently, Liquet et al. (2016b) proposed a sparse group partial least squares approach for dealing with structured data in a genomic context.
In a Bayesian framework Xu and Ghosh (2015) proposed a Bayesian group lasso using spike and slab priors for group variable selection. Rockova and Lesaffre (2014) have recently developed rapid computational procedures based on the expectation maximization (EM) algorithm for a hierarchical model incorporating grouping information. Recently, Stingo et al. (2011) proposed a partial least squares approach for pathway and gene selection using variable selection priors and Markov chain Monte Carlo (MCMC) for computation. However, these Bayesian procedures only deal with a univariate response variable.
In some cases, the outcome can be complex and may consist of several correlated measures of continuous variables (e.g., metabolic syndrome). Figure 1 illustrates the most general situation of p predictors (typically OMICs measures) belonging to G groups being analyzed in relation to q correlated outcomes based on n observations. The matrix X can be divided into G sub-matrices (groups) X g : n × m g where m g is the number of covariates in group g. For example, in gene expression data this sub-matrix may represent gene pathways or be factor level indicators for categorical data. The aim is to select only a few groups of X which are related to the multivariate response Y. Further, sometimes we would like sparsity with respect to which groups are selected and which coefficients are nonzero within each group. For exam-Figure 1: General representation of data for modeling multivariate outcomes given a group structure of the covariates.
ple, we might be interested in identifying important genes within selected gene pathways.
The multivariate response and predictor combination can be modelled using the multivariate Gaussian linear model: where Y is a n×q matrix of responses, and B g is a m g ×q matrix of regression coefficients associated to the sub-matrix predictors X g . MN(·, ·, ·) indicates the normal matrixvariate as defined in Dawid (1981) where Σ (q × q) controls the responses' residual correlation (the variance-covariance matrix of Y − G g=1 X g B g ) and the observations are treated as independent (e.g., no familial structure is assumed in the data, with I n denoting the n×n identity matrix). Setting q = 1, the Gaussian linear model simplifies to where Y is a n × 1 vector, β g is a p g × 1 vector of regression coefficients associated with the group g, and σ 2 corresponds to the variance of the error term. We denote with N n (·, ·) the n−variate normal distribution.
One way to analyze the multivariate Gaussian model (1) is to consider a collection of q regression problems in R p . Using this framework, we could fit q different univariate regression problems using any of the previous variable selection methods. However, in many applications the response matrix Y contains variables that are strongly correlated. As such, one would expect that the underlying covariates would be related. One approach for this type of problem would be to posit that there is some underlying subset of the coefficients in X which is related to all components of the response Y.
A frequentist way to tackle this problem is offered by Friedman et al. (2010) who suggest using a group lasso penalty to select variables which are related to all components of the response Y. Extensions of this approach have been proposed to provide simultaneous estimation of the precision matrix (inverse of Σ) and of the regression coefficients (see e.g., Rothman et al. (2010), Lee and Liu (2012), Cai et al. (2013)). However, these methods do not take into account the group structure of the predictors.
In the context of multivariate multiple regression, different approaches have been developed to account for the biological group structure within the predictor matrix (see e.g., Wen (2014), Zhu et al. (2014)). For imaging genomics data, Greenlaw et al. (2016) have recently proposed a Bayesian hierarchical modeling formulation where the posterior mode corresponds to the estimator proposed by Wang et al. (2012). However, their approach is limited to the case of Σ = σ 2 I q corresponding to the independent phenotypes.
Following the ideas of Xu and Ghosh (2015), we develop more general Bayesian hierarchical models for variable selection with a group structure in the context of correlated multivariate response variables. For the univariate response model (2), a Bayesian group lasso model with spike and slab priors has been developed by Xu and Ghosh (2015) providing variable selection at the group level. The authors have also proposed a hierarchical spike and slab prior structure to select variables both at the group level and within each group. The posterior mode of their models was shown to provide shrinkage similar to the group lasso and sparse group lasso. Using this formulation, all groups were shrunk equally. Inspired by the adaptive group lasso (Wang and Leng, 2008;Nardi and Rinaldo, 2008) and the Bayesian adaptive lasso (Leng et al., 2014) we generalize their approach to allow for different amounts of shrinkage for different groups and coefficients.
For the multivariate response model (1), we define in Section 2 a Multivariate Bayesian Group Lasso with Spike and Slab prior (hereafter referred to as MBGL-SS) which enables only variable selection at the group level. Our Bayesian model is connected with penalized regression. We highlight properties of the posterior median estimator such as an Oracle property for group variable selection. We also derive asymptotic distributions under orthogonal designs. Then, we derive efficient Gibbs sampling algorithms for our group Bayesian lasso models with spike and slab priors. In order to select variables at both the group level and the individual level, we define a Multivariate Bayesian Sparse Group Selection with Spike and Slab priors (hereafter referred to as MBSGS-SS) in Section 3. Section 4 presents simulation studies to evaluate the performance of our approaches in terms of variable selection and prediction performance compared to frequentist approaches such as lasso, group lasso, and sparse group lasso. We illustrate the method in Section 5, with a challenging genetic data set where SNPs are used to predict the gene expression data in four tissue types. The final section concludes with a brief discussion.

Multivariate group lasso with spike and slab prior (MBGL-SS)
In this section, we consider the problem of Bayesian shrinkage estimation with group variables. The group Bayesian lasso was first proposed for a univariate response, by Kyung et al. (2010) who showed that a scale mixture of Normals with Gamma hyper priors for β g enables shrinkage of coefficients at the group level. While this method was shown to have shrinkage properties similar to the group lasso, Xu and Ghosh (2015) have stressed that the posterior mean and median do not produce exact zero estimates.
To obtain sparsity at the group level, they proposed a hierarchical Bayesian group lasso model with an independent spike and slab type prior. We exploit and extend their approach for multivariate responses and propose the following hierarchical multivariate Bayesian group lasso model with an independent spike and slab prior for each group variable B g : where δ 0 (V ec(B T g )) denotes a point mass at 0 ∈ R mgq , B g is the m g × q regression coefficient matrix for the group g and denote β g ij (i = 1, . . . , m g and j = 1, . . . , q) as the elements of this matrix. The prior density of Σ is assumed to follow an inverse Wishart distribution (denoted IW) where d and Q are respectively the degrees of freedom and a positive finite scale matrix such that E(Σ) = Q/(d − 2). The matrix Q is defined as Q = kI q with the hyper parameter k being comparable in size with the likely error variance of Y given X and d = 3 representing the smallest integer value ensuring the existence of E(Σ).
Fixing π at 1 2 is often recommended since it assigns equal probabilities to all submodels in the regression. Instead of fixing π 0 at 1 2 , we use a conjugate beta prior on π 0 , π 0 ∼ Beta(a, b) to incorporate potential prior knowledge on the sparsity of the model. This choice has been adopted by Scheipl et al. (2012) and Xu and Ghosh (2015). By setting a = b = 1, it gives a uniform prior for π 0 with mean 0.5 but allows for spread in the prior. However, one can choose an informative prior such as π 0 ∼ Beta(20, 40) (see Scheipl et al. (2012)) to encourage a sparser model for high dimensional data.
The value of λ g controls the amount of shrinkage for the gth group of coefficients. This parameter needs to be carefully tuned to provide the correct amount of shrinkage for the estimation. A large value of λ g will result in parameters that are extremely biased towards zero, whilst small values of λ g will lead to poor variable selection properties.
In Xu and Ghosh (2015), the shrinkage for each group is controlled by a single parameter. Consequently, larger groups of variables will be less affected by the shrinkage and more likely to be selected. In the original group lasso, Yuan and Lin (2006) propose weighting the shrinkage parameters by the size of their group to reduce the effect of different group sizes. We propose a global shrinkage parameterization of λ g by setting λ g = √ m g λ where λ is a global shrinkage parameter, and m g is the size of the group.
We take an empirical Bayes approach to estimate the value of λ from the data using marginal maximum likelihood (Park and Casella, 2008;Xu and Ghosh, 2015). Since the marginal likelihood function and marginal posterior for λ are intractable, a Monte Carlo EM algorithm is used. The kth EM update for λ is: in which the posterior expectation of τ 2 g is replaced by the Monte Carlo sample average of τ 2 g generated in the Gibbs sample based on λ (k−1) . We name this choice of λ the "global shrinkage parameter".
Inspired by the adaptive group lasso (Wang and Leng, 2008;Nardi and Rinaldo, 2008) and the Bayesian adaptive lasso (Leng et al., 2014) we propose an "adaptive shrinkage parameter" λ g for each group. The adaptive shrinkage parameter can be estimated using a Monte Carlo EM algorithm where the kth update for λ g is:

Connection to penalized regression and alternate reformulation of the model
To place our method in a context with the existing Bayesian group lasso and the penalized multivariate regression, we observe the marginal prior for B g . Integrating out the term τ 2 g in (4) using prior (5), the marginal prior distribution is a mixture of a point mass at 0 ∈ R mgq and a m g q-dimensional K-distribution: where MK(α, β, μ, Γ) denotes the Multivariate K-distribution as defined by Eltoft et al. (2006) with parameter set {α, β, μ, Γ}. In general the multivariate K-distribution does not have a closed form; however, for the parameters specified in (8) the density function is given by For a single response, the correlation matrix Σ will be a scalar denoted by σ and the MK distribution from (9) will reduce to the m g -dimensional Multi-Laplace distribution, From (8) and (10), we can observe that the marginal prior for V ect(B g ) with a single response variable reduces to a point mass at 0 ∈ R mg , and a term that matches the Bayesian group lasso with shrinkage parameter λ g (Raman et al., 2009;Kyung et al., 2010;Leng et al., 2014).
In the multivariate setting, the distribution for the slab part (9) can be interpreted as a generalization of the regular Bayesian group lasso that accounts for the correlations between the response variables. We note that there are many possible ways to extend the Laplace distribution for multivariate random variables. A general class of multivariate priors has been considered for group-sparse modeling by Babacan et al. (2014). While they do not consider correlated multivariate responses, they note that the multivariate Laplace distribution is part of a rich family of heavy tailed distributions. Importantly, it has been shown that for spike and slab priors using a heavy tailed distribution for the slab part results in optimal estimation risk with the posterior median estimator (Johnstone and Silverman, 2004).
To see the connection between our method and penalized regression we re-parameterize the regression coefficients: V ec(B T ) = γ g b g where γ g is an indicator taking a value 0 or 1 and Guided by the marginal prior distribution (8) we place an MK distribution on b g using the parameters from (9) and a Bernoulli prior on γ g , for g = 1, 2, . . . , G.
Using the formulation V ec(B T ) = γ g b g , the marginal prior for V ec(B T g ) will match the marginal prior (8). The negative log likelihood of the model and the prior defined by the above formulation is: In the case where the matrix Σ is set to σ 2 I q we have b g Im g ⊗Iqσ 2 = σ −1 b g 2 . In this setting the likelihood becomes Thus the posterior mode of the regression model is equivalent to a penalized regression problem where groups are penalized with an 2 norm (see Li et al. (2015)) and the number of nonzero groups is penalized in an 0 -like penalty. Setting q = 1 we obtain an expression similar to the likelihood found by Xu and Ghosh (2015). Once again our expression differs because we introduced a group spesific λ g to allow for different shrinkage across groups.

Median thresholding estimator
A key point of this section is to highlight the benefits of using the posterior median estimator in spike and slab type models for both selection and estimation at the same time. We generalize to a multivariate response variable the thresholding results of the posterior median estimator proposed by Xu and Ghosh (2015), who have also generalized the thresholding results of Johnstone and Silverman (2004). We first show that the posterior median estimator enables one to perform group variable selection by obtaining a zero coefficient for some groups. Then, we express the posterior median as a soft thresholding estimator. Finally, we show that the median thresholding estimator is consistent in model selection and has optimal asymptotic estimation rate.

Posterior median estimator
Consider one group: where Z is an m-dimensional random variable, and γ(·) and f (·) are both density functions for m-dimensional random vectors. Assume the density function f (t) is maximized at t = 0. Let Med(μ i |z) denote the marginal posterior median of μ i given data. By defining Xu and Ghosh (2015) stated the following theorem: In the case of a block orthogonal design matrix X (i.e., By Theorem 1, assuming π 0 > c 1+c , then there exists t(π 0 ) > 0, such that the marginal posterior median of β g ij under the prior (4) satisfies when ||V ec( B T g )|| 2 < t. As noted by Xu and Ghosh (2015), the marginal posterior median estimator of the gth group of regression coefficients is zero when the norm of the corresponding block least square estimator is less than a certain threshold.

Posterior median as a soft thresholding estimator
We assume now that the design matrix X is orthogonal, i.e., X T X = nI p and consider the model defined by (3) and (4) . Under this model the posterior distribution of B g is a spike and slab, where B LS,g is the least squares estimator of B g , D g = 1 1+nτ 2 g , and .
Thus the marginal posterior distribution of β g ij (1 ≤ i ≤ m g and 1 ≤ j ≤ q) conditional on the observed data is also a spike and slab distribution, where Σ jj is the j-th diagonal element of Σ. The resulting median is a soft thresholding estimator defined bŷ where z + denotes the positive part of z and Q g = φ −1 ( 1 2(1−min( 1 2 ,lg)) ). For a univariate response (q = 1) the matrix Σ reduces to the scalar σ 2 , and our result matches the previous work of Xu and Ghosh (2015). In the multivariate frequentist setting, Li et al. (2015) have proposed an iterative algorithm which utilizes a similar soft thresholding function to incorporate group structure in estimating the regression estimates. Under an orthogonal design, the median thresholding estimator has the oracle property.

Oracle property
Theorem 2. Assume an orthogonal design matrix, i.e., X T X = nI p . Suppose √ nτ 2 g,n → ∞ and log(τ 2 g,n )/n → 0 as n → ∞, for g = 1, . . . , G, then the median thresholding estimator has the oracle property, that is, variable selection consistent estimation, and asymptotic normality, The proof follows the same steps as the proof of Theorem 4 in Xu and Ghosh (2015). For asymptotic normality, the result comes from the fact that

Gibbs sampler
An efficient block Gibbs sampler (Hobert and Geyer, 1998) is used for simulating from the posterior distribution. The full posterior distribution of all the unknown parameters conditional on the data is

Conditional posterior distribution
Let B (g) denote the B matrix without the gth group, and X (g) denote the covariate matrix corresponding to B (g) , that is, where X g is the design matrix corresponding to B g .
• The conditional posterior distribution of B g where .
• The conditional posterior distribution of α 2 g = 1 where the inverse Gaussian distribution is defined in Folks and Chhikara (1978) and the inverse Gamma distribution in Gelman et al. (2014).
• The conditional posterior distribution of Σ • The conditional posterior distribution of π 0 Remark. From the Gibbs sampler, different strategies are used to select models and predictors. The highest posterior probability model (denoted here after HPPM) is estimated by recording at each simulation (iteration of the Gibbs sampler) the generated model. Then, the generated models are tabulated to find the model that has the highest frequency. Thus HPPM defines the selected relevant groups. An alternative choice is the median estimator which is found by taking the element wise median of the samples of B g generated by the Gibbs sampler.

Multivariate sparse group selection with spike and slab prior (MBSGS-SS)
The MBGL-SS is tailored for problems that only require group level sparsity. However, sometimes we would like to combine both sparsity of groups and within each group. For example, if the predictor matrix contains genes, we might be interested in identifying particularly important genes in pathways of interest. For a univariate response, Xu and Ghosh (2015) defined a Bayesian sparse group lasso which offers shrinkage effects at both the group level and also within a group. However, the authors stressed that the model does not produce a sparse model since the posterior mean/median estimators are never exactly set to zero. To overcome this drawback, the authors defined a hierarchical Bayesian sparse group selection with spike and slab prior using spike and slab type priors (named BSGS-SS) for both group variable selection and individual variable selection. We exploit the same idea for a multivariate response variable.

Model specification
First, we reparametrize the coefficient matrices to tackle the two kinds of sparsity separately: whereB g , when nonzero, follows the distribution V ec(B T g ) ∼ N mgq (0, I mg ⊗ Σ). Thus the diagonal element of V 1 2 g control the magnitude of the elements of B g . To select variables at the group level, we assume the multivariate spike and slab prior for each V ec(B T g ): We denote the j-th row of B g by B j g and the j-th row ofB g byB j g . Note that when τ gj = 0, the row B j g is set to zero, even when the corresponding rowB j g is nonzero. In order to choose variables within each relevant group, we assume the following spike and slab prior for each τ gj : τ gj ind ∼ (1 − π 1 )N + (0, s 2 ) + π 1 δ 0 (τ gj ), g = 1, . . . , G; j = 1, . . . , m g , where N + (0, s 2 ) denotes a normal N (0, s 2 ) distribution truncated below at 0. Note that this truncated normal distribution has mean 2 π s and variance s 2 .

Prior specification
• We assume an Inverse Wishart prior for Σ ∼ IW(d, Q) • We assume conjugate beta hyper-priors for π 0 and π 1 : • We use a conjugate inverse gamma prior for s 2 ∼ Inverse Gamma(1, t), and estimate t with the Monte Carlo EM algorithm. For the k-th EM update, where the posterior expectation of 1 s 2 is estimated from the Gibbs samples based on t (k−1) .

The posterior distribution ofB g
g ) −1 , then the conditional posterior distribution ofB g is a spike and slab distribution where .
The conditional posterior distribution of τ gj The conditional posterior distribution of τ gj is a spike and slab distribution: where

The conditional posterior distribution of Σ
The conditional posterior distribution of Σ is an inverse Wishart distribution: The conditional posterior distribution of π 0 and π 1 The conditional posterior distribution of s 2 Remark. The MBGL-SS and MBSGS-SS methods have been designed to tackle the situation where a subset of predictors in X are related to all components of the response Y. Meaning, that our models have not been designed to allow for sparseness within a regressor across the response variables. However, the median estimator does do this, as it were, for free. This nice feature of the median estimator is highlighted in both the simulation study and the case study application.

Simulation studies
To investigate the properties of our approach, our first simulation study was conducted in the univariate setting to show the behavior of the BGL-SS (Bayesian Group Lasso with Spike and Slab prior) proposed by Xu and Ghosh (2015) compared to the proposed extension including the group size effect related to the shrinkage part of our model. We compared different approaches (such as lasso, group lasso, sparse group lasso) in terms of prediction and variable selection accuracy performance. Then, a second simulation study was performed with a multivariate response to demonstrate the good prediction and variable selection accuracy performance of MBGL-SS and MBSGS-SS when compared with BGL-SS, BSGS-SS (Bayesian Sparse Group selection with spike and slab priors defined in Xu and Ghosh (2015)) and two lasso methods (denoted mlasso and MRCE) for a multivariate response. The mlasso method has been implemented in the glmnet R package and assumes that there is some underlying subset of the coefficients in X which are related to all components of the response Y. The multivariate lasso (mlasso) problem is stated as: where · F denotes the Frobenius norm, and B jT denotes the jth row of B. The MRCE method proposed by Rothman et al. (2010) has been implemented in the MRCE R package (Rothman, 2017). The multivariate regression with covariance estimation (MRCE) method producing a sparse estimator of B which depends on the inverse of the covariance matrix Ω = Σ −1 is stated as: where L(B, Ω) is the negative log-likelihood function and Ω = [w j j ].
Note that BGL-SS and BSGS-SS are designed for univariate response but are also used in the multivariate response simulation (viewed as q univariate regressions) and the results are pooled to obtain prediction performance. The different Bayesian methods used have been implemented in our R package MBSGS (Liquet and Sutton, 2017) available on CRAN.
The posterior mean and posterior median are both used as our Bayes estimators and we compare their variable selection and prediction performance.
For our Bayesian methods, we generate data from the full posterior distribution with a Gibbs Sampler, running 20000 iterations in which the first 10000 are burn-ins. For MBGL-SS and BGL-SS, we set a = b = 1 for the hyperparameters relating to π 0 . For MBSGS-SS and BSGS-SS, hyperparameters relating to the Beta distributions of π 0 and π 1 are chosen to be a 1 = a 2 = c 1 = c 2 = 1. As suggested by Brown et al. (1998), a weak prior information requires a small value for the hyperparameter d. We set d = 3 which is just a convenient small value, the smallest integer value for which the expectation of Σ exists. Instead of an arbitrary setting of the hyperparameter k for the expectation of the error variance (E(Σ) = kI q ), we propose a practical way to fix it in the spirit of an Empirical Bayes approach. As adopted by Petretto et al. (2010) and Liquet et al. (2016a), we perform q univariate regressions which enable us to derive an estimate of the error variances. We fix k to be the average of the q residual error variance from the univariate models. In the case of p > n, q univariate forward regressions are performed to derive an estimate of the error variance.
We use the glmnet R package (Friedman et al., 2010) to perform the lasso method for univariate and multivariate responses. The SGL R package  is used to perform the group and sparse group lasso methods for the univariate setting. The MRCE R package (Rothman, 2017) is used to perform the multivariate regression with covariance estimation. The tuning parameters for the frequentist methods are calibrated using 5-fold cross-validation.

Univariate setting
In this simulation setting, we investigate both the effect of correlation between predictors and the group size effect.

True models
The data have been generated from the following univariate model: where each row of the predictor matrix X is generated from a multivariate Normal distribution with zero mean and covariance matrix Σ X = (1 − ρ)I p + ρ1 p 1 T p where the correlation ρ is given according the simulation setting, 1 m is the m-length vector of ones. We consider the following two simulation settings: • Model 1. We simulated data sets with n = 120 observations and p = 20 covariates divided into 4 groups with 5 covariates each. We randomly sampled 80 observations to train the model and used the remaining 40 for comparing performance prediction. Let β T = ((0.3, −1, 0, 0.5, 0.01), 0 5 , 0.8 5 , 0 5 ), where the notation x l denotes a vector of length l with x values. We varied the pairwise correlation ρ ∈ {0, 0.5, 0.75} between covariates. We specify σ = 3.
• Model 2. We simulated a data set with n = 120 observations and p = 130 covariates divided into 5 groups with respectively 5, 5, 20, 50 and 50 covariates. We randomly sampled 80 observations to train the model and used the remaining 40 for comparing performance prediction. Let β T = ((0.3, −1, 0, 0.5, 0.01), 0 5 , 0.8 5 , 0 50 , 0 50 ). We vary the pairwise correlation ρ ∈ {0, 0.5, 0.75} between covariates. We specify σ = 3. This model enables us to investigate both the effect of correlation between predictors and the group size effect. For this model we also investigate the behavior of our methods when the sample size increases (200 and 300 observations for training and 40 observations for comparing prediction performance).
For the simulated data for model 1, Table 1 of the Supplementary Material  presents the model selection accuracy over 50 replications for the different methods designed for univariate response variables. The models are compared with true and false positive rates and with Matthews correlation coefficient.
For both BGL-SS and BSGS-SS, the median thresholding model (MTM) outperforms all other methods including the highest posterior probability model (HPPM) whatever the values of the pairwise correlation. Lasso, group lasso (gLasso) and sparse group lasso (sgLasso) tend to select more variables than the spike and slab methods. A similar pattern for the simulations was noted in Xu and Ghosh (2015). Our extension of the BGL-SS model incorporating the group size effect gives similar results to the traditional one if we use the global shrinkage parameterization of λ g while the adaptive shrinkage parameterization tends to select more variables.
The results regarding the prediction performance (mean square error of prediction) are presented in Table 2 of the Supplementary Material. The medians of the mean squared prediction error are compared for the 12 methods. The bootstrapped standard errors of the medians are given in the parentheses. BSGS-SS and BGL-SS methods gave similar results and outperform the frequentist lasso approaches which are adversely impacted by the correlation between predictors. Note that in this case the posterior mean estimator and posterior median estimator have similar performances.

Tables 3 and 4 of the Supplementary Material present performance results for
Model 2, where the number of predictors in each group varies. This structure of the data clearly affects the BGL-SS model proposed by Xu and Ghosh (2015) especially when the predictors are correlated and with small sample size. Our modifications of the model (including the group size effect to the shrinkage parameter) combined with the "adaptive shrinkage parameter" give better results, especially with the Median Thresholding Model. We can note that the BSGS-SS model is not affected by this structure of the data and out performs all the other methods. Only for high correlation and small sample size are the frequentist approaches competitive in terms of variable selection compared with the BGL-SS model with "adaptive shrinkage parameter". However, regardless of the pairwise correlations between the predictors, the frequentist methods have worse prediction performances.

True models
The data have been generated from the following multivariate model: For all models, we assumed strong levels of correlation between the first and second outcomes, and weaker levels for the other pairwise correlations. Specifically, we defined We considered the following nine simulations setting: • Model 1. We simulated data sets with n = 100 observations and p = 20 covariates divided into four groups with five covariates each. We randomly sampled 60 observations to train the model and used the remaining 40 for comparing performance prediction. Let The pairwise correlation between covariates is set equal to 0.5.
• Model 2. The simulation setting is the same as for the previous model except for the true B. Let In this simulation, some relevant predictors (1,2, 3 and 4) are not associated for the second response variables.
• Model 3. We consider the situation where n < p. We simulated data set with n = 60 and p = 80 covariates divided into 16 groups with 5 covariates each. We use 40 observations to train the model and used the remaining 20 for comparing performance prediction. Let (26) We define the jth predictor in group g as X gj = z g + z gj , where z g and z gj are independent standard normal variates, g = 1, . . . , 16; j = 1, 2, . . . , 5. Thus predictors within a group are correlated with pairwise correlation 1 2 while the predictors in different groups are independent.
• Model 4. The simulation setting is the same as for the previous model except for the true B. Let (27) As in model 2, some relevant covariates are non-zero for two responses and zero for the other one.
Predictors have been simulated in the same way as in model 2.
• Model 6. The simulation setting is the same as for the previous model except for the true B. Let Predictors have been simulated in the same way as in model 2.
• Model 8. We simulated data with n = 240 and p = 1000 covariates divided into 50 groups with 20 covariates each. We randomly sampled 200 observations to train the model and used the remaining 40 for comparing performance prediction. Let Predictors have been simulated in the same way as in model 2.
• Model 9. We simulated data with n = 120 observations and p = 130 covariates divided into 5 groups with respectively 5, 5, 20, 50 and 50 covariates. We randomly sampled 80 observations to train the model and used the remaining 40 for comparing performance prediction. Let We vary the pairwise correlation ρ ∈ {0, 0.5, 0.75} between covariates. This model enables us to investigate both the effect of correlation between predictors and the group size effect. For this model we can also investigate the behavior of our methods when the sample size increases (200 and 300 observations for training and 40 observations for comparing performance prediction).
Note that models 1, 2, 6 and 9 have sparsity at the group level and also sparsity within nonzero groups while models 3, 7 and 8 have only sparsity at the group level. Models 2, 4 and 5 present the case where some relevant covariates are not related to all the responses variables.
For the first 8 models we used the "global shrinkage parameter" version of our MBGL-SS which has better performance in the univariate setting when all the groups have the same size. For model 9, we performed both the "adaptive" and "global" parameterizations of λ g for our MBGL-SS model. Table 5  which does not take into account the information of the data (group) structure. As expected, the MTM model which is more parsimonious has a lower false positive rate than the HPPM model. However, the HPPM model gives better result for the true positive rate when a multivariate sparse group selection model is applied.

Numerical results
Models 2 and 4, correspond to the scenario where some relevant covariates are not associated with all response variables, MBGL-SS has a higher false positive rate since the method produces an estimator which gives non-zero estimates for all coefficients within a selected group regardless of the response variables. However, MBSGS-SS is not impacted by this situation because the method produces an estimator which can give zero estimates for some coefficients within a selected group. This result is highlighted in the application section.
For a large number of predictors (Models 7 and 8) compared to the number of observations, MBGL-SS has poor performance while MBSGS-SS attains very good results. In these simulations, MBGL-SS gives good results for larger sample sizes (n = 500 for Model 7 and n = 900 for Model 8). Note that MRCE methods failed dramatically in these situations and give the worst performances of all simulated models. For the current version of the MRCE optimization problem (22), the diagonal elements of the optimization variable corresponding to the error precision matrix are not penalized. Consequently, when p > n a global minimizer of the penalized likelihood optimization can fail to exist. For this reason, it is not recommended to use the current MRCE approach when p > n. The bootstrapped standard errors of the medians are given in the parentheses. BGL-SS and BSGS-SS have been performed on each response variable and we have derived and reported the median mean squared error of prediction corresponding to the multivariate response. As expected the multivariate models MBGL-SS and MBSGS-SS outperform the univariate model applied to each response variable. The MSBGS-SS also outperforms the lasso models for all simulation settings. Finally, we remark that the MSBGS-SS always performs better than the MBGL-SS and shows very good behavior when there is strong within-group sparsity (Model 6).
Results from Model 9 (corresponding to a model with different group size effect) for different pairwise correlations between predictors and different sample sizes are presented in Tables 7 and 8 of the Supplementary Material. From these tables we observe that: • When the predictors are independent, MBGL-SS performs well both for variable selection and prediction. As expected the results improve when the sample size increases. These approaches always outperform the lasso models in this independent setting. However, the MBGL-SS methods (both the "adaptive" and "global" parameterizations of λ g ) are impacted by moderate and high correlations between predictors especially for small sample size (n = 80). For a larger sample size (n = 200), the "global shrinkage parameter" mostly gives better results for the prediction performance, but the large standard error of the median of the mean squared error of prediction indicates that some of the runs over the 50 replications gave poor results. For larger sample sizes, the models are competitive with the other approaches. We can note that the "global shrinkage parameter" always gives better results than the "adaptive shrinkage parameter" which only gives good results for the large sample size (n = 900, not shown in these tables).
• MBSGS-SS models outperform all the other approaches except in the case of independent predictors and small sample size. For this setting only, the method failed to select the signal of the model which is also the case for lasso models.
• As previously observed the median thresholding model (MTM) which is more parsimonious has slightly lower false positive rate than the HPPM model.

Application to real data
In this section, we present the results of the application of our approaches to investigate genetic regulation. To discover the genetic causes of variation in the expression (i.e. transcription) of genes, gene expression data are treated as a quantitative phenotype while genotype data (SNPs) are used as predictors, a type of analysis known as expression Quantitative Trait Loci (eQTL). Here, we use a dataset which comes from a larger study ) from which we selected the Hopx genes, as in Petretto et al. (2010). This dataset has been also analyzed by Liquet et al. (2016a) who used a Bayesian model to identify a parsimonious set of predictors that explains the joint variability of gene expression in four tissues (adrenal gland, fat, heart, and kidney). However, their model does not take into account the group structure of the predictors. The dataset consists of 770 SNPs in 29 inbred rats as a predictor matrix (n = 29, p = 770), and the 29 measured expression levels in the 4 tissues as the outcome (q = 4). A full description of the dataset is provided in Petretto et al. (2010) and available from the R package R2GUESS (Liquet et al. (2016a)). Table 1 presents the means, variances, and pairwise correlation structure among the four tissues, noting similar means and variances except for heart, which is larger, and moderate positive correlation. The table also shows the partitioning of the SNPs among the 20 chromosomes of the rats. Thus, the chromosome information defines the group structure of the predictor matrix.
We ran our MBGL-SS and MBSGS-SS models using this group structure and the multivariate lasso which does not take into account the group structure. The multivariate lasso selects 69 SNPs which come from the 20 chromosomes. The MBGL-SS selects only the two first groups corresponding to the SNPs from chromosomes 1 and 2. However, the simulation study showed that MBGL-SS methods are impacted by moderate and high correlations between predictors, especially for small sample sizes. Therefore, we focus our analysis on the MBSGS-SS model. Using the thresholding median estimator, the method selects 32 SNPs which belong to only 8 groups/chromosomes. Table 2 presents the posterior median estimators of the selected SNPs (meaning that the median estimator produced an estimate exactly equal to 0 for all others SNPs). Note that some SNPs in the selected chromosomes are not associated (median estimator exactly equal to 0) with all the four tissue types. Although the model does not allow for sparseness within the SNP across tissue types, that is, setting some regression parameters to 0, the median estimator does do this, as it were, for free. We note this in Table 2 for chromosomes 14, 15 and 19 in particular.
The SNP D14Mit3 (from chromosome 10), which has been previously identified by Liquet et al. (2016a) as the most associated with the four levels of expression, is also selected by our method with the highest estimate (0.334) for the heart tissue. The four estimates for SNP D14Mit3 for the four tissue types are substantially larger than estimates for any other SNPs. We can consider the statistical significance, estimate (posterior mean, median) compared with the posterior standard deviation for the SNP regression parameters. We note that the posterior standard deviation of the regression parameter for each selected non-zero median estimate was in the range 0.11 to 0.64. In particular, the SNP D14Mit3 estimate was 0.334 with posterior standard deviation 0.639, a "Z-value" of about 0.5. All other SNPs with non-zero estimates have "Z-values" close to 0.0. Here the study size was small, n = 29, explaining to some extent the lack of power of the analysis but nevertheless when the SNP D14Mit3 estimates are compared with the other SNPs' estimates they are substantially larger. In terms of choosing important chromosomes in addition to chromosome 10, on which SNP D14Mit3 lies, chromosomes 2 and 7 have a larger number of non-zero SNP estimates than other chromosomes. The importance of chromosomes was investigated using an estimate of the probability of inclusion (EPI) presented in Table 3. The EPI is defined as an empirical version of l g defined in (20) and shows the importance of chromosomes 2, 3, 4, 7, 10, and 14, all with EPI equal to 1.0.

Concluding remarks
In this paper, we have proposed Bayesian methods for group-sparse modeling in the context of a multivariate correlated response variable. Our models are based on spike and slab type priors which facilitate variable selection. In the case of the group variable selection, we have shown the importance of including the group size information in the shrinkage part of our model. We have shown that the posterior median estimator could both select and estimate the regression coefficients. Simulation results showed excellent performance of the posterior median estimator for variable selection and prediction error. This estimator obtains similar results as the highest probability model in terms of true and false positive rates. Moreover, this estimator can produce sparseness within the regression coefficient across the response variables even though our models have not been specifically designed for this purpose.
Simulation results also suggest that the multivariate Bayesian group lasso with spike and slab prior is strongly influenced by a combination of different group size structures and high correlation between predictors. The multivariate Bayesian sparse group selection with spike and slab prior does not suffer in this situation. This approach seems to be the most powerful method for variable selection and prediction performance in the presence of group structure data except in the extreme case of independent predictors and small sample size. All numerical results from this article can be reproduced using our R package MBSGS (Liquet and Sutton, 2017) available on CRAN.
We have illustrated our methods on a challenging dataset involving gene expression data (q = 4, n = 29) and SNP explanatory variables (p = 770) with the group structure (G = 20) defined by chromosomes. Our approach effectively found a significant SNP and chromosome while suggesting five other chromosomes could possibly be of interest.
We noted the small sample size, (n = 29), indicating a lack of power for this study.
Our current development is restricted to non-overlapping groups. To use the present approaches with overlapping groups, such as groups of genes (like in Gene Ontology), an extension in the spirit of Stingo et al. (2011) who uses two sets of binary indicators for group and individual level selection would be required.  In terms of computation, these methods are very practical. The current version of our package runs, for example, an MBSG-SS model in around two minutes and an MBGL-SS model in around one minute for the simulated Model 1 (Section 4.2) for a sample size (n = 900) with 20000 iterations including 10000 for the burnin. Further improvements of the code, such as parallelization over the group structure, are in progress to speed  Table 3: Empirical estimation of the Probability of Inclusion of each chromosome (EPI).
up the computational time for tackling Big Data sets which are common for genomics studies where predictors are embedded in a grouping structure such as gene pathways. In this context of genetics studies, some further extensions of our model are under investigation such as integrating different group penalties given a biological prior of the pathways or different distribution priors for each group.