Bayesian Effect Fusion for Categorical Predictors

In this paper, we propose a Bayesian approach to obtain a sparse representation of the effect of a categorical predictor in regression type models. As the effect of a categorical predictor is captured by a group of level effects, sparsity cannot only be achieved by excluding single irrelevant level effects but also by excluding the whole group of effects associated to a predictor or by fusing levels which have essentially the same effect on the response. To achieve this goal, we propose a prior which allows for almost perfect as well as almost zero dependence between level effects a priori. We show how this prior can be obtained by specifying spike and slab prior distributions on all effect differences associated to one categorical predictor and how restricted fusion can be implemented. An efficient MCMC method for posterior computation is developed. The performance of the proposed method is investigated on simulated data. Finally, we illustrate its application on real data from EU-SILC.


Introduction
In many applications, especially in medical, social or economic studies, potential covariates collected for a regression analysis are categorical, measured either on an ordinal or on a nominal scale. The usual strategy for modelling the effect of categorical covariates is to define one level as baseline and to use dummy variables for the effects of the other levels with respect to this baseline. Hence, the effect of a categorical covariate is captured not by a single but by a group of regression effects. Including categorical variables as covariates in regression type models can therefore easily lead to a highdimensional vector of regression effects. Moreover, since only the subset of observations with a specific level contribute information to estimation of its effect, estimated effects of rare levels will be associated with high uncertainty.
Many methods have been proposed to achieve sparser models by identifying regressors with non-zero effects. Whereas frequentist methods, e.g. the lasso (Tibshirani, 1996) or the elastic net (Zou and Hastie, 2005) rely on penalties, Bayesian variable selection methods are based on specification of appropriate prior distributions, e.g. shrinkage priors (Park and Casella, 2008;Griffin and Brown, 2010) or spike and slab priors (Mitchell and Beauchamp, 1988;George and McCulloch, 1997;Ishwaran et al., 2001). However, variable selection methods perform selection of single regression effects and are not appropriate for a categorical covariate with more categories, as the natural grouping of the dummy variables capturing its effect is not taken into account.
Moreover, a sparser representation of the effect of a categorical covariate can be achieved not only by restricting all of its level effects to zero but also when some of the levels have the same effect. In this paper, we propose a Bayesian approach to achieve a sparser representation of the effects of a categorical predictor, which encourages both shrinkage of non-relevant effects to zero as as well as fusion of (almost) identical level effects.
Approaches that explicitly address inclusion or exclusion of groups of regression coefficients associated to one variable are the group lasso (Yuan and Lin, 2006) and the Bayesian group lasso (Raman et al., 2009;Kyung et al., 2010). Chipman (1996) uses spike and slab priors for grouped selection of the set of all dummy variables related to a categorical predictor. Whereas all these methods aim at sparsity for groups of regression coefficients, the recently proposed sparse-group lasso (Simon et al., 2013) addresses also sparsity within groups by shrinking negligible effects to zero, however not by fusing identical level effects.
To encourage both sparsity of regression effects as well as their differences in regression models with metric predictors, Tibshirani et al. (2005) proposed the fused lasso and Kyung et al. (2010) its Bayesian counterpart, the Bayesian fused lasso. Both methods assume some ordering of effects and shrink only effect differences of subsequent effects to zero. Hence, they are not appropriate for nominal predictors where any effect difference should be subject to shrinkage. Effect fusion for nominal predictors is considered only in Bondell and Reich (2009) who propose a modification of the fused lasso for ANOVA, by Gertheiss and Tutz (Gertheiss and Tutz, 2009;Gertheiss et al., 2011;Gertheiss and Tutz, 2010;Tutz and Gertheiss, 2016) who specify different lasso-type penalties for ordinal and nominal covariates and recently in Tutz and Berger (2014) where tree-structured clustering of effects of categorical covariates is performed.
We address the problem of sparsity for effects of categorical predictors from a Bayesian point of view and incorporate structure in the prior on the regression effects. As the goal is to learn whether two level effects are almost equal or considerably different, we do not specify a standard independence Normal prior for the level effects but explicitly model dependence in their joint precision matrix by allowing for either almost perfect or low dependence. We show that the prior can alternatively be achieved by specifying spike and slab prior distributions on all level effects and their differences and taking into account their linear dependence. Spike and slab prior distributions have been applied extensively in Bayesian approaches to variable selection. The mixture structure with a spike at zero and a flat slab allows for intrinsic classification of effects and effect differences as (almost) zero, when a coefficient is assigned to the spike and as non-zero otherwise. Whereas for a categorical predictor any two level effects will be subject to fusion, it seems natural to exploit the ordering information available for an ordinal predictor by restricting fusion to adjacent categories, which is easily accomplished in our framework. Generally, our proposed method is not limited to categorical predictors but can be applied to all groups of covariates The rest of the paper is organised as follows: in Section 2 we introduce the data model and in Section 3 the prior distribution constructed to encourage a sparse representation of covariate effects. Posterior inference is discussed in Section 4 and Section 5 investigates the performance of the method for simulated data. An applications of the proposed method for Bayesian effect fusion is illustrated on a real data example in Section 6 and we conclude with Section 7.

Model specification
We consider a standard linear regression model with Normal response y and p categorical covariates C h , h = 1, ..., p, where covariate C h has c h +1 ordered or unordered levels 0, . . . , c h . To represent its effect on the response y, we define C h = 0 as the baseline category and introduce dummy variables, X h,k , to capture the effect of level C h = k with respect to the baseline category. The regression model is then given as where µ is the intercept, β h,k is the effect of level k of covariate C h (with respect to the reference category) and ε ∼ N (0, σ 2 ) is the error term.
For an (n × 1) response vector y = (y 1 , . . . , y n ) we write the model as where X h is the (n × c h ) design matrix for covariate C h , β h is the (c h × 1) vector of the corresponding regression effects and ε the (n×1) vector of error terms. 1 denotes a vector with elements 1 and I the identity matrix.

Prior specification
Bayesian model specification is completed by assigning prior distributions to all model parameters. We assume a prior of the structure where ξ h = (τ 2 h , δ h ) denotes additional hyperparameters, which are specified below. We assign a flat proper prior p(µ) ∼ N (0, M 0 ) to the intercept and an Inverse Gamma distribution p(σ 2 ) ∼ G −1 (s 0 , S 0 ) to the error variance.
The prior on the regression effects β h is specified hierarchically as where γ h is a fixed constant, τ 2 h is a scale parameter and the matrix Q h determines the structure of the prior precision matrix. To encourage effect fusion, we let Q h depend on a vector δ h of indicator variables δ h,kj , which are defined for each pair of level effects k and j subject to fusion. δ h,kj = 1 indicates that β h,k and β h,j differ considerable and hence two regression parameters are needed to capture their respective effects whereas for δ h,kj = 0 the effects are almost identical and the two level effects could be fused. To allow fusion of level effects to 0, i.e. conventional variable selection, we include in δ h also indicators δ h,k0 , k = 1, . . . c h .
The dimension of δ h and the concrete specification of Q h depend on which pairs of effects are subject to fusion. We discuss the case where fusion is completely unrestricted and hence any pair of effects might be fused in Section 3.1. Whereas unrestricted effect fusion will be appropriate for a nominal covariate, for an ordinal covariate information on the ordering of levels is available which suggests to fuse only adjacent categories as discussed in Gertheiss and Tutz (2009). We describe effect fusion taking into account restrictions that preclude direct fusion for specified pairs of effects in Section 3.2. For notational convenience we define β h,0 = 0 and drop the covariate index h in the following.

Prior for unrestricted effect fusion
For unrestricted effect fusion, we introduce an indicator δ kj for each pair of effects k = 1, . . . c and j = 0, . . . , k − 1 (including 0 for the baseline) and hence δ is of dimension d = c+1 2 . We define where r is a fixed large number (e.g. r = 10000 ) for k > j and κ jk = κ kj for j > k.
The structure of the prior precision matrix is then specified as and finally, we set γ = c/2.
The structure matrix Q(δ) determines the prior precision matrix of β up to the scale factor γτ 2 and therefore has to be symmetric and positive definite. Symmetry of Q(δ) is guaranteed by definition and positive definiteness as if β = 0, see Appendix A.1 for a detailed proof.
Thus, the prior allows for either high (if δ kj = 0) or low (if δ kj = 1) positive prior partial correlation. Further, depending on δ k , c different values of the prior precision are possible, ranging from c γτ 2 (for δ k = 1) to cr γτ 2 (for δ k = 0). As an example, consider a covariate with c = 3 levels, where δ kj = 1 for all k > j = 0, . . . , c except one and r = 10000. If δ 10 = 0 the structure matrix is and the marginal prior on β 1 is concentrated close to zero. For δ 32 = 0, and hence the joint prior on (β 2 , β 3 ) is concentrated close to β 2 = β 3 . Note that marginally the prior on off-diagonal elements of Q is a mixture of two inverse Gamma distributions.
The structure of the quadratic form in equation (6) suggests a straightforward interpretation of the effect fusion prior in terms of Normal priors on all effect differences θ kj = β k − β j , k = 1, . . . , c; j = 0 . . . , k − 1. For δ kj = 0, the effect difference θ kj is concentrated around zero, whereas it is more dispersed for δ kj = 1. Actually, as we show in Appendix A.2 the effect fusion prior specified above can be derived by starting from independent spike and slab priors on all effect differences θ kj and then correcting for the linear restrictions Finally, we note that from a frequentist perspective, the effect fusion prior can be interpreted as an adaptive quadratic penalty (see equation (6)), with either heavy or slight penalization of effect differences . In contrast, Gertheiss and Tutz (2010) use a weighted L 1 penalty on the effect differences.

Prior for restricted effect fusion
If information on the structure of the level effects is available, this information can be exploited by allowing only fusion of specific pairs of level effects. Consider e.g. an ordinal covariate where the ordering of levels suggests to allow only fusion of subsequent level effects β k−1 and β k i.e. restricting direct fusion of effects to adjacent categories. A restriction that e.g. β k and β j should not be fused can be implemented in our prior in two ways: we can either fix the indicator δ kj = 0 or directly set the corresponding element in the prior precision matrix Q(δ), q kj = 0. Setting δ kj = 0 implies that effects β k and β j are still smoothed to each other and hence a soft restriction, whereas q kj = 0 is a hard restriction which implies conditional independence of β k and β j .
Whereas implementation of soft restrictions is straightforward, (hard) conditional independence restrictions require slight modifications for the structure matrix Q(δ), the vector of indicators δ and the constant γ. We start by introducing a vector ζ of indicators ζ kj , which are defined for each effect difference θ kj . The elements of ζ are fixed and indicate whether an effect difference is subject to fusion (for ζ kj = 1) or not (for ζ kj = 0). Deviating from unrestricted effect fusion considered in Section 3.1, we define a stochastic indicator δ kj only for those effect differences where ζ kj = 1 and hence the dimension of δ is d = c k=1 0≤j<k ζ kj . To allow off-diagonal elements of the prior precision to be zero, we set and q jk = q kj . Thus q kj takes the value zero if ζ kj = 0 and −κ kj otherwise.
Similarly, the diagonal elements are specified as As noted above, an important special case is an ordinal covariate where it is natural to restrict fusion to adjacent categories as in Gertheiss and Tutz (2009), i.e.
Hence, the vector of indicators δ has only d = c elements and Q(ζ, δ) is a tri-diagonal matrix with elements In this case, the maximum value of a diagonal element q jj is two and therefore we set γ = 1.
It is easy to show that this specification of Q(ζ, δ) corresponds to a random walk prior with initial value β 0 = 0 on the regression effects: Due to the spike and slab structure, this prior allows for adaptive smoothing, with almost no smoothing for δ k,k−1 = 1 and pronounced smoothing for δ k,k−1 = 0.
Another special case is the standard spike and slab prior used for variable selection, where only fusion of level effects to the baseline, i.e. shrinkage of β k to zero is considered. The spike and slab prior is recovered in our framework when γ = 1, ζ kj = 1 for j = 0 0 otherwise and hence non-diagonal elements of Q(ζ, δ) are zero and q kk = κ k0 .

Prior on the indicator variables
A standard choice in variable selection is to assume conditional prior independence of the elements of δ with p(δ kj = 1) = ω, where ω is either fixed or assigned a hyperprior ω ∼ B(v 0 , w 0 ). This would in principle be possible also with our prior, however, from a computational point of view a more convenient choice is to set as with this choice the determinant of Q(δ) cancels out in the joint prior p(β, δ|τ 2 ), which results as

Choice of hyperparameters
The hyperparameters of the effect fusion prior should be chosen to minimize the expected loss of the underlying decision problem: loss occurs if level effects which are different are fused or effects which are equal are not fused. We call the first case false negative, as a non-zero effect difference is not detected and the second false positive, as a zero effect difference is classified as non-zero. False positives and false negatives have different impacts: if an effect difference is falsely classified as positive, two parameters are included in the model though only one would be sufficient. This results in a loss of estimation efficiency. In contrast, if an effect difference is falsely classified as negative, two effects that are actually different from each other are modelled by only one parameter. This will result in biased estimation and bad prediction performance. Hence, the primary goal will be to avoid false negatives.
From the representation of the prior in terms of spike and slab priors on all effect differences θ kj , it is evident that the conditional prior fusion probability P (δ kj = 0|θ kj ) depends on the slab to spike ratio r and the parameters of the inverse Gamma distribution for the slab variance g 0 and G 0 .
We propose to set g 0 = 5, a standard choice in variable selection (see e.g. Fahrmeir et al. (2010); Scheipl et al. (2012)), where the tails of spike and slab are not too thin to cause mixing problems in MCMC. For fixed θ kj > 0, the prior fusion probability P (δ kj = 0|θ kj ) is lower for a larger slab to spike ratio r and for a smaller scale parameter G 0 . This suggests to choose a high value for r and a small value for G 0 . However, shrinkage of effect differences to zero is more pronounced with a smaller scale parameter G 0 , also under the slab, which might hamper detection of small effect differences. We will investigate this issue in more detail in the simulation study in Section 5.3.
As the model is a linear Bayesian regression model with a conditionally conjugate prior, MCMC is straightforward. After choosing starting values for ξ = (τ 2 , δ) and σ 2 MCMC proceeds by iterating between the following steps: 1. Update the prior variance matrix B 0 (ξ) and sample the regression coefficients β from the full conditional Normal distribution N (b, B) with moments given as 3. For h = 1, ..., p: sample the scale parameter τ 2 h from the Inverse Gamma distribution G −1 (g h , G h ) with parameters 4. As both the likelihood given in equation (6) as well as the prior p(δ h ) can be factorized with respect to the indicators δ h,kj , these can be sampled independently from

Model selection
Effect fusion aims at selecting an appropriate model for a categorical predictor and thus is a particular model selection problem. In a Bayesian approach, model selection is usually based on posterior model probabilities and the goal is to find the model with maximum posterior model probability. Slightly differing, in Bayesian variable selection typically not the maximum probability model but the median probability model, i.e. the model including all covariates that have an estimated posterior inclusion probability larger than 0.5, is selected.
To select a model with potentially fused effects one could use the estimated fusion probabilitiesψ h,kj = 1−δ h,kj whereδ h,kj is the mean of the corresponding MCMC draws, and fuse effects ifψ h,kj > 0.5. However, this strategy could yield a logically inconsistent model where e.g. levels j and l as well as k and l are fused but not levels j and k. Hence, we fuse levels k and j only if ψ h,kj > 0.5 and for all l = j, k bothψ h,kl andψ h,jl are either larger or smaller than 0.5. This strategy avoids logically inconsisent models and as levels are only fused when evidence is clear, takes into account the asymmetry in loss of false positives and false negatives.
After model selection, we estimate the dummy-coded regression coefficients of the selected model by a Bayesian regression under a flat Normal prior N (0, IB 0 ) with B 0 = 10000 on all effects.

Simulation study
We now illustrate the performance of the proposed method in a simulation study and compare our method to various other approaches: the regularization approach in Gertheiss and Tutz (2010) (Penalty), the Bayesian lasso (BLasso), Bayesian elastic net (BEN ) and the group lasso (GLasso). Additionally, we include Bayesian regularization via graph Laplacian (GLap), proposed in Liu et al. (2014). They also specify the prior directly on the elements of the prior precision matrix, with the goal to identify conditional independence by shrinking off-diagonal elements to zero.
A list of the used R packages and related papers are given in the Appendix B.1. Additionally, we fit the full model l (Full ) with separate dummy variables for each level and the true model (True), i.e. the model with fused categories according to data generation. We use a set-up similar as in Gertheiss and Tutz (2010) and compare the methods with respect to parameter estimation, predictive performance and model selection.
To perform effect fusion, we specify a Normal prior with variance B 0 = 10000 for the intercept and the improper prior p(σ 2 ) ∝ 1/σ 2 (which corresponds to an Inverse Gamma distribution with parameters s 0 = S 0 = 0) for the error variance σ 2 . For each covariate C h , the hyperparameters are set to G h0 = 20 and r = 20000, but we investigate also different values in Section 5.3.
MCMC is run for 10000 iterations after burnin of 5000 to perform model selection for each data set. Models Full and True and the refit of the selected model are estimated under a flat Normal prior N (0, IB 0 ) with B 0 = 10000 on the regression coefficients and MCMC is run for 3000 iterations after a burnin of 1000. The tuning parameters of the frequentist methods Penalty and GLasso are selected automatically via cross-validation in the corresponding R packages. For the Bayesian methods, we use the default prior parameter settings in the code (for GLap) and the R packages monomvn and EBglmNet and estimate the regression coefficients by the posterior means.

Simulation results
We first compare the suggested method for Bayesian effect fusion to the other approaches with respect to estimation of the regression effects. Figure  1 shows the mean squared estimation error (MSE) defined for each covariate C h as

M SE
Obviosly, the mean of the MSEs (over all 100 data sets) are lower for Bayesian effect fusion than for all other methods. Bayesian effect fusion performs particularly well for covariates where all levels have an effect of zero (variables 2, 4 ,6, 8). For covariates with non-zero effects overall performance is good, but for some data sets the MSE can be higher than for the full model, when levels with actually different effects are fused, see eg. covariate 5, a nominal covariate with eight levels.
The competitors BLasso and BEN perform very well both for covariates with zero as well as covariates with non-zero effects. Penalty which is designed for effect fusion does not clearly outperform these two methods for covariates with non-zero effects but yields higher MSEs for covariates with no effects. GLasso performs reasonably well for covariates with no effects but worse for covariates with non-zero effects and GLap yields only slight improvements compared to the full model for covariates with no effect.
We would like to remark that also model averaged estimates, which are obtained as posterior mean estimates from the first MCMC run under the effect fusion prior, perform very well with respect to parameter estimation.
To evaluate the predictive performance of Bayesian effect fusion, we generate a new sample of n * = 500 observations z j , j = 1, . . . , n * from the linear regression model (2) with fixed regressorsx j and the same parameters as in the simulated data sets. Predictions for these new observations are computed using the estimates from each of the original data sets asẑ  Variables on the right panel (even numbers) have no effect on the response.
The mean squared prediction errors (MSPE) defned for each data set as ) 2 , i = 1, . . . , 100 are shown in Figure 2. The predictive performance for our method is almost as good as if the true model were known and considerably better than for all competing methods in most data sets. BLasso is the second best method and also Penalty, GLasso and BEN yield slightly smaller prediction errors compared to full model. Prediction errors using GLap are similar to those from the full model. Finally, to evaluate and compare the performance of the methods with respect to model selection, we use the true positive rate (TPR), the true negative rate (TNR), the positive predictive value (PPV) and the negative predictive value (NPV), see Appendix B.2 for detailed definitions. If fusion is completely correct, all four values are equal to 100% but TPR and PPV are not defined for covariates where all effects are zero. For the effect fusion prior we perform model selection as described in Section 4.2, for the other methods we consider two level effects as identical if the posterior mean of their difference is smaller or equal to 0.01. Results reported in Tables 1 and 2 show that Bayesian effect fusion clearly outperforms all other methods in particular with respect to identifying categories with the same effect (TNR).

Influence of hyperparameters
In this section, we investigate the sensitivity of model selection under the Bayesian effect fusion prior with respect to the hyperparameters. As discussed in Section 3.4 the primary goal is to avoid incorrect fusion of level effects, i.e. false negatives while keeping false positives at a moderate level. We therefore focus on false negative rates, FNR=1-TPR and false positive rates FPR = 1-TNR and report both rates for various values of G 0 and fixed r = 20000 in Table 3 and for various values of r and fixed G 0 = 20 in Table  4. Table 3 indicate that increasing G 0 from 0.2 to 200 has little effect on FNR but results in lower FPR for ordinal predictors (covariate 1 to 4), whereas it has little effect on FPR but results in in higher FNR for nominal effects. Hence, G 0 should be chosen not too high. From our experience G 0 = 2 is a good choice to detect also small effect differences of nominal predictors whereas for ordinal predictors a larger value for G 0 , e.g. G 0 = 20 is reasonable. Table 4 reports FNR and FPR for values of r from 2·10 2 to 2·10 5 . Obviously, r has almost no influence for ordinal predictors but for nominal covariates low values of r encourage too much fusion and hence yield a high FNR. These results indicate that r should not be too small, but still small enough to avoid stickiness of MCMC. We suggest to use a value of at least 2 · 10 4 .

Real data example
As an illustration of Bayesian effect fusion on real data, we model contributions to private retirement pension in Austria. The data are from the European household survey EU-SILC (SILC = Survey on Income and Living Conditions) 2010 in Austria. We use a linear regression model to analyse the effects of socio-demographic variables on the (log-transformed) annual contributions to private retirement pensions. As potential regressors we consider gender (binary, 1=female/0=male), age group (ordinal with eleven levels), child in household (binary, 1=yes/0=no), income class (in quartiles of the total data set, i.e. ordinal with four levels), federal state of residence   Table 4: Simulated data: Model selection results for G 0 = 20 and various r FNR FPR Covariate \ r 2 · 10 2 2 · 10 3 2 · 10 4 2 · 10 5 2 · 10 2 2 · 10 3 2 · 10 4 2 · in Austria (nominal with nine levels), highest attained level of education (nominal with ten levels) and employment status (nominal with four levels). We restrict the analysis to observations without missing values in regressors and/or response and a minimum annual contribution of EUR 100. Hence, the final data set used for our analysis comprises 3077 persons.
We standardize the response and fit a regression model including all potential covariates. Results reported in Table 5 indicate that several levels of covariate education have a similar effect and most level effects of federal state are close to zero, which suggests that a sparser model might be adequate for these data.
To specify the effect fusion prior, we choose the hyperparameters with r = 50000, G h0 = 2 for nominal and G h0 = 20 for ordinal predictors. To perform model selection, MCMC was run for 50000 iterations after a burn-in of 30000.  Based on the estimated pairwise fusion probabilities we perform model selection for all covariates as described in Section 4.2. Covariates child, federal state and employment status are completely excluded from the model and levels of the covariates age group, income class and education are fused. Thus, the selected model has only eleven regression effects compared to 35 in the full model. Results of a refit of the selected model using flat priors are shown in the right panel of Table 5. The posterior mean of the error variance, σ 2 = 0.828, is almost identical to that of the full model, whereσ 2 = 0.826.

Conclusion
In this paper, we present a method for sparse modelling of the effects of categorical covariates in Bayesian regression models. Sparsity is achieved by excluding irrelevant predictors and/or by fusing levels which have essentially the same effect on the response. To encourage effect fusion, we propose a Normal prior distribution that allows for almost perfect as well as almost zero partial dependence of level effects. Alternatively, this prior can be derived as a spike and slab prior on all level effect contrasts associated with one covariate and taking the linear restrictions among them into account.
As an advantage the construction of the prior easily allows to incorporate prior information on which pairs of levels should not be fused directly. This property is of particular interest for ordinal covariates where typically fusion would be restricted to subsequent levels.
Posterior inference using MCMC methods is straightforward. Model selection can be based on the estimated posterior means of the pairwise fusion probabilities. To avoid selection of logically inconsistent models we suggest to fuse effects when posterior evidence is clear. Simulation results show that the proposed method automatically excludes irrelevant predictors totally and outperforms competing methods in terms of correct model selection, coefficient estimation as well as prediction.
Bayesian effect fusion is not restricted to categorical predictors in linear regression models but can be applied also in more general regression models e.g. generalised linear models. Only little adaption is required for posterior simulation in a Bayesaion regression type model, where posterior inference using MCMC methods is feasible for a Normal prior on the regression effects.
A certain drawback of the method is that to construct the prior covariance matrix all pairwise effect differences have to be assessed in each MCMC sampling step and hence the computational effort can be prohibitive for nominal covariates with a very high number of levels.
Thus, the off-diagonal elements of Q are where k denotes the k-th column of L. For each pair of columns k and j , there is exactly one row where both vectors have a non-zero element, which takes the value 1 for one and −1 for the other vector and therefore we have Finally, the diagonal elements of Q are given as B Details on the simulation study B.1 Alternative methods Table 6 lists the methods to which we compare Bayesian effect fusion in the simulation study, together with the name of the corresponding R packages and the references given in the package manuals. The code of the Graph Laplacian approach in Liu et al. (2014) was provided directly from the authors.

B.2 Model selection measures
As measures for correct model selection, we use true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV) and negative predictive