Gibbs Variable Selection using BUGS

In this paper we discuss and present in detail the implementation of Gibbs variable selection as defined by Dellaportas et al. (2000, 2002) using the BUGS software (Spiegelhalter et al. ,'96a,b,c). The specification of the likelihood, prior and pseudo-prior distributions of the parameters as well as the prior term and model probabilities are described in detail. Guidance is also provided for the calculation of the posterior probabilities within BUGS environment when the number of models is limited. We illustrate the application of this methodology in a variety of problems including linear regression, log-linear and binomial response models.


Introduction
In Bayesian model averaging or model selection we focus on the calculation of posterior model probabilities which involve integrals analytically tractable only in certain restricted cases.This obstacle has been overcomed via the construction of efficient MCMC algorithms for model and variable selection problems.
A variety of MCMC methods have been proposed for variable selection including the Stochastic Search Variable Selection (SSVS) of George and McCulloch (1993), the reversible jump Metropolis by Green (1995), the model selection approach of Carlin and Chib (1995) the variable selection sampler of Kuo and Mallick (1998) and the Gibbs variable selection (GVS) by Dellaportas et al. (2000Dellaportas et al. ( , 2002)).
The primary aim of this paper is to clearly illustrate how we can utilize BUGS (Spiegelhalter et al. , 1996a, see also www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml)for the implementation of variable selection methods.We concentrate on Gibbs variable selection introduced by Dellaportas et al. (2000Dellaportas et al. ( , 2002) ) with independent prior distributions.Extension to other Gibbs samplers such as George and Mc-Cullogh (1993) SSVS and Kuo and Mallick (1998) sampler is straightforward; see for example in Dellaportas et al. (2000).Finally, application of Carlin and Chib (1995) algorithm is also illustrated using BUGS by Spiegelhalter et al. (1996c).
The paper is organised into three sections additional to this introductory one.Section 2 briefly describes the general Gibbs variable selection algorithm as introduced by Dellaportas et al. (2002), Section 3 provides detailed guidance for implementation in BUGS and finally Section 4 presents three illustrated examples.

Gibbs Variable Selection
Many statistical models may be represented naturally as (s, γ) ∈ S × {0, 1} p , where the indicator vector γ identifies which of the p possible sets of covariates are present in the model and s denotes other structural properties of the model.For example, for a generalised linear model, s may describe the distribution, link function and variance function, and the linear predictor may be written as where X j is the design matrix and β j the parameter vector related to the jth term.
In the following, we restrict attention to variable selection aspects assuming that s is known and we concentrate on the estimation of the posterior distribution of γ.
We denote the likelihood of each model by f (y|β, γ) and the prior by f (β, γ) = f (β|γ)f (γ), where f (β|γ) is the prior of the parameter vector β conditional on the model structure γ and f (γ) is the prior of the corresponding model.Moreover, β can be partitioned into two vectors β γ and β \γ corresponding to parameters of variables included or excluded from the model.Under this approach the prior can be rewritten as while, since we are using the linear predictor (1), the likelihood can be simplified to From the above it is clear that the components of the vector β \γ do not affect the model likelihood and hence the posterior distribution within each model γ is given by where f (β γ |γ, y) is the actual posterior of the parameters of model γ and f (β \γ |β γ , γ, y) is the conditional prior distribution of the parameters not included in the model γ.We may now interpret f (β γ |γ) as the actual prior of the model while the distribution f (β \γ |β γ , γ) may be called as 'pseudoprior' since the parameter vector β \γ does not gain any information from the data and does not influence the actual posterior of the parameters of each model, f (β γ |γ, y).Although this pseudoprior does not influence the posterior distributions of interest, it influences the performance of the MCMC algorithm and hence it should be specified with caution.
The sampling procedure is summarised by the following steps: 1. We sample the parameters included in the model by the posterior 2. Sample the parameters excluded from the model from the pseudoprior 3. Sample each variable indicator γ j from a Bernoulli distribution with success probability O j /(1 + O j ); where O j is given by The selection of priors and pseudopriors is a very important aspect in model selection.Here we briefly present the simplest approach where f (β|γ) is given a product of independent prior and pseudoprior densities: In such case, a usual and simple choice of f (β j |γ j ) is given by Note that the above prior can be efficiently used in any model selection problem if we orthogonalize the data matrix and then perform model choice using the new transformed data (see Clyde et al. , 1996).If orthogonalization is undesirable then we may need to construct more sophisticated and flexible algorithms such as reversible jump MCMC; see Green (1995) and Dellaportas et al. (2002).
The simplified prior (5) and model formulation such as (1), result in the following full conditional posterior 6) indicating that the pseudoprior, f (β j |γ j = 0) does not affect the posterior distribution of each model coefficient.
Similarly to George and McCulloch (1993), we use a mixture of Normal distribution for model parameters.
The hyperparameters μj and S j are parameters of the pseudoprior distribution; therefore their choice is only relevant to the behaviour of the MCMC chain and do not affect the posterior distribution.Ideal choices of these parameters are the maximum likelihood or pilot run estimators of the full model (as, for example, in Dellaportas and Forster, 1999).However, in the experimental process, we noted that an automatic selection of μj = 0 and S j = Σ j /k 2 with k = 10 has also been proven an adequate choice; for more details see Ntzoufras (1999).This 'automatic selection' uses the properties of the prior distributions with 'large' and 'small' variance introduced in SSVS by George and McCulloch (1993).The parameter k is now only a pseudoprior parameter and therefore it does not affect the posterior distribution.Suitable calibration of this parameter assists the chain to move better (or worse) between different models.
The prior proposed by Dellaportas and Forster (1999) for contingency tables, is also adopted here for logistic regression models with categorical explanatory variables (see Dellaportas et al. , 2000).Alternatively, for generalized linear models, Raftery (1996) has proposed to select the prior covariance matrix using elements from the data matrix multiplied by a hyperparameter.The latter was selected in such way that the effect of the prior distribution on the posterior odds becomes minimal.
When no restrictions on the model space are imposed then a common prior for the term indicators γ j is f (γ j ) = Bernoulli (1/2), whereas in other cases (for example, hierarchical or graphical log-linear models) it is required that f (γ j |γ \j ) depends on γ \j ; for more details see Chipman (1996) and Section 3.4.
Other Gibbs samplers for model selection have also been proposed by George and McCulloch (1993), Carlin and Chib (1995) and Kuo and Mallick (1998).Detailed comparison and discussion of the above methods is given by Dellaportas et al. (2000Dellaportas et al. ( , 2002)).Implementation of Carlin and Chib methodology in BUGS is illustrated by Spiegelhater et al. (1996c, page 47) while an additional simple example of Gibbs variable selection methods is provided by Dellaportas et al. (2000).

Applying Gibbs Variable Selection Using BUGS
In this section we provide detailed guidance for implementing Gibbs variable selection using BUGS software.It is divided into four sub-sections involving the definition of the model likelihood f (y|β, γ), the specification of the prior distributions f (β|γ) and f (γ), and, finally, the direct calculation of posterior model probabilities using BUGS.

Definition of likelihood
The linear predictor of type (1) used in Gibbs variable selection and Kuo and Mallick sampler can be easily incorporated in BUGS using the following code where • N denotes the sample size, • p the number of total variables under consideration, • x[i,j] is the i, j component of the data or design matrix X, • y[i] is i element of the response vector y, • b[j] is the j element of the parameter vector β, • g[j] is the inclusion indicator for j element of γ, • z[i,j] is a matrix used to simplify calculations, • eta[i] is the i element of linear predictor vector η and should be substituted by the corresponding link function, for example logit(p[i]) in binomial logistic regression, • distribution should be substituted by appropriate BUGS command for the distribution that the user prefers (for example dnorm for normal distribution), • parameter1,parameter2 should be substituted according to distribution chosen, for example for the normal distribution with mean µ i and variance τ −1 we may use mu[i], tau.
For the usual normal, binomial and Poisson models the model formulations are given by the following lines of BUGS code where mu[i] is the expected value for the ith observation and tau is the precision of the regression model.
where lambda[i] is the Poisson mean for the ith observation.
where p[i] is the probability of success and n[i] is the total number of Bernoulli trials for the ith binomial experiment.Alternative link functions maybe used by substituting logit(p where Φ is the standardised normal cumulative distribution function.

Definition of Prior Distribution of Parameter Vector
When we use independent priors as given by ( 5) and each covariate parameter vector is univariate, the definition of the prior is straightforward.Our prior is a mixture of independent normal distributions where μj , S j are the mean and variance respectively used in the corresponding pseudoprior distributions and Σ j is the prior variance, when the j term is included in the model.In order to use (8) in BUGS we write for j = 1, 2, . . ., p; where m j and τ j are the prior mean and precision for β j depending on γ j and t[j], se[j], mean[j], bpriorm[j], tprior[j] are the BUGS variables for Σ −1 j , S j , μj , m j and τ j , respectively.When we consider a categorical explanatory variable j with J > 2 categories then the corresponding parameter vector β j will be multivariate with dimension d j = J − 1.In such cases we denote by p and d(> p) the dimensions of γ and the full parameter vector β, respectively.Therefore, we need one variable to facilitate the association between these two vectors.This vector is denoted by the BUGS variable pos.The pos vector, which has dimension equal to the dimension of β, takes values from 1, 2, ..., p and denotes that kth element of the parameter vector β is associated with the γ pos k binary indicator for all k = 1, 2, ..., d.
For illustration, let us consider an ANOVA model with two categorical variables X 1 and X 2 with 3 and 4 categories respectively.Then, the terms under consideration are X 0 , X 1 , X 2 and X 12 ; where X 0 denotes the constant term and X 12 the interaction between the terms X 1 and X 2 .The corresponding dimensions are Then, we set the pos vector equal to pos <-c ( 1, 2,2, 3,3,3, 4,4,4,4,4,4 ) to state that the first parameter corresponds to the first term (X 0 ), parameters 2-3 correspond to the second term (X 1 ), parameters 4-6 correspond to the third term (X 2 ) and parameters 7-12 correspond to the fourth term (X 12 ).Finally, we use another vector called gtemp of dimension d which is given by gtemp[i] <-g[ pos [i] ] for all i = 1, . . ., d.The vector gtemp is used in the likelihood instead of the g vector.For details see example 1 and the associated BUGS code in the Appendix.
Moreover, the definition of the prior distribution when factors or terms with many parameters are considered is more complicated.For example a mixture of multivariate normal prior distributions as given by ( 5) can be expressed as a multivariate normal distribution on the 'full' parameter vector β.Therefore we may write in BUGS where N d is the d-dimensional normal distribution; m T = (m 1 , m 2 , . . ., m d ) and T are the prior mean vector and precision matrix depending on γ; μk is the corresponding pilot run estimate for k element of model parameter vector β; Σ is the constructed prior covariance matrix for the whole parameter vector β when we use for each β j the multivariate extension of prior distribution (8); T kl and [Σ −1 ] kl is the k row and l column elements of T and Σ −1 matrices respectively; and Tau[,], t[,] are the BUGS matrices for T and Σ −1 , respectively.An illustration of usage of such prior distribution is given in example 1.

Implementation of Other Gibbs Samplers for Variable Selection
SSVS and Kuo and Mallick sampler can easily be applied with minor modifications in the above code.In SSVS the prior ( 8) is used with μj = 0 and S j = Σ j /k 2 j , where k 2 j should be large enough in order that β j will be close to zero when γ j = 0.For selection of the prior parameters in SSVS see semiautomatic prior selection of George and McCulloch (1993).The above restriction can easily be applied in BUGS by bpriorm Moreover, the likelihood in SSVS should be slightly modified by substituting the first line of the code in Section 3.1 with Kuo and Mallick sampler uses prior on β that does not depend on model indicator γ.Therefore the specification of the prior is the same as in simple modelling using BUGS.Moreover, the likelihood definition is the same as in Gibbs variable selection described in Section 3.1.

Definition of Prior Term Probabilities
In order to apply any variable selection method in BUGS we need to define the prior probabilities f (γ).When we are vague about models we may set f (γ) = 1/M , where M is the number of all models under consideration.When the explanatory variables do not involve interactions (for example in linear regression) then the number of models under consideration is 2 p .In these situations the latent variables γ j can be treated as a − priori independent and therefore set in BUGS • g[j] ∼ dbern(0.5)denoting that γ j ∼ Bernoulli(0.5).
for all j = 1, 2, . . ., p.This prior results to f (γ) = 2 −p for all γ ∈ {0, 1} p .When we are dealing with models using categorical explanatory variables with interaction terms, such as AN OV A or log-linear models, we usually want to restrict attention to hierarchical models.The conditional distributions of f (γ j |γ \j ) need to be specified in such way that f (γ) = 1/M when γ is referring to hierarchical model and f (γ) = 0 otherwise.
For example, in a two way AN OV A we have three terms under consideration; the main effects A,B and the interaction AB.All possible models are eight, while the hierarchical ones are only five (constant, [A]
From the above it is evident that Using similar calculations we find that f (γ) = 0.2 for all five models under consideration.For further relevant discussion and application see Chipman (1996).For implementation in BUGS see examples 1 and 3.

Calculating Model Probabilities in Bugs
In order to directly calculate the posterior model probabilities in BUGS and avoid saving large MCMC output we may use matrix type variables with dimension equal to the number of models.Using a simple coding such as 1+ p j=1 γ j 2 j−1 we transform the vector γ in a unique, for each model index (noted by mdl) for which pmdl[mdl]=1 and pmdl[j]=0 for all j = mdl.The above statements can be written in BUGS with the code for (j in 1: for (m in 1:mdl) { pmdl[m] < − equals(m,mdl) } Then using the command stats(pmdl) in BUGS environment (or cmd file) we can monitor the posterior model probabilities.This is feasible only if the number of models is limited and therefore applicable only in some simple problems.

Examples
The implementation of three illustrated examples are briefly presented.The first example is a 3 × 2 × 4 contingency table used to illustrate how to handle factors with more than two levels.Example 2 provides model selection details in a regression type problem involving many different error distributions while example 3 is a simple logistic regression problem with random effects.In all examples posterior probabilities are presented while the associated BUGS codes are provided in the appendix.Additional details (for example, convergence plots) are omitted since our aim is just to illustrate how to use BUGS for variable selection.

Example 1: 3 × 2 × 4 Contingency Table
This example is a 3 × 2 × 4 contingency table presented by Knuiman and Speed (1988) where 491 individuals are classified by three categorical variables: obesity (O: low,average,high), hypertension (H: yes,no) and alcohol consumption (A: 1,1-2,3-5,6+ drinks per day); see Table 1 The full model is given by The above model can be rewritten with likelihood given by (1) where β can be divided to β j sub-vectors with j ∈ {∅, O, H, OH, A, OA, HA, OHA}; where ].Each β j is a multivariate vector and therefore each prior distribution involves mixture multivariate normal distributions.We use sum-to-zero constraints and prior variance Σ j as in Dellaportas and Forster (1999).We restrict attention in hierarchical models including always the main effects since we are mainly interested for relationships between the categorical factors.Under these restrictions, the models under consideration are nine (9).In order to forbid moves to non hierarchical models we use the following BUGS code to define the prior model probabilities: • for (i in 5:7) { g[j]∼dbern(pi) } for γ j |γ OHA ∼ Bernoulli(π) for all i ∈ {OH, OA, HA}, • for (j in 1:4) { g[j] ∼ dbern(1) } for γ j ∼ Bernoulli(1) for all i ∈ {constant, O, H, A}.
These priors result to prior probability for all hierarchical models equal to 1/9 and zero otherwise.
Results using both pilot run pseudoprior and automatic pseudoprior with k = 10 are summarised in

Example 2: Stacks Dataset
This example involves stack-loss data analysed by Spiegelhalter et al. (1996b, page 27) using the Gibbs sampling.The dataset features 21 daily responses of stack loss (y) which measures the amount of ammonia escaping with covariates the air flow (x 1 ), temperature (x 2 ) and acid concentration (x 3 ).Spiegelhalter et al. (1996b) consider regression models with four different error structures (normal, double exponential, logistic and Student's t 4 distributions).They also consider the cases of ridge and simple independent regression models.We extend their work by applying Gibbs variable selection on all these eight cases.
The full model will be where D i (µ i , τ ) is the distribution of the errors with mean µ i and variance τ −1 which here is assumed to be normal, or double exponential, or logistic or t 4 ; where z ij = (x ij − xj )/sd(x j ) are the standardised covariates.The ridge regression approach assumes a further restriction that the β j for j = 1, 2, 3 are exchangable (Lindley and Smith, 1972) and therefore we have β j ∼ N (0, φ −1 ).We use 'non-informative' priors with prior precision equal to 10 −3 for the independent regression and for φ in ridge regression we use gamma prior with parameters equal to 10 −3 .Since we do not wish to apply any restriction on the model space we use the prior probabilities γ j ∼ Bernoulli(1/2) for j = 1, 2, 3 which results to prior probability of 1/8 for all possible models.For the pilot run pseudoprior parameters we use the posterior values as given Spiegelhalter et al. (1996b).
Tables 3 and 4 provide the results from all eight distinct cases using pilot run pseudopriors.In all cases flow of air (z 1 ) has posterior probability of inclusion higher than 99%.The temperature (z 2 ) seems to be also an important term with posterior probability of inclusion varying from 39% to 96%.The last term (z 3 ) which measures the acid concentration in air has low posterior probabilities of inclusion which are less than 5% for simple independence models and less than 20% for 'ridge' regression models.

Example 3: Seeds Dataset, Logistic Regression with Random Effects
This example involves the examination of a proportion of seeds that germinated on 21 plates.For these 21 plates we have recorded the seed (bean or cucumber) and the type of root extract.This data set is analysed by Spiegelhalter et al.
(1996b, page 10) using BUGS; for more details see references there in.The model is a logistic regression with 2 categorical explanatory variables and random effects.
The full model will be written for i, l = 1, 2 and k = 1, . . ., 21; where y ilk and n ilk is the number of seeds germinated and total number of seeds respectively for i seed, l type of root extract and k plate; w k is the random effect for the k plate.
We use sum-to-zero constraints for both fixed and random effects.Following Dellaportas and Forster (1999) we use prior variance for the fixed effects Σ = 4 × 2. The prior for the precision of the random effects is considered to be a gamma distribution with parameters equal to 10 −3 .The pseudoprior parameters were taken from a pilot chain of the saturated model.The models under consideration are ten.The prior term probabilities for the fixed effects are assigned similarly as in the example for two-way ANOVA models.For the random effects term indicator we have that γ w ∼ Bernoulli(0.5).Table 5 provides the calculated posterior model probabilities.We used both pilot run proposals and automatic pseudoprior with k = 10.Both chains gave the same results as expected and the type of root extract (B) is the only factor that influences the proportion of germinated gems.The corresponding models with random and fixed effects have posterior probability equal to 51% and 32%, respectively.The marginal posterior probability of random effects is 61% which is about 56% higher than the posterior probability of fixed effects models.

Appendix: BUGS Codes
Bugs code and all associated data files are freely available in electronic form at the Journal of Statistical Software web site www.jstatsoft.org/v07/i07/ or by electronic mail request.

Table 2 .
The data give 'strong' evidence in favour of the model

Table 2 :
3 × 2 × 4 Contingency Table: Posterior Model Probabilities Using BUGS. of independence.Model [OH][A], in which obesity and hypertension are depending on each other given the level of alcohol consumption, is the model with the second highest posterior probability.All the other models have probability lower than 1%.