ABSTRACT

Jaesik Jeong, Marina Vannucci, Kim-Anh Do, Bradley Broom, Sinae Kim, Naijun Sha, Mahlet Tadesse, Kai Yan, and Lajos Pusztai Indiana University, Indianapolis, IN; Rice University, Houston, TX; UT Texas MD Anderson Cancer Center, Houston, TX; University of Michigan, Ann Arbor, MI; University of Texas at El Paso, TX and Georgetown University, Washington, DC

In this chapter we review our contribution to the development of Bayesian methods for variable selection. In particular, we review linear regression settings where the response variable is either continuous, categorical or a survival time. We also briefly describe how some of the key ideas of the variable selection method can be used in a different modeling context, i.e. model-based sample clustering. In the linear settings, we use a latent variable selection indicator to induce mixture priors on the regression coefficients. In the clustering setting, the group structure in the data is uncovered by specifying mixture models. In both the linear and the clustering settings, we specify conjugate priors and integrate out some of the parameters to accelerate the model fitting. We use Markov chain Monte Carlo (MCMC) techniques to identify the high probability models. The methods we describe are particularly relevant for the analysis of genomic studies, where high-throughput technologies allow thousands of variables to be measured on individual samples. The amount of measured variables in such data is in fact often substantially larger than the sample size. A typical example with this characteristic, and one that we use

to illustrate our methodologies, is DNA microarray data. The practical utility of variable selection is well recognized and this topic

has been the focus of much research. Variable selection can help assess the importance of explanatory variables, improve prediction accuracy, provide a better understanding of the underlying mechanisms generating data, and reduce the cost of measurement and storage for future data. A comprehensive account of widely used classical methods, such as stepwise regression with forward and backward selection can be found in Miller.22 In recent years, procedures that specifically deal with very large number of variables have been proposed. One such approach is the least absolute shrinkage and selection operator (lasso) method of Tibshirani,36 which uses a penalized likelihood approach to shrink to zero coefficient estimates associated with unimportant covariates. For Bayesian variable selection methods, pioneering work in the univariate linear model setting was done by Mitchell & Beauchamp,23 and George & McCulloch,15 and in the multivariate setting by Brown et al..6 The key idea of the Bayesian approach is to introduce a latent binary vector to index possible subsets of variables. This indicator is used to induce a mixture prior on the regression coefficients, and the variable selection is performed based on the posterior model probabilities.