Bayesian variable selection for logistic mixed model with nonparametric random effects

https://doi.org/10.1016/j.csda.2011.12.014Get rights and content

Abstract

In analyzing correlated data or clustered data with linear or logistic mixed effects model, one commonly assumes that the random effects follow a normal distribution with mean zero. However, this assumption might not be appropriate in many cases. In particular, substantial violation of normality assumption might potentially impact the subset selection of variables in these models. In this article, we address the problem of joint selection of both fixed and random effects and bias control for random effects in nonparametric settings. An efficient Bayesian variable selection is implemented using a stochastic search Gibbs sampler to allow both fixed and random effects to be dropped effectively out of the model. The approach is illustrated using a simulation study and a real data example.

Introduction

In longitudinal studies, logistic mixed models (Drum and McCullagh, 1993, Noortgate and Boeck, 2005) are widely used for clustered binary data to study the relationship between the response and covariates. Generally the random effects are incorporated to account for subject-specific variation and are routinely assumed to follow normal distribution with mean zero. However, this assumption might not be realistic and one might question the validity of inferences of the mixed effects when it is violated. Moreover, flexible specification for random effects such as multimodal or skewness might provide insight into heterogeneity and even unveil failure to include important covariates in the model. Such concern has motivated many nonparametric approaches for the random effects. Zhang and Davidian (2001) approximated the random effects by the seminonparametric approach of Gallant and Tauchen (1987). Further, Chen et al. (2002) extended it to the generalized linear mixed models (GLMMs). There are also some other frequentist approaches proposed such as Lai and Shih (2003), and Ghidey et al. (2004). Alternatively, many Bayesian nonparametric approaches using Dirichlet process (DP) (Ferguson, 1973) and DP mixtures (DPM) are also proposed. Readers can refer to Bush and MacEachern (1996), Kleinman and Ibrahim (1998), Ishwaran and Takahara (2002), among many others. However, these methods do not address the uncertainty of predictors to be included in the mixed effects of the model.

Typically, a random variable is included when it is expected to vary among subjects. However, a practical problem is how to decide which predictors have coefficients varying among subjects. Standard approaches such as Akaike information criterion (AIC), Bayesian information criterion (BIC), generalized information criterion (GIC) and Bayes factor (BF) generally compare a few models in enumeration. However, such methods do not work well when the number of potential predictors is large. Especially, the number of possible models increases exponentially with the number of predictors. For example, with l1 fixed effects and l2 random effects, the total number of possible models is 2l1+l2. When l1=l2=10, the total number of model is well above one million.

Unlike the linear mixed effects (LME) model (Laird and Ware, 1982), the random effects have a rather complicated maximum likelihood form in logistic mixed models. Inference based on likelihood requires integration over the dimensions of the random effects, which is often intractable even with simple normal distribution. With this, researchers proposed Laplace and other approximation approaches, for example, Schall (1991), Breslow and Clayton (1993) etc. However, such approaches may result in biased estimates for the fixed effects (Breslow and Lin, 1995, Lin and Breslow, 1996). To resolve the difficulty, some Bayesian methods have been developed to circumvent the intense integration. Zeger and Karim (1991) used Gibbs sampling for the random effects. McCulloch (1997) and Booth and Hobert (1999) used Monte Carlo EM algorithm for posterior inference.

For mixed effects models, it is desirable to accommodate uncertainty of predictors to be included in the model for enhanced flexibility. Bayesian methods can accommodate such flexibility and avoid cumbersome integration with MCMC algorithms. In addition, one can easily infer from the variable selections results, for example, posterior probabilities of the mixed effects inclusion and models of the Bayesian approaches. Kuo and Mallick (1998) and George and McCulloch, 1993, George and McCulloch, 1997 used the approach of Bayesian variable selection for the general linear model. Chen and Dunson (2003) used the Cholesky decomposition for the random effects. Kinney and Dunson (2007) extended the approach to logistic mixed model. Bondell et al. (2010) proposed a penalized joint likelihood with an adaptive penalty in joint selection of both fixed and random effects. Ibrahim et al. (2010) used maximum penalized likelihood estimation for fixed and random effects selection. However, all these approaches do not have flexible specification for the random effects. For nonparametric specification of the uncentered random effects, the expected mean generally is not zero and thus causes identifiability with the fixed effects. Ultimately, bias is incurred. Cai and Dunson (2010) proposed a nonparametric random effect model without addressing the potential bias. Though, they might take the approach by Yang and Dunson (2010), Yang et al. (2010) and Li et al. (2011) to reduce bias. However, it is difficult for interpretation with variable selection, in particular, when the fixed effect is selected but the corresponding random effect is not. With this, Yang (2010) used the centered Dirichlet process mixture models for the random effects. To the author’s best knowledge, there is no method proposed for GLMM which addresses joint selection of mixed effects, flexible prior specification and bias control simultaneously.

In this article, we address variable selection for logistic mixed model with nonparametric random effects. The article is organized as follows: Section 2 describes the logistic mixed models. Section 3 describes the approach of joint selection of fixed and random effects and the posterior inference. Sections 4 Simulation, 5 Application presents simulation and real data example respectively. A final discussion is provided to conclude the article.

Section snippets

General description

Suppose there are n subjects in a study and each subject has ni repeated observations for i=1,,n. Let Xij denote the predictor for subject i at observation j, a vector of dimension l×1, let yij be the corresponding binary response variable, Zij is a predictor vector of dimension q×1. Then the logistic mixed model is denoted as: yijBernoulli(1(χij)),χij=Xijβ+Zijζi where β=(β1,,βl) is the fixed effect coefficient vector, ζiN(0,Ω) is the ith random effect, () is the logistic link

Posterior inference

We outline the brief Gibbs sampler for posterior sampling with details provided in the Appendix. The posterior sampling proceeds as follows with the initial values assigned for the parameters:

  • 1.

    Given the data and the current values of g,ϕ,Λ,Γ,ξ, sample βJ from the full conditional posterior distribution.

  • 2.

    Update Jk, for k=1,,l from the full conditional posterior distributions given the data and the current values of ϕ,Λ,Γ,ξ,g.

  • 3.

    Sample g from the posterior Gamma distribution given the data and

Simulation

To evaluate the performance of our proposed algorithms, we conduct the following simulation. We generate data with 200 subjects, each subject with 20 observations. There are four covariates in Xij, that is, Xij=(xij1,xij2,xij3,xij4). The first element xij1 is fixed as one and the other three elements are generated from the uniform distribution U(2,2). We set the design matrix Zij=Xij, β=(0,1,1,1). The random effect ζi is generated from a mixture of three multivariate normal distributions ζi

Application

We take a subset of the ICPSR (Inter-University Consortium for Political and Social Research) data set collected for the World Value Survey (WVS 1981–2004) at the following website (http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/04531). The WVS was designed to understand a crossnational, crosscultural comparison of values and norms on a wide variety of topics and to monitor changes in values and attitudes across the globe. The survey was conducted by researchers in over 80 societies,

Discussion

Mixed effects models have received considerable attention for their flexibility in characterizing heterogeneity across clusters in the literature. In this article, we propose a logistic mixed model for variable selections with nonparametric random effects. Compared to the previous work such as Chen and Dunson (2003), Kinney and Dunson (2007), Cai and Dunson (2010) and Bondell et al. (2010), our approach has several advantages: easy implementation, efficient algorithm and bias reduction due to

Acknowledgments

The author thanks his colleagues and the three anonymous reviewers for their comments and critical reading of the manuscript.

References (53)

  • M. Smith et al.

    Nonparametric regression using Bayesian variable selection

    Journal of Econometrics

    (1996)
  • M Yang et al.

    Semiparametric Bayes hierarchical models with mean and variance constraints

    Computational Statistics and Data Analysis

    (2010)
  • J. Albert et al.

    Bayesian tests and model diagnostics in conditionally independent hierarchical models

    Journal of the American Statistical Association

    (1997)
  • D. Blackwell et al.

    Ferguson distributions via Polya urn schemes

    Annals of Statistics

    (1973)
  • H.D. Bondell et al.

    Joint variable selection for fixed and random effects in linear mixed effects models

    Biometrics

    (2010)
  • J.G. Booth et al.

    Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm

    Journal of the Royal Statistical Society, Series B

    (1999)
  • N.E. Breslow et al.

    Approximate inference in generalized linear mixed models

    Journal of the American Statistical Association

    (1993)
  • N.E. Breslow et al.

    Bias correction in generalized linear mixed models with a single component of dispersion

    Biometrika

    (1995)
  • C.A. Bush et al.

    A semiparametric Bayesian model for randomised block designs

    Biometrika

    (1996)
  • Cai, B., Dunson, D.B., 2010, Variable selection in nonparametric random effects models, Technical...
  • Z. Chen et al.

    Random effects selection in linear mixed models

    Biometrics

    (2003)
  • J. Chen et al.

    A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution

    Biostatistics

    (2002)
  • Y. Chung et al.

    Nonparametric Bayes conditional distribution modeling with variable selection

    Journal of the American Statistical Association

    (2009)
  • H. Doss et al.

    Monte Carlo methods for Bayesian analysis of survival data using mixtures of Dirichlet priors

    Journal of Computational and Graphical Statistics

    (2003)
  • M. Drum et al.

    REML estimation with exact covariance in the logistic mixed model

    Biometrics

    (1993)
  • M.D. Escobar

    Estimating normal means with a Dirichlet process prior

    Journal of the American Statistical Association

    (1994)
  • M.D. Escobar et al.

    Bayesian density estimation and inference using mixtures

    Journal of the American Statistical Association

    (1995)
  • J. Fabius

    Asymptotic behavior of Bayes estimates

    Annals of Mathematical Statistics

    (1964)
  • D.A. Freedman

    On the asymptotic behavior of Bayes estimates in the discrete case

    Annals of Mathematical Statistics

    (1963)
  • T.S. Ferguson

    A Bayesian analysis of some nonparametric problems

    Annals of Statistics

    (1973)
  • T.S. Ferguson

    Prior distributions on spaces of probability measures

    Annals of Statistics

    (1974)
  • A.R. Gallant et al.

    Nonlinear Models for Repeated Measurement Data

    (1987)
  • A. Gelman

    Prior distributions for variance parameters in hierarchical models

    Bayesian Analysis

    (2005)
  • E.I. George et al.

    Variable selection via Gibbs sampling

    Journal of the American Statistical Association

    (1993)
  • E.I. George et al.

    Approaches for Bayesian variable selection

    Statistica Sinica

    (1997)
  • W. Ghidey et al.

    Smooth random effects distribution in a linear mixed model

    Biometrics

    (2004)
  • Cited by (14)

    • A Bayesian goodness-of-fit test for regression

      2021, Computational Statistics and Data Analysis
    • A Bayesian analysis of the incomplete block crossover design

      2023, Communications in Statistics: Simulation and Computation
    View all citing articles on Scopus
    View full text