Abstract
The goal of this paper is to develop a double penalized hierarchical likelihood (DPHL) with a modified Cholesky decomposition for simultaneously selecting fixed and random effects in mixed effects models. DPHL avoids the use of data likelihood, which usually involves a high-dimensional integral, to define an objective function for variable selection. The resulting DPHL-based estimator enjoys the oracle properties with no requirement on the convexity of loss function. Moreover, a two-stage algorithm is proposed to effectively implement this approach. An H-likelihood-based Bayesian information criterion (BIC) is developed for tuning parameter selection. Simulated data and a real data set are examined to illustrate the efficiency of the proposed method.
Similar content being viewed by others
References
Ahn M, Zhang HH, Lu W (2012) Moment-based method for random effects selection in linear mixed models. Stat Sin 22:1539–1562
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) Second international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281
Andrews DWK (1992) Generic uniform convergence. Econom Theory 8:241–257
Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MH, While JS (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol Evol 24:127–135
Bondell HD, Krishna A, Ghosh SK (2010) Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics 66:1069–1077
Chen Z, Dunson DB (2003) Random effects selection in linear mixed models. Biometrics 59:762–769
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
Fan J, Li R (2004) New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J Am Stat Assoc 99:710–723
Foster SD, Verbyla AP, Pitchford WS (2009) Estimation, prediction and inference for the LASSO random effect models. Aust N Z J Stat 51:43–61
Huang J, Wu C, Zhou L (2002) Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89:111–128
Ibrahim JG, Zhu H, Garcia RI, Guo R (2011) Fixed and random effects selection in mixed effects models. Biometrics 67:495–503
Jiang J (2007) Linear and generalized linear mixed models and their applications. Springer, New York
Jiang J, Jia H, Chen H (2001) Maximum posterior estimation of random effects in generalized linear mixed models. Stat Sin 11:97–120
Jiang J, Rao J, Gu Z, Nguye T (2008) Fence methods for mixed model selection. Ann Stat 36:1669–1692
Kaslow RA, Ostrow DG, Detels R, Phair JP, Polk BF, Rinaldo CR (1987) The multicenter AIDS cohort study: rationale, organization and selected characteristics of the participants. Am J Epidemiol 126:310–318
Lan L (2006) Variable selection in linear mixed model for longitudinal data. PhD Thesis, North Carolina State University
Lange N, Laird NM (1989) The effect of covariance structures on variance estimation in balance growth-curve models with random parameters. J Am Stat Assoc 84:241–247
Lee Y, Nelder JA (1996) Hierarchical generalized linear models (with discussion). J R Stat Soc B 58:619–678
Lee Y, Nelder JA, Pawitan Y (2006) Generalized linear models with random effects: unified analysis via H-likelihood. Chapman and Hall, London
Li Y, Wang S, Song PX, Wang N, Zhu J (2011) Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Manuscript
Liang H, Wu H, Zou G (2008) A note on conditional AIC for linear mixed-effects models. Biometrika 95:773–778
Meng X (2009) Decoding the H-likelihood. Stat Sci 24:280–293
Ni X, Zhang D, Zhang HH (2010) Variable selection for semiparametric mixed models in longitudinal studies. Biometrics 66:79–88
Peng H, Lu Y (2012) Models selection in linear mixed effect models. J Multivar Anal 109:109–129
Pu W, Niu X (2006) Selecting mixed-effects models based on a generalized information criterion. J Multivar Anal 97:733–758
Rao CR, Wu Y (1989) A strongly consistent procedure for model selection in a regression problem. Biometrika 76:369–374
Schelldorfer J, Buhlmann P, van de Geer S (2011) Estimation for high-dimensional linear mixed-effects models using l 1-penalization. Scand J Stat 38:197–214
Schelldorfer J, Buhlmann P (2011) GLMMLasso: an algorithm for high-dimensional generalized linear mixed models using l 1-penalization. Preprint, ETH Zurich http://stat.ethz.ch/people/schell
Schwartz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464
Song PX (2007) Correlated data analysis: modeling, analytics, and applications. Springer, New York
Tierney L, Kadane JB (1986) Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc 81:82–86
Vaida F, Blanchard S (2005) Conditional Akaike information for mixed-effects models. Biometrika 92:351–370
Wang H, Leng C (2007) Unified Lasso estimation via least square approximation. J Am Stat Assoc 102:1039–1048
Wang H, Li R, Tsai CL (2007) Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94:553–568
Wang S, Song PX, Zhu J (2010) Doubly regularized REML for estimation and selection of fixed and random effects in linear mixed-effects models. The University of Michigan Department of Biostatistics Working Paper Series. Working Paper 89. http://biostats.bepress.com/umichbiostat/paper89
White H (1994) Estimation, inference, and specification analysis. Cambridge University Press, New York
Yang H (2007) Variable selection procedures for generalized linear mixed models in longitudinal data analysis. PhD Thesis, North Carolina State University
Zipunnikov VV, Booth JG (2006) Monte Carlo EM for generalized linear mixed models using randomized spherical radial integration. Manuscript
Zou H (2006) The adaptive Lasso and its Oracle properties. J Am Stat Assoc 101:1418–1429
Acknowledgements
Xu’s research was supported by Scientific Research Foundation of Southeast University; Zhu’s research was supported by a grant from the Research Grants Council of Hong Kong. The authors thank the Editor, the Associate Editor and referees for their constructive suggestions and comments which led an improvement of the early manuscript. A special thank goes to a referee who pointed out a mistake in the original proof of Theorem 4 such that we had a chance to correct it.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 1
First, note that the marginal likelihood is approximated by function (2.5) through Laplace’s method. Hence, we know that
according to the results of Tierney and Kadane [31]. On the other hand, under conditions (C1)–(C5), it follows from White [36] that A 11(ϖ ∗) is positive definite,
and
Therefore, by Slutsky’s theorem, we have
and
where \(\tilde{\varpi}_{a} = (\tilde{\beta}_{a}^{\tau}, \tilde{d}_{a}^{\tau}, \tilde{\gamma}_{a}^{\tau})^{\tau}\) is the unpenalized maximum adjusted profile likelihood estimator of \(\varpi^{*}_{a}\).
Then we prove the estimation consistency of \(\widehat{\varpi}\). It is sufficient to show that for any given ϵ>0, there exists a large constant C ϵ such that
where \(u = (u^{\tau}_{1}, u^{\tau}_{2}, u^{\tau}_{3})^{\tau}\), in which u 1=(u 11,…,u 1p )τ is a p-dimensional vector, u 2=(u,…,u)τ is a q-dimensional vector, and u 3 is a (q(q−1)/2)-dimensional vector. This implies that there exists a local maximizer in the ball {ϖ ∗+n −1/2 u:∥u∥2≤C ϵ } and thus \(\|\widehat{\varpi} - \varpi^{*}\|_{2} = O_{p}(n^{-1/2})\).
Consider
By using the concavity and monotonicity of penalty functions, we know that
For S 1(⋅), using Taylor expansion around ϖ ∗, we have
Then, using Cauchy–Schwarz inequality together with condition (C3) and the fact (7.1), we have
where ∥u∥1 is the L 1-norm of the (p+q(q+1)/2)-dimensional vector u, and it can be easily checked that \(\|u\|_{1} \leq\sqrt {p+q(q+1)/2} \|u\|_{2}\). For S 2(u), an application of Taylor expansion around zero vector yields
Consequently, if a n =O p (n −1/2), b n =o p (1), we know that
Similarly, for S 3(u), we have
if c n =O p (n −1/2) and d n =o p (1). We can see that, by choosing a sufficient large C ϵ , S 12(u) dominates other terms uniformly in ∥u∥=C ϵ , which completes the proof. □
Proof of Theorem 2
First, we prove the sparsity. Note that if d k =0, we have γ kt =0, for all t=1,…,k−1. Therefore, it is sufficient to show \(P(\widehat{\beta}_{j} = 0)\rightarrow1\) for all j=p 1+1,…,p and \(P(\widehat{d}_{k} = 0)\rightarrow1\) for all k=q 1+1,…,q. Without loss of generality, we show in detail that \(P(\widehat{\beta}_{p} = 0) \rightarrow1\). Then, the same argument can be used to show that \(P(\widehat{\beta}_{j} = 0) \rightarrow1\) for j=p 1+1,…,p−1. Similarly, we can show that \(P(\widehat{d}_{k}=0) \rightarrow1\) for k=q 1+1,…,q. Applying the Taylor expansion around β ∗ yields
where ϖ 0 lies between ϖ ∗ and \(\widehat{\varpi}\). From White [36], we know that \(\frac{\partial AP_{\alpha}(\varpi^{*})}{\partial\beta_{p}} = O_{p}(n^{1/2})\). And from Theorem 1 we have \(\|\widehat{\beta}-\beta^{*}\|_{2} = O_{p}(n^{-1/2})\). Hence, under conditions (C3) and (C5), we know
which implies that \(n\varphi'_{\lambda_{1n}}(|\widehat{\beta}_{p}|)\operatorname{sgn}(\widehat {\beta}_{p})\) dominates the first three terms with probability tending to one if \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{p}|) \rightarrow \infty\). Since \(\partial \mathit{DPHL}(\widehat{\varpi}) / \partial\beta_{p}=0\), this cannot hold as long as the sample size is sufficiently large. Consequently, \(\widehat{\beta}_{p}\) has to be exactly 0 with probability tending to one.
Next, we prove the asymptotic normality. Following Theorems 1 and 2(1), there exists a root-n consistent estimator \(\widehat{\varpi} = (\widehat{\varpi}^{\tau}_{a}, \mathbf{0}^{\tau})^{\tau}\) that satisfies the equation \(\partial \mathit{DPHL}(\widehat{\varpi}) / \partial\varpi_{a} = 0\). Then, an application of Taylor expansion around \(\varpi^{*}_{a}\) yields
where \(F_{1}(\varpi^{*}_{a}) = ( \varphi'_{\lambda_{1n}}(|\beta^{*}_{j}|)\operatorname{sgn}(\beta^{*}), j=1,\ldots, p_{1}, \psi'_{\lambda_{2n}}(|d^{*}_{k}|)\operatorname{sgn}(d^{*}), k=1,\ldots, q_{1}, \mathbf{0}^{\tau})^{\tau}\), and \(F_{2}(\varpi^{*}_{a})\) is the corresponding second derivative matrix of penalty function vector \((\varphi_{\lambda_{1n}}(|\beta_{j}|), j=1,\ldots,p_{1}, \psi_{\lambda_{2n}}(|d_{k}|), k=1,\ldots, q_{1}, \mathbf{0}^{\tau})^{\tau}\) on \(\varpi^{*}_{a}\). Consequently, under conditions in Theorem 2, we have
Then, by Slutsky’s theorem,
□
Proof of Theorem 3
In order to get the conclusion, by Theorems 1 and 2, we only need to check the corresponding a n , c n =o p (n −1/2), b n , d n =o p (1), \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{j}|) \rightarrow \infty\) and \(n^{1/2} \psi'_{\lambda_{2n}}(|\widehat{d}_{k}|) \rightarrow\infty\).
For any j=1,…,p 1, under conditions (C1)–(C5), we know that \(w_{\beta_{j}} = O_{p}(1)\). Consequently,
which implies that a n =o p (n −1/2) and b n =o p (1). Similarly, we can easily find that c n =o p (n −1/2) and d n =o p (1).
While for any j=p 1+1,…,p, under conditions (C1)–(C5), we know that \(w_{\beta_{j}} = O_{p}(n^{\upsilon_{1}/2})\). Then,
which implies that \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{j}|) \rightarrow\infty\). Similarly, we know that \(n^{1/2} \psi'_{\lambda_{2n}}(|\widehat{d}_{k}|) \rightarrow\infty\) with probability tending to one, for all k=q 1+1,…,q. This completes the proof. □
Proof of Theorem 4
We first show that for \(\varpi_{n} \stackrel{P}{\longrightarrow} \varpi\),
First, according to the results in Tierney and Kadane [31] and conditions (C2) and (C3), we have
Then, following the proof of Theorem 2a in [11], it is sufficient to show that
Note that conditions (C1)–(C5) imply \([l(\varpi)-E\{l(\varpi)\}]/n \stackrel{P}{\longrightarrow} 0\) for all ϖ∈Θ. Further, since conditions (C2), (C3) and (C5) satisfy the W-LIP assumption of Lemma 2 of Andrews [3], we have the uniform continuity and stochastic continuity of E{l(ϖ)} and [l(ϖ)−E{l(ϖ)}]/n, respectively. Consequently, according to Theorem 3 of Andrews [3], (7.4) holds based on the stochastic continuity and pointwise convergence properties. Therefore, together with (7.3), it implies (7.2).
Then, considering an arbitrary candidate model \(\mathcal{S}\), under condition (C6), it follows from White [36] that the unpenalized estimator \(\tilde{\varpi }_{\mathcal{S}}\) converges to \(\varpi^{*}_{\mathcal{S}}\) in probability. Similarly, one can verify that \(\varpi^{*}_{\mathcal{S}} = \varpi^{*}\) for any overfitted model \(\mathcal{S} \supset \mathcal{S}_{T}\), and \(\varpi^{*}_{\mathcal{S}} \neq\varpi^{*}\) for any underfitted model \(\mathcal{S} \nsupseteq\mathcal{S}_{T}\) since we shrink some nonzero parameters to zero. Consequently, we prove the conclusion of the theorem in two different cases with respectively underfitted and overfitted model.
Case 1 (Underfitted Model). By the fact that \(\varpi^{*}_{\mathcal{S}} \neq\varpi^{*}_{\mathcal{S}_{T}}\) for any \(\mathcal{S} \nsupseteq\mathcal {S}_{T}\) and \(\varpi^{*}_{\mathcal{S}_{T}} = \varpi^{*}\), we have
where the first inequality follows because \(AP_{\alpha}(\tilde{\varpi }_{\mathcal{S}_{\lambda_{n}}}) \geq AP_{\alpha}(\widehat{\varpi}_{\lambda _{n}})\) for all λ n and the second equality follows from (7.2). Therefore, under condition (C7), we have \(P(\inf_{\lambda_{n} \in R_{-}} \mathit{BIC}(\lambda_{n}) > \mathit{BIC}(\lambda)) \longrightarrow1\).
Case 2 (Overfitted Model). For any λ n ∈R +, we have \(df_{\lambda_{n}} - (p_{1} + q_{1}(q_{1}+1)/2) \geq1\). Then
By the fact that AP α (ϖ)=l(ϖ)(1+O p (n −1)) and \(\varpi^{*}_{\mathcal{S}} = \varpi^{*}\) for any \(\mathcal{S} \supset \mathcal{S}_{T}\), an application of Taylor expansion yields
Therefore, under conditions (C1)–(C5), it follows from White [36] that \(AP_{\alpha}(\widehat{\varpi}_{\lambda}) - AP_{\alpha}(\tilde{\varpi}_{\mathcal{S}}) = O_{p}(1)\), which implies that
As a result, we have \(P(\inf_{\lambda_{n} \in R_{+}} \mathit{BIC}(\lambda_{n}) > \mathit{BIC}(\lambda)) \longrightarrow1\), which completes the proof. □
Rights and permissions
About this article
Cite this article
Xu, P., Wang, T., Zhu, H. et al. Double Penalized H-Likelihood for Selection of Fixed and Random Effects in Mixed Effects Models. Stat Biosci 7, 108–128 (2015). https://doi.org/10.1007/s12561-013-9105-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-013-9105-x