Skip to main content

Advertisement

Log in

Double Penalized H-Likelihood for Selection of Fixed and Random Effects in Mixed Effects Models

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

The goal of this paper is to develop a double penalized hierarchical likelihood (DPHL) with a modified Cholesky decomposition for simultaneously selecting fixed and random effects in mixed effects models. DPHL avoids the use of data likelihood, which usually involves a high-dimensional integral, to define an objective function for variable selection. The resulting DPHL-based estimator enjoys the oracle properties with no requirement on the convexity of loss function. Moreover, a two-stage algorithm is proposed to effectively implement this approach. An H-likelihood-based Bayesian information criterion (BIC) is developed for tuning parameter selection. Simulated data and a real data set are examined to illustrate the efficiency of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Ahn M, Zhang HH, Lu W (2012) Moment-based method for random effects selection in linear mixed models. Stat Sin 22:1539–1562

    MATH  MathSciNet  Google Scholar 

  2. Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) Second international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281

    Google Scholar 

  3. Andrews DWK (1992) Generic uniform convergence. Econom Theory 8:241–257

    Article  Google Scholar 

  4. Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MH, While JS (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol Evol 24:127–135

    Article  Google Scholar 

  5. Bondell HD, Krishna A, Ghosh SK (2010) Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics 66:1069–1077

    Article  MATH  MathSciNet  Google Scholar 

  6. Chen Z, Dunson DB (2003) Random effects selection in linear mixed models. Biometrics 59:762–769

    Article  MATH  MathSciNet  Google Scholar 

  7. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499

    Article  MATH  MathSciNet  Google Scholar 

  8. Fan J, Li R (2004) New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J Am Stat Assoc 99:710–723

    Article  MATH  MathSciNet  Google Scholar 

  9. Foster SD, Verbyla AP, Pitchford WS (2009) Estimation, prediction and inference for the LASSO random effect models. Aust N Z J Stat 51:43–61

    Article  MathSciNet  Google Scholar 

  10. Huang J, Wu C, Zhou L (2002) Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89:111–128

    Article  MATH  MathSciNet  Google Scholar 

  11. Ibrahim JG, Zhu H, Garcia RI, Guo R (2011) Fixed and random effects selection in mixed effects models. Biometrics 67:495–503

    Article  MATH  MathSciNet  Google Scholar 

  12. Jiang J (2007) Linear and generalized linear mixed models and their applications. Springer, New York

    MATH  Google Scholar 

  13. Jiang J, Jia H, Chen H (2001) Maximum posterior estimation of random effects in generalized linear mixed models. Stat Sin 11:97–120

    MATH  MathSciNet  Google Scholar 

  14. Jiang J, Rao J, Gu Z, Nguye T (2008) Fence methods for mixed model selection. Ann Stat 36:1669–1692

    Article  MATH  Google Scholar 

  15. Kaslow RA, Ostrow DG, Detels R, Phair JP, Polk BF, Rinaldo CR (1987) The multicenter AIDS cohort study: rationale, organization and selected characteristics of the participants. Am J Epidemiol 126:310–318

    Article  Google Scholar 

  16. Lan L (2006) Variable selection in linear mixed model for longitudinal data. PhD Thesis, North Carolina State University

  17. Lange N, Laird NM (1989) The effect of covariance structures on variance estimation in balance growth-curve models with random parameters. J Am Stat Assoc 84:241–247

    Article  MATH  MathSciNet  Google Scholar 

  18. Lee Y, Nelder JA (1996) Hierarchical generalized linear models (with discussion). J R Stat Soc B 58:619–678

    MATH  MathSciNet  Google Scholar 

  19. Lee Y, Nelder JA, Pawitan Y (2006) Generalized linear models with random effects: unified analysis via H-likelihood. Chapman and Hall, London

    Book  Google Scholar 

  20. Li Y, Wang S, Song PX, Wang N, Zhu J (2011) Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Manuscript

  21. Liang H, Wu H, Zou G (2008) A note on conditional AIC for linear mixed-effects models. Biometrika 95:773–778

    Article  MATH  MathSciNet  Google Scholar 

  22. Meng X (2009) Decoding the H-likelihood. Stat Sci 24:280–293

    Article  Google Scholar 

  23. Ni X, Zhang D, Zhang HH (2010) Variable selection for semiparametric mixed models in longitudinal studies. Biometrics 66:79–88

    Article  MATH  MathSciNet  Google Scholar 

  24. Peng H, Lu Y (2012) Models selection in linear mixed effect models. J Multivar Anal 109:109–129

    Article  MATH  MathSciNet  Google Scholar 

  25. Pu W, Niu X (2006) Selecting mixed-effects models based on a generalized information criterion. J Multivar Anal 97:733–758

    Article  MATH  MathSciNet  Google Scholar 

  26. Rao CR, Wu Y (1989) A strongly consistent procedure for model selection in a regression problem. Biometrika 76:369–374

    Article  MATH  MathSciNet  Google Scholar 

  27. Schelldorfer J, Buhlmann P, van de Geer S (2011) Estimation for high-dimensional linear mixed-effects models using l 1-penalization. Scand J Stat 38:197–214

    Article  MATH  MathSciNet  Google Scholar 

  28. Schelldorfer J, Buhlmann P (2011) GLMMLasso: an algorithm for high-dimensional generalized linear mixed models using l 1-penalization. Preprint, ETH Zurich http://stat.ethz.ch/people/schell

  29. Schwartz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464

    Article  Google Scholar 

  30. Song PX (2007) Correlated data analysis: modeling, analytics, and applications. Springer, New York

    Google Scholar 

  31. Tierney L, Kadane JB (1986) Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc 81:82–86

    Article  MATH  MathSciNet  Google Scholar 

  32. Vaida F, Blanchard S (2005) Conditional Akaike information for mixed-effects models. Biometrika 92:351–370

    Article  MATH  MathSciNet  Google Scholar 

  33. Wang H, Leng C (2007) Unified Lasso estimation via least square approximation. J Am Stat Assoc 102:1039–1048

    Article  MATH  MathSciNet  Google Scholar 

  34. Wang H, Li R, Tsai CL (2007) Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94:553–568

    Article  MATH  MathSciNet  Google Scholar 

  35. Wang S, Song PX, Zhu J (2010) Doubly regularized REML for estimation and selection of fixed and random effects in linear mixed-effects models. The University of Michigan Department of Biostatistics Working Paper Series. Working Paper 89. http://biostats.bepress.com/umichbiostat/paper89

  36. White H (1994) Estimation, inference, and specification analysis. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  37. Yang H (2007) Variable selection procedures for generalized linear mixed models in longitudinal data analysis. PhD Thesis, North Carolina State University

  38. Zipunnikov VV, Booth JG (2006) Monte Carlo EM for generalized linear mixed models using randomized spherical radial integration. Manuscript

  39. Zou H (2006) The adaptive Lasso and its Oracle properties. J Am Stat Assoc 101:1418–1429

    Article  MATH  Google Scholar 

Download references

Acknowledgements

Xu’s research was supported by Scientific Research Foundation of Southeast University; Zhu’s research was supported by a grant from the Research Grants Council of Hong Kong. The authors thank the Editor, the Associate Editor and referees for their constructive suggestions and comments which led an improvement of the early manuscript. A special thank goes to a referee who pointed out a mistake in the original proof of Theorem 4 such that we had a chance to correct it.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lixing Zhu.

Appendix

Appendix

Proof of Theorem 1

First, note that the marginal likelihood is approximated by function (2.5) through Laplace’s method. Hence, we know that

$$ AP_\alpha(\varpi) = l(\varpi) \bigl(1 + O_p \bigl(n^{-1} \bigr) \bigr) $$

according to the results of Tierney and Kadane [31]. On the other hand, under conditions (C1)–(C5), it follows from White [36] that A 11(ϖ ) is positive definite,

$$ n^{-1/2} \sum^n_{i=1} \frac{\partial l_i(\varpi^*)}{\partial\varpi_a} \stackrel{L}{\longrightarrow} N \bigl(\mathbf{0}, B_{11} \bigl(\varpi^* \bigr) \bigr), $$

and

$$ n^{-1/2} \bigl(\bar{\varpi}_a - \varpi^*_a \bigr) \stackrel{L}{\longrightarrow} N \bigl(\mathbf{0}, A_{11} \bigl( \varpi^* \bigr)^{-1} B_{11} \bigl(\varpi^* \bigr) A_{11} \bigl(\varpi^* \bigr)^{-1} \bigr). $$

Therefore, by Slutsky’s theorem, we have

$$ n^{-1/2} \frac{\partial AP_\alpha(\varpi^*)}{\partial\varpi _a} \stackrel{L}{\longrightarrow} N \bigl(\mathbf{0}, B_{11} \bigl(\varpi^* \bigr) \bigr), $$
(7.1)

and

$$ n^{-1/2} \bigl(\tilde{\varpi}_a - \varpi^*_a \bigr) \stackrel{L}{\longrightarrow} N \bigl(\mathbf{0}, A_{11} \bigl( \varpi^* \bigr)^{-1} B_{11} \bigl(\varpi^* \bigr) A_{11} \bigl(\varpi^* \bigr)^{-1} \bigr), $$

where \(\tilde{\varpi}_{a} = (\tilde{\beta}_{a}^{\tau}, \tilde{d}_{a}^{\tau}, \tilde{\gamma}_{a}^{\tau})^{\tau}\) is the unpenalized maximum adjusted profile likelihood estimator of \(\varpi^{*}_{a}\).

Then we prove the estimation consistency of \(\widehat{\varpi}\). It is sufficient to show that for any given ϵ>0, there exists a large constant C ϵ such that

$$ P \Bigl(\sup_{\|u\|_2 =C_{\epsilon}} \mathit{DPHL} \bigl(\varpi^* + n^{-1/2}u \bigr) < \mathit{DPHL} \bigl(\varpi^* \bigr) \Bigr) \geq1 - \epsilon, $$

where \(u = (u^{\tau}_{1}, u^{\tau}_{2}, u^{\tau}_{3})^{\tau}\), in which u 1=(u 11,…,u 1p )τ is a p-dimensional vector, u 2=(u,…,u)τ is a q-dimensional vector, and u 3 is a (q(q−1)/2)-dimensional vector. This implies that there exists a local maximizer in the ball {ϖ +n −1/2 u:∥u2C ϵ } and thus \(\|\widehat{\varpi} - \varpi^{*}\|_{2} = O_{p}(n^{-1/2})\).

Consider

$$\begin{aligned} S(u) =& \mathit{DPHL} \bigl(\varpi^* + n^{-1/2}u \bigr) - \mathit{DPHL} \bigl(\varpi^* \bigr) \\ =& AP_\alpha\bigl(\varpi^* + n^{-1/2}u \bigr) - AP_\alpha\bigl(\varpi^* \bigr) \\ & {}- n\sum^p_{j=1} \bigl( \varphi_{\lambda_{1n}} \bigl(\bigl|\beta^*_j + n^{-1/2}u_{1j}\bigr| \bigr) - \varphi_{\lambda_{1n}} \bigl(\bigl|\beta^*_j\bigr| \bigr) \bigr) \\ & {}- n\sum^q_{k=1} \bigl( \psi_{\lambda_{2n}} \bigl(\bigl|d^*_k+n^{-1/2}u_{2k}\bigr| \bigr) - \psi_{\lambda_{2n}} \bigl(\bigl|d^*_k\bigr| \bigr) \bigr). \end{aligned}$$

By using the concavity and monotonicity of penalty functions, we know that

$$\begin{aligned} S(u) \leq& AP_\alpha\bigl(\varpi^* + n^{-1/2}u \bigr) - AP_\alpha\bigl(\varpi^* \bigr) \\ & {}- n\sum^{p_1}_{j=1} \bigl( \varphi_{\lambda_{1n}} \bigl(\bigl|\beta^*_j + n^{-1/2}u_{1j}\bigr| \bigr) - \varphi_{\lambda_{1n}} \bigl(\bigl|\beta^*_j\bigr| \bigr) \bigr) \\ & {}- n\sum^{q_1}_{k=1} \bigl( \psi_{\lambda_{2n}} \bigl(\bigl|d^*_k+n^{-1/2}u_{2k}\bigr| \bigr) - \psi_{\lambda_{2n}} \bigl(\bigl|d^*_k\bigr| \bigr) \bigr) \\ \triangleq& S_1(u) - S_2(u) - S_3(u). \end{aligned}$$

For S 1(⋅), using Taylor expansion around ϖ , we have

$$ S_1(u) = n^{-1/2} u^\tau\frac{\partial AP_\alpha(\varpi^*)}{\partial\varpi} - \frac{1}{2}u^\tau\biggl(-n^{-1} \frac{\partial^2 AP_\alpha(\varpi^*)}{\partial\varpi\partial\varpi^\tau} \biggr) u + o_p \bigl(n^{-1}\|u\|^2_2 \bigr). $$

Then, using Cauchy–Schwarz inequality together with condition (C3) and the fact (7.1), we have

$$\begin{aligned} S_1(u) \leq& O_p(1)\|u\|_1 - \frac{1}{2}u^\tau\bigl\{ A \bigl(\varpi^* \bigr)+o_p(1) \bigr\} u + o_p \bigl(n^{-1}\|u\|^2_2 \bigr) \\ \leq& \sqrt{p+q(q+1)/2} \|u\|_2 O_p(1) - \frac{1}{2}u^\tau A \bigl(\varpi^* \bigr)u + o_p \bigl(n^{-1}\|u\|^2_2 \bigr) \\ \triangleq& S_{11}(u) + S_{12}(u) + S_{13}(u), \end{aligned}$$

where ∥u1 is the L 1-norm of the (p+q(q+1)/2)-dimensional vector u, and it can be easily checked that \(\|u\|_{1} \leq\sqrt {p+q(q+1)/2} \|u\|_{2}\). For S 2(u), an application of Taylor expansion around zero vector yields

$$ S_2(u) \leq n^{1/2} a_n \|u_1 \|_1 + \frac{1}{2}b_n \|u_1 \|^2_2 + o_p \bigl(\|u_1 \|^2_2 \bigr). $$

Consequently, if a n =O p (n −1/2), b n =o p (1), we know that

$$\begin{aligned} S_2(u) \leq& \sqrt{p}\|u_1\|_2O_p(1) + o_p \bigl(\|u_1\|^2_2 \bigr) \\ \triangleq& S_{21}(u) + S_{22}(u). \end{aligned}$$

Similarly, for S 3(u), we have

$$\begin{aligned} S_3(u) \leq& \sqrt{q} \|u_2\|_2O_p(1) + o_p \bigl(\|u_2\|^2_2 \bigr) \\ \triangleq& S_{31}(u) + S_{32}(u), \end{aligned}$$

if c n =O p (n −1/2) and d n =o p (1). We can see that, by choosing a sufficient large C ϵ , S 12(u) dominates other terms uniformly in ∥u∥=C ϵ , which completes the proof. □

Proof of Theorem 2

First, we prove the sparsity. Note that if d k =0, we have γ kt =0, for all t=1,…,k−1. Therefore, it is sufficient to show \(P(\widehat{\beta}_{j} = 0)\rightarrow1\) for all j=p 1+1,…,p and \(P(\widehat{d}_{k} = 0)\rightarrow1\) for all k=q 1+1,…,q. Without loss of generality, we show in detail that \(P(\widehat{\beta}_{p} = 0) \rightarrow1\). Then, the same argument can be used to show that \(P(\widehat{\beta}_{j} = 0) \rightarrow1\) for j=p 1+1,…,p−1. Similarly, we can show that \(P(\widehat{d}_{k}=0) \rightarrow1\) for k=q 1+1,…,q. Applying the Taylor expansion around β yields

$$\begin{aligned} \frac{\partial \mathit{DPHL}(\widehat{\varpi})}{\partial\beta_p} =& \frac {\partial AP_\alpha(\varpi^*)}{\partial\beta_p} + \sum^p_{l=1} \frac{\partial^2 AP_\alpha(\varpi^*)}{\partial\beta_p \partial\beta _l} \bigl(\widehat{\beta}_l - \beta^*_l \bigr) \\ &{} + \frac{1}{2}\sum^p_{l=1}\sum ^p_{s=1} \frac{\partial^3 AP_\alpha(\varpi_0)}{\partial\beta_p \partial\beta_l \partial\beta_s} \bigl( \widehat{\beta}_l - \beta^*_l \bigr) \bigl(\widehat{ \beta}_s - \beta^*_s \bigr) - n\varphi'_{\lambda_{1n}}\bigl(| \widehat{\beta}_p|\bigr)\operatorname{sgn}(\widehat{\beta}_p), \end{aligned}$$

where ϖ 0 lies between ϖ and \(\widehat{\varpi}\). From White [36], we know that \(\frac{\partial AP_{\alpha}(\varpi^{*})}{\partial\beta_{p}} = O_{p}(n^{1/2})\). And from Theorem 1 we have \(\|\widehat{\beta}-\beta^{*}\|_{2} = O_{p}(n^{-1/2})\). Hence, under conditions (C3) and (C5), we know

$$ \frac{\partial \mathit{DPHL}(\widehat{\varpi})}{\partial\beta_p} = n^{1/2} \bigl(O_p(1) - n^{1/2}\varphi'_{\lambda_{1n}}\bigl(|\widehat{ \beta}_p|\bigr)\operatorname{sgn}(\widehat{\beta}_p) \bigr), $$

which implies that \(n\varphi'_{\lambda_{1n}}(|\widehat{\beta}_{p}|)\operatorname{sgn}(\widehat {\beta}_{p})\) dominates the first three terms with probability tending to one if \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{p}|) \rightarrow \infty\). Since \(\partial \mathit{DPHL}(\widehat{\varpi}) / \partial\beta_{p}=0\), this cannot hold as long as the sample size is sufficiently large. Consequently, \(\widehat{\beta}_{p}\) has to be exactly 0 with probability tending to one.

Next, we prove the asymptotic normality. Following Theorems 1 and 2(1), there exists a root-n consistent estimator \(\widehat{\varpi} = (\widehat{\varpi}^{\tau}_{a}, \mathbf{0}^{\tau})^{\tau}\) that satisfies the equation \(\partial \mathit{DPHL}(\widehat{\varpi}) / \partial\varpi_{a} = 0\). Then, an application of Taylor expansion around \(\varpi^{*}_{a}\) yields

$$\begin{aligned} 0 =& \frac{1}{\sqrt{n}}\frac{\partial AP_\alpha(\varpi^*)}{\partial \varpi_a} + \frac{1}{n} \frac{\partial^2 AP_\alpha(\varpi^*)}{\partial\varpi_a \partial\varpi ^\tau_a} n^{1/2} \bigl(\widehat{\varpi}_a - \varpi^*_a \bigr) + o_p \bigl(n^{1/2} \bigl( \widehat{\varpi}_a - \varpi^*_a \bigr) \bigr) \\ & {}- n^{1/2}F_1 \bigl(\varpi^*_a \bigr) - F_2 \bigl(\varpi^*_a \bigr) n^{1/2} \bigl( \widehat{\varpi}_a - \varpi^*_a \bigr) + o_p \bigl(n^{1/2} \bigl(\widehat{\varpi}_a - \varpi^*_a \bigr) \bigr), \end{aligned}$$

where \(F_{1}(\varpi^{*}_{a}) = ( \varphi'_{\lambda_{1n}}(|\beta^{*}_{j}|)\operatorname{sgn}(\beta^{*}), j=1,\ldots, p_{1}, \psi'_{\lambda_{2n}}(|d^{*}_{k}|)\operatorname{sgn}(d^{*}), k=1,\ldots, q_{1}, \mathbf{0}^{\tau})^{\tau}\), and \(F_{2}(\varpi^{*}_{a})\) is the corresponding second derivative matrix of penalty function vector \((\varphi_{\lambda_{1n}}(|\beta_{j}|), j=1,\ldots,p_{1}, \psi_{\lambda_{2n}}(|d_{k}|), k=1,\ldots, q_{1}, \mathbf{0}^{\tau})^{\tau}\) on \(\varpi^{*}_{a}\). Consequently, under conditions in Theorem 2, we have

$$ 0 = \frac{1}{\sqrt{n}}\frac{\partial AP_\alpha(\varpi^*)}{\partial \varpi_a} + \frac{1}{n} \frac{\partial^2 AP_\alpha(\varpi^*)}{\partial\varpi_a \partial\varpi^\tau_a} n^{1/2} \bigl(\widehat{\varpi}_a - \varpi^*_a \bigr) + o_p(1). $$

Then, by Slutsky’s theorem,

$$ \sqrt{n} \bigl(\widehat{\varpi}_a - \varpi^*_a \bigr) \stackrel{L}{\longrightarrow} N \bigl(\mathbf{0}, A^{-1}_{11} \bigl(\varpi^* \bigr) B_{11} \bigl(\varpi^* \bigr) A^{-1}_{11} \bigl(\varpi^* \bigr) \bigr). $$

 □

Proof of Theorem 3

In order to get the conclusion, by Theorems 1 and 2, we only need to check the corresponding a n , c n =o p (n −1/2), b n , d n =o p (1), \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{j}|) \rightarrow \infty\) and \(n^{1/2} \psi'_{\lambda_{2n}}(|\widehat{d}_{k}|) \rightarrow\infty\).

For any j=1,…,p 1, under conditions (C1)–(C5), we know that \(w_{\beta_{j}} = O_{p}(1)\). Consequently,

$$\begin{aligned} n^{1/2} \varphi'_{\lambda_{1n}} \bigl(\bigl| \beta^*_j\bigr| \bigr) =& n^{1/2}\lambda_{1n}w_{\beta_j} = o_p(1) \\ \varphi''_{\lambda_{1n}} \bigl(\bigl|\beta^*_j\bigr| \bigr) =& 0, \end{aligned}$$

which implies that a n =o p (n −1/2) and b n =o p (1). Similarly, we can easily find that c n =o p (n −1/2) and d n =o p (1).

While for any j=p 1+1,…,p, under conditions (C1)–(C5), we know that \(w_{\beta_{j}} = O_{p}(n^{\upsilon_{1}/2})\). Then,

$$ n^{1/2} \varphi'_{\lambda_{1n}}\bigl(|\widehat{ \beta}_j|\bigr) = n^{1/2}\lambda_{1n}w_{\beta_j} = O_p \bigl(\lambda_{1n} n^{(1+\upsilon_1)/2} \bigr), $$

which implies that \(n^{1/2} \varphi'_{\lambda_{1n}}(|\widehat{\beta}_{j}|) \rightarrow\infty\). Similarly, we know that \(n^{1/2} \psi'_{\lambda_{2n}}(|\widehat{d}_{k}|) \rightarrow\infty\) with probability tending to one, for all k=q 1+1,…,q. This completes the proof. □

Proof of Theorem 4

We first show that for \(\varpi_{n} \stackrel{P}{\longrightarrow} \varpi\),

$$\begin{aligned} AP_\alpha(\varpi_n) - E \bigl\{ l(\varpi) \bigr\} =& o_p(n). \end{aligned}$$
(7.2)

First, according to the results in Tierney and Kadane [31] and conditions (C2) and (C3), we have

$$ \max_{\varpi\in\varTheta} \frac{1}{n}\bigl|AP_\alpha( \varpi) - l(\varpi)\bigr| \stackrel{P}{\longrightarrow} 0. $$
(7.3)

Then, following the proof of Theorem 2a in [11], it is sufficient to show that

$$ \max_{\varpi\in\varTheta} \frac{1}{n}\bigl|l(\varpi) - E \bigl\{ l(\varpi) \bigr\} \bigr| \stackrel{P}{\longrightarrow} 0. $$
(7.4)

Note that conditions (C1)–(C5) imply \([l(\varpi)-E\{l(\varpi)\}]/n \stackrel{P}{\longrightarrow} 0\) for all ϖΘ. Further, since conditions (C2), (C3) and (C5) satisfy the W-LIP assumption of Lemma 2 of Andrews [3], we have the uniform continuity and stochastic continuity of E{l(ϖ)} and [l(ϖ)−E{l(ϖ)}]/n, respectively. Consequently, according to Theorem 3 of Andrews [3], (7.4) holds based on the stochastic continuity and pointwise convergence properties. Therefore, together with (7.3), it implies (7.2).

Then, considering an arbitrary candidate model \(\mathcal{S}\), under condition (C6), it follows from White [36] that the unpenalized estimator \(\tilde{\varpi }_{\mathcal{S}}\) converges to \(\varpi^{*}_{\mathcal{S}}\) in probability. Similarly, one can verify that \(\varpi^{*}_{\mathcal{S}} = \varpi^{*}\) for any overfitted model \(\mathcal{S} \supset \mathcal{S}_{T}\), and \(\varpi^{*}_{\mathcal{S}} \neq\varpi^{*}\) for any underfitted model \(\mathcal{S} \nsupseteq\mathcal{S}_{T}\) since we shrink some nonzero parameters to zero. Consequently, we prove the conclusion of the theorem in two different cases with respectively underfitted and overfitted model.

Case 1 (Underfitted Model). By the fact that \(\varpi^{*}_{\mathcal{S}} \neq\varpi^{*}_{\mathcal{S}_{T}}\) for any \(\mathcal{S} \nsupseteq\mathcal {S}_{T}\) and \(\varpi^{*}_{\mathcal{S}_{T}} = \varpi^{*}\), we have

$$\begin{aligned} n^{-1} \bigl(\mathit{BIC}(\lambda_n) - \mathit{BIC}(\lambda) \bigr) =& 2n^{-1} \bigl\{ AP_\alpha(\widehat{\varpi}_{\lambda}) - AP_\alpha(\widehat{\varpi}_{\lambda_n}) \bigr\} + \frac{\log n}{n} (df_{\lambda_n} - df_{\lambda}) \\ \geq& 2n^{-1} \bigl\{ AP_\alpha(\widehat{\varpi}_{\lambda}) - AP_\alpha(\tilde{\varpi}_{\mathcal{S}_{\lambda_n}}) \bigr\} + o_p(1) \\ =& 2n^{-1} \bigl\{ E \bigl\{ l \bigl(\varpi^* \bigr) \bigr\} - E \bigl\{ l \bigl(\varpi^*_{\mathcal{S}_{\lambda_n}} \bigr) \bigr\} \bigr\} + o_p(1) \\ \geq& 2n^{-1}\min_{\mathcal{S} \nsupseteq \mathcal{S}_T} \bigl\{ E \bigl\{ l \bigl( \varpi^* \bigr) \bigr\} - E \bigl\{ l \bigl(\varpi^*_{\mathcal{S}} \bigr) \bigr\} \bigr\} + o_p(1), \end{aligned}$$

where the first inequality follows because \(AP_{\alpha}(\tilde{\varpi }_{\mathcal{S}_{\lambda_{n}}}) \geq AP_{\alpha}(\widehat{\varpi}_{\lambda _{n}})\) for all λ n and the second equality follows from (7.2). Therefore, under condition (C7), we have \(P(\inf_{\lambda_{n} \in R_{-}} \mathit{BIC}(\lambda_{n}) > \mathit{BIC}(\lambda)) \longrightarrow1\).

Case 2 (Overfitted Model). For any λ n R +, we have \(df_{\lambda_{n}} - (p_{1} + q_{1}(q_{1}+1)/2) \geq1\). Then

$$\begin{aligned} &\mathit{BIC}(\lambda_n) - \mathit{BIC}(\lambda) \\ &\quad = 2 \bigl\{ AP_\alpha(\widehat{\varpi}_{\lambda}) - AP_\alpha(\widehat{\varpi}_{\lambda_n}) \bigr\} + \bigl(df_{\lambda_n} - \bigl(p_1 + q_1(q_1+1)/2 \bigr) \bigr) \log n \\ &\quad \geq 2 \bigl\{ AP_\alpha(\widehat{\varpi}_{\lambda}) - AP_\alpha(\tilde{\varpi}_{\mathcal{S}_{\lambda_n}}) \bigr\} + \log n \\ &\quad \geq 2 \min_{\mathcal{S} \supset\mathcal{S}_T} \bigl\{ AP_\alpha (\widehat{ \varpi}_{\lambda}) - AP_\alpha(\tilde{\varpi}_{\mathcal{S}}) \bigr\} + \log n. \end{aligned}$$

By the fact that AP α (ϖ)=l(ϖ)(1+O p (n −1)) and \(\varpi^{*}_{\mathcal{S}} = \varpi^{*}\) for any \(\mathcal{S} \supset \mathcal{S}_{T}\), an application of Taylor expansion yields

$$\begin{aligned} & AP_\alpha(\widehat{\varpi}_{\lambda}) - AP_\alpha( \tilde{\varpi}_{\mathcal{S}}) \\ &\quad = \bigl\{ \sqrt{n} \bigl(\widehat{\varpi}_{\lambda} - \varpi^* \bigr) - \sqrt{n} \bigl(\tilde{\varpi}_{\mathcal{S}} - \varpi^* \bigr) \bigr\} ^\tau n^{-1/2} \frac{\partial l(\varpi^*)}{\partial\varpi} \\ &\qquad {} + O_p \biggl( \bigl(\widehat{\varpi}_{\lambda} - \varpi^* \bigr)^\tau\frac{\partial^2 l(\varpi^*)}{\partial\varpi\partial \varpi^\tau} \bigl(\widehat{\varpi}_{\lambda} - \varpi^* \bigr) \biggr) \\ &\qquad {} + O_p \biggl( \bigl(\tilde{\varpi}_{\mathcal{S}} - \varpi^* \bigr)^\tau\frac{\partial^2 l(\varpi^*)}{\partial\varpi\partial \varpi^\tau} \bigl(\tilde{\varpi}_{\mathcal{S}} - \varpi^* \bigr) \biggr). \end{aligned}$$

Therefore, under conditions (C1)–(C5), it follows from White [36] that \(AP_{\alpha}(\widehat{\varpi}_{\lambda}) - AP_{\alpha}(\tilde{\varpi}_{\mathcal{S}}) = O_{p}(1)\), which implies that

$$ \mathit{BIC}(\lambda_n) - \mathit{BIC}(\lambda) \geq O_p(1) + \log n \stackrel{P}{\rightarrow} \infty. $$

As a result, we have \(P(\inf_{\lambda_{n} \in R_{+}} \mathit{BIC}(\lambda_{n}) > \mathit{BIC}(\lambda)) \longrightarrow1\), which completes the proof. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, P., Wang, T., Zhu, H. et al. Double Penalized H-Likelihood for Selection of Fixed and Random Effects in Mixed Effects Models. Stat Biosci 7, 108–128 (2015). https://doi.org/10.1007/s12561-013-9105-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-013-9105-x

Keywords

Navigation