A Kenward-Roger Approximation and Parametric Bootstrap Methods for Tests in Linear Mixed Models – The R Package pbkrtest

When testing for reduction of the mean value structure in linear mixed models, it is common to use an asymptotic χ 2 test. Such tests can, however, be very poor for small and moderate sample sizes. The pbkrtest package implements two alternatives to such approximate χ 2 tests: The package implements (1) a Kenward-Roger approximation for performing F tests for reduction of the mean structure and (2) parametric bootstrap methods for achieving the same goal. The implementation is focused on linear mixed models with independent residual errors. In addition to describing the methods and as-pects of their implementation, the paper also contains several examples and a comparison of the various methods.


Introduction
In this paper we address the question of testing for reduction of the systematic components in mixed effects models. Attention is restricted to models which are linear and where all random effects are Gaussian. The focus in this paper is on the implementation of these models in the lme4 package (Bates, Maechler, Bolker, and Walker 2014a) for R (R Core Team 2014); specifically as implemented in the lmer() function. The package pbkrtest (Halekoh and Højsgaard 2014) implements the methods described in this paper and the package is available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project. org/package=pbkrtest.
It is always possible to exploit that the likelihood ratio (LR) test statistic has a limiting χ 2 distribution as the amount of information in the sample goes to infinity. We shall refer to this test as the asymptotic χ 2 test. However, the χ 2 approximation can be poor and lead to misleading conclusions for small and moderate sample sizes. For certain types of studies it is possible to base the inference on an F statistic. Such studies generally need to be balanced in some way, for example, the number of observations in each treatment group being the same and so on. These balance requirements can often not be met in practice. Therefore there is a need for tests which, for a large class of linear mixed models, (1) are better than the asymptotic χ 2 test and (2) which are relatively easy to compute in practice.
The paper is structured as follows: Section 2 describes the problem addressed in more detail and sets the notation of the paper. Section 3 illustrates the problems related to tests in mixed models through several examples. In Section 4 we describe the approach taken by Kenward and Roger (1997) to address the inference problem. Section 5 describes an alternative approach based on parametric bootstrap methods. In Section 6 we apply the methods to several data sets. Section 7 contains a discussion and outlines some additional improvements that can be made to the implementation in pbkrtest.
Throughout this paper we will use the CRAN version 1.0-5 of the lme4 package.

Preliminaries and notation
In this paper we focus on linear mixed models which, in the formulation of Laird and Ware (1982), are of the form where Y is an N vector of observables. The superscripts in Equation 1 refer to the dimension of the quantities and these superscripts will be omitted whenever possible in the following.
In (1), X and Z are design matrices of the fixed and random effect, b is the random effect vector distributed as b ∼ N (0, Γ) and ∼ N (0, σ 2 I) is the vector of residual errors where I is the N × N identity matrix. It is assumed that b and are independent. The covariance matrix of Y is therefore Var(Y) = Σ N ×N = ZΓZ + σ 2 I. This model is a simplification of the more general model proposed in Laird and Ware (1982), who allow the covariance matrix of to be a general positive definite matrix.
We are interested in testing hypotheses about the fixed effects in (1), i.e., testing for the smaller model where C(X 0 ) ⊂ C(X) with C(X) denoting the column space of X. Let d = dim(C(X)) − dim(C(X 0 )). Notice that the structural forms of the random components of the two models are identical.
Testing the reduction of E(Y) = Xβ to E(Y) = X 0 β 0 can in some cases be made as an F test; one example is given in Section 3. However, in many practical cases, such an exact F test is not available and one often resorts to asymptotic tests. One approach is based on the LR test statistic T which is twice the difference of the maximized log-likelihoods T = 2(log L − log L 0 ).
Under the hypothesis, T has an asymptotic χ 2 d distribution (Wilks 1938). The reduction of the large model to the small model can equivalently be expressed by the equation Lβ = 0 with a non-singular d × p restriction matrix L. In Appendix B it is shown how L can be constructed from X and X 0 .
A test of the more general hypothesis L(β − β H ) = 0 can be based on the Wald test statistic whereβ is an estimate for β andV for the covariance matrix ofβ. In this paper we focus on the case where β H = 0. Under the hypothesis, W also has an asymptotic χ 2 d distribution and the Wald and the LR test are hence asymptotically equivalent.
The approximation of the null distribution of T or W by a χ 2 d distribution can for small samples be quite poor and this can lead to misleading conclusions. An example of this is given in Section 3. Nonetheless, this approximation is often used in practice -mainly because of the lack of attractive alternatives. This paper is aimed at providing some remedies for this.
(a) Kenward and Roger (1997) provide a modification of W given in (4). They also argue that this modified statistic is asymptotically distributed as an F d,m distribution for which they provide a method for estimating the denominator degrees of freedom m. We have implemented their work in the function KRmodcomp() for models of the form (1); notice in particular that attention is restricted to models for which the residuals are independent and have constant variance. Throughout this paper we shall refer to Kenward and Roger (1997) as KR.
(b) The second contribution of this paper is to determine either the full null distribution or moments of the null distribution of the LR test statistic (3) by a parametric bootstrap approach (Davison and Hinkley 1997, Chapter 4). This has been implemented in the function PBmodcomp().

The degree-of-freedom issue for linear mixed models
In this section we discuss the degree-of-freedom issue on the basis of the beets dataset in the pbkrtest package. The beets data, which to our knowledge have not been published elsewhere, come from a split-plot experiment. Although the classical analysis of split-plot experiments is described in many places in the literature, see e.g., Cochran and Cox (1957, Chapter 7), we treat the topic in some detail in order to put the other parts of the article into context.
In the following i denotes harvesting dates (i = 1, 2), j denotes block (j = 1, 2, 3) and k denotes sowing dates (k = 1, . . . , 5). Let I = 2, J = 3 and K = 5. For simplicity we assume that there is no interaction between sowing and harvesting time (this assumption is supported by Figure 1). A typical model for such an experiment would be where U ij ∼ N (0, ω 2 ) and ijk ∼ N (0, σ 2 ). Notice that U ij describes the random variation between whole-plots (within blocks) and the presence of this term implies that measurements on the same split-plot will be positively correlated.

The asymptotic χ 2 test
We can fit the models and test for no effect of sowing and harvesting time using the lmer() function from lme4 (Bates et al. 2014a).
R> library("lme4") R> data("beets", package = "pbkrtest") R> sug <-lmer(sugpct~block + sow + harvest + (1 | block:harvest), + data = beets, REML = FALSE) R> sug_no.harv <-update(sug, .~. -harvest) R> sug_no.sow <-update(sug, .~. -sow) We then proceed by testing for no effect of sowing and of harvesting time: These tests are based on the limiting χ 2 distribution of the LR test statistic and suggest a highly significant effect of both sowing and harvesting time. Notice that above we have fitted the models with REML = FALSE, i.e., by maximum likelihood rather than by restricted maximum likelihood. We must do so for the following asymptotic χ 2 tests to make sense. However the test for no effect of harvesting time is misleading because the hierarchical structure of the data has not been appropriately accounted for. We shall discuss this important issue in detail below.

The exact F test
Consider a comparison of two sowing dates and of two harvesting dates. From (5) we get: For the sowing dates the whole plot variation cancels out whereas the whole-plot variation prevails for the harvesting dates. This means that the effect of whole-plot treatments are determined with smaller precision than the effect of split-plot treatments. In some applications (for example if whole-plots are animals and split plots correspond to an application of a treatment at different time points) it is often the case that ω 2 is considerably larger than σ 2 . Estimated contrasts for sowing dates and harvesting dates hence become Test for no effect of harvesting time Next we consider test statistics. We shall use the notation y i++ = jk y ijk andȳ i++ = y i++ /(JK) etc. Also we letσ 2 = ω 2 + σ 2 /K. The test for no effect of harvesting time is based on the marginal model obtained after averaging over the sowing dates, i.e., Observe thatȳ ij+ in (10) has the structure of a model for a balanced two-way layout (with factors harvesting times and block) without replicates.
Let SS I = ijk (ȳ i++ −ȳ +++ ) 2 be the sums of squares associated with harvesting time. Direct calculation shows that E(SS I ) = Q I + (I − 1)Kσ 2 where Q I = JK i (α i −ᾱ + ) 2 . The corresponding mean squares MS I = SS I /(I − 1) then has expectation E(MS I ) = Q I /(I − 1) + Kσ 2 . Since Q I ≥ 0 and Q I = 0 iff all α i are identical, MS I can be used for constructing a test for no effect of harvesting time.
Hence, when the hierarchical structure of the experiment has been accounted for, the effect of harvesting time is not significant at the 5% level.

The Mississippi influents example
The Mississippi dataset in the SASmixed package (Bates 2011) contains the nitrogen concentration (in PPM) from several sites at six randomly selected influents of the Mississippi river. R> data("Mississippi", package = "SASmixed") R> Mississippi$influent <-factor(Mississippi$influent) R> Mississippi$Type <-factor(Mississippi$Type) R> head(Mississippi) influent y Type 1 1 21 2 2 1 27 2 3 1 29 2 4 1 17 2 5 1 19 2 6 1 12 2 The influents were characterized according to watersheds as follows. Type = 1: No farmland in watershed (influents no. 3 and 5); Type = 2: Less than 50% farmland in watershed (influents no. 1, 2 and 4); Type = 3: More than 50% farmland in watershed (influent no. 6). Measurements from the same influent are expected to be similar and there is no particular interest in the individual influents. It is more interesting to investigate the effect of the watershed type on the nitrogen concentration.
A typical model for such data would be Trusting large sample asymptotic results is questionable. If the data had been balanced such that there were the same number of influents for each watershed type and the same number of recordings for each influent, then we could have made a proper F test along the lines of Section 3.1.
An alternative is to analyze the means for each influent and this yields a much less clear indication of an effect of watershed type. To calculate the means we employ the doBy package (Højsgaard and Halekoh 2013) R> library("doBy") R> Miss.mean <-summaryBy(y~influent + Type, data = Mississippi, + FUN = mean) R> detach("package:doBy") R> miss1_lm <-lm(y.mean~Type, data = Miss.mean) R> anova(miss1_lm)

Approximate F statistic and the KR approximation
In this section we describe first the KR approach of testing the hypothesis L(β − β H ) = 0 for a more general model than (1). We describe then the class of linear mixed models fitted with lmer() for which function KRmodcomp() of the package pbkrtest provides the KR approach.

A multivariate normal model
KR consider for Y the multivariate normal model The covariance-matrix Σ = Σ(γ) is assumed to be a function of M parameters collected in the vector γ. We denote the REML estimates of these parameters withγ. The unbiased REML estimate of β is then, see Kackar and Harville (1984), where Φ is the covariance matrix of the asymptotic distribution of β and Φ(γ) is a consistent estimate of Φ.
A scaled Wald-type statistics for testing the hypothesis whereV is some positive definite symmetric matrix. The usual Wald test statistic useŝ V = Φ(γ). In this case F has asymptotically a 1 d χ 2 d distribution (which can be thought of as the limiting distribution of an F d,m distribution when m → ∞.) For some models, F has an exact F distribution under the hypothesis. One example of this is a balanced one-way analysis of variance.

The approach of Kenward and Roger
KR modify the statistic F in (14) to improve the small sample properties by approximating the distribution of F by an F d,m distribution, and they also provide a method for calculating the denominator degrees of freedom m. The fundamental idea is to calculate the approximate mean and variance of their statistic and then match moments with an F distribution to obtain the denominator degrees of freedom. KR left out some detail in the derivation of their method. Alnosaier (2007) provides more details, weakens some of the assumptions for the approach, and extends the list of models for which it is known that the approach yields exact F tests.
KR take two steps to improve the small sample distributional properties of F . Firstly, Kackar and Harville (1984) showed that the covariance matrix ofβ can be written as the sum Var(β) = Φ + Λ where Λ expresses the bias by which the asymptotic covariance matrix Φ underestimates Var(β). KR combine a Taylor approximation to Λ with a bias-corrected modification of Φ(γ) using second order Taylor expansion to derive a new estimate Φ A (γ). In the statistic F in (14), KR replace the matrixV withV = Φ A (γ). Secondly, KR derive a scaling factor λ (such that the statistic they consider is λF ) and a denominator degree of freedom m by matching approximations of the expectation and variance of λF with the moments of an F d,m distribution. In more detail, KR derive an approximation for the expectation E and variance V of F based on a first order Taylor expansion. Then they solve the system of equations where E(F d,m ) and Var(F d,m ) denote expectation and variance of an F d,m distributed random variable. The E and V are slightly modified without changing the order of approximation such that for the balanced one-way ANOVA model and the Hoteling's T 2 model the exact F tests are reproduced (Alnosaier 2007, Chapters 4.1 and 4.2). We shall refer to these two steps as the Kenward-Roger approximation (or KR approximation in short). Details of the computations are provided in Appendix A.1. In particular, the solution to the equations above is given in (27). Recall that the mean of an F d,m distribution exists provided that m > 2 and the variance exists provided that m > 4. The moment matching method does however not prevent estimates of m that are less than or equal to 2. KR did not address this problem and neither did we in our implementation.

Models for which KRmodcomp() provides tests
The KRmodcomp() function of the pbkrtest package provides the KR approximation for linear mixed models of the form (1) where Σ is a sum of known matrices The matrices G r are usually very sparse matrices. Variance component models and random coefficient models are models which have this simplified covariance structure. For details we refer to Appendix A.1.

Parametric bootstrap
An alternative approach is based on parametric bootstrap, and this is also implemented in pbkrtest. The setting is the LR test statistic T for which we have an observed value t obs . The question is in which reference distribution t obs should be evaluated; i.e., what is the null distribution of T . Instead of relying on the approximation of the null distribution by a χ 2 d distribution one can use parametric bootstrap: First, create B (e.g., B = 1000) bootstrap samples y 1 , . . . , y B by simulating fromf 0 (y) (wherê f 0 denotes the fitted distribution under the hypothesis). Next, calculate the corresponding values T * = {t 1 , . . . , t B } of the LR test statistic. For what follows, let E * T and V * T denote sample mean and sample variance of T * . These simulated values can then be regarded as samples from the null distribution and these values can be used in different ways which are implemented in the PBmodcomp() function. The labels below refer to the output from PBmodcomp(), see Section 6: PBtest: Direct calculation of tail probabilities: The values T * provide an empirical null distribution in which t obs can be evaluated. Let I(x) be an indicator function which is 1 if x is true and 0 otherwise. Following Davison and Hinkley (1997, Chapter 4), the p value then becomes Gamma: Approximate the null distribution by a gamma distribution with mean E * T and variance V * T .
Bartlett: Improve the LR test statistic by a Bartlett type correction: The LR test statistic T can be scaled to better match the χ 2 d distribution as F: Approximate the null distribution of T /d by an F d,m distribution with mean E * /d. This yields a single equation for deriving m, namely Notice that the parametric bootstrap approach is not restricted to linear mixed models of the type discussed in this paper. In fact, pbkrtest implements parametric bootstrap also for generalized linear and for generalized linear mixed models.
We shall make the following remarks about the quantities mentioned in the listing above (in Section 6 we also provide graphical illustrations of these approaches): 1. Regarding PBtest recall that the definition of a p value for a composite hypothesis is (see e.g., Casella and Berger 2002, p. 397) where the supremum is taken over all possible values θ = (β 0 , γ) under the hypothesis. When this supremum can not be evaluated in practice, it is often exploited that for large samples P θ is approximately the distribution function for a χ 2 d distribution which is independent of θ. Implicit in (18) is therefore a definition of a bootstrapped p value to be p = Pθ(T > t obs ) and then (18) is used for the calculation. Determining the tail of a distribution as in (18) by sampling requires a large number of samples B (but how large B must be depends in practice on the size of t obs ).
2. The quantities Gamma, Bartlett and F are based on assuming a parametric form of the null distribution such that the null distribution can be determined from at most the first two sample moments of T * . It requires in general fewer samples to obtain credible estimates for these moments than for obtaining the tail probabilities in (18). We have no compelling mathematical argument why T * should be well approximated by a gamma distribution, but since a χ 2 d distribution is also a gamma distribution, it is appealing to approximate T * by a gamma distribution where we match the first two moments. In practice this means that we obtain a distribution which can have a heavier tail than the χ 2 d distribution. The idea behind adjusting the LR test statistic by a Bartlett type correction as in T B = T E * T /d is to obtain a a statistic whose distribution becomes closer to a χ 2 d distribution (cfr. Cox 2006, p. 130). See also e.g., Jensen (1993) for a more comprehensive treatment of Bartlett corrections. Approximating the distribution of T /d by an F d,m distribution can be motivated as follows: Under the hypothesis, T is in the limit χ 2 d distributed so T /d has in the limit a χ 2 d /d distribution with expectation 1 and variance 2/d. This is, loosely speaking, the same as an F d,m distribution with an infinite number of denominator degrees of freedom m. By estimating m as m = 2E * T /(E * T − d) we obtain the increased flexibility of an F distribution with a larger variance than 2/d, i.e., a distribution with a heavier tail than that of a χ 2 d /d distribution.
3. A general problem with the parametric bootstrap approach is that it is computationally intensive. However the pbkrtest package allows for the samples to be drawn in parallel by utilizing several processors on the computer.
4. The parametric bootstrap approach may be modified into a sequential scheme as follows: Instead of fixing the number of parametric bootstrap samples B in advance, on may draw samples until h (e.g., h = 20) values of the test statistic which are more extreme than the observed test statistic have been obtained. If this takes B samples then the p value to report is (h+1)/(B +1). If there is little evidence against the hypothesis then only a small number B of simulations would be needed. This idea is the parametric bootstrap version of the approach of Besag and Clifford (1991) for calculating sequential Monte Carlo p values. This idea is illustrated below.

Applications of the methods
This section contains applications of the methods described in Sections 4 and 5 to the examples in Section 3. This section also contains additional examples. In connection with parametric bootstrap, pbkrtest allows for samples to be drawn in parallel by utilizing several processors on the computer via the facilities provided in the parallel package. To do so we create clusters: R> nc <-detectCores() R> clus <-makeCluster(rep("localhost", nc))

The sugar beets example
For the sugar beets example of Section 3.1, the KR approximation provides the following results.

Harvesting time
The 0.000999 *** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 First, it is noted that the p values reported from both KRmodcomp() and PBmodcomp() generally are (1) within the same order of magnitude and (2) close to the results of the exact F test of Section 3.1. Hence the results would all suggest the same qualitative conclusion, namely that there is little (if any) evidence for an effect of harvesting time and strong evidence for an effect of sowing time. Secondly, it is noticed that KRmodcomp() is much faster than PBmodcomp() in these examples. However the difference in computing time is much smaller for other types of models/datasets; for example for certain random regression models (not reported in this paper).

Warnings from the optimizers
The built-in optimizers for lmer() are the "bobyqa" method for bound constrained optimization without derivatives from the minqa package, (Bates, Mullen, Nash, and Varadhan 2014b), see also Powell (2009) and Nelder-Mead optimization as implemented in the Nelder_Mead() function in the lme4 package.
The default optimization method in lmer() is the "bobyqa" method. When using this method in connection with parametric bootstrap, as for example in R> PBmodcomp(sug, sug_no.sow) on may encounter warnings of the following form: Warning in optwrap(object@optinfo$optimizer, ff, x0, lower = lower, control = control$optCtrl, : convergence code 3 from bobyqa: bobyqa --a trust region step failed to reduce q For the specific case above, this happens in a small number of simulations, say in up to 5% of the cases. One may alternatively use Nelder-Mead optimization as: R> sugNM <-lmer(sugpct~block + sow + harvest + (1 | block:harvest), + data = beets, REML = FALSE, + control = lmerControl(optimizer = "Nelder_Mead")) R> sugNM_no.sow <-update(sugNM, .~. -sow) R> PBmodcomp(sugNM, sugNM_no.sow) Nelder-Mead optimization, on the other hand, can result in the following warning (which for the specific case above happens very rarely, say in 5 out of 1000 simulations): Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge: degenerate Hessian with 1 negative eigenvalues In either case, the warnings indicate convergence problems and in practical use of PBmodcomp() one must check that these do not happen in too many simulations.

How good are the parametric reference distributions?
In Section 5 the idea of approximating the bootstrap distribution by an F distribution, a gamma distribution and a scaled χ 2 distribution (Bartlett correction) was introduced. In this section we illustrate how well these approximations work.
The results of these approximations are obtained using summary(): Testing for no effect of sowing time Figure 3: Comparisons of the p values of the bootstrapped null distribution with the p values of the approximating parametric distributions. Left: Testing for no effect of harvesting time. Right: Testing for no effect of sowing time. 85.203 4.000 < 2.2e-16 *** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 For a range of p values from 0.0001 to 0.200, i.e., those p values which are typically of practical relevance, we have calculated the quantiles q in the bootstrap reference distribution. We have then calculated the tail probabilities corresponding to these quantiles in the approximating gamma distribution, F distribution, the Bartlett scaled distribution, and the asymptotic χ 2 distribution of the LR statistic. The results are shown in Figure 3. It is clear form these plots that the Bartlett scaled distribution, the gamma distribution and (to a lesser extent) the F distribution approximates the bootstrap distribution quite well whereas the χ 2 distributions approximate the reference distribution poorly.

A sequential version of parametric bootstrap
As mentioned above, one may create a sequential version of the parametric bootstrap scheme as follows (where the aim is to speed up computations): Instead of fixing the number of samples B in advance, on may draw samples until h (e.g., h = 20) values of the test statistic which are more extreme than the observed test statistic have been obtained. If this takes B samples then the p value to report is (h + 1)/(B + 1). If there is little evidence against the hypothesis then only a small number B of simulations would be needed. This functionality is not implemented in pbkrtest at the time of writing, but it is straight forward to create a minimal implementation: 0.0365449 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Running PBmodcomp() without parallel computing takes about a minute so the saving in computing time is significant.

The Mississippi influents example
For the Mississippi data of Section 3.4 our methods provide the following results: Hence we obtain p values which are in the order of 10 times the p value provided by the χ 2 approximation. The p values we obtain are in good accordance with the p value obtained when analyzing the means as done in Section 3.4.

Random coefficient regression -A simulation study
KR perform a small simulation study on a simple random coefficient regression model. We made a simulation using the same model set-up and use it to compare the results between the different tests we provide and to the KR approach as implemented by the MIXED procedure of the SAS software system (SAS Institute Inc. 2013). Kenward and Roger (1997, Table 4) consider the following random coefficient model There are j = 1, . . . , 24 observed subjects divided into three groups of eight subjects. For each group observations are made at the non overlapping times t = 0, 1, 2; t = 3, 4, 5 and t = 6, 7, 8. The data for the simulation were generated under the assumption that β 0 = β 1 = 0, (A j , B j ) and jt are normally distributed with zero expectation, (A j , B j ) are independent from tj and observations from different subjects are independent.
The full model and the reduced models are fitted by: R> Mod <-lmer(y~1 + t + (1 + t | subject)) R> Mod_no.int <-lmer(y~0 + t + (1 + t | subject)) R> Mod_no.slope <-lmer(y~1 + (1 + t | subject)) The results are shown in The KR approach from our implementation yields slightly conservative results for the tests on the intercept parameter β 0 and the tests in column F yield conservative results for the lowest nominal level. The difference of the results of the KR approach between our implementation and that of SAS may lie in the different treatment of cases where the covariance matrix Γ is singular.

Testing a hypothesis Lβ = Lβ H
We present now an example on testing a hypothesis L(β − β H ) = 0 with KRmodcomp() via the specification of a matrix L. We illustrate this with the sugar beets data. Assume that one wants to test the hypothesis that the difference between the second and first sowing date and between the third and second sowing date are equal to 0.1. The parameter vector β from the model fit sug in Section 3.2 has the entries R> names(fixef(sug)) [1] "(Intercept)" "blockblock2" "blockblock3" "sowsow2" [5] "sowsow3" "sowsow4" "sowsow5" "harvestharv2" The restriction matrix L is then given as R> L <-matrix(0, nrow = 2, ncol = 8) R> colnames(L) <-names(fixef(sug)) R> L[1, "sowsow2"] <-1 R> L[2, c("sowsow2", "sowsow3")] <-c(-1, 1) R> t(L)  If the restriction matrix is not of full row rank it will be replaced by a matrix of full row rank using Gram-Schmidt orthogonalization as in the following example on the difference between the third and second sowing date.

Discussion
In this paper we have presented our implementation of a KR approximation for tests in linear mixed models. In the implementation, there are several matrices of the order N × N where N is the number of observations. We have exploited that several of the matrices involved in the computations in many cases will be sparse via the facilities in the Matrix package (Bates and Maechler 2014). Nonetheless, the current implementation of the KR approximation does not always scale to large datasets. As an example, consider a repeated measurement problem in which repeated measurements are made on a collection of subjects. If there are many subjects and the time series for each subject is short then there is a sparseness to be exploited. On the other hand, if there are a few long time series then the matrices involved will have a non-negligible number of non-zero elements. One approach to speed up the computations is to compute the average of the observed and expected information matrices rather than the expected information matrix. This can lead to substantial improvements in computing time because some of the computationally most intractable terms vanish in the average information. See Gilmour, Thompson, and Cullis (1995) and Jensen, Mantysaari, Madsen, and Thompson (1996) for details. This may become available in later versions of pbkrtest. A very specific issue which we have no clear answer to is how the KR approximation should be modified in case of a singular estimate of the covariance matrix.
Contrary to the KR approximation, the parametric bootstrap approach has the advantage that it is easy to implement; all that is required is a way of sampling data from the fitted model under the hypothesis. Furthermore, parametric bootstrap is straightforward to implement for many types of models. Parametric bootstrap is already implemented for generalized linear models and for generalized linear mixed models in pbkrtest.
A problem with the parametric bootstrap approaches is the randomness of the results; repeated applications to the same dataset do not give entirely identical results. Moreover, calculating the reference distribution by sampling is computationally demanding. However, pbkrtest implements the possibility of parallel computing of the reference distribution using multiple processors via the parallel package. There are various possibilities for speeding up the parametric bootstrap computations: (1) Instead of fixing the number of parametric bootstrap samples B in advance, one may adopt a sequential scheme in which sampling continues until a pre-specified number of extreme samples have been obtained. This idea, which is closely related to the approach of Besag and Clifford (1991) for calculating sequential Monte Carlo p values, has been illustrated in the paper.
(2) The Bartlett type correction we implemented is such a possibility because the correction depends only on the mean of the simulated null distribution and the gamma approximation depends only on the mean and variance of the simulated null distribution. Estimating these two moments will in general require fewer simulations than estimating the tail of the null distribution. Hence, if one chooses to focus on these two distributions then one may get credible results with fewer samples.
(3) It may also be possible to devise a sequential sampling scheme such that sampling stops when the estimates of the first or the first two moments have stabilized.
In the beginning of the paper it is stated that we consider models which are nested with respect to the structure of the mean values but where the random effects are the same. For the KR approach, this is formally not a requirement, because it is only the random structure of the large model that matters. The small model is only used for the construction of the restriction matrix. In contrast, for the parametric bootstrap approach, it is only the structure of the random effect of the smaller model that matters.
An important final comment is that we do not in any way claim to have an omnibus panacea solution to a difficult problem. Instead we have provided two practically applicable alternatives to relying on large sample asymptotics when testing for the reduction of the mean value in general linear mixed models. In this appendix more details of the implementation of the approach of KR in KRmodcomp() are given. First we describe the structure of the design matrix of the random effects Z and the related structure of the covariance matrix Γ. Secondly, the sequence of computations with the matrices available from a fitted model object from lmer() and the derived matrices are given.

Structure for Z and Γ
The description of the structure of Z and Γ draws on the description given in a vignette of the lme4 package for versions prior to 1.0 (Bates 2013). Structural changes in these matrices for later versions have been accounted for in the following.
For a linear mixed model fitted with lmer() it is assumed that we have i = 1, . . . , f grouping factors denoted by f i . It is allowed that f i = f i for i = i . The ith grouping factor f i has g i levels and there are q i random effects for each level. The random effects for group level j are collected in the vector b ij = (b ij1 , . . . , b ijq i ) and the random effects of It is assumed that the random effects from different grouping factors and from different levels of a grouping factor are independent, i.e., The covariance matrix of the random effects for grouping level j of factor f i is independent of the grouping level and is denoted by We assume that all of the elements of Γ i are parameters that vary freely except that Γ i must be positive definite. Hence Var(b i ) = I g i ×g i ⊗ Γ i where ⊗ denotes the Kronecker product and I g i ×g i the identity matrix of dimension g i .
For the random coefficient model of the simulation example there is one grouping factor, subject, with 24 levels, hence f = 1, g 1 = 24 and q 1 = 2 random effects (A j , B j ) for subject j such that b 1 = (A 1 , B 1 , . . . , A 24 , B 24 ), and Γ 1 is the matrix in (19). If in the simulation example the two random effects A j and B j were assumed to be uncorrelated the model would be specified with lmer() as lmer(ỹ A + t + (1 | subject) + (0 + t | subject)). Now there are two grouping factors, both are equal to subject, hence f = 2, g 1 = g 2 = 24, q 1 = q 2 = 1, b 1 = (A 1 , . . . , A 24 ) , b 2 = (B 1 , . . . , B 24 ) and Let γ i = (γ i;11 , γ i;2,1 , . . . , γ i;q i 1 , γ i;22 , . . . , γ i;q i q i ) denote the s i = q i (q i + 1)/2 vector of the elements of the lower triangular Γ i . For the kth element γ i;k of γ i it holds that γ i;k = γ i;rr where k = (r − 1)(q i − r/2) + r . Then we may write The E i;k are the q i × q i symmetric incidence matrices with ones at the position (r, r ) and (r , r). Now, where f is the number of grouping factors. Let γ denote the vector of length M made by concatenation of the vectors γ i and, as the last element, the σ 2 . Let G r = Z i (I g i ×g i ⊗ E i;r )Z i where r refers to the rth element in γ and i is the group factor f i related to the covariance parameter γ r . Note that G M = I N ×N .

Implementation of the KR approach in the KRmodcomp() function
The following estimates are directly provided by lmer(): (1) the parameter estimateβ, (2) the vectorγ of the REML estimated covariance parameters and (3) the estimate Φ(γ) of the asymptotic covariance matrix ofβ.
The estimate of the covariance matrix forγ is not directly available from lmer(), but is estimated in (26) from the inverse information matrix, (cf. also Kenward and Roger 1997, Equations 4 and 5).
The implementation of the KR approximation in pbkrtest is based on the following quantities.
1. For each covariance parameter γ r in γ we use where i refers to the group for the covariance parameter γ r .
2. Then the estimated covariance matrix for Y becomesΣ = M rγ r G r .
3. For the computations to follow, we define the following auxiliary matrices: Notice that Ω r is not used in any computation in the implementation in pbkrtest but Ω r appears in the derivations below.
4. For each covariance parameter γ r let 5. For each pair (γ r , γ s ) of covariance parameters let Notice that Q rs is generally not symmetric but Q rs = Q sr and hence Q rs + Q sr is symmetric. This symmetry property is exploited below. Moreover, tr(Q rs ) = tr(Q sr ).
8. The asymptotic covariance matrix of the random effects parameters becomes W rr (Q rr − P r ΦP r ).
10. The adjusted estimate of Cov(β) is then Φ A = Φ(γ) + 2 ·Λ, whereΛ p×p =ΦUΦ, and the adjusted test statistic is (where d is the rank of L) 11. KR derive a scaling factor λ for the F statistic given above (such that the statistic they finally propose is λF ) and a denominator degrees of freedom m by matching approximate first and second moments of the λF statistic with the moments of an F d,m distribution. In this connection KR use the following quantities:  .

A.2. Some numerical issues
In the computation of ρ we encountered numerical problems in the calculation of ρ for some models where the division of two numbers both equal to zero are encountered. One can write ρ as where V 0 = 1 + c 1 B, V 1 = 1 − c 2 B, V 2 = 1 − c 3 B and D = 1 − A 2 /d. V 1 and D can become simultaneously very small yielding an unreliable ratio D/V 1 . We resolve this problem by setting the ratio to 1 if max(|D|, |V 1 |) < 10 −11 .
For example, for a simple block design, Y bt = µ + α t + b + bt , b = 1, . . . , n b , t = 1, . . . , n t , one has for n t = 2 and n b = 3 or for n t = 3 and n b = 2 an exact F test with m = 2 denominator degrees of freedom. We have for a specific application of this model found that D and V 1 were very close to zero. If we define the ratio to be 1 in this case then we end up with the correct answer, i.e., with m = 2. For the same design but for n t = 2 and n b = 5 or n t = 3 and n b = 3 we have for a specific application found that V 2 = 0 which leads to ρ = ∞. This caused no problem since the correct m = 4 degrees of freedom are obtained from Equation 27.

B. From model matrix to restriction matrix and vice versa
Referring to (1) and (2), let X and X 0 be model matrices with full column rank and dimensions N ×p and N ×p 0 . We wish to construct a restriction matrix L such that E(Y) = Xβ ∧Lβ = 0 is equivalent to E(Y) = X 0 β 0 .
Let P and P 0 denote the orthogonal projection matrices onto C(X) and C(X 0 ). One choice of L is L = (I − P 0 )PX = (P − P 0 )X, where (P − P 0 ) is the orthogonal projection onto the orthogonal complement of C(X 0 ) in C(X).
If Lβ = 0 then Xβ = P 0 Xβ ∈ C(X 0 ). On the other hand, if E(Y) = X 0 β 0 , then there exists a β such that X 0 β 0 = Xβ = P 0 Xβ and hence from (29) Lβ = 0. Now L is an N × p matrix but it only contains d = p − p 0 linearly independent rows, and we only need to extract these to obtain a valid restriction matrix.
The computations in pbkrtest are done via the QR decomposition of the augmented matrix D = [X 0 : X], i.e. D = QR. The matrix Q 0 of the first p 0 columns of Q has C(Q 0 ) = C(X 0 ). The matrix Q 1 of the following p−p 0 columns of Q is a basis for the orthogonal complement of C(X 0 ) in C(X). Hence Q 1 Q 1 is the orthogonal projection onto this complement and therefore L = Q 1 Q 1 X. Since Q 1 Q 1 and Q 1 have the same nullspace, a (p − p 0 ) × p restriction matrix L is obtained as L = Q 1 X.
Next consider the opposite situation: Given X and L we want to derive X 0 . Let W denote a p × p 0 matrix such that C(W) is equal to the nullspace of L. In pbkrtest, W is found from a QR decomposition of L . Hence for m = Xβ ∧ Lβ = 0 we have β = Wz, say, and hence m = Xβ = XWz so that X 0 can be taken as X 0 = XW.