Balancing Type I Error and Power in Linear Mixed Models

Linear mixed-effects models have increasingly replaced mixed-model analyses of variance for statistical inference in factorial psycholinguistic experiments. Although LMMs have many advantages over ANOVA, like ANOVAs, setting them up for data analysis also requires some care. One simple option, when numerically possible, is to fit the full variance-covariance structure of random effects (the maximal model; Barr et al. 2013), presumably to keep Type I error down to the nominal alpha in the presence of random effects. Although it is true that fitting a model with only random intercepts may lead to higher Type I error, fitting a maximal model also has a cost: it can lead to a significant loss of power. We demonstrate this with simulations and suggest that for typical psychological and psycholinguistic data, higher power is achieved without inflating Type I error rate if a model selection criterion is used to select a random effect structure that is supported by the data.


Introduction
During the last ten years, linear mixed-eects models (LMMs, e.g., Pinheiro and Bates, 2000;Demidenko, 2013) have increasingly replaced mixed-model analyses of variance (ANOVAs) for statistical inference in factorial psycholinguistic experiments. The main reason for this development is that LMMs have a number of advantages over ANOVAs. From a pragmatic perspective, the most prominent one is that a single LMM can replace two separate ANOVAs with subjects (F 1 ANOVA) and items (F 2 ANOVA) as random factors, which does away with ambiguities of inter-and may provide what looks like an acceptable solution. Obviously, in such a model many of the model parameters may still be very close to zero and consequently removing such parameters from the model will probably not signicantly decrease the goodness of t of the model. How models are to be selected is itself a highly active eld of research, covering all elds of advanced multivariate statistics (i.e., it is not an LMM-specic problem), and, at this point in time, there is no perfect solution, but there are accepted ones (e.g., Vandekerckhove, Matzke and Wagenmakers, 2015). Such reduction in model complexity leads to a simpler model than the maximal model. However, many psychologists do not want to engage in and possibly have to defend their model selection. And indeed, as long as the maximal model were to yield the same estimates as a justiably reduced parsimonious model, there actually would be no need for model reduction. Most importantly, by tting the maximal model, one avoids the risk of falsely removing a variance component which may lead to anti-conservative test statistics for xed eects (e.g., Barr et al., 2013;Schielzeth and Forstmeier, 2009). Similarly, failure to model correlations may increase the Type I error rate of the associated xed eects above the specied signicance criterion. 1 In other words, the maximal model protects a researcher from falsely reporting a spurious xed eect that originates from a misspecication of the unknown true random-eect structure.
It was against this background that Barr et al. (2013) concluded on the basis of simulations that maximal LMMs yield appropriate test statistics for xed eects. We paraphrase their recommendation as follows: [T]he maximal random eect structure should be tted to the data. This includes a variance component for subject-related and item-related intercepts, for every withinsubject and within-item xed eect, and in the ideal case even all possible correlations between these random eects. The random eect structure should be reduced if and only if the maximal model does not converge.
In many ways, this is a stunning recommendation because it provides a very simple solution to a problem that is quite pervasive in many types of statistical analyses. How is it possible that potentially overtting an LMM incurs no cost? In the following, we will show that there is a substantial cost: The additional protection against Type I errors implies a signicant increase in 1 The signicance level, usually 5%, should guarantee a rate of 5% false positives in the long run.
Type II error rate or, in other words, a loss in statistical power to detect the signicance of xed eects. We will also show that selection of a parsimonious LMM is a promising alternatives to the maximal model, balancing the Type I error rate and power.

Simulation
If there is any situation where the maximal model approach implies a cost in terms of statistical power, we should be able to demonstrate the problem with a simulation of a simple experiment, estimating the Grand Mean (intercept) and a single xed eect as described below.

Specication of the simulation models
Here, Y c,s,i refers to the dependent variable, the subscripts stand for condition (c), subject (s) and item (i), the by-subject and by-item intercepts are S 0,s and I 0,i , and the corresponding slopes are S 1,s and I 1,i . The xed eects are β 0 and β 1 and the residual error is ε c,s,i .
This generating process mimics a very simple experiment where 20 items (i = 1, . . . , 20) are presented to 50 subjects (s = 1, . . . , 50) under two conditions (c = 1, 2), encoded as X c=1 = −0.5 for the rst and X c=2 = +0.5 for the second condition. For example, we can collect response times under two conditions with a grand mean of 2000 ms and an experimental eect of 25 ms. Accordingly, the model intercept was chosen as β 0 = 2000 and the experimental eect (i.e., the dierence between the two experimental conditions) is either β 1 = 0 (assuming the H0) or β 1 = 25 (assuming H1). The key concern is how the complexity of the random-eects structure of the LMM aects the estimate and the signicance of the xed eect β 1 , with respect to both Type I error and power.
In the generating process, there is a random intercept for each subject S 0,s and a random slope (i.e., experimental eect) for the condition within each subject S 1,s . The standard deviation of the subject specic random intercept is τ 00 = 100, while the standard deviation of the subject specic random slopes will be varied on the interval τ 11 ∈ [0, 120], to simulate the eect of the size of the random slope on Type I error rate and power. These two subject specic random eects are chosen to be correlated with ρ S = 0.6.
Additionally there is a random intercept I 0,i and slope I 1,i for item. Again, the standard deviation of the random intercept is (ω 00 = 100). Like the standard deviation of the subject specic random slope, the standard deviation of the item random slope will be varied on the interval ω 11 ∈ [0, 120] and the item specic random eects are again correlated with ρ I = 0.6. Finally, the model residuals are independent and identically distributed, ε c,s,i ∼ N (0, 300 2 ).
Note that the item-and subject-related random slope standard deviations and the number of items and subjects were chosen to ensure that these variance components will be present, but barely detectable in the simulated data over a wide range of the interval [1, 120].
Given that every item is presented to each subject in each of the two conditions, the total number of data points is 2000. From this generating process, a sample is drawn and ve LMMs are tted to the data that dier only in the structure of the random-eects part. The models are estimated under the null hypothesis of a zero xed eect and under the alternative hypothesis of a xed eect with value β 1 = 25.
The rst model is the maximal model, including the estimation of correlation parameters (ρ I and ρ S ) that by denition are xed at 0.6 in the generating process. This model matches the generating process exactly unless the variances of the random slopes are set to 0.
Y c,i,s = β 0 + S 0,s + I 0,i + (β 1 + S 1,s + I 1,i ) X c + c,s,i , Compared to the maximal one, the second model only diers in the two correlation parameters (i.e., it is model (Eq. 1) where ρ S and ρ I are set to 0).
The third model is a reduced one which ignores the item specic random slope (i.e., it is model The fourth model excludes the random slope for subject (i.e., it is model (Eq. 2) where τ 11 = 0) while keeping the random slope for item.

Two simulation scenarios
There are two simulation scenarios. First an appropriate model should yield adequate test statistics even if covariance parameters in the true model are set to zero. Obviously, these parameters would be removed in a reduced model (unless they were explicitly expected on theoretical grounds).
In the rst scenario, we set the eect-related (co-)variance parameters to zero; that is, we know that the maximal model is overparameterized with respect to these parameters. This scenario illustrates the maximal cost of the maximal model specication.
Second, an appropriate model should not be aected by small (co-)variance components. Although they are present in the generating process, they are at or below the threshold of detectability.
In this second scenario, we are simulating the situation where the maximal model actually matches the generating process, but with a relatively small sample size, this maximal random-eect structure may not be supported by the data. In such cases, as far as we know, it is unknown whether a model with a richer random-eect structure will outperform a parsimonious model without these variance components.

Determination of the error rates
Both simulation scenarios involve two runs. In the rst run, the Type I error rate of the models were estimated. In each iteration of the simulation, a sample was drawn from the generating process above, where β 1 = 0 (no xed eect of condition in the generating process). Then all ve models were tted (using the lme4 package, Bates, Mächler, Bolker and Walker, 2015b;R Core Team, 2014) to this data excluding the xed eect for condition (H0) and the same models where tted to the same data including the xed eect of condition (H1). If any of these ten model ts (including the maximal ones) did not converge, the sample was to be redrawn. Thus, it is ensured that the simulation results are not inuenced by numerical convergence issues.
Then, the false-positive detection of the xed eect for condition was determined with a likelihood ratio test (LRT). If the dierence of the deviances between a model under H0 (excluding a xed eect for condition) and the same model under H1 (including that xed eect) is larger than 3.85, the t was considered as a false-positive detection of that xed eect. This criterion implies a nominal Type I error rate of about α = 0.05, assuming a chi-squared distribution with one degree of freedom for the dierence of the model deviances. In the second run, we determined power by drawing samples from the generating process with β 1 = 25 using the criterion χ 2 1 ≥ 3.85 for the same models.
Given the Type I error rate and power estimates, the performance of the maximal model specication can be compared with the performance of the parsimonious model (see below). Obviously, the best model is the one providing maximal power while maintaining a Type I error rate at the nominal level (e.g., α = 0.05). The latter, however, not only depends on the model structure but also on the hypothesis test being used. For example, the LRT approximates the sample distribution of the deviance dierences by a chi-squared distribution. This approximation immediately implies that the test statistic of an LRT for a xed eect in an LMM is not exact and the obtained Type I error rate will not match the expected α exactly. This discrepancy between the expected and the observed Type I error-rate decreases with increasing sample sizes. The unknown exact Type I error rate of the test statistic, however, increases the diculty of comparing the model performances.

Model selection
With our simulations, we investigate the selection of a parsimonious, that is, a possibly reduced model for a given data set. Hence a criterion must be chosen to decide whether the complexity of a certain model is supported by the data. Naturally, the most complex model will always provide the best t for a given data set, but bearing the risk of over-tting the data. Therefore, every model selection criterion will try to balance the goodness-of-t with the model complexity.
Popular model selection criteria are the Akaike information criterion (AIC, Akaike, 1998), Bayes or Schwarz information criterion (BIC, Schwarz, 1978) and the aforementioned likelihood ratio test. For our simulations, we use the LRT criterion (an evaluation of the AIC can be found in the Appendix). The BIC is known to put a strong penalty on the model complexity for small sample sizes (e.g., Vandekerckhove et al., 2015).
In contrast to the AIC or BIC, which allows us to compare several models at once, the LRT can only compare two models. Hence, an additional heuristic is needed to choose the best model out of a set of candidate models. For our simulations, we chose the backward-selection heuristic. There, one starts with the most complex model (i.e., the maximal model, Eq. 1) and reduces the model complexity (i.e., model 1 → 2 → 3 → 4 → 5) until a further reduction would imply a signicant loss in the goodness-of-t. The signicance level of this model-selection criterion is specied by the chosen α LRT of the LRT. Within the context of model selection, it is important to resist the reex of choosing α LRT = 0.05. The α LRT cannot be interpreted as the "expected model-selection Type I error-rate" but rather as the relative weight of model complexity and goodness-of-t. For example, choosing α LRT = 0, an innite penalty on the model complexity is implied and consequently the minimal model is always chosen as the best, irrespective of the evidence provided by the data.
Choosing α LRT = 1 implies an innite penalty on the goodness-of-t, and the maximal model is always chosen as the best. Therefore, choosing α LRT = 0.05 may imply an overly strong penalty on the model complexity and hence select a reduced model even if data favor a more complex one.
In fact, when comparing two nested models, the LRT with α LRT ≈ 0.157 is equivalent to the AIC. Hence, one may expect that the LRT with α LRT = 0.2 puts a slightly larger penalty on the goodness-of-t compared to the AIC and therefore, it will choose the more complex model more frequently (see also Appendix).

Results
The simulations were carried under two scenarios. In the rst scenario, we investigated the cost incurred by the maximal model with respect to the power to detect the xed eect. In this case, the variances of both random slopes where xed to 0 in the generating process (and, by implication, correlation parameters were zero as well). In this worst case scenario for the maximal model, the intercept-only model (Eq. 5) matches the generating process; all other models (Eqs. 1-4) were overparametrized by construction.
In the second scenario, we varied the standard deviations of the random eect slopes from τ 11 = ω 11 = 0 to τ 11 = ω 11 = 120 while keeping the correlation of the within-subject random eects and the within-item random eects xed at 0.6. The goal was to determine how Type I error rates and power change as random slopes increase. Obviously, the maximal model should be favored for large random slopes. Moreover, the inclusion of models two to four in the simulations allowed us to examine whether model selection with a standard criterion yields better results than the maximal model when weak random slopes are present in the generating process.

Worst case scenario
The worst case scenario allowed us to determine the maximal cost implied by the maximal model. In this case, the variance of both random slopes was set to zero in the generating process.
Hence the intercept-only model (Eq. 5) matched the generating process and the other four models were overparametrized. Unsurprisingly, the intercept-only model (Eq. 5) rather than the maximal model (Eq. 1) performs best for this worst case scenario. Most interestingly, the maximal model has even a substantially smaller Type I error rate than the nominal α = 0.05. Hence it appears to be over-conservative at the cost of power. Similar results with a similar simulation were reported in Barr et al. (2013).
They, however, concluded that loss in power is negligible.  Table 1: Type I error rates and power of models (Eqs. 1-5) in the case of a generating process without random slopes; 95% condence intervals are also presented. In this case, random-intercept-only model (Eq. 5) matches the generating process, while models (Eqs. 1-4) are overparametrized by construction.

Small random slopes
In the rst simulation scenario, we showed that the maximal model leads to substantially reduced power over the intercept-only model when the latter model matches the generating process. Hence, one may expect that even if the maximal model is true, that is, even in the case where the random slope variances of the generating process are non-zero, a reduced model (Eqs. 2-5) may hold a power advantage over the maximal model (Eq. 1), if its complexity is not supported by the data.
Of course, we do not know a priori whether a certain random-eect structure is supported by the data. We can determine a parsimonious model with a standard criterion such as the LRT. This criterion weights the goodness-of-t of each model with its complexity in order to select the best model for a given data set and guard against over-tting the data.
Therefore, in the second simulation scenario, we compared the performance (Type I error rates and power) of the maximal model (Eq. 1) with the model selected by the LRT (out of Eqs. 1-5) as a function of the random-slope variances (see also Westfall, Kenny and Judd, 2014). This scenario allowed us to study whether LRT detect the need for an increased model complexity with increasing random-slope standard deviations and therefore maintain a Type I error rate close to the nominal α.
For each iteration of this simulation we chose the random-slope standard deviations in 20,000 steps from τ 11 = ω 11 = 0 to τ 11 = ω 11 = 120 (leading to a step size of 0.006). All other variancecovariance parameters were kept constant. Along with the false-positive and false-negative detections, we obtained the Deviance values for each model. Then, we chose the best of the models 1-5 supported by the data according to the LRT model selection criterion and compared their performance.
The false-positive and false-negative detections of each iteration are samples of a Bernoulli distribution with an unknown rate parameter (the Type I and II error rates, respectively). To visualize Type I error rate and power (1 -Type II error rate) as a function of the random-slope standard deviation, we tted generalized additive models (GAM, e.g., Hastie and Tibshirani, 1990;Wood, 2006) to the false-positive and false-negative detection data using the binomial family to describe the response distributions. This not only accurately models the response distribution of the false-positive and false-negative detections, it also provides condence intervals for the estimated Type I error rate and power. Figure 1 shows Type I error rate and power (left and right column, respectively) as a function of the random-slope standard deviations (τ 11 , ω 11 ) for the maximal model (Eq. 1) and the model selected according to LRT.
The top row shows Type I error rate and power for the maximal model (Eq. 1). Its Type I error rate appears to be substantially smaller than the expected α = 0.05 (horizontal solid line) up to τ 11 = ω 11 ≈ 40. This indicates an over-conservative behavior of the maximal model in cases where the random slope standard deviations are too small to support the complex random-eect structure. On the interval [80, 120], the maximal model appears to be slightly anti-conservative with respect to the expected α = 0.05. On that interval, however, the maximal model matches the generating process.
Given that the maximal model is over-conservative even with respect to the expected α = 0.05 for small random slopes and, therefore, has reduced power, we expect that a parsimonious model may provide better power in these cases by choosing a model that is supported by the data. For example model (5) for small, model (1) for larger random slopes, and one of the models (2-4) in between.
The bottom row in Figure 1 shows Type I error rate and power of the parsimonious model, selected according to LRT, as a function of the random-slope standard deviation. Although the Type I error rate appears to be larger than the expected α = 0.05, it is not substantially larger than the maximal Type I error rate of the maximal model. At the same time, it provides substantially better power on the interval [0, 40]. Thus, the LRT-based model selection approach yields an advantage of statistical power over the maximal model while maintaining a comparable Type I error rate.
Not everyone may consider the gain in power for this setting (shown in Fig. 1) as substantial enough to justify the additional eort one has to put into model selection. However, the relative power gain naturally increases when overall power decreases (e.g., when sample size is small).
Hence, we performed the same simulation with a reduced number of subjects (30 instead of 50) and a reduced number of items (10 instead of 20). To this end we reduced the total sample size from 2000 to 600. Here the maximal model shows an anti-conservative behavior on the interval [0, 80]. Consequently, a model selection approach using the LRT backward selection heuristic is able to provide a substantial gain in power over a larger interval [0, 90] while maintaining a Type I error rate comparable with the maximal model. Moreover, the relative power gain is larger for this reduced sample size. Figure 3 shows how the LRT model selection rate changes with increasing random-slope variances (top panel: N subj = 50; N item = 20, bottom panel: N subj = 30; N item = 10). A similar pattern is obtained for both sample sizes: The intercept-only model wins for small random-slope standard deviations and is less frequently selected in favor of the more complex models with increasing random-slope standard deviations; the maximal model gets selected almost exclusively for large random eect slopes.
These model selection rates also explain the behavior seen in Figures 1 and 2: Small random slopes are not supported by the data, even in the case of a larger sample size. For this case, a parsimonious model yields the best description of the data and provides a power advantage over a maximal model. As the random-slope standard deviations increase, a more complex model must be chosen to describe the data adequately. Of cause, this happens earlier for larger sample sizes compared to smaller sample sizes. Hence, a model selection approach using the LRT backward selection heuristic provides a power benet compared to the maximal model.
In summary, the rst simulation scenario (i.e., the simulation of the worst-case scenario) revealed clear decits of the maximal model. In the second simulation scenario, even when the generating process matched the maximal model, the maximal model performed worse than a parsimonious model when variance components were not well supported by the data.

Discussion
The simulations yielded a clear set of results. First, in agreement with Barr et al. (2013) and others before them, we showed that the maximal model is able to guard against an increased Type I error rate by ignoring a signicant variance component. However, while the maximal model indeed performs well as far as Type I error rates were concerned, power decreases substantially with model complexity. We have shown that the maximal model may trade-o power for some conservatism beyond the nominal Type I error rate, even in cases where the maximal model matches the generating process exactly. In fact, the best model is the one providing the largest power, while maintaining the chosen nominal Type I error rate. If more conservatism with respect to the Type I error rate is required, the signicance criterion α should be chosen to be more conservative, instead of choosing a possibly over-conservative method with some unknown Type I error rate.
As already stated by Stroup (2012, p. 185 (Vasishth, Chen, Li and Guo, 2013) and of 26 of the studies reviewed by Engelmann, Jäger and Vasishth (2015) showed that the number of subjects used in rating studies and reading studies (self-paced reading and eyetracking) range from 16 to (in one case) 150 , and the number of items from 12 to 80; the smaller numbers of subjects and items are far more common in psycholinguistics. For such typical sample sizes, it is not necessarily the maximal, but more likely a model with a parsimonious random eect structure that will be most suitable for describing the data of factorial experiments. For experiments with much larger sample sizes, the situation could be dierent.
There is a looming concern about using model selection in inferential statistics about xed eects in factorial experiments. Factorial experiments typically implement designs with a limited number of balanced conditions. Usually, full-factorial ANOVAs are specied providing test statistics for all xed eects, that is, all main eects and interactions. Rarely, main eects or interaction terms are pooled with error terms because there is no a priori expectation for them to be signicant.
And almost never is model selection used to pool non-signicant main eects and interaction terms because they are not signicant. Barr et al. (2013) refer to the rst two cases as conrmatory and the last case as exploratory hypothesis testing. How do or should these options inuence decisions about the choice of the random-eect structure of LMMs? Barr et al. (2013) provide a very nuanced discussion of these issues and carefully delineate various alternatives, but in the end they come down strongly in favor of their recommendation to keep it maximal, that is against model selection. 2 In line with our results, we want to make a strong case for model selection. Barr et al. (2013) contrast conrmatory (design-driven; i.e., the LMM is selected before the analysis and does not depend on the dataunless there are convergence problems) and data-driven hypothesis testing (i.e., the nal LMM is selected taking into account whether model parameters are supported by the data). If we assume that all within-subject or within-item xed eects are expected to be dierent from zero, it is reasonable to assume that all variance components are signicantly dierent from zero as well, because it is unlikely that our experiments will detect a natural constant in psycholinguistic experiments. From this perspective, the maximal model looks like a coherent analysis strategy. This is only very rarely a realistic scenario, at least once we go beyond two-factorial design; we simply don't see explicit expectations about pattern of means relating to three-factor interactions. Conrmatory hypothesis testing must pool terms that are not expected to be signicant and their associated variance components with the residual variance. Thus, strictly speaking, traditional ANOVA-based hypothesis testing is more often better characterized as overtting rather than conrmatory. Barr et al. (2013) favor the maximal model because they assume that removal of non-signicant terms might be the consequence not of a priori selection of a model, but a post-hoc selection with an eye towards the signicance of p-values in the xed eects.
Our starting point is dierent. We think a strictly conrmatory hypothesis testing is simply not a realistic scenario; Barr et al. (2013) do not provide one either. Especially, the strict linkage of xed-eect parameters and associated variance components deserve some special attention. We go along with tting all xed eects of factoral experimental designs. Although it might be useful to re-evaluate this strategy in perspective, this issue is beyond the current article. Importantly, as is obvious by now, we argue that LMMs should be based on model selection. To be clear: The goal of model selection is not to obtain a signicant p-value for a xed eect; the goal is to identify the most parsimonious model that can be assumed to have generated the data. This procedure can be decided upon in advance and p-values associated with xed eects are then a result of this a priori 2 Unfortunately, the reception of this article has been much less nuanced than warranted by its content. determined procedure.
There are four arguments in favor of a model-selection strategy. First, as shown with the simulations in this article, xed-eect estimates do not depend on random-eects structure as long as the factorial design is balanced (i.e., covariates are uncorrelated). Thus, the invariance of xedeect estimates in balanced designs for varying random-eects structures is not a surprise; one may just as well use ordinary least squares to estimate them. The random-eects structure may impact the estimate of the standard errors associated with xed eects and thereby at least sometimes (not always) the decision about statistical inference. Matters get more complicated once one deals with correlated covariates, non-factorial designs, substantial imbalance due to missing data, or autocorrelated residuals (Baayen, Vasishth, Bates and Kliegl, 2015;Matuschek and Kliegl, 2015). Our simulations showed that, in the long run, the parsimonious model yields the best chances to detect a true xed eect as signicant.
Second, while plausible, the assumption that the true value of all variance components is larger than zero is actually not the critical test during model selection. Rather the question is for which variance components this assumption is supported by the data. Fitting the maximal model incurs substantial loss of statistical power if the true value of the variance component is small. Moreover, such overparameterized LMMs often yield nonsensical estimates of correlation parameters (i.e., values of -1 or +1) that clutter journal pages and mislead readers.
Third, we strongly want to encourage a move beyond the interpretation of variance components as mere nuisance parameters that need to be taken care of, for example, to avoid anti-conservative estimates of xed eects (e.g., Barr et al., 2013;Schielzeth and Forstmeier, 2009). They hold the key for a joint consideration of experimental eects and associated individual or item dierences (González et al., 2014;Kliegl et al., 2011). For example, whether a correlation between mean reponse time and an experimental eect is estimated as reliably positive or negative may have profound consequences for a theoretical consideration. Also, a xed eect may not be signicantly dierent from zero, but model selection may reveal reliable individual dierences in this eect (Kliegl et al., 2011). In this case, the signicant variance component points to qualitative dierences in the direction of eects (i.e., positive and negative for about 50% of the subjects, respectively).
Fourth, the distinction between design-driven and data-driven, as introduced by Barr et al. (2013), misses an important conrmatory aspect in multivariate statistics: Any hypothesis about the support of variance components by the data requires a model comparison. For example, hypotheses about the random-eect structure (e.g., postulating a correlation parameter to be signicantly dierent from zero) are tested by model selection (i.e., an LRT of models with and without the critical variance component). In multivariate statistics, LRTs are conrmatory, but Barr et al. (2013) call this analysis data-driven because selection of the parsimonious model depends on the data. Finally and obviously, we agree with Barr et al. (2013) that model-selection strategies must be carefully documented.
Having reached this conclusion, we face the discussion about suitable model selection criteria (e.g., AIC, LRT, GCV) and ecient schemes (e.g., forward, backward, full-scale cross validation), like all other scientic elds using multivariate statistics to guide the accumulation of knowledge (e.g., Vandekerckhove et al., 2015). There are also some specic proposals for selection of parsimonious mixed models (e.g., Bates et al., 2015a). Actually, in contrast to many other elds, much psycholinguistic research enjoys two key advantages. First, it is usually not dicult to design experiments with sucient statistical power; that is, to plan a priori for a larger number of subjects and number of items (although in some cases it can be more dicult to increase items than subjects).
Aiming for high power is important for independent reasons, because even a statistically signicant result obtained with a low-power experiment could easily have the wrong sign or an exaggerated eect size (Gelman and Carlin, 2014;Colquhoun, 2014). Second, it is not too dicult to repeat an experiment with new samples of subjects and items to validate the results of the best model found on the basis of the data of the rst experiment. These are denitely viable and reliable steps to determine the best random-eects structure for the data of a specic design.
Finally, we want to emphasize that we are not proposing a dogmatic alternative to the keep it maximal proposal of Barr et al. (2013), in the sense that now everyone must engage in model selection. There is certainly room for argument about dierent approaches here, as there is in all of statistical practice. In order to achieve maximum transparency in the analysis, we propose that all data and code should be released with the publication of a paper, so that readers can revisit the analysis and investigate the robustness of claimed eects. This allows for checks of,   Appendix: Model selection using the AIC Throughout the main article, we employed the LRT backward heuristic as the model selection criterion. Another popular means of model selection is the Akaike information criterion (AIC; Akaike, 1998). In contrast to the LRT, the AIC allows for the comparison of several models at once, while the former only compares two models with each other. The simultaneous comparison of several models avoids the need to choose a heuristic for a sequential, pair-wise model selection.
Within this appendix, we present the results (Type I error rates and power) of our simulations using the AIC as the model selection criterion instead of the LRT.  Fig. 1). For a smaller sample size (i.e., N sub = 30, N item = 10, Fig. 5), however, the model selected by the AIC might be considered as too anti-conservative with respect to the Type I error rate, although a denite statement is not possible as the exact Type I error rate of the test is unknown.
The dierence between the two model selection approaches for the smaller sample size (compare In general, model selection tries to balance the goodness-of-t of a model with its risk for overtting the data. Therefore, the decision about which model is the most appropriate one for a given data set depends strongly on the amount of evidence provided by the data. Simply taking the best according to the model selection criterion in cases of very small data sets bears the risk of choosing the wrong model. To this end, model selection gets increasingly dicult with decreasing samples sizes while allowing for an increased gain in (relative) power.
In summary, for small sample sizes, the model selected by the AIC appears to be slightly anticonservative even with respect to the maximum Type I error rate of the maximal model. The more conservative LRT with backward-selection heuristic, however, still maintains a Type I error rate