Perils and pitfalls of mixed-effects regression models in biology

Biological systems, at all scales of organisation from nucleic acids to ecosystems, are inherently complex and variable. Biologists therefore use statistical analyses to detect signal among this systemic noise. Statistical models infer trends, find functional relationships and detect differences that exist among groups or are caused by experimental manipulations. They also use statistical relationships to help predict uncertain futures. All branches of the biological sciences now embrace the possibilities of mixed-effects modelling and its flexible toolkit for partitioning noise and signal. The mixed-effects model is not, however, a panacea for poor experimental design, and should be used with caution when inferring or deducing the importance of both fixed and random effects. Here we describe a selection of the perils and pitfalls that are widespread in the biological literature, but can be avoided by careful reflection, modelling and model-checking. We focus on situations where incautious modelling risks exposure to these pitfalls and the drawing of incorrect conclusions. Our stance is that statements of significance, information content or credibility all have their place in biological research, as long as these statements are cautious and well-informed by checks on the validity of assumptions. Our intention is to reveal potential perils and pitfalls in mixed model estimation so that researchers can use these powerful approaches with greater awareness and confidence. Our examples are ecological, but translate easily to all branches of biology.

A simple simulation illustrates how this issue arises in small datasets. Imagine a simple 150 treatment-control experiment that applies a treatment to half of the members of each of six 151 families. Here, family ID is the random intercept term, and the mean value for the response 152 variable for each family is drawn from a Normal distribution with mean = 0 and standard 153 deviation = 1. We then simulate values for the response variable so that the experimental 154 treatment itself has no effect on values of the response variable (mean = family mean and 155 standard deviation = 1). Here, family ID is the dominant source of variation in the response 156 variable, and the null hypothesis is true and should be accepted. We conducted a poorly 157 replicated version of this experiment, in which only one individual from each family is measured 158 in each experimental group (treatment and control). We then tested for statistical significance of 159 the treatment effect using mixed-effects models and: a) by determining if the 95% confidence 160 interval of the treatment effect in the full model overlapped zero; b) a likelihood ratio test that 161 compared Maximum Likelihood models with and without the fixed effect of treatment; and c) a 162 comparison of the AIC of the models with and without the treatment effect. We used a sigma 163 threshold of 0.05 for approach b) and considered the effect of treatment to be important for 164 approach c) when the AIC of the model including the fixed effect of treatment was more than 2 165 AIC units less than the null model. All three approaches produced high rates of false positives: in 166 122 out of 1000 (12.2%) simulations, 105 out of 1000 (10.5%) and 100 out of 1000 (10%) 167 respectively (with identical results from nlme and lme4). This is around double the 5% threshold 170 contribute ten control and ten treated individuals, the error rates recorded were 5.9%, 5.7% and 171 3.3% (5.1% for nlme) respectively. Consequently, while this is a pitfall to be aware of when 172 using (G)LMMs in small datasets, it is likely to be less important with adequate replication. Pseudoreplicated experimental designs are those in which fixed effects vary at a 176 hierarchical scale higher than the measured subject. If, for example, whole families are exposed 177 to experimental treatment, and family members resemble each other beyond the impact of 178 treatment, then individual members of families do not represent independent measures of the 179 effect of treatment and should not be considered true biological replicates. In an extreme 180 situation where there are many non-independent observations within each of a small number of 181 random effect levels this could cause the number of residual degrees of freedom (those used to 182 infer residual variance) to greatly exceed the number of experimental units to which treatments 183 were applied (number of families). While, traditionally, F-ratio tests would reveal this problem, 184 likelihood ratio tests hide it from the analyst because they use only the degrees of freedom 185 associated with the explanatory variables (Arnqvist, 2020). AIC values also hide the problem by 186 providing no information on degrees of freedom at all. When (G)LMMs are fitted in R this 187 problem is managed by fitting fixed effects at the correct level of hierarchy. This deals with the 188 pseudoreplication pitfall but can result in poor replication of the fixed effects if the number of 189 random effect groups is low (Fig. 3). The model may struggle to partition variance explained by 190 the random effects from unexplained (residual) variance. This results in a marked decline in the 191 accuracy of model inference and increased risk of encountering anti-conservative p-values from 216 A standard "blocked" (G)LMM with planter ID as a random effect would not control for this 217 issue.

218
The risk of anti-conservatism does not mean that (G)LMMs have no value in 219 pseudoreplicated designs. However, it does mean that the analyst must have a clear  The standard definition of a random effect is that it should describe membership of a 231 group that is part of a random sample of a larger population of groups. Random effect 232 hyperparameters infer the variance of means and slopes among groups. Inference of any variance 233 parameter requires several independent replicates and improves with increasing sample size of 234 the number of groups. Hence, accurate estimation of random effects can only occur when several 235 groups are represented in a dataset (Harrison, 2014(Harrison, , 2015Harrison et al., 2018). In practice, 236 however, mixed effects models are commonly used to capture and absorb any non-independence 237 due to group membership. When only few groups have been measured, inference of random 238 effect hyperparameters will be poor; models might be degenerate; and there will be PeerJ reviewing PDF | (2020:03:47151:1:0:NEW 14 Jun 2020) Manuscript to be reviewed 239 consequences for the correct inference of fixed effect parameters and tests of their importance. 240 For example, fitting sex as a random effect in a model implies the choice of calculating a 241 variance using only two values: we wouldn't infer a simple standard deviation from a sample of 242 two individuals, so we shouldn't use the (G)LMM algorithm to make the same crude inference. 243 In general, it is suggested to only fit a categorical variable as a random effect if it has 5-6 levels 244 or more (Bolker et al., 2009;Harrison, 2015). 245 We illustrate this problem in Figure  Many studies that use (G)LMMs use a random intercepts model, in which the random 286 subspecies is likely to detect no overall effect, but also show that the slope for one subspecies 287 was unusual. The difference between the random slopes and random intercept models does not 288 occur because of differences in their estimates of the effect of body size ( Fig. 5c) but instead 289 because the random intercept model consistently underestimates the standard error around this 290 estimate (Fig. 5d). Note that often random slope models can return low estimates of variance for 291 the among-group slopes, but this is to be expected if there is only one group with a strong 292 relationship with the response. This is especially the case if the random effects sample size is 293 large. Measuring the 'importance' of random effects based on these variance components is a 294 contentious issue (Harrison et al., 2018), but random effects shouldn't automatically be 295 immediately discounted/removed just because a variance component appears negligible.

#2. Pseudoreplicated with group-level predictors
Infer the effect of maternal traits on the performance of several offspring per mother.

#4. Random intercepts when groups vary in their response to a treatment
Testing a relationship between body size and competitive advantage in multiple populations of the same species Increased Type I error rate when there is variation in slopes between different random effect levels. This is regardless of any correlation between the fixed and random effects.
Use of random slopes model instead of random intercepts only Manuscript to be reviewed