A critical issue in model-based inference for studying trait-based community assembly and a solution

Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two p-values (the pmax test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the pmax test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an “omitted variable bias” problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based pmax test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.


Introduction
14 15 According to the habitat templet theory (Southwood 1977;Townsend & Hildrew 1994), 16 evolution selects for species characteristics (i.e., traits) appropriate to their environment. Such 17 traits influence community assembly (Ackerly & Cornwell 2007) and a major goal in 18 contemporary ecology has become to identify among a set of traits which ones interact with the 19 environment and which do not. Although most traits may influence the way species are 20 distributed in space, not all environmental features necessarily select for these traits, hence the 21 need to test for trait-environmental interactions and the need to select only the relevant trait-22 environment combinations in predictive models. 23 The first statistical methods to uncover and describe trait-environment associations were 24 correlation-based with as prime examples the Community Weighted trait Means (CWM) 25 approach (Lavorel et al. 2008) that correlates community weighted means with environmental 26 features, the fourth-corner correlation (Dray & Legendre 2008; Legendre et al. 1997) and the 27 multivariate method RLQ (Dolédec et al. 1996;Dray et al. 2014). See Kleyer et al. (2012) for a 28 review of methods. Notwithstanding the availability and wide use of these methods, it took some 29 time to understand the behaviour of these methods and to develop valid statistical tests to assess 30 trait-environment associations. Dray & Legrendre (2008) showed that randomization tests based 31 on either site or species permutations lead to increased type I error rates. The issue of increased 32 type I error rate was solved by ter Braak et al. (2012) and Peres-Neto et al. (2016) who showed 33 that regardless of the method used to assess trait-environment relationships, valid statistical 34 testing requires both a site-based and a species-based analysis, each resulting in a p-value. They 35 showed that correct rates are achieved by assessing significance by the largest of the two p-36 values (the p max permutation test). If it is desired to account for phylogenetic relationships among 37 the species and/or spatial and temporal correlations among the sites, the random permutations 38 should be replaced by restricted permutation or bootstrap ( Warton et al. 2015b). These methods model the abundance (or presence-absence) of 43 multiple species across sites (communities) as a function (linear or non-linear) of species traits 44 and environmental variables. If these are generalized linear (mixed) models, main effects of traits 45 and environmental variables and their interactions are specified on a link-scale, for count data 46 usually the log-scale giving a log-linear model. Interaction terms represent trait-environment 47 associations, each being a product of a single trait with a single environmental variable. The 48 associated (standardized) regression coefficients provide insights regarding the strength and 49 direction of trait-environment associations (positive and negative associations, e.g., large-bodied 50 species tend to occur more often in low-temperature environments than in high-temperature 51 ones, known as Bergmann's rule (Bergmann 1847).
52 The main effects in the GLM model represent the separate effects of environmental variables and 53 traits on species distributions. For example, traits might be used to predict the distribution of 54 species in an 'average' environment; by adding the environmental effects and their interaction 55 with the traits, the distribution of a given species in a specific environment may be predicted. 56 Also, by setting main effects to be polynomial terms of quantitative trait and environment 102 associated with species) as found for correlation-based approaches, and, if so, how can we then 103 explain the difference in outcome between the community-level and species-level analyses?
104 In this paper we investigate these questions by simulating data according to models with and 105 without trait-environment interaction and re-analysing a literature data set as an illustrative case 106 study. We apply four statistical tests which differ in the way resampling is performed 107 (resampling sites, species or both) and how the test statistic is calculated (assuming a negative-108 binomial or a Poisson distribution). We report on the observed type I error rates of these 109 procedures in the data sets simulated without trait-environment interactions and their power in 110 simulated data sets with this interaction. We also apply the predictive modelling approach of 111 Brown et al. (2014) to see how many times trait-environment terms were falsely judged 112 predictive when there was in fact no interaction between the observed traits and the environment.

Statistical tests on trait-environment interaction
139 The trait-environment interaction in equation (1) can be tested by fitting the model with and 140 without the interaction term, the latter by setting , calculating the likelihood ratio (LR) = 0 141 statistic of the two models for the data and assessing its significance via resampling (Warton et 142 al. 2015b). We consider four ways of carrying out the test. The first test uses the LR based on a 143 negative-binomial response distribution and is therefore rather slow. To investigate whether we 144 could improve on speed without sacrificing the type I error rate and loosing (too much) power, 145 we set the response distribution to Poisson in the other three tests, as any issue due to 146 overdispersion is accounted for by the resampling procedure. Moreover, theory tells that Poisson 147 likelihood is the only likelihood for non-negative data and models that gives consistent estimates 148 under misspecification of the distribution; the normal likelihood/least-squares has this feature for 149 unbounded data (Gourieroux et al. 1984a; Gourieroux et al. 1984b;Wooldridge 1999

Simulation models
179 Abundance data on m = 30 species in n = 30 sites were generated by two simulation models, a 180 log-linear simulation model and a one-dimensional Gaussian response model, detailed in 181 Appendix S1. In summary, two traits t and z (both m values) and two environmental variables e 182 and x (both n values) were drawn independently from the standard normal distribution; t and e 183 are taken as the observed trait and the observed environment, respectively, and z and x as 184 unobserved. As such, variables z and x are latent variables that are unrelated to (i.e. independent 185 of) the observed ones; alternatively, z and x represent simply noise, more specifically, variability 186 among species in their response to the environment, as in GLMM models 187 Pollock et al. 2012), and unobserved variability among sites, respectively. Either way, 188 unobserved variation is likely the case in most ecological analysis and needs to be considered 189 more often in simulation studies.
190 The statistical test procedure seeks to detect whether the observed trait t and the observed 191 environment e are associated (i.e., interact), without knowledge about the two latent variables z 192 and x, as these are unobserved/unmeasured. In the null models, abundance data are generated 193 without any interaction between the observed t and e, but with an interaction between, for 194 example, the unobserved trait z and the observed environmental variable e.
195 In the Gaussian response model, this is achieved by generating the expected abundance of 196 species j at site i as a Gaussian response function of e and z: 198 where is the maximum value and is the tolerance or niche breadth of species j that are both ℎ 199 constants or random, with independent of t. As z and are independent of the observed trait t, 200 this model by definition contains no association between t and e. This is the 'environmentally 201 structured but trait random' case, in short the 'trait random' case. Similarly, we can define a 202 'trait structured but environment random' case, in short the 'environment random' case, by 203 replacing by and by , and a 'both random' case by replacing by in Equation (3).
204 The corresponding log-linear simulation model has free main effects ( and ) and one 205 interaction, namely either or respectively, e.g. for the 'trait random' case 219 The log-linear model allows one case that is not available in the one-dimensional Gaussian 220 model. This case has two interaction terms: t interacts with x and e interacts with z, but t does 221 not interact with e,
(5) log ( ) = + + + 223 These are simple models. In the reported simulations, we also included structured and 224 unstructured noise in the expected abundances, respectively (Appendix S1); these are important 225 to make the data more realistic, but are not essential for our main results.
To mimic a situation with many traits and a single environmental variable, in the above 229 equations was taken as the sum of q independent traits, normalized to a theoretical variance of 1.
230 To study the power of different methods to assess parameter significance, once the type I error 231 had been controlled, the interaction between the observed trait and environment, i.e. the term 232 , was added to the log-linear model with non-zero regression coefficient , and was replaced 233 by in the Gaussian response model. Each model was simulated 1000 times. 271 In the Gaussian model simulations, the 'trait random' case is the only one that generates a 272 significantly inflated type I error rate (>0.40) for the anova.traitglm test, whereas the max r/c test 273 provides appropriate rates (Table 1). With six species and five or ten random traits, the type I 274 error rate of anova.traitglm was even more inflated (>0.75), and for twenty species with 10-30 275 traits the type I error rate was even greater than 0.90.  Figure 1, respectively). The highest type I error rate of the p max test is 0.08 (Figure 1). 285 Figure 2 shows the power of max r/c test. Recall that power here is estimated at a higher than 286 0.05 nominal level because the test does not fully control the type I error. The actual size of the 287 test is 0.08 (the maximum observed type I error rate for this test in Fig. 1). As expected, power 288 increases with the trait-environment coefficient , but decreases with increasing , which 289 represents either the interaction of e with an unmeasured trait or noise in the model for the 290 species-specific slopes in Equation Manuscript to be reviewed 293 3.2

Predictive modelling results
294 In the 'trait random' case, site-based cross-validation to select the penalty parameter of the lasso 295 often resulted in models with trait-environment terms that have, in fact, no predictive value; the 296 number of data sets with such terms was 851 (in 1000 simulated data sets) with 463 data sets 297 giving one interaction coefficient that was at least 0.1 in absolute value. In the control 298 experiment using a model without any interaction of a latent trait with the observed variables, we 299 found low numbers of false positive (<20 out of 1000). With m=10 (instead of 30), there were 300 855 false positives numbers with 710 and 403 data sets with a coefficient larger than 0.1 and 0.3, 301 respectively. Applying species-based cross-validation resulted for m=30 in two data sets with 302 nonzero trait-environment coefficients and for m=10 in ten such data sets.