A statistical simulation model for field testing of non-target organisms in environmental risk assessment of genetically modified plants

Genetic modification of plants may result in unintended effects causing potentially adverse effects on the environment. A comparative safety assessment is therefore required by authorities, such as the European Food Safety Authority, in which the genetically modified plant is compared with its conventional counterpart. Part of the environmental risk assessment is a comparative field experiment in which the effect on non-target organisms is compared. Statistical analysis of such trials come in two flavors: difference testing and equivalence testing. It is important to know the statistical properties of these, for example, the power to detect environmental change of a given magnitude, before the start of an experiment. Such prospective power analysis can best be studied by means of a statistical simulation model. This paper describes a general framework for simulating data typically encountered in environmental risk assessment of genetically modified plants. The simulation model, available as Supplementary Material, can be used to generate count data having different statistical distributions possibly with excess-zeros. In addition the model employs completely randomized or randomized block experiments, can be used to simulate single or multiple trials across environments, enables genotype by environment interaction by adding random variety effects, and finally includes repeated measures in time following a constant, linear or quadratic pattern in time possibly with some form of autocorrelation. The model also allows to add a set of reference varieties to the GM plants and its comparator to assess the natural variation which can then be used to set limits of concern for equivalence testing. The different count distributions are described in some detail and some examples of how to use the simulation model to study various aspects, including a prospective power analysis, are provided.


Introduction
An essential element in the environmental risk assessment (ERA) of genetically modified (GM) plants is a comparative field trial in which the effect on non-target organisms (NTO), such as aphids, beetles and bumble bees is compared. Such an experiment ensures that the GM plant and its comparator(s) are grown under the same management and environmental conditions, thus enabling a fair and objective comparison. A basic statistical approach for designing and analyzing such field experiments has been outlined in an EFSA guidance document (EFSA, 2010), in Perry et al. (2009) and Semenov et al. (2013). However, in practice the power of these experiments to detect environmental changes of a given magnitude is often unknown, partly because insufficient prior thought is given to what exact endpoints the experiments are supposed to test, and partly because the complex nature of ecological data complicates the power calculations. One of the aims in the EU-funded project "Assessing and Monitoring the Impacts of Genetically modified plants on Agro-ecosystems" (AMIGA) is to devise statistically well-based protocols for the design and analysis of field trials. To prepare this action an inventory was made of existing field studies in the literature, and a statistical simulation model was developed to mimic ecological data such as found in practice. The aim of the current paper is to describe this statistical simulation model and to show how this can be used in the design of field experiments.
Data collection in experimental fields with genetically modified crops has been conducted for many years and a large variability of experimental designs, sampling techniques, guilds of non-target arthropods and statistical methods have been used (e.g., Marvier et al., 2007). To summarize the different approaches presented in the scientific literature, a non-exhaustive inventory of 33 field studies with insect resistance transgenic plants was compiled among those firstly published, where the detection of possible effects of GM plants on natural enemies was the primary goal of the study (Table 1). The papers were published from 1992 until 2005. This time-span was chosen to include the very first published experiments of this kind, and also to incorporate the first available data from surveys in GM commercial fields. Different crops were included in the selection. The table presents some of the indicators relevant to the experimental design, collection methods and statistical analyses performed on the data. None of the papers provided a prospective power analysis for the experiments described.
Field trials are thus diverse, but an example shows some typical elements. Al-Deeb and Wilde (2003) describe experiments to test the effects of the Cry3Bb1 toxin in Bt corn on aboveground non-target arthropods. The experiments were performed on eight locations in one year and three locations in a second year, and they involved three GM varieties and two isolines in combination with up to nine different seed and spraying treatments. Randomized complete block designs were used with 2-4 blocks and 8-40 plots. Visual inspection provided count data on 15-20 plants per plot. Average counts per plant for five NTOs varied between 0 and 70. Pitfall trap count data observed at 3-7 time points were reported as average numbers per pitfall trap between 0 and 616 for eight NTOs. Based on a statistical analysis using analysis of variance the authors concluded that no significant differences in numbers were detected between Bt corn and its non-Bt isoline. However, there is no mention of the effect sizes that these experiments would have been able to detect with a reasonable statistical power. In fact, the data provided are insufficient to draw any conclusion on the statistical power of the performed experiments, and this is also the case for many other reported studies. Indeed, in a few cases the importance of such an analysis had been singled out (Andow, 2003) and attempts to design field experiments on such bases were done in rare cases (e.g., Squire et al., 2003;Duan et al., 2006). To improve this situation the EFSA guidance asks for prospective power analyses to be performed. This issue is further developed in the present paper.
Typical data in environmental risk assessment of GM plants are counts or presence/absence data of NTOs. The basic distribution for counts is the Poisson distribution, while presence/absence data can usually be modeled by a binomial distribution. Clumping of individuals might give rise to an overdispersed distribution such as the negative binomial for counts and the beta-binomial distribution for presence/absence data. Also the number of zero observations can be larger than predicted by the distribution and this gives rise to so-called excess-zero distributions. In many experiments, NTOs are sampled at different points in time, for example weekly, for all experimental units. The data are thus repeated measurements probably with some form of autocorrelation across time within experimental units. Depending on the species various patterns across time are possible. Moreover experiments are frequently repeated on different locations and in different years.
The statistical analysis of ERA field trials comes in two flavors: difference testing and equivalence testing (van der Voet et al., 2011). The aim of the difference test is to reject the null hypothesis of no difference between the GM plant and its comparator. A significant difference test is then a "proof of difference", but this does not state that the difference is biologically relevant and constitutes a true hazard to the environment. Poorly designed experiments with low levels of replication may have low statistical power of finding a true difference. So the absence of a significant difference is not a proof that there is no difference, or "absence of evidence is not evidence of absence" (Altman and Bland, 1995). An equivalence test on the other hand employs a null hypothesis of non-equivalence, that is, that the difference between the GM plant and its comparator is larger than some pre-described equivalence limit, also called limit of concern (LOC). Rejection of the non-equivalence hypothesis implies that the difference is smaller than the LOC and this can be regarded as a "proof of safety". The advantage of equivalence testing is therefore that the onus is placed back on to those who wish to demonstrate the safety of GMOs to do high quality, well-replicated experiments with sufficient statistical power (Perry et al., 2009). Note that both the difference and equivalence test can be implemented by constructing a single confidence interval for the difference between the GM plant and its comparator. This employs the two onesided tests (TOST) approach of Schuirmann (1987) for equivalence testing.
It is important to know the statistical properties of difference and equivalence tests, for example the power and  robustness of a test and whether the test has the assumed significance level. Such properties are well-known for single experiments using tests based on the normal distribution, such as t-tests. For non-normal distributions, small sample properties of difference and equivalence tests are not straightforward. A simulation approach for sample size calculations for a difference test is employed by many authors, for example, Shieh (2001) and Hrdli ckov a (2006) for the Poisson distribution, Shieh (2001) and Demidenko (2008) for the binomial distribution, Aban et al. (2009) and Friede and Schmidli (2010) for the negative binomial distribution. A general practical approach to computing power for non-normal distributions is given by Lyles et al. (2007). However field testing of environmental effects of GM plants on NTOs is much more complicated as it may not only involve non-normal distributions, potentially with excess-zeros, but also a set of reference varieties in addition to the GM plant and its comparator, randomized blocks within an experiment, multiple experiments across different sites and/or years with possibly genotype by environment interaction, and finally repeated measures in time exhibiting some pattern in time possibly with autocorrelation. The object of this paper is to formalize all these elements in a single statistical simulation model which provides a framework for studying various statistical approaches for data analysis of such experiments. The simulation model was implemented in a userfriendly C# program, using the R package (R Core Team, 2012) for simulating from various distributions. The software is available as Supplementary Material to this paper. This paper first summarizes potentially useful statistical distributions for ecological data. Then the other elements of the statistical simulation model are described, namely block effects, additional varieties, repeated measurements and multiple trials. Some applications of the simulation model for power analysis are described, and possibilities for use and future research needs are discussed.

Statistical Distributions for Counts
The basic distribution for counts is the Poisson distribution. The Poisson distribution arises when events occur independently of each other but at a fixed rate in time or space. The number of events in a fixed time-or spaceinterval then follows a Poisson distribution. The theoretical variance of the Poisson distribution equals the mean l. Examples of three Poisson distributions are given in Figure 1. The Poisson distribution assumes a fixed rate of events in time or space. However frequently this rate might vary in different time-or space-intervals. A common way to model this is to assume inter-subject variability, also called mixing. It is then assumed that a count X follows a Poisson distribution with mean Z, where Z itself is a random variable with mean l and variance say s 2 . The marginal mean of the distribution of X is then given by l and the variance equals l + s 2 . Consequently the resulting distribution has a variance which is larger than the mean and this is termed over-dispersion. Three common ways to specify the mixing distribution of Z result in the overdispersed Poisson distribution, the negative binomial distribution and the Poisson-Lognormal distribution. These are described below.
The overdispersed Poisson distribution arises when Z follows a gamma distribution with variance s 2 = (/À1) l which is proportional to the mean l. The resulting distribution is a special form of the negative binomial distribution, see McCullagh and Nelder (1989), with variance /l which is also proportional to the mean. Figure 1 shows some examples of the overdispersed Poisson distribution. The so-called quasi likelihood approach is commonly used to fit this distribution. This employs the Poisson likelihood, estimates the dispersion parameter / by means of Pearson Chi-squared statistic and adjusts standard errors of estimates accordingly (McCullagh and Nelder, 1989).
The negative binomial distribution arises when the mixing distribution Z follows a gamma distribution with mean l and variance xl 2 . The marginal mean of X is then again l and the variance equals l + xl 2 . Some examples of the negative binomial distribution are given in Figure 1. This shows that the negative binomial distribution, with a large dispersion parameter x, has a large zero probability and a rather flat tail.
Specification of a lognormal distribution for Z, with say mean k and variance r 2 , results in the so-called Poisson-Lognormal distribution. This is equivalent to the introduction of a normally distributed random effect on the scale of the linear predictor in Poisson regression, see Breslow (1984). The mean of the marginal distribution is given by l ¼ expðk þ 1 2 r 2 Þ and the variance by l + [exp (r 2 )À1)]l 2 . This is thus the same variance function as the negative binomial distribution with x = [exp(r 2 )À1]. Despite this the distributions can be quite different for large l and x as is shown in Figure 1.
A different approach was introduced by Taylor (1961) who proposed the power relationship V = al b between the variance V and the mean l for field population counts. This pioneering paper was followed by a series of papers, notably Taylor et al. (1978Taylor et al. ( , 1980, in which it was shown that this relationship fitted well for many species, with varying values of a and b depending on the species at hand. The relationship was subsequently termed Taylor's power law by some authors. Perry et al. (2003) and Clark et al. (2006)  median values of the power b, when considering groups of indicator species, all fall between 1.5 and 2.0, averaging 1.7 overall. There is no statistical distribution associated with Taylor's power law, as it only specifies a relationship between the variance and the mean. Perry et al. (2003) used the negative binomial distribution to simulate according to Taylor's power law by solving x from l + xl 2 = al b for given values of l, a and b. Using the negative binomial is however arbitrary, as, for example, the Poisson-Lognormal has the same variance to mean relationship, but has a different distribution.

Statistical Distributions for Presence/ Absence Data
In field experiments the presence or absence of an organism might be recorded rather than counting the organism. The response X might then be the number of plants on which a specific organism is present for each experimental unit. Assuming independence between the n plants and a fixed presence probability p the response follows a binomial distribution. The mean of the binomial distribution is given by np and the variance equals np(1 À p). Examples of the binomial distribution are given in Figure 2.
Overdispersion arises by assuming that the presence probability itself, now called Z, follows some statistical distribution with mean p and some variance s 2 . It follows that the marginal mean of X itself equals np and the variance equals np(1 À p) + n(n À 1)s 2 which is larger than the variance of the binomial distribution. A popular choice for Z is the beta distribution which results in the so-called beta-binomial distribution. An alternative is to assume that the logit transform of Z follows a normal distribution. Details of both distributions are given below.
The beta-binomial distribution arises when Z follows a Beta distribution which is defined on the interval (0,1). A convenient re-parameterization results in a mean np and variance np(1 À p)[1 + (n À 1)u]. When the number of binomial trials is equal across experimental units, the term between squared brackets is constant and the variance of the beta-binomial distribution is then proportional to the binomial variance. In this case data can be easily analyzed by the quasi likelihood approach, similar to the analyses for the overdispersed Poisson distribution. Some examples of the beta-binomial distribution are given in Figure 2. This shows that the range of possible outcomes is extended as compared to the binomial distribution. However for very large values of x the distribution becomes bath-tub like with large probabilities for outcomes 0 and n and small probabilities for intermediate values.
An alternative is to assume that Z follows a logit-normal distribution. This is equivalent to the introduction of a normally distributed random effect on the scale of the linear predictor in logistic regression. For obvious reasons this distribution can be termed binomial-logitnormal. Unfortunately the mean and variance of the logit-normal distribution cannot be written in analytical form, and this is thus also the case for the binomial-logitnormal distribution itself. Probabilities can be obtained by integrating out the random effect by Gauss-Hermite quadrature. Some examples of this distribution are given in Figure 2; the parameters which are used in this figure are such that the mean and variance of the distribution are given by np and xnp(1 À p). This shows that for p = 0.5 there is hardly a difference with the beta-binomial distribution, while for smaller values of p the binomial-logitnormal distribution has somewhat smaller zero probabilities.

Excess-Zeros Distributions
Although the overdispersed count distributions have a larger zero probability than the corresponding Poisson or binomial distribution, in practice the number of zero observations can still be larger than predicted by the count distribution. This is termed excess-zeros or zeroinflation. Examples of situations with excess-zeros are given by Cunningham and Lindenmayer (2005), Sileshi (2008) and Lewin et al. (2010). Failure to account for zero-inflation in a statistical analysis may results in biased estimation of environmental effects of GM plants. A com-mon model for zero-inflation assumes that a proportion d of the experimental units have a structural zero and the remaining proportion (1 À d) of units follows one of the count distributions given above. The zero-inflated distribution for the resulting count Y is then given by is the distribution of the counts. Note that the probability of observing a zero is given by the probability d of obtaining a structural zero plus the probability of obtaining a zero by chance. Having a lot of zeros in itself does not necessarily mean that a zeroinflated model is needed. An examples of this is given by the negative binomial distribution in Figure 1. The mean of a zero-inflated Poisson distribution equals l(1 À d) and its variance equals l(1 À d)(1 + dl). Regression models based on the zero-inflated Poisson distributions were introduced by Lambert (1992) who considered simultaneous modeling of l and d which are related to possibly different sets of covariates. Greene (1994) brought regression modeling to the zero-inflated negative binomial distribution. Hall (2000) and Vieira et al. (2000) seem to be the first papers which employ a zero-inflated binomial model. Finally, Cheung (2006) uses a zero-inflated beta-binomial model to analyze cognitive function test scores of Indonesian children.

Statistical Simulation Model
Having defined possible probability distributions for counts and presence/absence data, the other elements of field testing of environmental effects of GM plants on NTOs can be introduced. These elements are summarized in Table 2.

Block effects and the transformed scale
An experiment will minimally consist of a GM plant and its comparator. When these are compared in a single trial without blocking, with only a single measurement per experimental unit and no excess-zeros, simulation of such an experiment only requires specification of the mean l of the count distribution for both the GM plant and the comparator. In addition, when overdispersion is required, a common dispersion parameter must be specified. Adding blocking to this experiment requires a random blocking effect. It is natural and common to introduce blocking effects for counts on the log scale, that is, logðlÞ ¼ ðfixed variety effectÞ þ ðrandom blocking effectÞ as this ensures that the mean l of the count is always positive. Note that this also requires that the variety effect of the GM plant and the comparator is specified on the log scale. Likewise, the logit transformation can be used to specify the probability of success p for the presence/absence data and/or the excess-zeros probability d, for example, This ensures that probabilities are always in the interval (0,1). So in the simulation model, all effects are introduced on the natural log scale for counts and on the logit scale for probabilities.
When an experiment with blocking is required, it is assumed that block effects follow a normal distribution with some variance. For each new block, a block effect is simulated according to this distribution and added to the variety effect on the transformed scale. Note that when an excess-zero distribution is used, a block variance must be specified for both the count or presence/absence distribution as well as for the excess-zero probability.

Additional varieties and reference varieties
In addition to the GM plant and the comparator, other varieties can be introduced in the experiment. These might be other comparators or other GM plants, or alternatively, the GM plant and/or comparator itself which receive different agronomical treatments with for instance herbicides. Although the latter are not varieties but rather treatments, for simplicity, these are also termed additional varieties in the simulation model. For each of the additional varieties, a variety effect on the transformed scale must be specified.
A special case is an experiment in which the GM plant and its comparator are to be compared with a group of reference varieties which are assumed to have a history of safe use ( Van der Voet et al., 2011). In this case, the individual reference varieties themselves are not of interest, but they are rather used to derive baselines or equivalence limits. The reference varieties in the experiment might thus be considered as representing a population of reference varieties. It is then natural to assume that the variety effect of each reference variety is drawn from a statistical distribution. For convenience, a normal distribution is used for this. The difference between additional and reference varieties is that for each additional variety, a fixed variety effect must be specified, whereas for reference varieties, only a common variety effect and an associated variance must be specified.
At this stage, it is instructive to go through the different steps in the statistical simulation model for a hypothetical small experiment. Suppose a GM plant, its comparator and three reference varieties are to be compared in a randomized block experiment with 2 blocks. A zero-inflation Poisson distribution is used with the following parameters for the varieties: First the count effect on the log scale of the three reference varieties is drawn from a normal (1,1) distribution; suppose that this results in random draws 0.8, 0.9 and 1.2. Secondly, the excess-zero probability effect on the logit scale for the reference varieties is drawn from a normal (-0.8, 0.5) distribution; suppose that this results in -1.2, -0.9 and -0.6. Furthermore, assume that the between block variance for the counts equals 0.1 with random draws À0.4 and 0.1 for the two blocks and that the between block variance for the excess-zero probabilities equals 0.01 with random draws À0.1 and 0.2. The mean count l and excess-zero probability d for every plot in the experiment are then given by These values are then used to generate a count for each plot. In a simulation study, typically many datasets are simulated with the same settings. For each dataset, new reference variety effects and new blocking effects are simulated, so only the bold values remain the same for each simulated dataset.

Repeated measurements
Non-target organisms at the same experimental unit are frequently sampled at different points in time, see for example Head et al. (2005). The simulation model assumes that time points are equidistant, that is, 1, 2, . . ., T. There are many possible patterns in time. The simulation model accommodates constant, linear or quadratic time patterns, all on the transformed scale, and this can be set separately for the mean of the counts and for the excess-zero probability. Moreover, the repeated observations on the same experimental unit can be independent or can be correlated. The mean l t of each variety at time point t, assuming no block effects, is given by log (l t ) = f p (t) + v t , where f p (t) is a polynomial up to order p = 2. The extra random effect v t specifies the correlation between repeated measures. Absence of a time effect is simply given by f 0 (t) = b 0 , a linear time effect by f 1 (t) = b 0 + b 1 t and a quadratic time effect by f 2 (t) = b 0 +b 1 t+b 2 t 2 . An alternative parameterization for the second-order polynomial with more meaningful parameters is given by f 2 (t) = b max À (t Àb opt ) 2 /(2b tol ), where the minimum or maximum b max is attained for the optimal time point b opt and the parameter b tol represents the width of the quadratic curve, also called the tolerance. This latter parameterization is used in the simulation tool. For a positive tolerance, the parabola has a maximum, while for a negative tolerance, it has a minimum. The vector of random effects v = (v 1 , . . ., v T ) is assumed to follow a multivariate normal distribution, that is, v $ MNð0; r 2 v VÞ where V is a T x T symmetric correlation matrix. The simulation tool implements three options for the random effects: 1 No extra variability as given by r 2 v ¼ 0. 2 Equal correlation across time by setting V kk = 1 and V kl = q for k 6 ¼ l 3 Autoregressive correlation across time by setting V kl = q |kÀl| Note that the second and third option involves an overdispersion mechanism. Consequently, an equal correlation across time with q = 0 combined with the Poisson distribution is equivalent with the Poisson-lognormal distribution with no extra variability across time. A combination with, for example, the negative binomial distribution however involves two levels of overdispersion. The simulation tool requires specification of the b parameters for the GM plant and its comparator and also for the additional varieties. When excess-zeros are desired, a different set of b parameters must be specified for the excess-zero probability model. For the reference varieties, each b parameter is drawn from a normal distribution with specified mean and variance.

Multiple trials
Multiple field trials across environments fall into two main classes. One in which there is no further structure across trials, for example, when trials are conducted at different sites, and secondly, when the trials follow a site 9 year structure. In the first case, there might be random trial effects such that trials will vary in their level of response without affecting differences between varieties. This is in addition to random block effects within trials. In the second case, experiments, carried out at different sites, are replicated for a limited number of years. Then, random site and random year effects are conceivable as well as a random site by year interaction effect. Note that these random effects are added to the other effects on the transformed (log or logit) scale.
In addition to additive multiple trial effects on the transformed scale, variety effects might be different from trial to trial, from site by site or from year to year. This can be termed genotype by environment interaction. This interaction can be accommodated by additional random effects which operate on variety effects. Instead of a fixed variety effect, for e.g. the GM plant, in each separate trial, the variety effect is drawn from a normal distribution with some mean and variance. For the reference varieties, there are two stages of simulation. In the first stage, a reference mean, say M, for the trial is drawn from a normal distribution with specified mean and variance. In the second stage, the effects of the reference varieties in that specific trial are simulated from a normal distribution with mean M and some other specified variance. Similarly for site 9 year trials, three variance components can be distinguished for every variety effect. In case of repeated measurements with say a quadratic time effect, all the time effect b parameters have their associated variance components and many parameters need to be specified. Moreover, the same statistical model can be specified for the excess-zero probability.

Examples of Simple Simulations
The simulation model was implemented in a software tool, available as Supplementary Material, and was used to perform a series of simulations to demonstrate various aspects which can be studied by means of the tool. To this end, single, completely randomized trials were simulated to assess properties of statistical difference and equivalence testing. In addition to the GM plant and its comparator, one additional variety was included in every simulation and the simulated response was a count. The mean of the comparator and the additional variety was assumed to be equal to say l, and the mean of the GMO is denoted by hl such that h is the multiplicative difference between the GM plant and the comparator. The following levels of replication N were employed: N = 4, 6, 8, 10, 15, 20, 30 and 40. A two-sided test with a significance level of a = 0.05 was used throughout. In each example, 1000 datasets were simulated for every parameter combination.

Power of difference test for the negative binomial distribution
The power of a likelihood ratio test for the difference between a GM plant and its comparator was studied. Data were simulated according to the negative binomial distribution. All combinations of the following values were used: l 1, 2, 5, 10, 20, 40 h 1, 1.2, 1.4, 1.6, 1.8, 2.0 x 0.25, 0.5, 1 The negative binomial distribution was fitted to each dataset, first under the restriction that the mean of the GMO equals the mean of the comparator and secondly, without this restriction. A likelihood ratio test statistic is then given by twice the difference between the log likelihoods of the two models. The large sample distribution of this test statistic is v 2 1 , and this distribution was used to calculate P values. The (simulated) power of the difference test is then given by the fraction of the 1000 simulated datasets for which the null hypothesis of no difference is rejected. Examples of resulting power curves are given in Figure 3. The number of replications required to detect a multiplicative difference h = 2 between the mean of the GM plant and the comparator with probability 0.80 is given in Table 3 for the various values of l and x. These values were interpolated from the values of N which are used in the simulation. As expected, the power is larger when there is less overdispersion (smaller values of x) and when the mean of the distribution is large.

Sensitivity analysis for the negative binomial distribution
The true underlying distribution of field count data is generally not known. Furthermore, especially for small samples, it is difficult to discriminate between the various statistical distributions. To assess the sensitivity to the assumed distributional form, each simulated negative binomial dataset was also analyzed by two alternative models. The first alternative employs the quasi-Poisson approach in which the likelihood ratio test was scaled by the mean deviance under the full model, whenever this mean deviance is larger than 1. The second alternative first log transforms the count data, after the addition of 0.5, whenever there is at least one zero in the data, followed by an analysis of variance. Examples of power curves for replication N = 40 for the three models are given in Figure 4. This seems to suggest that the power of the negative binomial and the quasi-Poisson analysis are very similar and that the power of the log-transform approach is somewhat smaller especially for the combination of a large mean l and a large dispersion parameter x.

Properties of equivalence test for the Poisson distribution
Properties of the TOST approach to equivalence testing were assessed for count data which were simulated according to the Poisson distribution. The null hypothesis of non-equivalence is rejected in favor of equivalence when the confidence interval completely lies in the interval determined by fixed lower and upper equivalence limits. The same simulation setting as in the first simulation was used However, as the Poisson distribution was employed to simulate data there is no over dispersion. Hypothetical equivalence limits of ½ and 2 were employed to perform equivalence testing. A 95% likelihood ratio confidence interval for the ratio of the GMO mean and the comparator mean was calculated for each simulated dataset. The number of times this interval lies within the equivalence interval (½, 2) can then be counted. As an example, the confidence interval for 40 simulated datasets is given in Figure 5 with l = 5 for both the GMO and the comparator, so h = 1, and for various values of the number of replications N. In this case, the GMO and comparator have equal means and are thus theoretically equivalent. However, for small numbers of replications, the confidence intervals frequently cross the equivalence limits implying that the null hypothesis of non-equivalence is not always rejected.
The number of replications required to detects a quotient of d between the GM plant and its comparator, as well as the number of replications required to reject the null hypothesis of non-equivalence with equivalence limits of ½ and 2 is given in Table 4. This clearly shows the asymmetry in the requirements for a difference test and an equivalence test. As the multiplicative difference h increases, the number of replicates for a difference test decreases, while those for an equivalence test increases.

The effect of excess-zeros
To evaluate the effect of excess-zeros on the power of the ordinary likelihood ratio test, a separate simulation with the excess-zero negative binomial distribution was executed. Again, a single trial without blocking with a single measurement was assumed. Furthermore, a multiplicative difference of h = 2 was used between the GM plant and the comparator. The excess-zero probability was set to d = 0, 0.1, 0.2 and 0.5. The mean (1Àd)l of the excesszero distribution was set to 1, 5 and 40 ensuring that the means of the distributions are identical for different values of d. The data were analyzed with the negative binomial distribution as if there were no excess-zeros. The power for different levels of replication is given in Figure 6. This indicates that for small means and small excess-zero probabilities, the power is not much affected. However, for larger means, there can be a considerable decline of the power. For an excess probability of d = 0.5 and larger means, the resulting distribution has a spike at zero in combination with larger values with not very Figure 4. Power of a likelihood ratio difference test with a = 0.05 for negative binomial data with dispersion parameter x, a mean l for the comparator, and a mean hl for the GM plant for replication level N = 40 when analyzed employing a negative binomial model (black), a quasi-Poisson model (red), and a log transformation (green). Table 3. Number of replications needed to obtain a significant difference test with probability 80% when the quotient of the mean of the GMO and the comparator equals Θ = 2 for data which have a negative binomial distribution with mean l for the comparator, mean Θl for the GM plant, and dispersion parameter x.
x l= 1 l = 2 l = 5 l = 10 l = 20 l = 40 much in between. In such a situation, the estimate of the dispersion parameter becomes very large so as to "catch" both the zeros and the larger observations. Consequently, the distinction between the means of the comparator and the GMO disappears resulting in very low power values. In such a case, the data should be analyzed by means of an excess-zero distribution.

Repeated measurements
In field studies, observations on the same experimental units are frequently carried out on different points in time. The question then arises whether it is more fruitful to increase the number of observations in time or to increase the number of experimental units to achieve a certain power. Consider the situation where the mean count of a species follows a second-order polynomial on the log scale with a maximum l at t = 0 and a value of 0.6l at t = AE 7 days. This defines a second-order polynomial on the log scale with parameters b opt = 0, b tol = 47.96 and b max = log(l). According to the polynomial, the mean count at t = AE 14 days equals 0.13l. Data were simulated with the Poisson-lognormal model with a dispersion parameter x = 0.25. This was carried out for the situation in which the comparator has a maximum mean count of l and the GM plant a maximum of 2l, so h = 2. Three situations are simulated: a) a single observation for each experimental unit at t = 0, b) repeated observations at t = À14, À7, 0, 7 and 14 assuming that these observations are independent and c) repeated observations at the same five points in time but now assuming an autoregressive correlation across time for the five random effects on the log scale with correlation q = 0.8. Independent observations in time are included as a limiting case because in practice, it is unlikely that repeated observations on the same experimental unit are independent. In cases b) and c), observations on the five time points are summed for every experimental unit, and in all cases, the negative binomial distribution, which has the same variance function as the Poisson-lognormal, was used for analyzing the data. The power curve of the difference test for various replication levels is given in Figure 7. For l = 1, there is a large benefit of repeated measurements for both the independent and the correlated counts in time. This benefit becomes smaller for larger l for the correlated counts, possibly to a point that it is more efficient to increase the replication level rather than sampling at different points in time. This will however also depend on the size of the correlation. Only a minority of the papers listed in Table 1 explicitly considered this issue in analyzing data from field experiments.  The confidence interval for the difference parameter on the log scale was used to perform equivalence testing. The number of replications required to reject the null hypothesis of non-equivalence with equivalence limits of 1/3 and 3 is given in Table 5. This also shows that, for the correlation case, the advantage of multiple sampling times becomes smaller as the mean increases.

Discussion
This paper describes a general framework for simulating data typically encountered in environmental risk assessment of genetically modified plants. It focuses on field testing of GM plants to assess and compare their potential adverse effects on non-target organisms. Based on such field trials, the assessment includes the use of statistical difference and equivalence testing. It is important to know the statistical properties of such tests, for example, the power and robustness of a test and whether the test has the correct significance level. The EFSA guideline (EFSA, 2010) for environmental risk assessment of GM plants for instance requires that each field trial should have sufficient replication to be able to yield a standalone analysis if required. Such and other statistical issues related to the assessment can best be researched by means of a statistical simulation model.
Limits of concern are required for equivalence testing, but these may be difficult to specify for NTO experiments. They can of course be set to a fixed value by authorities, such as a 20% increase or decrease (Hothorn and Oberdoerfer, 2006), but this remains largely arbitrary. Alternatively, a set of reference varieties can be included in the same comparative experiment to allow comparison with the GM plant grown under the same conditions (van der Voet et al., 2011). The natural variation, in, for example, counts of NTOs, between the reference varieties can then be used to set limits of concern.
The simulation model described in this paper can be used to generate data for various endpoints having Figure 6. Power of a likelihood ratio difference test with a = 0.05 for negative binomial data with overdispersion parameter x = 0.25 and additional excess-zeros with probability d = 0 (black), 0.1 (red), 0.2 (blue), and 0.5 (green). The comparator has mean l(1 À d), and the GM plant has a mean of 2l(1 À d).   (Taylor, 1961) suggests that overdispersion, relative to the Poisson distribution, is likely to be normal in environmental studies. Statistical distributions for this phenomena are the overdispersed Poisson, negative binomial and Poisson-lognormal distribution for counts; these have much wider tails and also a larger zero probability. Likewise, overdispersed distributions for binomial data are the beta-binomial and the binomial-logitnormal distribution, which seem to be quite similar for a wide range of parameter values. In addition to overdispersion, excess-zeros might frequently be encountered especially for rather rare species (Cunningham & Lindenmayer, 2005). Failure to account for this in the analysis may result in biased estimation of ecological effects (Sileshi, 2008). The simulation model therefore also addresses zero-inflation for every distribution just described. Field trials for environmental risk assessment may be too small to discriminate between the various statistical distributions. For example, a large number of zeros can be due to structural zeros and thus zero-inflation or can be due to heavy clumping of individuals which gives rise to a negative binomial distribution with a large dispersion parameter. The simulation model can be used to assess the robustness of statistical models to digressions from the model. For instance, an analysis according to Taylor's power law, or possible an analysis of log counts using the normal distribution, might be a good comprise for the various overdispersed count distributions. This can also be researched by means of the simulation model.
In many experiments, non-target organisms are sampled at different points in time. The simulation model accommodates this by the option to specify a constant, linear or quadratic pattern in time on the log-or logittransformed scale. Moreover, repeated measures at the same plot might exhibit autocorrelation which can be modeled by a random effect with equal correlation across time or by an autoregressive process. This gives the opportunity to study various endpoints such as the summed counts over a specified period of time, the time of first occurrence of a species, the time at which maximal abundance of a species occurs, or even the repeated measures themselves.
Genotype by environment interaction is an important issue in the admission of GM plants. In the European context, the EFSA guideline (EFSA, 2010) also focuses on the receiving environment of the GM plant, assuming that different bio-geographical zones, in terms of meteorological, ecological and agricultural conditions, may comprise different risks of growing GM plants. To develop appropriate statistical methods to handle genotype by environment interaction in studies over multiple bio-geographic regions and under varying agronomical conditions, a simulation tool is indispensable. The interaction is accommodated by the simulation model by additional random effects which operate on variety effects. Instead of a fixed variety effect, for e.g. the GM plant, in each separate trial (or environment), the variety effect, on the transformed scale, is drawn from a normal distribution with some mean and variance. This ensures that the difference between the GM plant and its comparator does have a common basis across trials, but might be varying from trial to trial. As an alternative, the simulation model also caters for a site by year interaction in which the "environments" are now structured by sites and years.
All these elements are implemented into a single computer program which can handle non-normal distributions for counts and absence/presence data possibly with excess-zeros, accommodates reference varieties in addition to the GM plant and its comparator, employs completely randomized or randomized block experiments, enables genotype by environment interaction by adding random variety effects, and finally repeated measures in time following a constant, linear, or quadratic pattern in time possibly with some form of autocorrelation. The computer program is available as Supplementary Material to this paper.
Although the tool is quite comprehensive, there are certain restrictions. It is not possible to define different dispersion parameters for different varieties or for different trials. This might be useful when one would like to keep the coefficient of variation constant across varieties rather than the dispersion parameter. Simulation by means of Taylor's power law is not implemented as there is no distribution associated with this law. In the current simulation model, it is not possible to have a linear effect in time for one variety and a quadratic time effect for another variety, but if needed, pseudo-linear behavior can be obtained by specific choices of the quadratic parameters. Also the time dependence is assumed to be equal across varieties. Split-plot experiments, with for example agronomical treatments on the main plot level and varieties on the subplot level, are not supported.
The tool was used to perform a series of simple preliminary simulations to demonstrate various aspects which can be studied by means of the tool. It is shown that the tool can be used for a prospective power analysis for both difference testing and equivalence testing. A simple sensitivity analysis seems to imply that, for simple experiments, the assumed distributional form may not be very critical. The effect of a small proportion of excess-zeros might be small when the mean count itself is also small. Finally, it appears that the benefit of repeated measures, assuming autoregressive correlation in time, becomes smaller when the mean increases.
Based on the simulation model as described in this paper, further work is needed to develop a protocol for prospective power analysis when designing field experiments for environmental non-target effects of GM plants. Requiring the use of such a protocol could avoid literature reports where the conclusion of non-significant differences between GM and non-GM plants is not accompanied by a report on the statistical power of the field experiment.