Enforced symmetry: the necessity of symmetric waxing and waning

A fundamental question in ecology is how the success of a taxon changes through time and what drives this change. This question is commonly approached using trajectories averaged over a group of taxa. Using results from probability theory, we show analytically and using examples that averaged trajectories will be more symmetric as the number of averaged trajectories increases, even if none of the original trajectories they were derived from is symmetric. This effect is not only based on averaging, but also on the introduction of noise and the incorporation of a priori known origination and extinction times. This implies that averaged trajectories are not suitable for deriving information about the processes driving the success of taxa. In particular, symmetric waxing and waning, which is commonly observed and interpreted to be linked to a number of different paleobiological processes, does not allow drawing any conclusions about the nature of the underlying process.

A function g is mirror symmetric (short symmetric) with respect to the y-axis if g(x) = g(−x). A 11 function g is mirror symmetric with respect to the axis defined by x = a if the function g(x − a) 12 is symmetric with respect to the y-axis. For simplicity we only examine the first case, the second 13 case follows mutatis mutandis.
14 For any function f , its symmetric part is defined as (1) It is maximal in the sense that it is the biggest symmetric function that is smaller than f . The 16 asymmetric part of f is given by Let 18 asy : f → f asy (3) be the operator that assigns every function f its asymmetric part, so asy( f ) = f asy . It has the 19 following properties: 20 asy(λ f ) = λ asy( f ) for λ ≥ 0 (positive homogeneity) (5) asy( f + g) = asy( f asy + g asy ) ≤ asy( f ) + asy(g) (sublinearity) .
Combining this operator with any monotonous seminorm · on the vector space of functions 21 yields the function 22 f asy := asy( f ) measuring the degree of asymmetry of the function f . We will refer to f asy as quantified 23 asymmetry, short QuAsy. It is slightly weaker than a seminorm in the sense that it is only 24 positively homogeneous and not absolutely homogeneous. 1

25
As an example, taking vectors in R n as functions and using the 1-norm yields the QuAsy measuring the asymmetry of binned data. Its continuous equivalent can be obtained by using 28 the norm instead of · d 1 . Here, the example from the section "The Effect of Noise" is formalized and the stated result is 32 derived formally.

33
By the assumptions made in the example in the main text, the real valued random variables Y n 34 describing the result of a statistical analysis after n samples have been evaluated converge to 35 some deterministic value a and the random variable Z n describing the contribution of the noise 36 converge to 0. Let P n be the distributions of Y n and Q n the distributions of the Z n . Adding 37 random variables is equivalent to convoluting their distributions, soỸ n has the distribution P n * 38 Q n , where * denotes the convolution (Klenke, 2008, p. 277). 39 We will show that P n * Q n (A) P n (A) as n → ∞, meaning that the perturbed analysis is in the 40 long run more likely to show results in any set A than the original analysis. Applying this to any 41 set that does not contain a shows that the probability of deviations from the value a are higher 42 in the perturbed analysis than in the original analysis.

43
The main result of the theory of large deviations roughly states that for two so called rate functions I, J that determine the rate of decay of the probability of the set 46 A as n increases (Klenke, 2008, ch. 23) (Varadhan, 1984). To show the desired inequality, it is 47 therefore sufficient to show that J(z) ≤ I(z) for all z, meaning that P n * Q n decays slower than 48 P n .

49
To show this, let P n and Q n satisfy a large deviation principle (LDP) with rate functions F P and 50 F Q 2 . By the assumption on the convergence of the noise, we have F Q (0) = 0, and by shifting the P n 51 2 The existence of a LDP is not a strong assumption, since LDPs are known in many cases (e.g. for the Brownian motion, empirical measures, averages of i.i.d. random variables) and preserved under a number of operations, e.g.
3 by a, we can without loss of generality assume that F P (0) = 0. Then the product measures (P n ⊗ 52 Q n ) n∈B on R 2 satisfy a LDP with rate function R(x, y) = F P (x) + F Q (y) (under the assumption 53 that both P n and Q n are exponentially tight) (Kühn, 2014, lemma 2.7). The image measures of 54 (P n ⊗ Q n ) under the function f (x, y) = x are the P n , which do, according to the contraction 55 principle (see (Klenke, 2008, p. 518)) and by definition, satisfy a LDP with rate function I(z) = 56 F P (z).

57
Next, take the function g(x, y) = x + y. The image measure of (P n ⊗ Q n ) under this function is 58 P n * Q n by the definition of the convolution. Applying the contraction principle yields the rate for the LDP of P n * Q n . By setting x = z, y = 0 and x = 0, y = z, the inequality follows. Although this estimate is not very elaborate and can certainly be improved, it is enough which is the desired statement.

88
The described approach can also be used to create a nonparametric, distribution-free test for The age-area hypothesis as formulated by Willis (1922) has been rejected as a general ecological 104 pattern (Gaston, 1998) and only serves as a well-known example to demonstrate the described 105 approach. We test a reformulation of it, which states that taxa have a constantly increasing  The trajectories representing the age-area hypothesis were generated in a stochastic model that 126 assumes that although the abundance of taxa is constantly increasing, it is still subject to random 127 fluctuations.

128
The presence p of a taxon at time t is assumed to follow the equation at t = 1 in accordance with the procedure described in the section "The Way Data is Processed".

134
The number of trajectories simulated is identical to the number of taxa in the empirical data.

135
Each trajectory was binned with the bins used above, the value of the bin with borders t i , t i+1 is 136 given by Last, for each trajectory, the values of the bins were rescaled for the combined area of the bins to 138 have an area of one to make them comparable with the bins from the dataset given above. To demonstrate this test, radiolarian data was downloaded from the NSB database (Lazarus,154 1994; Spencer, 1999) using the same parameters, options and rescaling as described above. Over-155 all, there were 97811 occurrences from 667 species, with the median of the number of occurrences 156 per species being 48 and the mean 146.6 . For every species, the rescaled ages were binned into 157 n = 10 bins. The bins of each species were then rescaled to have an area that sums up to one.

158
Then two sets were created: one with the species whose histories were reversed in time and one 159 with those whose trajectories were left unchanged. Each species was randomly assigned to one of 160 these groups with a probability of 0.5. For these two datasets, the multivariate Cramér test for the 161 two sample problem ( Baringhaus and Franz, 2004) as implemented in the R package "cramer" In the following, we will call some space E × A combined with a probability distribution P a 173 model. The set E will represent the part of the space on which the conditioning will take place. In for all f from some class of functions F , then P 1 = P 2 . Important classes of functions are where p(i) is the probability of the unconditioned model that a trajectory ends at i and δ i ⊗ 205 Q i (de, da) is the probability distribution describing the model conditioned to end at value i.

206
Similarly if the model is conditioned to end with probability distribution q, we obtain This shows that every conditioned model is a convex combination of the models that determin-208 istically end with value i ∈ E. By defining the simplex every conditioned model can be uniquely identified by the mapping where x ∈ ∆. Accordingly we get where a i = f (e, a)δ i ⊗ Q(de, da) ∈ R and x i = q(i). Maximizing (minimizing) the integral 212 for a fixed function f and varying conditioned models is therefore equivalent to maximizing 213 (minimizing) the linear function ∑ N i=1 x i a i over ∆. This is a linear optimization problem, therefore 214 its optima can be found in the vertices of the simplex, which represent the models conditioned 215 to deterministically end at some value.
Assume that the function L does a good job in identidying the model in the sense that its 226 expectation value is close to l if the model P l is assumed.

227
Now transition to the conditioned models derived from P l , here denoted by δ j ⊗ Q l for j ∈ E.

228
Then by the line of argument in the subsection above, L(e, a) δ j ⊗ Q l (de, da) will differ from 229 l . So trying to identify unconditioned models on the basis of data derived from conditioned 230 models will lead to the misidentification of the models.