Exploring the effects of ecological parameters on the spatial structure of genetic tree sequences

,

the gap between a lack of theoretical results and the desire to build realistic spatial models 78 of species. 79 Here, we set out to understand properties of geographically annotated sequences of 80 trees along a genome, using a simulation-based approach. We leverage slendr to study 81 how ecological parameters affect the spatial distribution of individuals, and the structure 82 of genealogies relating them over time. 83 First, we explore the effects of varying the mode and scale of mating and dispersal 84 on the realised distances between parents and their offspring. We show that, in some 85 cases, these distances closely match their theoretical distribution. Then, we illustrate a 86 case in which the realised distribution closely matches a theoretical model which explicitly 87 includes the radius of mate choice. Finally, we test the efficacy of a maximum likelihood 88 estimator of the mean distance between parent and offspring, using distances recorded in 89 the branches of a phylogeny under a commonly used Gaussian mode of dispersal. 90 Our work serves to show that a sound understanding of the geographic parameters 91 of a species, with respect to the dispersal distribution and to ecological factors (such as 92 competition for resources and mate choice), is key to carrying out reliable phylogeographic 93 inference in real populations. 3.1 Dispersal patterns in spatially-tagged trees 96 We were interested in determining the extent to which the observed parent-offspring dis- The parent-offspring distances in a population reflect the stages of reproduction and dispersal Over many generations, how do these manifest in a geo-spatially tagged genealogy? What can we learn about the dispersal process from a genealogy? latitude longitude Figure 1: The mechanics of dispersal in our simulations. In our forwards-in-time simulations, two parents p1 and p2 are chosen. The distance between them (red line) must be less than the user-specified mating distance. The offspring ('o') is then dispersed from p1 (blue line) according to a specified mode of dispersal parametrized via a dispersal function (DF ) and distance (which we call σ). These mechanics imply that a given one-generation dispersal may either be a direct observation of a draw from the DF (the p1-o distance, blue line) or it may be a composite of mate choice and dispersal (the p2-o distance, grey line).
sampled all individuals and reconstructed the tree connecting them. For each condition, 117 we ran 3 replicates. We will call the distribution of realised parent-offspring distances in 118 these trees the DD (empirical distance distribution).

119
The shape of the DD was related to that of the theoretical DF . and move cohesively throughout the landscape (as shown in Fig. 4). As the competition 130 distance increased, the population broke into discrete clusters which appeared to "repel" 131 each other. Varying the competition distance had little effect on the DD (Fig. 6), and 132 we found that a competition distance of approximately 0.2 avoided excessive "clumping" 133 or "scattering" -we note that this is not an general threshold, but rather a function of 134 our fleet of parameters, in this section 50 individuals in a world size of 50x50 units. A 135 competition distance of 0.2 also led to the least excess variance in the DD compared to 136 the theoretical DF (Fig. 6).

137
Increasing the mating distance tended to allow a greater level of scattering, and al-138 lowed mating to occur across clustered populations. This is shown by the long branches 139 connecting clusters in Fig. 4 Half-Normal Angle drawn uniformly, distance drawn from N(0,σ 2 ). Distance follows folded normal distribution Uniform Angle drawn uniformly, distance drawn from U(0,σ) σ/2 σ 2 /12 Table 1: The parametrization of parent-offspring distances via the dispersal distance. We parametrized the dispersal distribution through a parameter σ, such that the theoretical variance increased with σ 2 , and the mean with σ (this does not apply to the Cauchy distribution, which has undefined mean and variance; here, σ was the scale parameter). Further details are given in Methods section 5.1.1.  The distribution of parent-offspring distances is an equally weighted mixture of dispersals from a "gestating" parent (often conceptualised as the mother) and a non-gestating parent. If the parent-offspring distance is y, its density given a dispersal distance parameter σ and a mate choice radius r b is therefore

Modelling dispersal patterns
The first term reflects the density given a standard Rayleigh distribution (between the 154 gestating parent and its offspring) with scale σ, while the second term models the distance 155 between the non-gestating parent and the offspring.

156
If we assume a uniformly distributed mate choice radius, then the density function of  Figure 3: Increasing σ leads to longer range parent-offspring dispersal. The figure shows the empirical density of the simulated parent-offspring distances under each of our modes of dispersal, with increasing σ. We chose to parametrize σ for each distribution such that it led to longer parent-offspring distances on average (see Table 1). This is evidenced by the horizontal stretch of the densities with larger σ. These curves were obtained from three replicates of a simulation of 50 individuals over 50 generations. The mating distance was 1 and the competition distance 0.2. The connections between parents and offspring are shown as grey lines. When the competition distance was zero (left column), we observed a strong clumping behaviour -which was somewhat rescued by allowing a large scale of mate choice (bottom row). As we increased the competition and mating distances, the population tended to scatter into discrete clusters. In these simulations, the mode of dispersal was 'Brownian' and σ was 1. mating distance competition distance Figure 5: The effect of mate choice and competition radius on realized parentoffspring distances. Each tile shows the excess variance of the empirical dispersal distribution compared to the theoretical one -as given by Table 1. Since the Cauchy distribution has undefined variance, the 'excess' is relative to zero. Increasing the competition distance tended to have little effect on the variance of parent-offspring distances, but altering the scale of mate choice had a very strong effect. However, when the mating distance was 100, the excess variance was the lowest when the competition distance was 0.2 -and not zero. There was a close match between the theoretical and simulated distributions across a range of mating distances. As the mating distance increased, the distributions acquired a flat shoulder compared to the Rayleigh distribution from long father-offspring dispersals. In all simulations, the dispersal distance parameter σ was 1 and the competition distance was 0.2.
the distance between the non-gestating parent and the offspring is given by Where a is the distance between the gestating parent and its offspring, and b is the distance 157 between parents. This derivation is elaborated in the Methods section 5.3. We verified 158 that these equations matched the simulated distances ( Fig. 6) across the parameter range 159 we examined.

160
If the mate choice distance is instead modeled more simply as a Rayleigh distribution 161 (see Methods section 5.3), the density function between the offspring and the (unknown) 162 parent can be analytically solved: where τ is scale of the Rayleigh distribution governing the mate choice distance.

164
This formulation also leads to a simple result for the mean parent-offspring distance. 165 Since the expected mother-offspring distance is σ π/2 and the expected father-offspring 166 distance is √ σ 2 + τ 2 π/2, the expected parent-offspring distance is were to measure the distances along branches of a genealogy, we would eventually expect 168 to see generation-scaled distances follow a Gaussian with mean . This may be 169 interpreted as a many-generation "effective" dispersal distance parameter.

Estimating dispersal distances from spatially tagged trees
Finally, we sought to test how accurately σ could be estimated, given a perfectly inferred spatial tree sequence. Under a Gaussian mode of dispersal (what we term "Brownian"), the maximum likelihood estimator of σ iŝ where the index i denotes each of N branches in a genealogy, with geographic distance d i 172 and branch-length in generations l i (see Methods section 5.4). 173 We sampled 100 genomes across 5 simulation replicates from a population dispersing 174 with a σ of 1, and obtained maximum likelihood estimates of σ from the set of all parent-175 offpsring distances (Fig. 7). We next emulated a situation where we know the geographic In agreement with this, the estimated σ was larger than 202 the true parameter (Fig. 7), and was also more sensitive to increasing mating distances. were produced from all branch-wiseσ values. All branches are length 1 in the unsimplified tree, so 'cutting' (excluding all branches of more than 100 generations from the estimation procedure) has no effect on the distribution of distances. However, cutting long branches increased the estimated dispersal distance for the simplified tree and the tip-only distances. We show that altering the kernel of parent-offspring dispersal can have profound ef-208 fects on the diffusion captured within a genealogy, and in particular on the tails of this 209 dispersal distribution. We were able to manipulate these distributions in order to produce 210 different diffusion rates across the habitat -for example, the Cauchy, a text-book example 211 of a "heavy-tailed" distribution did indeed produce a greater proportion of long-distance 212 dispersals.

213
There was some difficulty in choosing a common parametrization for these dispersal  The scale of competition had little effect on these distances overall, but caused dis-221 tinctive patterns in the distribution of a population within its landscape. In particular, 222 long-range competition led to a clustered distribution of individuals, which may be a prac-223 tical nuisance to simulation users, and lead to unwanted geographic structure. 224 We observed that the distances within a genealogy increased dramatically if the scale   Simulations will be key to approaching these issues. We simulated under several modes of mother-offspring dispersal, coming under two cate-307 gories: 308 1. Angle-distance dispersal: in these, the absolute distance are controlled by a given 309 distribution. An angle is drawn randomly from a uniform distribution between 0 and 310 2π, and a distance d was drawn from one of the following distributions:

313
• Half-Normal : the p1-offspring distance is Gaussian distributed, with mean 0 and 314 variance σ 2 . When a distance is below zero, the offspring is effectively ejected 315 backwards. The mean of the resulting folded normal distribution (specifically, 316 a half-normal) is σ( 2/π) and the variance is σ 2 .

Tree recording and manipulation
We simulated a single locus in order to focus on fundamental geographic dynamics which act 330 on single trees. After a simulation run, we retrieved the simplified and unsimplified trees.

331
Simplified trees consist of nodes representing coalescence events, and edges connecting 332 them, which implicitly record many individuals. In contrast, an unsimplified tree records 333 all individuals along edges. Such a tree is useful to directly observe the dispersals which 334 occurred at every generation along a long branch. We processed and analysed these via the 335 slendr interface to the tskit library (Kelleher, Thornton, et al. 2018). tskit is a powerful 336 framework for storing and manipulating trees and tree-sequences with close-to-optimal 337 space usage. We also converted these trees to the "phylo" R object class, which allowed us 338 to analyse them via the phylogenetics package ape (Paradis and Schliep 2019).  1. Genetic parent is the "mother", p1. We observe p1-offspring dispersal, (which in 355 slendr is directly encoded).

356
2. Genetic parent is the "father", p2. We observe a convolution of p1-offspring dispersal 357 and the p1 − p2 distance. 358 We can draw a triangle which connects both parents and offspring, as shown in Fig.   359 10. In case (1), we observe sideã. In case (2), we observe sideỹ.b is the distance which 360 separates the two parents, and the angle between sidesã andb isθ.ã ∼ Rayleigh(σ), if we 361 have Brownian dispersal. Since in slendr, parents are chosen with uniform probability from 362 a specified radius r b (the mating distance),b ∼ U nif (0, r b ) where r b is the mating distance.

363
The angle between these sides is free to range between zero and π, soθ ∼ U nif (0, π). 364 We can calculate the length of the side y from a, b and θ: θ ∼ Unif(0,2π) y = a 2 + b 2 − 2ab cos θ Figure 10: A schematic of parent-offspring dispersal. When we observe dispersal from p2, the observed parent-offspring distance (y) is a convolution of the distance between p1 and p2 (b, in red), and the dispersal between p1 and the offspring (a, in blue). The cosine rule gives us an expression for y in terms of a, b and the angle between them θ. If we know the probability distributions of a, b and θ, we can obtain that of y via a change of variables.
We aim to derive the probability density function (pdf) of y, using the pdfs of a, b and 366 θ. This can be achieved with a change of variables: J is the jacobian matrix of partial derivatives: Which goes back into equation (5): This is the joint pdf of the three sides of the triangle. Now, we integrate out the 371 parameters a and b in order to get a fully marginalised f y . a constant probability of 1/π and 1/r b respectively. We also know that a has a Rayleigh 376 pdf of (a/σ 2 )e (−a 2 /2σ 2 ) . Replacing these in the function above: This is the fully marginalised pdf of y. This integral is challenging to solve analytically, 378 but we can obtain the approximate shape of the pdf by numerical integration.

379
Finally, we can write out the pdf of the distance between a randomly chosen parent 380 and its offspring. Let's call this pdf g y (y). With probability P = 0.5, the parent is 381 the mother (p1) and y simply follows a Rayleigh distribution with scale σ. When the 382 genome is inherited from the father (p2), which again occurs with P = 0.5, the pdf of y 383 is the distribution shown above. This leads to the final pdf g y (y) of the parent-offspring 384 distance, 385 g y|σ,r b (y|σ, r b ) = 1 2 y σ 2 e − y 2 2σ 2 + 1 2 f y|σ,r b (y|σ, r b ) Note that we could get the expectation of the distance y from this expression: which is a half-weighted average of the distance expected from the parent-offspring

A simpler model with Gaussian mate choice
There are simple scenarios that lead to a more analytically tractable pdf. For example, 390 let us suppose that the distance between parents is also generated in a similar way to 391 Brownian dispersal, from independent normal distributions in x and y dimensions with 392 variance τ 2 . In this case, the father-offspring distance in each dimension is a sum of two 393 Gaussian random variables and is itself normally distributed with variance σ 2 + τ 2 . This 394 gives rise to a Rayleigh distribution with scale √ σ 2 + τ 2 for the norm of the distance, y.

395
In that case, the final pdf is then: 396 g y|σ,τ (y|σ, τ ) = 0.5 y σ 2 e − y 2 2σ 2 + 0.5 y σ 2 + τ 2 e We may also wish to survey how each branch is contributing to the estimate. Since