Coalescent simulation in continuous space

Many species have an essentially continuous distribution in space, in which there are no natural divisions between randomly mating subpopulations. Yet, the standard approach to modelling these populations is to impose an arbitrary grid of demes, adjusting deme sizes and migration rates in an attempt to capture the important features of the population. Such indirect methods are required because of the failure of the classicalmodelsofisolationbydistance,whichhavebeenshowntohavemajortechnicalflaws.Arecently introducedmodelofextinctionandrecolonisationintwodimensionssolvesthesetechnicalproblems,and provides a rigorous technical foundation for the study of populations evolving in a spatial continuum. The coalescent process for this model is simply stated, but direct simulation is very inefficient for large neighbourhood sizes. We present efficient and exact algorithms to simulate this coalescent process for arbitrary sample sizes and numbers of loci, and analyse these algorithms in detail.


Introduction
1 As Wright noted (1978, p. 54), many species, such as the dom-may also have many lineages descending from a single parent 23 in a event. This process can be formalised as an algorithm very 24 simply (Kelleher et al., 2013); unfortunately, however, such a direct 25 implementation is very inefficient if we wish to simulate the 26 history of populations with large neighbourhood size. 27 Wright's neighbourhood size N (1943; 1946) is widely re-28 garded as the most important parameter describing populations 29 evolving in a two-dimensional continuum, and determines the rel- 30 ative rates of genetic drift and gene flow. Wright (1943) identi- 31 fied the approximate magnitudes of the neighbourhood sizes we 32 might expect to observe in natural populations. Small values, with 33 N < 100, correspond to populations with short range dispersal 34 and a high degree of differentiation. Large neighbourhood sizes, 35 with N > 10 4 , correspond to populations that are ''substantially 36 equivalent to panmixia throughout a range of any conceivable 37 size''. Empirical estimates of N have supported Wright's analysis, 38 with mobile species such as Drosophila having neighbourhood size 39 greater than 10 3 and estimates for plant species being as little as 10 40 or less (Crawford, 1984). These estimates are also consistent with 41 the range of observed F ST values (Morjan and Rieseberg, 2004). 42 In the disc model, neighbourhood size is given very simply as 43 N = ν/u (Barton et al., 2013a), where ν is the number of potential 44 parents and u is the probability an individual dies in an event. 45 Thus, assuming a single parent and a modest neighbourhood size 46 of 100, we arrive at an impact of u = 1/100. Unfortunately, these parameters present serious difficulties to a direct simulation of the 48 coalescent process, in which we expect to generate 1/u events that 49 intersect with a lineage before it jumps. For large neighbourhood 50 size, this represents a heavy computational burden. 51 In this article we develop algorithms to simulate the coalescent 52 process efficiently for large neighbourhood and range sizes. In Sec-53 tion 2 we begin by defining an algorithm to simulate the important 54 special case of the ancestry of a sample of size two. In this case it is 55 simple to calculate the exact distribution of the time that elapses 56 between events in which a lineage jumps, allowing us to simulate 57 only these events. We then generalise these ideas in Section 3 to 58 allow us to simulate the history of a sample of n lineages. Because 59 of the inherent geometric difficulty of the problem we cannot di-60 rectly generalise the pairwise algorithm. By tessellating the torus 61 into square pixels of edge s we can approximate the areas we re-62 quire very quickly and then apply a correction to precisely can-63 cel the errors entailed using rejection sampling. We then model 64 the behaviour of the algorithm and derive an optimal value for the 65 pixel size s to minimise the computational effort required for a lin-66 eage jump. Section 4 continues by generalising this algorithm so 67 that we can simulate the history of a sample of individuals with a 68 large number of loci. We describe a general algorithm to distribute 69 genetic material among ancestors over an arbitrary number of loci, 70 and show that this method is efficient when recombination is fast. 71 In particular, we show that the approach is much more efficient 72 that the method used by the classical ms program (Hudson, 2002). 73 A Python interface to an efficient implementation of this multilo-74 cus algorithm is available at https://pypi.python.org/pypi/discsim 75 under the terms of the GNU General Public License. 76 In this article we study algorithms in much more detail than 77 is customary in the population genetics literature. We provide 78 detailed and unambiguous algorithm listings in a well established 79 format (Knuth, 1997a, Section 1.1), which is important for several  Simulating the ancestry of a sample of size two is an important 98 special case, as we are often interested in pairwise statistics.

99
In the extinction/recolonisation continuum model this involves 100 tracing the history of two lineages as they move around the range 101 before meeting and, eventually, coalescing. Events fall uniformly at by the symmetric difference and the union of the two lineages, 1 respectively. (The symmetric difference of two sets A and B is 2 (A ∪ B) \ (A ∩ B).) Let A r (x) denote the area of the intersection 3 of two discs of radius r with centres distance x apart, i.e., (1) 5 for x < 2r and A r (x) = 0 otherwise. We also define in the interest of brevity.
8 Lemma 1. Given two lineages separated by distance x, the rate at 9 which at least one lineage jumps is Proof. Events fall uniformly at rate λ and hit the area covered by 13 exactly one lineage at rate λα 1 /L 2 . A single lineage subsequently 14 jumps with probability u, and so the rate at which jump events 15 fall in the symmetric difference of the two lineages is λuα 1 /L 2 . 16 Similarly, events fall in the intersection of the two lineages at rate 17 λα 2 /L 2 . Then, since there are two lineages that may jump in this 18 region, the probability that at least one jumps isū 2 .

19
Lemma 2. Conditional on at least one lineage jumping, the probabil- 20 ity that the centre of the event is in the intersection of the two lineages 21 is 1 − α 1 /(α 1 − α 2 (u − 2)).

22
Proof. The rate at which jump events fall in the intersection of the 23 lineages is ω 2 = λα 2ū2 /L 2 ; therefore, the probability that a given 24 jump event is of this class is ω 2 /Ω 2 .

25
Lemma 3. Conditional on the centre of a jump event falling in the 26 intersection of the two lineages, the probability that both lineages 27 jump and coalesce is u/(2 − u).

28
Proof. The probability that both lineages are hit is u 2 , and the 29 probability of at least one lineage being hit isū 2 . Therefore, the 30 probability of both being hit conditional on at least one being hit is To describe the algorithm concisely, we require some notation.

33
Let D r (z) define a disc of radius r centred at z on a two-dimensional 34 torus of diameter L, and let ∥x∥ be the Euclidean norm of the vector 35 x on such a torus (we suppress the dependence on L for simplicity). 36 We also require notation to describe sampling values from random 37 variables drawn from a variety of distributions. Let R ∆ (ξ 1 , . . . , ξ k ) 38 define a single independent sample from a random variable 39 with distribution ∆ and parameters ξ 1 , . . . , ξ k . (Note that each 40 instance of R ∆ (ξ 1 , . . . , ξ k ) within an algorithm listing represents 41 an independent random sample from the specified distribution.) events with radius r and impact u occur at rate λ on a two- ] Set α 2 ← A r (∥y 1 − y 2 ∥) and α 1 ← 2π r 2 − 2α 2 .

101
Having calculated the rates at which each of these classes of event 102 are occurring, we increment time by setting Ω ← ω 1 +· · ·+ω k and 103 t ← t + R E (Ω). We then choose an event class j with probability 104 ω j /Ω, and set r ← r j and u ← u j . After this, we can proceed with 105 P2 as before, choosing the location of the event in either the union  which k out of a given n discs intersect is not a trivial problem. 116 We approach the problem instead by calculating these areas 117 approximately using a tessellation of the torus into square pixels 118 and then correct for the errors introduced precisely using rejection 1 sampling.

2
The output of the algorithm is the simulated history of the 3 sample (π , τ ), where π is an oriented tree and τ the corresponding 4 node times. An oriented tree (Knuth, 2011, p. 461) is a sequence 5 π 1 π 2 . . . , such that π j is the parent of node j and j is a root if π j = 0.

6
For example, the oriented trees   of lineages, we obtain an overall jump rate of λ  n k=1 α kūk /L 2 as 33 required.

34
Lemma 5. Conditional on at least one lineage jumping, the probabil- 35 ity that the centre of the event is in the intersection of exactly k lineages 36 is α kūk /  n j=1 α jūj . 37 Proof. The rate at which the centre of jump events fall in the 38 intersection of k lineages is ω k = λα kūk /L 2 ; therefore, the 39 probability that a given jump event is of this class is ω k /Ω n .

40
Lemma 6. Given that z falls in the intersection of k lineages and that 41 at least one of them jumps, the probability that exactly j jump is into pixels of edge s, and then use this tessellation to quickly 51 calculate an over-estimate of the area in which k lineages intersect.

52
The errors introduced by this over-estimate can then be precisely 53 cancelled out by discarding potential jump events with a certain 54 probability using rejection sampling.

55
The key idea of Algorithm N is to divide the range into pixels 56 of edge s; we assume that L/s is an integer so that pixels do not We also require some further notation to describe sampling from 64 random variables. Firstly, we extend the notation R U (A) for sampling an element of a set A uniformly, by letting random sample of k elements chosen from A without replacement).

103
As illustrated in Fig. 1, Algorithm N maintains a spatial index via 104 the matrix P. Each entry in P corresponds to a pixel, and is the set 105 of lineages that intersect with that pixel. The list Q is used to keep 106 track of the pixels with a given occupancy (defined as the number 107 of lineages that intersect with it), such that Q h is the set of pixels 108 with occupancy h. These data structures serve several purposes.

109
Firstly, Q allows us to quickly determine the number of pixels with  numerical methods for L = 100, s = 2, λ = 1, r = 1 and two neighbourhood sizes corresponding to u = 1/8 and u = 1/80. The mutation rate to new alleles µ = 10 −6 . In Algorithm P, we take 10 6 replicates for every sampling distance x, and in Algorithm N we take 10 6 replicates of a simulation in which we sample individuals at 100 regularly spaced locations (of which a subset is shown).
a given occupancy when we calculate the rate at which events fall 1 in N2, and then gain access to these pixels when we choose one 2 uniformly. We then choose the centre of the event z uniformly 3 within the pixel v, and we know that all lineages that can possibly 4 be affected by this event are in the set P v .

5
Step N1 initialises the simulation, by first setting up the 6 required data structures and then adding each location x j into the 7 spatial indexes P and Q , updating the maximum occupancy h * as 8 necessary. Once this initialisation is complete, we calculate the rate 9 at which events fall based on the current state of the sample in N2, 10 after first adjusting h * downwards, if necessary. Once we know the 11 current rate at which jump events may occur, we then increment we move on to step N4; otherwise, we return to the start of N3 and 18 generate a new event according to the same rates and probabilities. 19 It is this process of rejection sampling that ensures we simulate 20 precisely the correct process, even though the areas we calculate 21 are approximate.

22
In step N4, we know that at least one lineage of the set S 23 was born in this event and so we choose the exact number born 24 according to Lemma 6. The variable k is used represent the parental 25 lineage. If exactly one child is born in the event, then there is no 26 coalescence, and the lineage simply jumps; thus, we set k to be the 27 child lineage (through a slight abuse of notation). Otherwise, when 28 more than one child is born in the event, a coalescence occurs, and 29 k is set to the new lineage η. 30 Step N5 then removes the child lineages from P and Q . 31 Afterwards, in step N6 we choose the location of the parent lineage, 32 and then update P and Q to reflect the insertion of this new lineage, 33 revising the maximum occupancy h * upwards, as required. 34 If C contains more than one lineage, we then proceed on to step 35 N7, where these lineages coalesce and we update the oriented tree 36 and node time structures to reflect this. We set π j ← η for each 37 j ∈ C to signify that the parent of lineage j is the new lineage η, 38 and also set τ η ← t to record the fact that lineage η entered the 39 sample at time t. Finally, we calculate the number of remaining 40 lineages, and if this is greater than 1, return to step N2.

41
In the previous section we discussed the process by which the 42 pairwise coalescent can be extended to multiple event classes.  the rates of these events we obtain Cancelling terms, we see that the rate at which an event falls in  of computational effort that is required to simulate the coalescent. 30 If s is too large we will generate many events that miss all of the 31 lineages, and so spend a great deal of time looping around step N3.

32
On the other hand, if s is too small, we spend our time in steps N5 33 and N6 updating the indexing structures P and Q . 34 One approach to choosing the value of s is to simply run

79
We also assume a large neighbourhood size, so u ≪ 1. These 80 assumptions are reasonable, since they reflect the biological reality 81 that we wish to model. The following lemmas provide us with the 82 key quantities for our analysis. given pixel is s 2 + 4rs + π r 2 , as shown in Fig. 3. Therefore, the 91 probability that the disc intersects with a given pixel is (s 2 + 4rs + 92 π r 2 )/(Ns 2 ) since the total area of the torus is Ns 2 . Then, summing 93 this probability over all N pixels gives us the expected number of 94 pixels the disc intersects with, and the required result. ψ r (s) ≥ π r 2 s 2 + 4rs + π r 2 .
(4) 1 therefore the area of these pixels is s 2 σ r (s). The area of the disc 2 is π r 2 , and therefore the probability that a point falling uniformly 3 at random in a covered pixel is within the disc is ≥π r 2 /(s 2 σ r (s)) 4 by Jensen's inequality. 5 We analyse Algorithm N by counting the approximate number  therefore know that the expectation of the maximum occupancy h * 20 is equal to 1. 21 Step N2 is executed once for each jump, and since E[h * ] = 1, 22 the total contribution of this step is constant. 23 Step N3 loops a certain number of times, and involves non-  Assuming a balanced tree data structure (Knuth, 1998, Section 27 6.2.3) for Q h , each pass through N3 requires log 2 (nσ r (s)) time.

28
Since the expected occupancy of a pixel is 1, we therefore have 29 |S| = 1 with probability ψ r (s), and consequently expect to repeat 30 this step 1/ψ r (s) times for each lineage jump. Therefore, we have 31 a contribution of log 2 (nσ r (s))/ψ r (s) per jump for this step.

32
Upon reaching N4, the expected value of |S| = 1, and this step 33 therefore requires a constant amount of time per lineage jump. In 34 step N5 we remove each lineage in C from the σ r (s) pixels that it 35 covers. Since E[h * ] = 1, the time required to update P v is constant, 36 and the time required to update Q h is log 2 (nσ r (s)) as before. Then, 37 since E[|C|] = 1, the contribution of this step is σ r (s) log 2 (nσ r (s)). 38 Similarly, step N6 adds the parental lineage to σ r (s) pixels, and 39 therefore also requires σ r (s) log 2 (nσ r (s)) time.

40
Step N7 is executed at most n − 1 times, and requires a constant gives us (5), as required.

45
Lemma 10 provides us with a prediction for the expected time 46 required to perform a single lineage jump in Algorithm N in terms of the sample size n. It is more useful, however, to consider the limit 48 of large sample sizes, and to derive a prediction that is independent 49 of n. In general, we expect s > r, and so σ r (s) is between 1 50 and 10; hence, for realistic sample sizes, the dominant term is 51 log 2 (n)(2σ r (s) + 1/ψ r (s)). As a result, the optimal pixel size is 52 approximately independent of n and can be found by minimising 53 2σ r (s) + 1/ψ r (s). Performing this minimisation numerically we 54 find a minimum at s ≈ 2.24r, providing us with a useful prediction 55 for the optimal pixel size in Algorithm N. Fig. 4 shows an excellent 56 agreement between the predicted and observed optimal pixel size 57 over a wide range of sample sizes.

Multilocus coalescent 59
There are two practical approaches to including the effects of It may seem unusual to allow a child lineage to descend from 7 an arbitrary number of parents ν. In events modelling the steady 8 reproduction of individuals in a sexually reproducing species we 9 have ν = 2 as one might expect. We must also consider events 10 that model large-scale demographic shifts, however, where several 11 generations may elapse before the local population returns to 12 the stationary distribution. In this scenario, the eventual children 13 of the event may descend from a number of founders, with 14 recombination mixing the contributions of each parent arbitrarily. 15 More sophisticated models of recombination may also be required 16 for these large-scale events, which can be readily incorporated. 17 In the interest of simplicity, however, we assume that all events 18 follow the recombination process outlined above. 19 In this section we describe Algorithm M, the extension of Q4 20 Algorithm N to simulate the ancestry of m-locus individuals.

21
The algorithms share the same overall structure, and, besides 22 the process of transferring ancestral material from offspring to 23 parents, do not differ in significant ways. Rather than writing out 24 a full listing of Algorithm M, which would obscure the key points 25 and result in needless notational complexity, we concentrate on 26 describing the differences between the two algorithms. The first 27 change we require is to generalise the data structures used to track 28 the history of m-locus individuals. We also require an algorithm to 29 transfer the ancestral material from children to parents under the 30 effects of recombination, recording any coalescences that occur. 31 The latter part is far more complex, and we therefore examine the 32 process in detail.

33
Algorithm M operates in the same manner as N, where we have 34 n individuals sampled on the torus and simulate until the history 35 of the sample is complete. Each locus ℓ has an oriented tree π ℓ 36 and list of node times τ ℓ . We also have a sequence η such that 37 η ℓ is the next available node in the oriented tree π ℓ . The state between G3 and G7 doing nothing other than choosing parents 10 for loci with nonancestral material. This can be greatly improved 11 using the sparse representation, and we outline the ways to modify 12 Algorithm G to generate parents under this approach. 13 Assuming that the mapping pairs (ℓ, α) are sorted in increasing following lemma allows us to accomplish this.

27
Lemma 11. Given a gap of length k between two loci with ancestral 28 material in a system with ν parents and a probability ρ of recombi-29 nation between adjacent loci, the probability that they descend from 30 different parents is Proof. Consider the following analogue of the recombination 33 process over a gap of length k. Suppose we set a 0 ← 1 and then for 34 each 1 ≤ j ≤ k set a j ← R U ({1, . . . , ν}\{a j−1 }) with probability ρ 35 or set a j ← a j−1 otherwise. Let φ(k) be the probability that a k ̸ = 1. 36 Clearly, φ(1) = ρ. Suppose that φ(k − 1) is the probability that 37 a k−1 ̸ = 1 and consider φ(k). 38 Three possibilities exist in which a k ̸ = 1: a k−1 = 1 and 39 recombination occurs; a k−1 ̸ = 1 and recombination does not 40 occur; or, a k−1 ̸ = 1 and we recombine to a value not equal to 1.

41
Writing these probabilities down, we have to occur at different rates. For example, the effect of rare large 10 events modelling demographic shifts along with frequent small 11 events modelling reproduction is analysed in detail by Barton et al. 12 (2010a). 13 In the interest of simplicity we have given detailed algorithm 14 listings for a single class of event occurring at a fixed rate, 15 and outlined the changes required to generalise to an arbitrary 16 number of event classes from the disc model. Incorporating events 17 from different models, however, is not so straightforward, as 18 our methods here depend entirely on the geometry of the disc approach. This is not a major drawback, however. It is only the most 22 frequent reproduction events that must be from the disc model; 23 rarer events can have any geometry that we wish, provided we 24 follow the method outlined in Section 3.1 for incorporating large- 25 scale events. Since these events are rare with respect to regular 26 reproduction, there is little point in conditioning on the events 27 affecting individuals and is in fact more efficient to simulate their 28 effects directly. 29 We have discussed and analysed the algorithms in this article Q6 30 in a great deal more detail than is customary in the literature.  where φ(x) = 2µ/Λ + 2uπ r 2 − u 2 A r (x), the kernel K (x, y) is given with 97 Q 1 (x, y, θ ) = A r   x 2 − 2xy cos θ + y 2  98 Q 2 (x, y, θ ) = A r  x, (−x/2 + y cos θ , y sin θ )  .

99
Here, f (x) is the probability density function of the distance 100 between two points sampled independently and uniformly at 101 random within the unit disc (Alagar, 1976), with f (x) = 0 for x > 2. The value of φ(x) is found by observing 104 that  R 2 A r (|z|) dz = (πr 2 ) 2 and  R 2 A r (x, z) dz = π r 2 A r (x).