Evolutionary stochastic dynamics of speciation and a simple genotype-phenotype map for protein binding DNA

Speciation is of fundamental importance to understanding the huge diversity of life on Earth. In contrast to current phenomenological models, we develop a biophysically motivated approach to study speciation involving the co-evolution of protein binding DNA for two geographically isolated populations. Our results predict that, despite neutral diffusion of hybrids in trait space, smaller populations have a higher rate of speciation, due to sequence entropy poising populations more closely to incompatible regions of phenotype space. A key lesson of this work is that non-trivial contributions of sequence entropy give rise to a strong population size dependence on speciation rates.

Speciation is of fundamental importance to understanding the huge diversity of life on Earth. In contrast to current phenomenological models, we develop a biophysically motivated approach to study speciation involving the co-evolution of protein binding DNA for two geographically isolated populations. Our results predict that, despite neutral diffusion of hybrids in trait space, smaller populations have a higher rate of speciation, due to sequence entropy poising populations more closely to incompatible regions of phenotype space. A key lesson of this work is that non-trivial contributions of sequence entropy give rise to a strong population size dependence on speciation rates.
Speciation is thought to underly much of the diversity of life on Earth today. The development of quantitative models that can predict speciation rates will thus allow better understanding of the different factors that maintain bio-diversity along with the processes of extinction and environmental change [1,2]. Yet the detailed genetic mechanisms by which distinct species arise is still largely not understood. Darwin [3], despite the title of his magnus opus, struggled to understand how natural selection could give rise to hybrid inviability or infertility. If the hybrid inviability were due to a single locus, how could two species evolve from a common ancestor, since at least one of these species would have to evolve past an inviability bottleneck; for example, if a pair of species, which share common ancestry, are fixed for aa and AA alleles, respectively, and the genotype aA is inviable, how did one of these populations evolve the Aa allele? It has since been understood that non-linear or epistatic interactions between different loci can give rise to so-called Dobzhansky-Müller incompatibilities [4][5][6] between independently evolving lineages. For example, two lines evolving independently through geographic isolation (allopatric evolution) from a common ancestor ab, can fix the allelic combinations aB and Ab respectively, yet the hybrid genotype AB is inviable.
Field data [1,7] suggest that the most dominant form of speciation does indeed involve geographically isolated populations with no or very little gene flow and that the mechanism is commonly via Dobzhansky-Müller incompatibilities [8,9]. There are a number of models of allopatric speciation based on Dobzhansky-Müller incompatibilities, which consider independent lineages evolving neutrally or under varying selection pressures on each line [10][11][12][13]. However, all of these models are essentially phenomenological in their assumptions about the genetic basis of speciation. This is understandable without recourse to very detailed models; however, in this paper we develop a generic approach, which explores universal properties that the genetic basis of traits may have on the process of speciation. In particular, much recent work has shown that in general mappings from genotype to phenotype are non-trivial [14][15][16][17], giving rise to a number of previously unpredicted effects, due to the increasing prominence that sequence entropic effects have for small populations. In this paper, we examine the process of how incompatibilities arise in allopatry, for an abstract, yet biophysically motivated model of transcription factor binding, which accounts for the sequence entropy of binding and ask what population size effects such a genotype-phenotype maps induces. The binding of transcription factors to DNA to control gene expression is arguably one of the most important co-evolving systems for organisms and crucial for their correct development and so makes an ideal first case study to study the affects of genotype-phenotype maps on speciation.
To tackle this question we develop the tools of stochastic dynamics for evolution in the regime where µN ≪ 1, where µ is the mutation rate and N is the population size, and use it to ask how hybrid fitness and the probability of incompatibilities increases with divergence time between a pair of lines under the same stabilising selection pressure. We find, in contrast to current models, which predict no population size effect [10][11][12][13], that despite the two lines diverging neutrally, smaller populations have a higher rate of speciation as the effect of sequence entropy is to bias populations more closely to incompatible regions of phenotype space. We suggest a key lesson of this work, is that even though co-evolving loci on different lineages may diverge in a population-size independent neutral manner, non-trivial contributions of sequence entropy in a stabilising fitness landscape give rise to strong population size dependence.

A SMOLUCHOWSKI EQUATION FOR EVOLUTIONARY STOCHASTIC DYNAMICS
Natural selection acts on phenotypes, however, in general, there will not be a one-to-one mapping between phenotypes and genotypes, but instead many genotypes can code for the same phenotype. As has been shown, for small population sizes (µN ≪ 1), where populations are largely monomorphic, genotype-phenotype maps can give rise to a bias in evolution towards phenotypes with larger sequence entropy and in equilibrium is described by a balance between a tendency of phenotypes to increase their fitness and at the same time maximise their sequence entropy [18,19]. This is embodied by the free fitness, Φ(ξ) = F (ξ)+S(ξ)/ν, where F is the Malthusian fitness [19] and S = − ln p(ξ) ξ is the sequence entropy, where p is the equilibrium distribution of ξ, the vector of traits and ν is the Lagrange multiplier or effective inverse temperature of the canonical ensemble, which is linear in population size N ; for the Wright-Fisher process ν = 2(N − 1) and the Moran process ν = N − 1 [19]. In equilibrium, the free fitness is maximised and the probability distribution of traits is Boltzmann distributed, p(ξ) = 1 Z e νΦ(ξ) , where Z = dξe νΦ(ξ) . Out of equilibrium, we expect that stochastic dynamics will give rise to: 1) diffusion in phenotype space with diffusion constant µ, where µ is the mutation rate of all sites contributing to the phenotype; 2) directed motion driven by gradients in the free fitness function with respect to changes in phenotype ξ. The flux of probability in phenotype space will then be given by where ζ is a coefficient representing the strength of evolutionary change in response to an evolutionary force (or gradient in free fitness) and the factor of a 1 2 for the mutation rate comes from converting from a discrete random walk to a continuous one. In equilibrium, detailed balance requires this flux to be zero, from which it is simple to show that which is the evolutionary equivalent of the Einstein relation that relates the friction constant to the diffusion constant of a Brownian particle; here the evolutionary friction constant ζ is inversely proportional to the mutation rate and population size. We can now express the Smoluchowski Equation in its final form using the continuity equation, ∂ t p(ξ) = −∇ · J(ξ) [20]: This Smoluchowski Equation can be converted to an equivalent set of stochastic differential equations [21,22] where i corresponds to the i th trait of ξ and where η i is a white noise Gaussian process with moments , is a generalisation of the Ornstein-Uhlenbeck process for phenotypic evolution described in [23], but for an arbitrary free fitness landscape and including the correct population size dependence of the strength of the drift term via the Einstein relation Eqn.2.

A SIMPLE CONTINUOUS MODEL OF TRANSCRIPTION FACTOR BINDING DNA
The two-state approximation [24,25] for transcription factor (TF) binding assumes that amino acid base pair hydrogen-binding energies are approximately additive and that each nonoptimal interaction increases the energy of binding by the same amount. This suggests replacing DNA and amino acid sequences by binary strings. The binding energy is simply proportional to the Hamming distance r = (g 1 − g 2 ) · (g 1 − g 2 ) between a pair of binary sequences, g 1 , g 2 . We can assign a fitness F r to each value of r, either arbitrarily or using some knowledge of the biophysics and function of a particular system. On the other hand, the entropy is given directly by the nature of the binary model; there are many sequences that give the same Hamming distance, precisely Ω r = ℓ r ≈ 2 ℓ 2/πℓ exp(− 2 ℓ (r − ℓ/2) 2 ), when ℓ, the sequence length, is large. So to a good approximation the sequence entropy is quadratic in Hamming distance r: We see that entropy is maximised for r = ℓ/2, which corresponds to the Hamming distance that has the largest number of genotypes coding it. In order to take advantage of the effective stochastic dynamics described in the previous section, we replace each sequence g i with a continuous variable x i and define a Hamming distance like or binding energy variable as ξ = |x 1 − x 2 | [26]. Further, we assume a simple quadratic fitness landscape F (ξ) = − 1 2 κ F ξ 2 . With sequence entropy given by Eqn. 5, we have (to within a constant) the free fitness given by Φ(ξ) = − 1 2 κ F ξ 2 − 2 ℓν (ξ − ℓ/2) 2 . Using Eqn. 4, treating x 1 and x 2 as independent variables, we can write down a pair of stochastic differential equations describing the dynamics of the sequence-like variables in a quadratic landscape: where κ = κ F + 4 ℓν , which is the curvature in the free fitness landscape and is a sum of the curvatures due to fitness and sequence entropy. In addition, we have used the fact that ∂ z (|z| − c) 2 = 2(z − c sgn(z)), where sgn is the sign function that returns the sign of the argument.

ANALYTICAL CALCULATION OF HYBRID FITNESS AND THE PROBABILITY OF AN INCOMPATIBILITY
From Eqn.6, fitness and sequence entropy balance (maximum in free fitness) for x 1 − x 2 = ±ξ ∞ = ± 2 κF ν , so the free fitness is doubled peaked with a cusp valley at x 1 − x 2 = 0, except in the limit of an infinite population size (ν → ∞). For the sake of analytical tractability, we can assume that each lineage is always confined to one or the other maximum, such that without loss of generality the initial condition x 1 (0) > x 2 (0) remains true. This implies we can make the simplification ξ = x 1 − x 2 , from which it follows: where the characteristic relaxation rate of the system is given by λ = νµκ/2. Note that this system of equations is analogous to the dynamics of a pair of overdamped beads connected by spring with a temperature-dependent elastic constant. To solve these equations, it is straightforward to take the Laplace transform of these equations, solve the resulting matrix equation to give solutions in Laplace space and find the inverse Laplace transform to give

the matrix J is given by
and it is understood that the integral of the vector above is an element by element operation. There will be an equivalent set of stochastic differential equations for the second lineage of the same form as Eqn.7 with solution x ′ (t) given by Eqn.8 and binding energy ξ ′ = |x ′ 1 − x ′ 2 | with an identically distributed noise vector, but uncorrelated with the first lineage. If we let w = x 1 − x ′ 2 and w ′ = x ′ 1 − x 2 and the hybrid binding energies be h = |w| and h ′ = |w ′ |, then it is straightforward to show using Eqns.8, 9 and the moments of the Gaussian process that We see that for long times (λt ≫ 1), the average binding energy ξ(∞) decays from the initial condition to its equilibrium value ξ ∞ = 2 κν on the timescale 1/(2λ); in reality, using the full dynamics in Eqn.6 this variable would have zero mean in the long time limit, but as we see below the results are insensitive to this error. If we define the vector w = (w, w ′ ) T , we find that the covariance matrix Σ = (w − w ) T (w − w ) is symmetric and has elements (1 − e −4λt ), We see for short times (λt ≪ 1), the off-diagonal terms are zero, which suggests the binding energies of the two hybrids are uncorrelated. However, for long times (λt ≫ 1) they become anti-correlated, which shows, unlike at short times, the probability of both hybrids being incompatible are not independent. Average hybrid fitness is then given by F h (t) = − 1 2 κ F h 2 = − 1 2 κ F w 2 and we see that on long times it decreases like ∼ µt.
A similar calculation of the variance on each lineage gives, which shows that, as expected, the co-evolutionary constraint of the free fitness landscape bounds the variance of binding energies on each lineage. So the variance in the hybrid binding energies is due to a pure diffusive term µt, which represents how the two lineages diffuse apart by independent mutations, plus a term which represents the saturating growth of variance of each lineage after divergence.
The dynamics of the probability of a DMI for each hybrid, irrespective of whether the other hybrid has a DMI or not, is simply, where ξ * = 2|F * |/κ F , where F * is the threshold fitness below which an incompatibility arises in a hybrid. The variable w is given by the sum of a number of Gaussian processes, so p(w, t) itself must be Gaussian (with p(h, t) = p(w, t)+p(−w, t)), which is completely specified by its mean (Eqn.10) and variance Σ 11 = w 2 − w 2 , which can be directly calculated from Eqns.10 & 11. From Eqn 14, the probability of a DMI is then simply an integral of a Gaussian, which is expressed in terms of complementary error functions: Note that both the average hybrid fitness and the probability of incompatibilities (Eqn.14) collapse to the same curves as a function of µt, for the same value of κν = κ F ν + 4/ℓ and when fitness is measured relative to the curvature F/κ F .

RESULTS
We can compare the analytical calculations of the previous section with numerical simulations of Eqn. 6 for a pair of lines diverging from a common ancestor, where no approximation is made regarding the values of x 1 and x 2 . The results of the simulations are averaged over 10 4 independent realisations and all results assume an effective sequence length ℓ = 10, F * /κ F = −25 and ξ(0) = ξ ∞ = 2 κν , which is the mean value of ξ in equilibrium.
In Fig.1, we have plotted the probability of an incompatibility, for various values of κ F ν, where solid lines are the analytical calculation and the dotted lines are from the numerical integration of Eqn.6. Firstly, we see that the analytical predictions compare very well to integrating the full stochastic differential equations, validating our simplifying assumptions. Secondly, we see there is a large population size effect for the probability of an incompatibility; the characteristic time for incompatibilities to arise is shorter as the population size decreases, where in addition for small populations (κ F ν ≪ 0.1) there are two such characteristic timescales. In addition, we see that the dynamics of P I (t) become insensitive to changes in population size both for small population sizes (κ F ν ≪ 0.1) and large population sizes (κ F ν ≫ 0.1). Note that, as we have not bounded the binding energies to a maximum value of ℓ, we would expect real sequences to have a smaller plateau value of P I in the long time limit of µt ≫ 1; as we argue below it is the short time limit that is most relevant to speciation.
To understand this general behaviour, we can consider what happens in the phase space of x 1 and x 2 for the first lineage, versus the phase-space for x 1 and x ′ 2 for the hybrids, as shown in Fig.2. Initially, both lineage and hybrid populations diffuse in a neutral and spherically symmetric manner (like 2µt and 4µt, respectively) up to the time ∼ λ −1 , when the change in free fitness is of order the mean fitness ∼ 1/ν and the accumulated variance of the hybrid population approaches the characteristic width of the potential δξ 2 ∼ 1 κν . After this time the co-evolutionary constraint of the free fitness landscape is felt on each lineage and the probability density is then squeezed along a tube whose axis is defined by x 1 = x 2 + ξ ∞ (assuming an initial condition x 1 (0) > x 2 (0)) and width δξ. As the marginal pdfs for x 1 and x ′ 2 will be identical, the result is that in the hybrid phase-space, we have that the p(x 1 , x ′ 2 , t) still grows in a spherically symmetric manner, as indicated by the term 2µt in Eqn.11; incompatibilities arise when hybrid populations have diffused to one or the other critical binding energy at x 1 − x ′ 2 = ±ξ * . From Eqn.14 and Fig.2, we see that there will in general be two characteristic times for DMIs to arise, given by the condition that (ξ * ± 2 κν ) 2 ∼ Σ 11 (t). It is then simple to see that in the limit of κ F ν ≫ 4/ℓ, ξ 0 → 0 and the time to diffuse to each boundary is the same as observed in Fig.1. Finally, for small population sizes (κ F ν ≪ 4/ℓ) κν → 4/ℓ, which shows the dynamics of DMIs becomes independent of population size in this limit as well. It is clear that the population size dependence described here is likely to be also seen in more complex models of co-evolving loci as for a general free fitness landscape, the balance between sequence entropy and fitness is population size dependent, poising populations nearer or further away from such incompatible regions.
Diagram showing the evolution of the probability density of p(x1, x2, t) (left) and p(x1, x ′ 2 , t) (right), as a function of time.
For the actual process of speciation, it will typically be the short-time behaviour that will dominate. Given a large number M ∼ 1000 → 10000 of similar noninteracting clusters of loci, we would expect speciation when M P I ≫ 1 and so the critical probability of a DMI will be ∼ 1/M ∼ 10 −4 → 10 −3 . As shown in the inset of Fig.1, which is a plot of P I (t) on a log-log scale together with the threshold probability 1/M = 10 −4 , there is strong population size dependence on the time to speciation. In particular, we see this time is relatively insensitive to the exact value of M , compared to the population size effect we describe and arises due to the steep gradient of P I (t) at short times.
The process of speciation underlies the vast diversity of life on Earth. However, conventional models of speciation via Dobzhansky-Müller incompatibilities for geographically isolated populations with neutral divergence predict that speciation is independent of population size [10][11][12][13]. However, such models are phenomenological with respect to the genetic basis of incompatibilities. Here using the tools of stochastic dynamics and a simple biophysically motivated model for the genotype-phenotype map of protein binding DNA, we have shown that there is a significant population size dependence of the time for hybrid incompatibilities to arise for independently evolving isolated populations under the same stabilising selection pressure. Protein binding DNA to control gene expression is a prototypical co-evolving system and critical for the proper development of organisms, thus these results have strong implications for the speciation rates and diversity of populations at small population sizes. We should note that our model does not account for differential selection across lineages, although it can easily be extended to do so, which can reinforce or have an opposite dependence on population size dependent on whether selection is beneficial or mildly deleterious [13]; the work here highlights a previously unanticipated population size dependence, which is necessary for a fully quantitative model of speciation.
We acknowledge useful discussions with David Pollock, University of Colorado and funding from the Medical Research Council, U.K.