Generalizing Bayesian phylogenetics to infer shared evolutionary events

Significance Phylogenetic models have long assumed that lineages diverge independently. Processes of diversification that are of interest in biogeography, epidemiology, and genome evolution violate this assumption by affecting multiple evolutionary lineages. To relax the assumption of independent divergences and infer patterns of divergences predicted by such processes, we introduce a way of conceptualizing, modeling, and inferring phylogenetic trees. We apply the approach to genomic data from geckos distributed across the Philippines and find support for patterns of shared divergences predicted by repeated fragmentation of the archipelago by interglacial rises in sea level.

. The accuracy and precision of the M G and M IB models at estimating the tree length (the sum of all branch lengths in units of expected substitutions per site) from data sets with 50,000 biallelic characters simulated on species trees randomly drawn from the M G and M IB tree distributions. Each plotted circle and associated error bars represent the posterior mean and 95% credible interval. Estimates for which the potentialscale reduction factor was greater than 1.  Figure S5. The accuracy and precision of the M G and M IB models at estimating the effective population size (N e µ) across the tree from data sets with 50,000 biallelic characters simulated on species trees randomly drawn from the M G and M IB tree distributions. Each plotted circle and associated error bars represent the posterior mean and 95% credible interval. Estimates for which the potential-scale reduction factor was greater than 1. Unlinked SNPs 500 loci, 100 linked sites each Figure S6. The M G model has a low false positive rate (FPR; the proportion of incorrectly merged divergence times with a posterior probability > 0.5) when applied to data simulated on trees drawn from M G with all (Row 1) or only variable (Row 2) unlinked characters, and all characters from linked loci (Row 3). Support for incorrectly merged divergence times is high only when the difference between the times is small (Column 2), and is not correlated with the age of the merged nodes (right). (Column 3; P = 0.109, 0.106, 0,053, and 0.068, from top to bottom for a t-test that Pearson's correlation coefficient = 0 using all points with posterior probability > 0). When data sets with linked loci are reduced to only one variable site per locus (Row 4), the FPR increases (left) and precision decreases (right). The top row is the same as Figure

Midpoint divergence time Posterior probability
Unlinked SNPs 500 loci, 100 linked sites each Figure S7. The M G model has a low false positive rate (FPR; the proportion of incorrectly merged divergence times with a posterior probability > 0.5) when applied to data simulated on trees drawn from M IB (no shared or multifurcating divergences) with all (Row 1) or only variable (Row 2) unlinked characters, and all characters from linked loci (Row 3). Support for incorrectly merged divergence times is high only when the difference between the times is small (Column 2), and is not correlated with the age of the merged nodes (right). (Column 3; P = 0.25, 0.29, 0,11, and 0.11, from top to bottom for a t-test that Pearson's correlation coefficient = 0 using all points with posterior probability > 0). When data sets with linked loci are reduced to only one variable site per locus (Row 4), the FPR increases (left) and precision decreases (right). The top row is the same as Figure 5D      Posterior probability Figure S10. Results of analyses of 100 data sets with linked loci simulated along the species tree shown in (A).
Each simulated data set comprised 500 loci with 100 biallelic characters. P-values are shown for Mann-Whitney U tests Mann and Whitney (1947) comparing the differences in tree distances between methods. For each simulation, the mutation-scaled effective population size (N e µ) was drawn from a gamma distribution (shape = 20, mean = 0.001) and shared across all the branches of the tree; this distribution was used as the prior in analyses. Each simulated data set comprised 500 loci with 100 biallelic characters. P-values are shown for Mann-Whitney U tests (Mann and Whitney, 1947) comparing the differences in tree distances between methods. For each simulation, the mutation-scaled effective population size (N e µ) was drawn from a gamma distribution (shape = 20, mean = 0.001) and shared across all the branches of the tree; this distribution was used as the prior in analyses. Tree plotted using Gram ( Posterior probability Figure S12. The performance of the M G and M IB tree models when applied to 100 data sets with 500 loci (each with 100 linked characters) simulated on species trees randomly drawn from the M G tree distribution. For each simulation, the mutation-scaled effective population size (N e µ) was drawn from a gamma distribution (shape = 20, mean = 0.001) and shared across all the branches of the tree; this distribution was used as the prior in analyses. Plots created using the PGFPlotsX (Version 1. Posterior probability Figure S13. The performance of the M G and M IB tree models when applied to 100 data sets with 500 loci (each with 100 linked characters) simulated on species trees randomly drawn from the M IB tree distribution. For each simulation, the mutation-scaled effective population size (N e µ) was drawn from a gamma distribution (shape = 20, mean = 0.001) and shared across all the branches of the tree; this distribution was used as the prior in analyses. Plots created using the PGFPlotsX (Version 1.   Figure S15. The maximum a posteriori (MAP) topology from Figure 6C for Gekko shown without branch lengths to make it clearer which clades are involved. Shared divergences indicated by dashed lines, with labels shown along the top that correspond to rows in

Rejected trees
Retained trees 2 Tables   Table S1. The data for all Cyrtodactylus samples included in our phylogenetic analyses are included in a tab-delimited text file available from the project repository and archived on Zenodo (Oaks and Wood, Jr., 2021): https://raw.githubusercontent.com/phyletica/ gekgo/master/phycoeval-msg-assemblies/ipyrad-assemblies/sample-data/Cyrt_ localities.tsv. Below is a summary of the data from each sample.  Table S3. Summary of shared divergences in the maximum a posteriori (MAP) phylogeny estimated under the generalized tree model for Cyrtodactylus ( Figure 6A) and Gekko ( Figure 6C). See Figures S14 & S15

The generalized tree model
Let T represent a rooted, potentially multifurcating tree topology with N tips and n(t) internal nodes t = t 1 , t 2 , . . . t n(t) , where n(t) can range from 1 (the "comb" tree) to N −1 (fully bifurcating, independent divergences). Each internal node t is assigned to one divergence time τ, which it may share with other internal nodes in the tree. We will use τ = τ 1 , . . . , τ n(τ) to represent n(τ) divergence times, where n(τ) can also range from 1 to N − 1, and every τ has at least one node assigned to it, and every node maps to a divergence time more recent than its parent ( Figure S17). Note, the number of divergence times is constrained by the number of internal nodes; n(τ) ≤ n(t). For convenience, we will index each τ from youngest to oldest. We assume the tree is ultrametric; all tips are at time zero, which we will denote as τ 0 . Figure S17. An illustration of the generalized tree model implemented in ecoevolity. The prior distributions of the divergence times are shown to the right, and "splittable" divergence times are indicated with an asterisk to the left. We assume all possible topologies (T ) are equally probable; see Figure S1 for an example of the sample space of topologies under the generalized tree model. We assume the age of the root node follows a parametric distribution (e.g., a gamma distribution), and each of the other divergence times is beta-distributed between the present (τ 0 ) and the age of the youngest parent of any node mapped to the divergence time. For example, in Figure S17, the parents of the nodes mapped to τ 1 are t 3 (parent of t 1 ) and t 6 (parent of t 5 ), which are mapped to τ 3 and τ 4 , respectively, the younger of which is τ 3 . Thus, divergence time τ 1 in Figure S17 follows a beta distribution that is scaled to the interval between 0 and τ 3 , resulting in the prior probability density where α τ and β τ are the two positive shape parameters of the beta prior probability distribution, and B(α τ , β τ ) is the beta function that serves as a normalizing constant. If we use y(τ i ) to denote the divergence time of the youngest parent of any node mapped to the non-root divergence time, τ i , we can generalize this beta prior probability density as For additional flexibility, we allow a distribution to be placed on the alpha parameter of the beta distributions on the non-root divergence times (α τ ). However, we constrain β τ = 1, simplifying the prior probability density of each non-root divergence time to For all of our analyses presented in this paper, we constrained α τ = 1, which further simplifies the prior probability of each non-root divergence time to the uniform density For simplicity below, we denote the probability density of a divergence time as f (τ i | y(τ i )), but our work generalizes to any proper probability distributions on the divergence times, including the full beta distribution (Equation 4).

Approximating the posterior of the generalized tree model
We use Markov chain Monte Carlo (MCMC), specifically Metropolis-Hastings (MH) (Metropolis et al., 1953; Hastings, 1970) algorithms, to sample from the generalized tree distribution. An MH algorithm works by stochastically proposing changes to parameters of a model, and using the following rule to determine the probability of accepting these moves: Acceptance probability = min 1, likelihood ratio × prior ratio × Hastings ratio (7) The product of the first two terms (likelihood and prior ratios) gives us the posterior density of the state of the model being proposed divided by the posterior density of the current state of the model. The Hastings ratio corrects for asymmetry in the proposal distribution by dividing the probability density of proposing the move that would reverse the move being proposed by the probability density of the move being proposed. Below we describe MH moves for sampling from the generalized tree distribution, and derive the prior and Hastings ratio for each. These moves can be coupled with any likelihood function to calculate the likelihood ratio and sample from the posterior distribution of generalized trees using MH (Equation 7). In Sections 4.1-4.4, we describe a pair of reversible-jump moves, "split-time" and "merge-times," for moving between trees with different numbers of divergence times. These are the core moves that allow the full space of the generalized tree distribution to be traversed and sampled. In Section 4.5, we describe moves that propose changes to the topology, but do not change the number of divergence-time parameters, nor their values. In Section 4.6, we describe a move that updates the values of divergence times without changing the topology (though we do describe an extension of this move that does update the topology).

Split-time move
To generalize the space of trees, we introduce two reversible-jump moves: "split-time" and "merge-times." When a reversible-jump move is to be attempted, the merge-times or split-time move is chosen with probability 0.5, except for two special cases: 1. If the current state of the chain is the most general tree model (n(τ) = N − 1), then the merge-times move is chosen with probability 1.
2. If the current state of the chain is the "comb" tree (n(τ) = 1), then the split-time move is chosen with probability 1.
The basic idea is to randomly divide a "splittable" divergence time into two nonempty sets of nodes, and assign one of the sets to a new, more recent divergence time. A divergence time is considered splittable if it has (1) more than one node mapped to it, or (2) a single multifurcating node mapped to it. For example, in Figure S17, divergence times τ 1 , τ 2 , and τ 4 are splittable. The first step is to randomly choose a divergence time, τ i , from among the splittable divergence times. After dividing τ i into two sets of nodes, we need to randomly select a time more recent than τ i to assign one of the sets.

Drawing the new divergence time
To get the new, proposed divergence time, we randomly draw a new divergence time between τ i−1 and τ i from a proposal distribution, where is the conditional probability density of proposing the new time τ given the times from the current values of τ i and τ i−1 . In our implementation, we use a beta probability distribution scaled and shifted to the interval τ i−1 -τ i , so that the probability density of the new time is where α and β are the two positive shape parameters of the beta distribution, and B(α, β) is the beta function . For generality and simplicity, we will use g τ to denote this probability density of the proposed divergence time below.

Prior ratio
The prior ratio for the split-time move is where f (T , τ ) is the prior probability of the proposed tree topology and divergence times. In our implementation, we assume that all possible tree topologies (across n(τ) = 1, 2, . . . , N −1) are equally probable a priori. We also assume the divergence time of the root (τ n(τ) ) is gamma-distributed, and each of the other divergence times is beta-distributed between the present (τ 0 ) and the divergence time of the youngest parent of a node mapped to τ i , which we denote as y(τ i ) ( Figure S17; Equation 4). Given these assumptions, the prior ratio becomes If τ i = 1 (i.e., the divergence time selected to split was the most recent divergence), then τ i−1 is the present, and so f (τ i−1 | y(τ i−1 ) ) = f (τ i−1 | y(τ i−1 )) = 1. Also, if none of the nodes assigned to τ i−1 has a parent assigned to the newly proposed divergence time (i.e., y(τ i−1 ) = τ ), then y(τ i−1 ) = y(τ i−1 ); e.g., in Figure S17, if τ 2 is split, the prior probability density of τ 1 is not affected, because the youngest parent of a node mapped to τ 1 is mapped to τ 3 . In both of these special cases, the prior ratio further simplifies to

Hastings ratio
The probability of proposing a split-time move involves several components. First, we have to choose to split rather than merge. We will account for the probability of this toward the end of this section. Next, we randomly choose a splittable divergence time τ i with probability 1 ns(τ) , where n s (τ) is the number of splittable divergence times. As described in Section 4.1.1 above, we randomly choose a new divergence time τ more recent than τ i with probability density g τ .
When we divide τ i into two sets of nodes, if any polytomies get broken up, new branches will get added to the tree. Under certain models, each of these new branches will need values randomly drawn for parameters. For example, if using a "relaxed-clock" model, each new branch will need a substitution rate. Or, if using a multi-species coalescent model where each branch has its own effective population size, a value for this will need to be drawn. Note, this does not involve divergence-time parameters, because all nodes split from τ i will be assigned to τ . We will use g z to represent the product of all the probability densities of the proposed values for the new branches. If no polytomies get broken up, or new branches created from broken polytomies do not require parameter values, then g z = 1.
We will deal with how the nodes assigned to τ i are divided into two sets below. For now, we will use Ξ to represent the probability of the proposed division of τ i . The probability density of the proposed split move is then where Θ and Θ represent the full state of the model before and after the proposed split move, respectively. The move that would exactly reverse this split move would simply entail randomly selecting the proposed divergence time from all divergence times except the root, which would then be deterministically merged with the next older divergence time. The probability of this reverse move is where n(τ) and n(τ) is the number of divergence times before and after the proposed split move, respectively. The Hastings ratio for the split move is then where γ S represents the probability of choosing to merge in the reverse move divided by the probability of choosing to split in the forward move, which is 0.5 if current tree is the "comb" and proposed tree has n(τ) < N − 1 2.0 if proposed tree has n(τ) = N − 1 and the current tree is not the "comb" 1.0 otherwise.
When we are working with a tree with more than three tips, the first case occurs whenever the current tree is the comb tree (n(τ) = 1), and the second case occurs whenever the proposed tree has no shared divergences nor multifurcations (n(τ) = N − 1). However, for full generality we need to include the second condition in both of these cases to account for the situation where we are working with a tree with only three tips.

Merge-times move
In the "merge-times" move, we randomly choose τ x from one of the n(τ) − 1 non-root divergence times. Then, we merge τ x with the next older divergence time, τ x+1 . This will create shared divergence times among nodes and/or multifurcating nodes. We will use τ x+1 to refer to the newly merged divergence time proposed by the move.

Prior ratio
Generally, the prior ratio for the merge-times move is the same as Equation 10. Assuming (1) all topologies are equally probable, (2) the divergence time of the root (τ n(τ) ) is gamma-distributed, and (3) each of the other divergence times is beta-distributed between the present (τ 0 ) and the age of the youngest parent to the nodes mapped to the divergence time ( Figure S17; Equation 4), the prior ratio becomes .

Hastings ratio
The probability of the forward merge-times move is simply where n(τ) and n(τ) is the number of divergence times before and after the proposed merge move, respectively. Borrowing from Equation 13, the probability density of the split move that would exactly reverse the proposed merge move is where n s (τ) is the number of splittable divergence times after the proposed merge-times move. The Hastings ratio for the merge move is then where γ M represents the probability of choosing to split in the reverse move divided by the probability of choosing to merge in the forward move, which is 2.0 if proposed tree is the "comb" and current tree has n(τ) < N − 1 0.5 if current tree has n(τ) = N − 1 and the proposed tree is not the "comb" 1.0 otherwise (21)

Expanding Ξ
Up to this point, we have not dealt with how, during the split-time move, we divide the nodes mapped to τ i into two sets, one of which gets assigned to the new divergence time drawn between τ i−1 and τ i . This has to be done with care to ensure that every possible configuration of two divergence times derived from the nodes assigned to τ i can be proposed, such that it properly balances the reverse merge-times move. As above, we use Ξ to represent the probability of the proposed division of τ i 's nodes.
In the next two sections, we show how this is done for two special cases. The first special case illustrates how we first choose which nodes currently mapped to τ i will get moved to the new divergence time. The second special case shows how we handle any multifurcating nodes that have been chosen to be moved to the new divergence time. In the third section, we build on these special cases to show a general solution for Ξ.

The case of all bifurcating nodes mapped to τ i
We will use n(t → τ i ) to represent the number of nodes mapped to τ i . If all n(t → τ i ) nodes mapped to τ i are bifurcating, we randomly divide these nodes into two non-empty sets and then randomly choose one of the two sets of nodes to move to the new divergence time. For example, this would be the case if τ 2 is chosen to split from the tree shown in Figure S17.
The number of ways n(t → τ i ) can be divided into two non-empty subsets is given by the Stirling number of the second kind, which we denote as S 2 (n(t → τ i ), 2). We uniformly choose among these, such that the probability of randomly selecting any set partition of the n(t → τ i ) nodes mapped to τ i is 1 n(t →τ i ) . After partitioning the nodes into two sets, there is a 1/2 probability of choosing one set to move to the new, more recent divergence time. Thus, when all of the nodes mapped to τ i are bifurcating the probability of each possible splitting of τ i is 4.3.2 The case of a single polytomy mapped to τ i Next, let's consider another special case where the number of nodes mapped to τ i is one (i.e., a single polytomy). For example, this would be the case if τ 4 is chosen to split from the tree shown in Figure S17. In this case, we randomly resolve the polytomy, by randomly (uniformly) choosing a set partition of the descending branches into non-empty subsets. Any subsets with only one branch remain attached to the original polytomy node, while each subset with multiple branches get split off to form a new node (clade) that descends from the original polytomy node. These new nodes are assigned to the new, more recent divergence time τ . The number of ways to partition the descending branches of the polytomy are thus B b − 2, where B b is the Bell number (Bell, 1934)-the number of possible set partitions of the b branches descending from the polytomy. We have to subtract 2 from B b , because we do not allow the two "extreme" set partitions with one or b subsets. The former would move the whole polytomy to the new divergence time, and the latter would leave the polytomy as is. Neither of these scenarios adds a dimension (divergence time) to the model. We avoid these two scenarios using rejection so that the remaining partitions of the b descending branches are chosen uniformly. Thus, when only a single polytomy is mapped to τ i the probability of each possible splitting of τ i is 4.3.3 The case when multiple nodes, including at least one polytomy, are mapped to τ i When multiple nodes are mapped to τ i , and at least one is a polytomy, we need to do some more accounting to ensure that we can reach every possible arrangement of two divergence times that can be merged to form the current configuration of nodes mapped to τ i . Similar to the case with all bifurcating nodes, we will first divide the n(t → τ i ) nodes mapped to τ i into two subsets and randomly choose one of these subsets to move to the proposed, more recent divergence time, τ . For each multifurcating node that ends up in the subset to be moved to τ (if any), we need to either break up the polytomy, as we did in the case of the single-polytomy case above, or move the entire polytomy to τ .
Unlike in the case of only bifurcating nodes mapped to τ i , when we partition the n(t → τ i ) nodes mapped to τ i into two sets, we must allow for the case where all n(t → τ i ) end up in the set to move to τ . This is because, if any of the polytomy nodes get broken up, they will leave at least one node at τ i , and the dimension of the model will change (i.e., the number of divergence times will increase by one). So, we have to allow an empty subset when we randomly partition the n(t → τ i ) into two subsets. However, we cannot allow the empty subset to be chosen to move to τ . There are S 2 (n(t → τ i ), 2) + 1 ways to partition the n(t → τ i ) nodes mapped to τ i into two subsets if we allow the set partition with one empty subset. For each of these, there are two ways to choose the subset to move to τ , and of all of these, there is one scenario we will reject: if the empty set gets selected to move to τ . Thus, there are (2(S 2 (n(t → τ i ), 2) + 1)) − 1 = (2S 2 (n(t → τ i ), 2)) + 1 = 2 n(t →τ i ) − 1 ways to choose a subset of the nodes assigned to τ i for moving to the new divergence time, and the probability of each is For each polytomy mapped to τ i that ends up in the set of nodes to move to τ (if any), we randomly choose one of the B b possible set partitions of the b branches descending from the polytomy. However, we will reject the set partition with b subsets (i.e., all branches end up in their own subset). We reject this, because no subclades get broken off from the polytomy to move to τ , and this scenario is already taken into account by the polytomy node not ending up in the set of nodes to move to τ in the first place. However, we need to allow the scenario where all b branches descending from a polytomy get assigned to a single set, which results in the entire polytomy node getting moved to τ , as long as at least one node remains assigned to τ i (we will handle this in a bit). Thus, for each polytomy in the set of nodes to be moved to τ , there are B b − 1 ways to move it. Using n p (t ⇒ τ ) to represent the number of polytomies in the subset of nodes to be moved to τ , the total number of ways these polytomies can be moved to τ is and the probability of each is equal to If no polytomies end up in the subset of nodes to move to τ , then Φ = 1. However, if all n(t → τ i ) nodes mapped to τ i end up in the set of nodes to be moved to τ , we need to reject the case where none of the polytomy nodes gets broken up (i.e., for every polytomy, all the descending branches get partitioned into a single set), because no nodes would remain assigned to τ i , and the move would simplify to changing the value of τ i . Thus, if all n(t → τ i ) nodes mapped to τ i end up in the set of nodes to move to τ , the total number of ways all n p (t ⇒ τ ) polytomies can be moved to τ is and the probability of each is equal to Given all of this, the probability of choosing a subset of nodes from τ i to move to the new divergence time across all possible cases is if no polytomies mapped to τ i 1 (2 n(t →τ i ) −1)(Φ−1) if ≥ 1 polytomy nodes mapped to τ i , and all n(t → τ i ) assigned to move set 1 (2 n(t →τ i ) −1)Φ otherwise. (30) Notice, the case of n(t → τ i ) = 1 (i.e., only a single polytomy node mapped to τ i ) is simply a special case of the second condition above, where all the nodes assigned to τ i end up in the set to move to the new divergence time, including at least one polytomy.

Validation of Split-time and Merge-times moves
To validate the split-time/merge-times moves, we used them to sample from the prior distribution of trees with 5, 6, and 7 leaves. If working correctly, we should sample all n(T ) tree topologies with an equal frequency of 1 n(T ) . If we collect N MCMC samples from the prior distribution, the number of times a topology is sampled should be approximately distributed as Binomial(n = N , p = 1 n(T ) ); i.e., binomially distributed where the number of "trials" is equal to the number of samples, and the probability of sampling each topology is 1 n(T ) . We found a close match between the number or times each tree was sampled by our reversible-jump MCMC chain and the expected number, and failed to reject the expected binomial distribution using χ 2 goodness-of-fit test ( Figure S18; p = 0.742, 0.464, and 0.172 for the test with a 5, 6, and 7-leaved tree).

Nested-neighbor-node-swap move
The split-time and merge-times moves can sample all of the space of the generalized tree distribution (assuming the age of the root node is fixed). However, we implemented additional topology moves that do not jump between tree models with different numbers of divergence times.
The goal of this move is to change the topology without changing the number or timing of divergences. We start by randomly picking a non-root divergence time, τ i , with probability 1 n(τ)−1 . Next, we find the divergence time, τ j , that contains the node that is the youngest parent of nodes mapped to τ i . We then randomly pick one of the nodes mapped to τ j that has children mapped to τ i , we will call it t a . Each child of t a that is mapped to τ i will randomly contribute one of its children to a "swap pool" of nodes. If t a has children that are not mapped to τ i , we randomly pick one of these children and add it to the swap pool if it is younger than τ i . If the selected child of t a is older than τ i , we randomly sample one of its children and continue to do so until we have chosen a descendant node that is younger than τ i , which we then add to the swap pool. Lastly, we randomly pick two nodes from the swap pool and we swap their parents.
After the proposed move, the structure of the tree rootward of the swapped nodes is the same. Because of this, the move that would reverse the proposed move would be equally probable; it would involve (1) choosing the same non-root divergence time τ i , (2) choosing the same t a , (3) choosing the children that swapped parents in the forward move to enter the swap pool, and (4) picking the nodes that swapped parents in the forward move from the pool and swapping their parents back. In #3, all the parent nodes involved have the same number of children as before the proposed forward move, and so the probability of the reverse move will be equal. As a result, the Hastings ratio for the move is 1.
For example, for the tree in Figure S17 if we randomly selected τ 1 , the divergence containing the youngest parent of the nodes mapped to τ 1 is τ 3 . Divergence time τ 3 only has one node that is a parent of nodes assigned to τ 1 , which is t 3 . Node t 3 only has one child mapped to τ 1 (t 1 ), which will randomly contribute one of its children, Leaf H or I, to the swap pool. Node t 3 also has children that are not mapped to τ 1 , one child t 2 that is mapped to τ 2 . Node t 2 is considered for the swap pool, but it is too old (it is older than τ 1 and thus could not become a child of t 1 . So, we randomly consider one the children of t 2 , Leaf F or G, for the swap pool, either of which is young enough to be added to the swap pool. Next, we randomly choose two nodes from the swap pool, which has exactly two nodes in this case, a child of t 1 (Leaf H or I) and t 2 (Leaf F or G), and these nodes swap parents. If we assume that Leaves G and H were swapped, it is clear that the probability of the reverse move that would swap them back is equally probable.
We also implemented variations of this move that make larger changes to the tree topology. For example, we can perform the swap for all of the nodes mapped to τ j that have children mapped to τ i (instead of randomly choosing one of them). Another option is to randomly permute the parents of all of the nodes in the swap pool, rather than swap the parents of just two of the nodes. By chance, when doing this permutation of the nodes in the swap pool, it is possible to end up with the same topology we started with. To avoid proposing the same state, we iteratively permute the parents of the nodes in the swap pool until we have a new topology (the parents of at least some of the nodes in the swap pool have changed). Just like with the swap move, this permutation move can be performed on one randomly selected node of τ j that has children mapped to τ i , or to all of them.

Validation of move
To validate this move, we used it to sample from a uniform distribution over the topologies of a 6-leaved bifurcating tree. There are 945 topologies for a rooted, bifurcating tree (Felsenstein, 1978). If the move is working correctly, the number of times we sample each of them should follow a Binomial(n = N , p = 1 945 ) distribution, where N is the total number of MCMC samples. From an MCMC sample of 100,000 trees, we found a close match to this expected distribution, and were unable to reject it using a χ 2 goodness-of-fit test ( Figure S19; p = 0.51).  Figure S19. Comparing the expected to the observed number of times each topology of a rooted, 6-leaved, bifurcating tree is sampled by our nested-neighbor-node-swap move. Our MCMC sample of 100,000 trees closely matched the expected Binomial(n = 100, 000, p = 1 945 ) distribution.

Divergence time slide bump move
To begin this move, we randomly pick one of the divergence times, τ i . Next, we draw a uniform deviate, u ∼ Uniform(−λ, λ), where λ is a tuning parameter that can be adjusted to improve the acceptance rate of the proposal. Then, we get a new divergence time value by τ i e u . We will index our randomly selected divergence time, τ i , as τ 1 . We then use τ 1 , τ 2 , . . . , τ n to represent the selected time, τ 1 , and all the divergence times between τ 1 and τ 1 e u that contain nodes ancestral or descendant to the nodes mapped to τ 1 . Note, that incrementing indices count younger or older divergence times, depending on whether τ i e u < τ 1 or τ i e u > τ 1 , respectively.
The simplest case is that we do not have any intervening divergence times, and so we only have τ 1 . This will happen when τ 1 e u is older than the oldest node that is a child of the nodes mapped to τ i and younger than the youngest node that is parent of the nodes mapped to τ i In that case, we propose a new time to which to slide τ 1 as To reverse this move (slide τ 1 back) would be To solve for the uniform deviate that would exactly reverse the move (u ), we take the log of Equation 32 and solve for u , To get the Hastings ratio for this move, we use the formula of Green (1995), which is the ratio of the probability of drawing the random deviate that would reverse the proposed move to the probability of drawing the random deviate of the proposed move, multiplied by the absolute value of the determinant of a Jacobian matrix. Because the forward and reverse random deviates are uniform, g (u ) g(u) = 1, and the Hastings ratio reduces to just the Jacobian term, In the next simplest case, there is one intervening divergence time τ 2 . In this case, τ 1 will slide to τ 2 and "bump" it to the new time τ 1 e u . More formally, the move will be τ 2 = τ 1 e u τ 1 = τ 2 .
Again, the uniform deviate that would exactly reverse this move would be u = −u and the reverse move would be τ 2 = τ 1 τ 1 = τ 2 e u .
Again, the Hastings ratio reduces to just the Jacobian term, To generalize this to an arbitrary number of intervening divergence times that will be bumped, we have τ n = τ 1 e u τ n−1 = τ n τ 2 = τ 3 τ 1 = τ 2 .
The Hastings ratio reduces to just the Jacobian term, We avoid values of τ 1 e u that are less than zero, by rejecting the proposed move. There is no upper limit for the move, because the root of the tree can be moved to an arbitrarily old divergence time. However, in our implementation, the prior on the divergence time of the root is different than the other divergence times, and can be much more informative. In such cases, we might be able to improve mixing and tuning of the move be excluding the root divergence time from the move. We do this by selecting only non-root divergence times, and rejecting any proposed moves where τ 1 e u is older than the root.

An extension to this move
We can easily extend this move to also propose new topologies. Whenever we have a "bump" that involves a node and its children, we can propose a node swapping or permuting move described above (see the nested-neighbor-node-swap move and its extensions). Because we are sliding the nodes to take the position of the nodes they bump, the swap or permute moves are simplified a bit. We do not have to worry about τ j contributing a child that is older than its potential new parents, so we never need to randomly choose descendants until we find a node that is younger than τ i . We implemented this move, but jointly proposing changes to continuous divergence time parameters and changing the topology might lead to poor acceptance rates (Yang, 2014), so using separate moves to update divergence times and the topology is likely a better strategy.

Validation of this move
We used this move to sample from the prior distribution to ensure that the distribution of sampled divergence times matched the gamma-distributed prior we placed on the root age and the beta priors we placed on all other divergence times. To validate the extension of this move that also incorporates node swapping when nodes "bump," we used it to sample from a uniform distribution over the topologies of a 6-leaved bifurcating tree. If the move is working correctly, the number of times we sample each of the 945 topologies should be follow a Binomial(n = N , p = 1 945 ) distribution, where N is the total number of MCMC samples. We found a close match between our samples and this expected distribution, and could not reject it using a χ 2 goodness-of-fit test ( Figure S20; p = 0.165,).  Figure S20. Comparing the expected to the observed number of times each topology of a rooted, 6-leaved, bifurcating tree is sampled by the node-swapping extension to our divergencetime-slide-bump move. Our MCMC sample of 1 million trees closely matched the expected Binomial(n = 1 × 10 6 , p = 1 945 ) distribution.