Adaptive Metropolis-coupled MCMC for BEAST 2

With ever more complex models used to study evolutionary patterns, approaches that facilitate efficient inference under such models are needed. Metropolis-coupled Markov chain Monte Carlo (MCMC) has long been used to speed up phylogenetic analyses and to make use of multi-core CPUs. Metropolis-coupled MCMC essentially runs multiple MCMC chains in parallel. All chains are heated except for one cold chain that explores the posterior probability space like a regular MCMC chain. This heating allows chains to make bigger jumps in phylogenetic state space. The heated chains can then be used to propose new states for other chains, including the cold chain. One of the practical challenges using this approach, is to find optimal temperatures of the heated chains to efficiently explore state spaces. We here provide an adaptive Metropolis-coupled MCMC scheme to Bayesian phylogenetics, where the temperature difference between heated chains is automatically tuned to achieve a target acceptance probability of states being exchanged between individual chains. We first show the validity of this approach by comparing inferences of adaptive Metropolis-coupled MCMC to MCMC on several datasets. We then explore where Metropolis-coupled MCMC provides benefits over MCMC. We implemented this adaptive Metropolis-coupled MCMC approach as an open source package licenced under GPL 3.0 to the Bayesian phylogenetics software BEAST 2, available from https://github.com/nicfel/CoupledMCMC.


Adaptive Metropolis-coupled MCMC for
Introduction 31 Phylogenetic methods are being used to study increasingly complex processes. Analyses 32 using such methods, however, also require an increasingly large amount of computa-33 tional resources. One way still be able to perform these analyses is by making use of 34 multiple CPU's, which requires calculations to be able to run in parallel. Tree likelihood 35 calculations (Suchard and Rambaut, 2009) often assume independent evolutionary pro-36 cesses on different branch and nucleotide sites and can be easily parallelised (Suchard 37 and Rambaut, 2009). This can, however, be complex or even impossible for many other 38 parts of such analyses, most notably tree prior calculations, which are used to infer 39 demographic processes from phylogenetic trees. A lot of recent development in the field 40 of phylogentics has been focused on developing such tree priors that allow us to infer 41 complex population dynamics from genetic sequence data (e.g., Müller et al. (2018); a new state x ′ from a proposal distribution g(x ′ |x) given the current state. At this new 109 state, we calculate the likelihood P (D|x ′ ) of the data D given the state and the prior 110 probability of the new state P (x ′ ) and compare it the to old state. The probability of 111 accepting this new state is then calculated as follows: If R is greater than a randomly drawn value between [0, 1], the new state x ′ is accepted 113 as the current state, otherwise it is rejected and we remain in the same state. If we keep 114 proposing new states x ′ and accept these using Equation (1), we eventually explore 115 parameter space with the frequency at which values of a parameter are visited being 116 its marginal probability (Geyer, 1991). 117 One of the issues of using this approach is that acceptance probabilities can be quite low, which makes it hard to move between different states in parameter space. Alternatively, an MCMC chain can be heated by using a temperature scaler β i = 1 1+(i−1)∆t , with i being the number of the chain (Altekar et al., 2004). Heating of an MCMC chain changes its acceptance probability R heated to: For a heated chain however, the frequency at which a value of a parameter is visited does not correspond to its marginal probability any more. However, heated chains can be used as a proposal to update the cold chain by performing what is essentially an MCMC move. This move proposes to swap the current states of two random chains i and j with the temperature β i and β j such that β i < β j . Exchanging the states of chains i and j is accepted with an acceptance probability R ij of: As for a regular MCMC move, swapping the states of the two chains is accepted when 118 a randomly drawn uniformly distribution value in [0, 1] is smaller than R ij .

119
Additional to randomly swapping states between chains, we also implemented the 120 possibility to only swap the states of neighbouring chains. That means that we condition 121 on i = j + 1 instead of randomly sampling both i and j. In this implementation of the parallel MC 3 , we run n different MCMC chains, with each average hotter chains at the same acceptance probability. 315 We next compared the inference of trees on a dataset DS1 that has proved prob-

322
We ran the dataset using MCMC for 5 * 10 7 iteration and MC 3 for 5 * 10 7 with 4 323 different chains. We ran MC 3 targeting three different acceptance probabilities, that Manuscript to be reviewed Here we compare inferred clade probabilities between one run (y-axis) and four replicates from different starting points (x-axis) using MCMC A and adaptive MC 3 run with target acceptance probabilities of 0.468 B, 0.234 C and 0.117 D. In E, we show how well the tree space is explored when assuming the temperatures of the heated chains are distributed according to the quantiles of a beta distribution and a target acceptance probability of 0.234. In F, we only allow swaps between neighboring chains and in G and H, we show the results when using 8 respectively 16 chains, but with only half respectively a quarter of the iterations.. in local optimas (Brown and Thomson, 2018).

345
In all other scenarios using MC 3 , the KS values steadily decrease, indicating con-346 vergence (see figure S6).

398
This suggest that adding more chains is less and less beneficial and running several 399 replicate analyses and then combining the runs might be a better use of computational 400 resources. For datasets where the heating of chains is not needed to explore the poste-401 rior probability space, it might be more computationally efficient to run N independent 402 MCMC analyses and combining them instead of running a MC 3 analysis with N chains.

403
An added benefit is that it is easier to detect convergence issues with MCMC compared 404 to MC 3 . In practice, this means that using MC 3 is most beneficial in cases where regular 405 MCMC shows convergence issues, such as not being able to retrieve the same posterior 406 distribution starting from different initial values.

407
The adaptive MC 3 algorithm is compatible with other BEAST 2 packages and there-408 fore works with any implemented model that does not directly affect the MCMC ma-409 chinery. This will help analyzing larger datasets with more complex evolutionary and Manuscript to be reviewed