Exchanging Replicas with Unequal Cost, Infinitely and Permanently

We developed a replica exchange method that is effectively parallelizable even if the computational cost of the Monte Carlo moves in the parallel replicas are considerably different, for instance, because the replicas run on different types of processor units or because of the algorithmic complexity. To prove detailed-balance, we make a paradigm shift from the common conceptual viewpoint in which the set of parallel replicas represents a high-dimensional superstate, to an ensemble-based criterion in which the other ensembles represent an environment that might or might not participate in the Monte Carlo move. In addition, based on a recent algorithm for computing permanents, we effectively increase the exchange rate to infinite without the steep factorial scaling as a function of the number of replicas. We illustrate the effectiveness of this replica exchange methodology by combining it with a quantitative path sampling method, replica exchange transition interface sampling (RETIS), in which the costs for a Monte Carlo move can vary enormously as paths in a RETIS algorithm do not have the same length and the average path lengths tend to vary considerably for the different path ensembles that run in parallel. This combination, coined ∞RETIS, was tested on three model systems.

The Markov chain Monte Carlo (MC) method is one of the most important numerical techniques for computing averages in high-dimensional spaces, like the configuration space of a many particle system.The approach has applications in a wide variety of fields ranging from computational physics, theoretical chemistry, economics, and genetics.The MC algorithm effectively generates a selective random walk through state space in which the artificial steps are designed such to ensure that the frequency of visiting any particular state is proportional to the equilibrium probability of that state.The Metropolis [1] or the more general Metropolis-Hastings [2] algorithms are the most common approaches for designing such random steps (MC moves) based on the detailedbalance principle.That is, the MC moves should be constructed such that the number of transition from an old state s (o) to a new state s (n) is exactly balanced by the number of transitions from the new to the old state: ρ(s (o) )π(s (o) → s (n) ) = ρ(s (n) )π(s (n) → s (o) ) where, ρ(•) is the state space equilibrium probability density and π(•) are the probabilities to make a transition between the two states given the set of possible MC moves.Further, the transition is split into a generation and an acceptance/rejection step such that π(s → s ′ ) = P gen (s → s ′ )P acc (s → s ′ ).In the case that the sampled state space is the configuration space of a molecular system at constant temperature, P gen might relate to moving a randomly picked particle in a random direction over a small random distance, and ρ(s) is proportional to the Boltzmann weight e −βE(s) with β = 1/k B T the inverse temperature and E(s) the state's * titus.van.erp@ntnu.noenergy.The Metropolis-Hastings algorithm takes a specific solution for the acceptance probability The generation probabilities will cancel in the above expression if they are symmetric, P gen (s → s ′ ) = P gen (s ′ → s) as in the less generic Metropolis scheme.At each MC step, the new state is either accepted or rejected based on the probability above.In case of a rejection, the old state is maintained and resampled.This scheme obeys detailed-balance and if, in addition, the set of MC moves are ergodic, equilibrium sampling is guaranteed.When ergodic sampling, even if mathematically obeyed, is slowed down by a rough (free) energy landscape, Replica exchange MC becomes useful.
Replica exchange MC (or replica exchange molecular dynamics) is based on the idea to simulate several copies of the system with different ensemble definitions [3][4][5], most commonly ensembles with increasing temperature (parallel tempering).By performing "swaps" between adjacent replicas, the low-temperature replicas gain access to the broader space region that are explored by the high-temperature replicas.The detailed-balance and corresponding acceptance-rejection step can be derived by viewing the set of states in the different ensembles (replicas) as a single high-dimensional superstate S = (s 1 , s 2 , . . ., s N ) representing the system in a set of N independent "parallel universes".The Metropolis scheme applied to the superstate yields in which the probability of the superstate equals where ρ i (•) is the specific probability density of ensemble i.For example, the move that attempts to swap the first two states, implying S (o) = (s 1 , s 2 , . . ., s N ) and S (n) = (s 2 , s 1 , . . ., s N ), will be accepted with a probability In a replica exchange simulation, swapping moves and standard MC or MD steps are applied alternately.Parallel computing will typically distribute the same number of processing units per ensemble to carry out the computational intensive standard moves.The swapping move is cheap, but it requires that the ensembles involved in the swap have completed their previous move.If the standard moves in each ensemble require different computing times, then several processing units have to wait for the slow ones to finish.If the disbalance per move is relatively constant, the replicas could effectively be made to progress in cohort by trying to differentiate the number of processing units per ensemble or the relative frequency of doing replica exchange versus standard moves per ensemble.However, in several MC methods this disbalance is not constant, such as with configurational bias MC [6][7][8] or path sampling [9].The number of elementary steps to grow a polymer in configurational bias MC obviously depends on the polymer's length that is being grown, but also early rejections lead to a broad distribution of the time it takes to complete a single MC move even in uniform polymer systems.Analogously, the time required to complete a MC move in path sampling simulations will depend on the length of the path being created.Other examples of complex Monte Carlo methods with a fluctuating CPU cost per move are cluster Monte Carlo algorithms [10] and event-chain Monte Carlo [11,12].
We will show that the standard acceptance Eqs. 1 and 4 can be applied in a parallel scheme in which ensembles are updated irregularly in time and the average frequency of MC moves is different for the ensembles.In addition, we show that we can apply an infinite swapping [13] scheme between the available ensembles.For this, we develop a new protocol based on the evaluation of permanents that circumvents the steep factorial scaling.This last development is also useful for standard replica exchange.

METHODS
Finite swapping.In the following, we will assume that we have two types of MC moves.One move that is CPU intensive and can be carried out within a single ensemble, and replica exchange moves between ensembles which are relatively cheap to execute.The CPU intensive move will be carried out by a single worker (one processor unit, one node or a group of nodes) and these workers perform their task in parallel on the different ensembles.One essential part of our algorithm is that we have less workers than ensembles such that whenever the worker is finished and produced a new state for one ensemble, this state can directly be swapped with the states of any of the available ensembles (the ones not occupied by a worker).After that, the worker will randomly switch to another unoccupied ensemble for performing a CPU intensive move.
In its most basic form, the algorithm consists of the following steps: 1. Define N ensembles and let ρ i (•) be the probability distribution of ensemble i.We also define P RE which is the probability for a replica exchange move.
2. Assign K < N 'workers' (processing units) to K of the N ensembles for performing a CPU intensive MC move.Each ensemble is at all times occupied by either 1 or 0 workers.The following steps are identical for all the workers.
3. If the worker is finished with its MC move in ensemble i, the new state is accepted or rejected according to Eq. 1 (with ρ i for ρ).Ensemble i is updated with the new state (or by resampling the old state in case of rejection) and is then considered to be free.
4. Take a uniform random number ν between 0 and 1.If ν > P RE go to step 7.
5. Among the available ensembles, pick a random pair (i, j).
6. Try to swap the states of ensembles i and j using Eq. 4 (with labels i, j instead of 1, 2).Update ensembles i, j with the swapped state or the old state in case of a rejection.Return to step 4.
7. Select one of the free ensembles at random and assign the worker to that ensemble for performing a new standard move.Go to step 3.
In this algorithm ensembles are not updated in cohort like in standard replica exchange, but updates occur at irregular intervals.In addition, the different ensemble conditions can result in systematic differences in the number of states that are being created over time.To prove that the above scheme actually samples the correct distributions requires a fundamentally new conceptual view as the superstate picture is no longer applicable.Despite that the algorithm uses the same type of Eqs. 1 and 4, as one would use in standard replica exchange, it does not rely on Eqs. 2 and 3 that are no longer valid.In the Supplementary Information (SI) we provide a proof from the individual ensemble's perspective in which the other ensembles provide an "environment" E that might, or might not, participate in the move of the ensemble considered.By doing so, we no longer require that the number of transitions from old to new, S (o) → S (n) , is the same as from new to old, S (n) → S (o) .Instead, by writing S = (s 1 , E), from ensemble 1's perspective, we have that the number of (s ) transitions should be equal to the number of (s 1 , a E (n) ) transitions when the standard move is applied where a E (n) refers to any new environment.The SI shows a similar detailedbalance condition for the replica exchange moves.At step 6 we sample only ensemble i and j or, alternatively, all free ensembles get a sample update.This would mean resampling the existing state of those not involved in a swap ("null move").This makes the approach more similar to the superstate sampling albeit using only free ensembles, as described in the SI.The null move does not reduce the statistical uncertainty, but we mention it here as it makes it easier to explain the infinite swapping approach.But for the detailed-balance conditions to be valid it is imperative that occupied ensembles are not sampled.
An essential aspect of the efficiency of our algorithm is that the number of workers K is less than the number of ensembles N .The case K = N is valid but would reduce the number of replica exchange moves to zero as only one ensemble is free at the maximum.Reducing the K/N ratio will generally imply a higher acceptance in the replica exchange moves as we can expect a higher number of free ensembles whose distributions have significant overlap.What gives the optimum number of workers is therefore a non-trivial question that we will further explore in the Results and Discussion section.However, for case K < N we can maximize the effect of the replica exchange moves by taking the P RE parameter as high as possible.In fact, we can simulate the effect of the limit P RE → 1 without having to do an infinite number of replica exchange moves explicitly.This lead to an infinite swapping [13] version of our algorithm.
Infinite swapping.If in the previously described algorithm we take P RE = 1 − δ, we will loop through the steps 4-6 for many iterations (n it = ∞ n=0 n(1 − δ) n δ = 1/δ in the limit δ → 0) before getting to step 7. When δ vanishes and n it becomes infinitely large, we expect that all possible swaps will be executed an infinite number of times.Since the swaps obey detailed balance between unoccupied ensembles, these will essentially sample the distribution of Eq. 3 (for the subset S * of unoccupied ensembles).Hence, when the loop is exited, each possible permutation σ ∈ S * has been sampled n it × ρ(σ)/ σ ρ(σ) times.By lumping all the times that the same permutation was sampled and normalizing by division with n it , we simply sample all the possible permutations in one go using fractional weights that sum up to 1.This is then the only sampling step, as the single update in step 3 can be skipped due to its negligible 1/n it weight.
The idea of doing an "infinite number" of swapping moves has been proposed before [13][14][15], but here we give a different flavor to this approach by a convenient reformulation of the problem into permanents that allows us to beat the steep factorial scaling reported in earlier works [13].The permanents formulation goes as follows.Supposed that after step 3, there are 4 free ensembles (we name them e 1 , e 2 , e 3 , e 4 ) containing 4 states (s 1 , s 2 , s 3 , s 4 ).Which state is in which ensemble after this step is irrelevant.We can now define a weight-matrix W : where W ij ∝ ρ j (s i ).Essential to our approach is the computation of the permanent of the W matrix, perm(W ), and that of the W {ij}-matrices in which the row i and column j are removed.
The permanent of a matrix is similar to the determinant, but without alternating signs.We can, henceforth, write perm(W ) = 4 j=1 W 1j perm(W {1j}).As the permanent of the 1 × 1 matrix is obviously equal to the single matrix value, the permanent of arbitrary dimension could in principle be solved recursively using this relation.Based on the permanents of W , we will construct a probability matrix P :

  
where P ij is the chance to find state s i in ensemble e j .As for each permutation each state is in one ensemble and each ensemble contains one state, the P -matrix is bistochastic: both the columns and the rows sum up to 1.If we consider S * ij the set of permutations in which state s i is in e j , we can write We can, however, also use the permanent representation in which So far we have not won anything as computing the the permanent via the recursive relation mentioned above has still the factorial scaling.The Gaussian elimination approach, that allows an order O(n 3 ) computation for determinants of n × n matrices, won't work for permanents as only some but not all row-and column-operations have the same effect to a permanent as to a determinant.One can for instance swap rows and columns without changing the permanent.Multiplying a row by a nonzero scalar multiplies the permanent by the same scalar.Hence, this will not affect the P-matrix based on Eq. 5. Unlike the determinant, adding or subtracting to a row a scalar multiple of another row, an essential part of the Gaussian elimination method, does change the permanent.This makes the permanent computation of a large matrix excessively more expensive than the computation of a determinant.Yet, recent algorithms based on the Balasubramanian-Bax-Franklin-Glynn (BBFG) formula [16][17][18] scale as O(2 n ).This means that the computation of the full P -matrix scales as O(2 n ×n 2 ), which seems still steep but is nevertheless a dramatic improvement compared to factorial scaling.For our target time of 1 second, for instance, we could only run the algorithm up to N = 7 in the factorial approach, while we reach N = 12 in the BBFG method using a mid-to-high-end laptop (DELL XPS 15 with an Intel Core i7-8750H).If matrix size of N = 20 is the target, the BBFG method can perform a full Pmatrix determination in ∼ 711 seconds, while it would take ∼ 15.3 × 10 6 years in the factorial approach.The BBFG method is the fastest completely general solution for the problem of computing a P -matrix from any Wmatrix.For several algorithms, the W -matrix has special characteristics that can be exploited to further increase efficiency.For instance, if by shuffling the rows and columns the W -matrix can be made into a block form, where squared blocks at the diagonal have only zero's at their right and upper side, the permanent is equal to the product of the block's permanents.For instance, if W 14 = W 24 = W 34 = 0 we have two blocks, 3 × 3 and 1 × 1.If W 13 = W 14 = W 23 = W 24 = 0, we can identify 2 blocks of 2 × 2 etc. Identification of blocks can hugely decrease the computation of a large permanent.Another speed-up can be made if all rows in the W -matrix are a sequence of ones followed by all zeros, or can be made into that form after previously mentioned column and row operations.This makes an order O(n 2 ) approach possible.We will further discuss this in Sec.Application: ∞RETIS.
The infinite swapping approach changes the aforementioned algorithm from step 3: 3.If the worker is finished with its MC move in a specific ensemble, the new state is accepted or rejected (but not yet sampled) according to Eq. 1.The ensemble is free.
4. Determine the W -matrix based on all unoccupied ensembles, calculate the P -matrix based on Eq. 5, and update all the unoccupied ensembles by sampling all free states with the fractional probabilities corresponding to the columns in the P -matrix.
5. Pick randomly one of the free ensembles e j .
6. Pick one of the available states (s 1 , s 2 , . ..) based on a weighted random selection in which state s i has a probability of P ij to be selected.
7. The worker is assigned to do a new standard move in ensemble e j based on previous state s i .Go to step 3.

APPLICATION: ∞RETIS
Replica Exchange Transition interface sampling (RETIS) [19,20] is a quantitative path sampling algorithm in which the sampled states are short molecular trajectories (paths) with certain start-and endconditions, and a minimal progress condition.New paths are being generated by a Monte Carlo move in path space, such as the shooting move [21] in which a randomly selected phase point of the previous path is randomly modified and then integrated backward and forward in time by means of molecular dynamics (MD).The required minimal progress increases with the rank of the ensemble such that the final ensemble contains a reasonable fraction of transition trajectories.The start-and end-conditions, as well as the minimal progress, are administered by the crossings of interfaces (λ 0 , λ 1 , . . ., λ M ) with λ k+1 > λ k , that can be viewed as non-intersecting hypersurfaces in phase space having a fixed value of the reaction coordinate.A MC move that generates a trial path not fulfilling the path ensemble's criteria is always rejected.RETIS defines different path ensembles based on the the direction of the paths and the interface that has to be crossed, but all paths start by crossing λ 0 (near the reactant state/state A) and they end by either crossing λ 0 again or reaching the last interface λ M (near the product state/state B).There is one special path ensemble, called [0 − ], that explores the left side of λ 0 , the reactant well, while all other path ensembles, called [k + ] with k = 0, 1, . . .M −1, start by moving to the right from λ 0 reaching at least λ k .
A central concept in RETIS is the so-called overall crossing probability, the chance that a path that crosses λ 0 in the positive direction reaches λ M without recrossing λ 0 .It provides the rate of the process when multiplied with the flux through λ 0 (obtained from the path lengths in [0 − ] and [0 + ] [20]) and is usually an extremely small number.The chance that any of the sampled paths in the [0 + ] path ensemble crosses λ M is generally negligible, but a decent fraction of those (∼ 0.1 − 0.5) will cross λ 1 and some even λ 2 .Likewise, paths in the [k + ], k > 0, path ensembles have a much higher chance to cross λ k+1 than a [0 + ]-path as they already cross λ k .This leads to the calculation of M local conditional crossing probabilities, the chance to cross λ k+1 given λ k was crossed for k = 0, 1 . . .M − 1, whose product gives an exact expression for the overall crossing probability with an exponentially reduced CPU cost compared to MD.
The efficiency is further hugely improved by executing replica exchange moves between the path ensembles.These swaps are essentially cost-free since there is no need to simulate additional ensembles that are not already required.An accepted swapping move in RETIS provides new paths in two ensembles without the expense of having to do MD steps.The enhancement in efficiency is generally even larger than one would expect based on those arguments alone as path ensembles higher-up the barrier provide a similar effect as the high-temperature ensembles in parallel tempering.In addition, point-exchange moves between the [0 − ] and [0 + ] are performed by exchanging the end-and start-points of these path that are then continued by MD at the opposite site of the λ 0 interface.
While TIS [22] (without replica exchange) can run all path ensembles embarrassingly parallel, the RETIS algorithm increases the CPU-time efficiency, but is difficult to parallelize and open source path-sampling codes, like OpenPathSampling [23] and PyRETIS [24], implement RETIS as a fully sequential algorithm.The path length distributions are generally broad with an increasing average path length as function of the ensemble's rank.This becomes increasingly problematic the more ensembles you have as they all have to wait for the slowest ensemble.This means that while RETIS will give you the best statistics per CPU-hour, it might not give you the best statistics in wall-time.With the continuous increase in computing power, trading some CPU-time efficiency for wall-time efficiency, getting the answer faster while spending more CPU-cycles, might be preferential.Our parallel scheme can effectively deal with the unequal CPU cost of the replicas, which allows us to increase the wall-time efficiency with no or minimal reduction in CPU-time efficiency.
The W -matrix in RETIS.If there are M + 1 interfaces, λ 0 , λ 1 , . . ., λ M , there are also as swappings are executed when 1 of the K workers is free, while the remaining K − 1 workers occupy path ensembles that are locked and do not participate in the swap.The smallest matrix occurs when one worker is occupying both [0 − ] and [0 + ] during the point exchange move, as described in the simulation methods.
Paths can be represented by a sequence of time slices, the phase points visited by the MD trajectory.For a path of length L + 1, X = (x 0 , x 1 , . . ., x L ), the plain path probability density ρ(X) is given by the probability of the initial phase point times the dynamical transition probabilities to go from one phase point to the next: Here, the transition probabilities depend on the type of dynamics (deterministic, Langevin, Nosé-Hoover dynamics, etc).The weight of a path within a specific path ensemble ρ j (X) can be expressed as the plain path density times the indicator function 1 ej and possibly an additional weight function w j (X): ρ j (X) = ρ(X)×1 ej (X)×w j (X).The indicator function equals 1 if the path X belongs to ensemble e j .Otherwise it is 0. The additional weight function w j (X) is part of the high-acceptance protocol that is used in combination with the more recent path generation MC moves such as stone skipping [25] and wire-fencing [26].Using these "high-acceptance weights", nearly all the CPU intensive moves can be accepted as they are tuned to cancel the P gen -terms in Metropolis-Hastings scheme, Eq. 1, and the effect of the non-physical weights is undone in the analysis by weighting each sampled path with the inverse of w j (X).
While the path probability ρ(s i = X) is difficult to compute, determining 1 j (s i ) and w j (s i ) is trivial.It is therefore a fortunate coincidence that we can replace W ij = ρ j (s i ) with because the P -matrix does not change if we divide or multiply a row by the same number, as mentioned in Sec.Methods.Except for [0 − ], all path ensembles have the same start and end condition and only differ with respect to the interface crossing condition.A path that crosses interface λ k automatically crosses all lower interfaces λ l<k .Reversely, if the path does not cross λ k , it won't cross any of the higher interfaces λ l>k .This implies that if the columns of W ij are ordered such that the 1st column (e 1 ) is the first available ensemble from the sequence is the second available ensemble etc, most rows will end with a series of zeros.
Reordering the rows with respect to the number of trailing zeros, almost always ensures that the W -matrix can be brought into a block-form such that the permanent can be computed faster based on smaller matrices.In particular, if [0 − ] is part of the free ensembles, it will always form a 1 × 1 block as there is always one and no more than one available path that fits in this ensemble.
If high-acceptance is not applied, we have w j (X) = 1 and each row in the W -matrix (after separating the [0 − ] ensemble if it is part of the free ensembles) is a sequence of ones followed by all zeros.The W -matrix can hence be represented by an array (n 1 , n 2 , n 3 , . . .n n ) where each integer n i indicates the number of ones in row i.As we show in the SI, the permanent of such a W -matrix is simply the product of (n i + 1 − i): perm(W ) = i (n i + 1 − i).Further, the P -matrix can be constructed from following order O(n 2 ) method.
The first step is to order the rows of the W -matrix such that n 1 ≤ n 2 ≤ . . .≤ n n .We then fill in the P -matrix from top to bottom for each row using The approach is extremely fast and allows the computation of P -matrices from a large W -matrix, up to several thousands, within a second of CPU-time.The above method applies whenever the rows of the W -matrix can be transformed into sequence of ones followed by all zeros.Besides RETIS without high-acceptance, this would apply to other MC methods like subset-sampling [27] or umbrella sampling [28] with semi-infinite rectangular windows.

RESULTS AND DISCUSSION
To test our algorithms we ran 3 types of simulations.First a memoryless single variable stochastic (MSVS) process was simulated in order to mimic a RETIS simulation in which the average path length increases linearly with the rank of the ensemble.A "path" is created by drawing 2 random numbers where the first determines how much progress a path makes and the second determines the path length.These two outcomes are variable and depend on the rank of the ensemble such that the fictitious path in ensemble [k + ] has a 0.1 probability to cross λ k+1 and has an average path length of approximately k/10 seconds (see Section Materials and Methods).The worker is paused for a number of seconds equal to the path length before it can participate in replica moves to mimic the time it would take to do all the necessary MD steps.While this artificial simulation allows us to investigate the potential strength of the method to tackle extremely rare events, it cannot reveal the effect of correlations between accepted paths when fast exploration of the reaction coordinate's orthogonal directions are crucial.To analyze this effect, we also ran a 2D membrane permeation system with two slightly asymmetric channels [29].Lastly, to study our algorithm with a more generic W -matrix that needs to be solved via BBFG formula, we also ran a set of underdamped Langevin simulations of a particle in a double well potential [30] using the recent wire fencing algorithm with the high acceptance protocol [26].All simulation results were performed using 5 independent runs of 12 hours.Errors were based on the standard deviations from these 5 simulations, except for the MSVS process, where a more reliable statistical error was desired for the comparison with analytical results.Here, block errors were determined on each of the five simulations based on the running average of the overall crossing probability.The block errors were finally combined to obtain the statistical error in the average of the five simulations.

Memoryless single variable stochastic (MSVS) process
Table I reports the overall crossing probabilities and their statistical errors for a system with 50 interfaces and 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 workers.All values are within a 50% deviation from the true value of 10 −50 with the more accurate estimates for the simulations having a large number of workers.Also, the true value is within one standard deviation of the reported averages for 70% of the data points, as is expected from the standard Gaussian confidence intervals.Figure 1a shows the scaling of the MD time (solid lines) and number of MC moves (dashed lines) of the MSVS simulations (orange) compared to linear scaling (black) and the expected scaling for standard replica exchange (REPEX) in which ensembles are updated in cohort (purple).
Whereas the number of "MD steps" and MC moves quickly levels off to a nearly flat plateau in the standard approach due to workers being idle as they need to wait for the slowest worker, the replica exchange approach developed in this article shows a perfect linear scaling with respect to the MD time.The number of MC moves in the new method shows an even better than linear scaling due to the fact that the ensembles with shorter "path lengths" get simulated relatively more often with more workers, resulting in more MC moves per second.This in itself does not necessarily mean that the simulations converge much faster because the additional computational effort may not be targeted to the sampling where it is needed.If we neglect the fact that path ensemble simulations are correlated via the replica exchange moves, we can write that the relative error in overall crossing probability ǫ follows from the relative errors in each path ensemble ǫ i via: It is henceforth clear that additional computational power should not aim to lower the error in a few path ensembles that were already low compared to other path ensembles.We therefore measure the effectiveness of the additional workers by calculating computational efficiencies.The efficiency of a specific computational method is here defined as the inverse computer time, CPU-or wall-time, to obtain an overall relative error equal to 1: ǫ = 1.
In figure 1d the efficiencies based on wall-time (solid) and CPU-time (dashed) are plotted for the MSVS process.These plots depends on the ability of computing reliable statistical errors in the overall crossing probability that is an extremely small number, 10 −50 .The somewhat fluctuating behavior of these curves should hence be viewed as statistical noise as the confidence interval of these efficiencies depends on the statistical error of this error.Despite that, clear trends can be observed in which the CPU-time efficiency is more or less flat, while the wall-time efficiency shows an upward trend.If we neglect the effect of replica exchange moves on the efficiency, we can relate these numerical results with theoretical ones [20,31] for any possible division of a fixed total CPU-time over the different ensembles.A common sense approach would be to aim for the same error ǫ i in each ensemble (which implies doing the same number of MC moves per ensemble) or to divide the total CPUtime evenly over the ensembles.These two strategies correspond to the case K = 1 or standard RETIS and K = N or standard TIS, respectively.Ref. [31] showed that these two strategies provide the same efficiency and in the SI we derive that this leads to a wall-time efficiency as function of the number of workers (K) equal to K/56250 which is the continuous purple line in figure 1d.The optimum division, however, would give a slightly better wall-time efficiency equal to K/50000 which is the continuous black line in this figure.Also shown in figure 1d are the expected theoretical efficiencies based on the numerical distribution of MC moves in each ensemble.This hybrid numerical/theoretical result is shown by the small purple dots.This shows that ∞RETIS, at least for a system in which the path length grows linearly with the ensemble's rank, naturally provides a division of the computational resources that is even better than TIS (K = N ) or RETIS (K = 1).Yet, due to statistical inaccuracies this is only evident for the K = 15 case.The best wall-time efficiency is obtained for the case K = N , which is essentially equivalent of running independent TIS simulations (i.e.without doing any replica exchange moves).We do not expect this to apply to more complex systems where the replica exchange move is a proven weapon for efficient sampling.

Two-channel simulations
In the middle column of table I we report the calculated crossing probabilities and permeabilities for 5 simulations for every number of workers.All simulations are somewhat higher, though still in good agreement with the previous simulation from Ref. [29].We also evaluated the approximate result based on Kramers' theory (see SI) which seem to confirm the results obtained in this paper.
Figure 1b shows the scaling of the MD time (solid lines) and number of MC moves (dashed lines) of the two-channel simulations (blue) compared to linear scaling (black).We see a slightly worse than linear scaling of the MD time, which might just be due to a small positive fluctuation of the 1 worker data-point.We also see a similar more than linear scaling in the number of MC moves as with the MSVS simulations, for the same reason.In figure 1e the efficiencies based on wall-time (solid) and CPU-time (dashed) are plotted for the two-channel system.The CPU-time efficiency is more or less flat until 8 workers after which it starts to drop off.The wall-time efficiency shows an upward trend until 10 workers after which it starts to drop off as well.We assign this drop to the reduction of replica exchange moves which is an essential aspect for sampling this system efficiently [29].This is tangible from figure S1 in the SI where we plot fraction of trajectories, passing through λ M−1 , that are in the lower barrier channel.While from the average fraction it still looks like the simulations sampled both channels for any number of workers, 4 out of the 5 simulations in the K = N = 12 case solely visited one of the two channels.This is in agreement with previous TIS results [29].The K = 11 case already provides a dramatic improvement, but is still expected to be sub optimal due to the relatively low frequency of replica exchange moves compared to K < 11.As reported in ref. [29], this ratio requires many MC moves to converge to the theoretical value of 0.71 without the added MC moves introduced in that paper.We did not simulate with these added moves and thus see the same slow convergence for all of our simulations.From this 2D system it would indicate that having half the number of ensembles as workers is a safe bet for optimum efficiency.

Double well 1D barrier using wire fencing
In the right column of table I we report the calculated crossing probabilities and rates for the underdamped Langevin particle in the 1D double well potential.All simulations are in reasonable agreement with each other and the results of Refs.[26] and and [30], as well as the approximate value based on Kramers' theory.However, while these results confirm the soundness of the method, the scaling and efficiency are less convincing.Figure 1b shows a significantly worse than linear scaling.On further inspection we found the average time per MC move was significantly smaller than our infinite-swapping goal CPU-time eff.

f)
FIG. 1.The average scaling of total MD time (cumulative time spend by all the workers) (solid) and MC moves (dashed) (a-c) and wall-time (solid) and the CPU-time (dashed) efficiencies (d-f) for each number of workers.This is shown for the memoryless single variable stochastic (MSVS) process (a, d, orange), the two-channel system (b, e, blue) and the double well with wire fencing (c, f, green) simulations.Each of the data points is based on 5 independent simulations.For the scaling plots, the black lines are guides for linear scaling from the 1 worker data-point.The purple lines in the scaling plot for the MSVS simulations (a) show what the scaling would be if we had to wait for the slowest ensemble to finish for each MC move.The black line, purple line, and points in the efficiency plot of the MSVS process (d) show the optimal, optimal TIS/RETIS, and hybrid wall-time efficiency, respectively, as computed in the SI.
of 1 s when the simulation was run with more than 2 workers.This results in a bottleneck on how many MC moves can be started per second, which is the reason for the observed bad scaling.It still is slightly positive instead of flat as the infinite swapping procedure becomes quicker with more workers due to the smaller W -matrix.
The same bottleneck can be seen in figure 1f were both efficiencies plummet with more than 2 workers.The reported scaling deficiency is of little significance for actual molecular systems where the creation of a full path takes minutes to hours rather than subseconds.

CONCLUSIONS
We developed a new generic replica exchange method that is able to effectively deal with MC moves with varying CPU costs, for instance due to the algorithmic complexity of the MC moves.An essential aspect of the method is that the number of workers, who execute the ensemble's specific MC moves in parallel, is less than the number of ensembles.Once a worker is finished with its move, replica exchange moves are carried out solely between those ensembles that are not occupied by a worker.This implies that the ensembles are updated at irregular intervals and a different number of MC moves will be executed for each ensemble.As a result, the conceptual viewpoint in which the set of replica's are viewed as a single superstate is no longer valid and the existence of some kind of detailed-balance relation is no longer trivial.To prove the exactness of our approach, we introduced some new conceptual views on the replica exchange methodology that is different from the common superstate principle.Instead, we show that the distributions in the new approach are conserved for each ensemble individually via a twisted detailed balance relation in which the other ensembles constitute an environment that is potentially actively involved in the MC move of the ensemble considered.In addition, the method can be combined with an infinite swapping approach without the factorial scaling based on a mathematical reformulation using permanents.
We applied the novel replica exchange technique on a path sampling algorithm, RETIS, which is a prototype of algorithm where the costs for a Monte Carlo move can vary enormously.The resulting new path sampling algorithm, coined ∞RETIS, was thereafter tested on three model systems.The results of these simulations show that the number of MD steps increase linearly with the number of workers invoked as long as the ensemble's MC move has a lower computational cost than the replica exchange move carried out by the scheduler.The number of executed MC moves shows an even better than linear scaling.Moreover, the efficiency increases linearly with the number of workers for a low-dimensional system in which the replica exchange has little effect, while it has an optimum in more complex systems as the number of successful replica exchange moves decreases when the number of workers is close to the number of ensembles.
In summary, the replica exchange method discussed in this paper has a clear potential to accelerate present path sampling simulations, but can also be combined with many other complex algorithms including those that are yet to be invented.With continuing trend to to run progressively more massively-parallel computing jobs, our algorithm is likely to gain importance and will open up new avenues in the field of molecular simulations and beyond.

Simulation methods
The implementation of ∞RETIS was structured as follows.We start 1 ≤ K ≤ N worker-and 1 schedulerprocess.Each of the worker-processes is going to process ensemble specific MC moves while the scheduler-process will do all the replica exchange moves and submits new jobs to the workers.All ensembles/trajectories that are currently being updated by a worker-process are not considered for MC moves by the scheduler, essentially being 'locked'.This means that no data is written for those ensembles and they are not valid targets for swapping moves.After a worker is done, it submits the result to the scheduler, the scheduler then unlocks the returned ensemble/trajectory and executes the replica exchange moves on all ensembles/trajectories that that are not locked.It then submits a new job to the freed worker for performing a new MC move in a randomly chosen free ensemble (or two ensembles in case of a point exchange move) and locks the involved ensembles/trajectories.
In the ∞RETIS method there are two kind of ensemble moves that involve MD steps.The first one is the shooting move (either standard shooting [21] or the more recent sub-trajectory moves [25,26]) in which a new path is being generated from an old path within a single ensemble.The second one is the point exchange move between [0 − ] and [0 + ].If a worker is assigned to this task, it means that both [0 − ] and [0 + ] are occupied by this worker.The scheduler ensures that there is never more than 1 worker considered free at a given time.When the free worker is assigned to perform a new MC move, each of the ensembles have an equal probability to be selected.If [0 + ] or [0 − ] is selected and the other is also free, there is a 50% chance to perform a [0 − ] ↔ [0 + ] point exchange move instead of a shooting move in the selected ensemble.

Memoryless single variable stochastic (MSVS) process
No actual MD is run for the MSVS simulations.Instead, we directly sample two random numbers, r 1 and r 2 from an uniform distribution ∈ [0, 1) to set the path's progress and the path length.A path in ensemble [k + ] is assumed to cross interface λ k+l if r 1 < (0.1) l .After this, we wait a random time, t = 0.2 r 2 k + 0.1 in seconds.This was done to simulate both the increasing average simulation time and variance for outer ensembles.This setup means that we have no history dependence and allows us to compute the theoretical values show in figure 1. 5 independent ∞RETIS simulations were run with 1, 5, 10, 15, . . ., 45, 50 workers.

Double channel simulations
In order to investigate the effect of our algorithm on the ergodicity of the sampling, a 2D two-channel simulation was run as described in reference [29].The new RETIS moves introduced in that paper (mirror-move and target-swap move) were not used.Instead, MD was only run to do shooting moves or the [0 − ] ↔ [0 + ] point exchanges.As the MD for this system completed too fast, every worker was set to wait 9 times the time it took to run the MD before returning the result.5 independent ∞RETIS simulations were run with 1, 2, . . ., 11, 12 workers.

1D double well with wire fencing
In order to investigate the accuracy with a W matrix that contains more numbers than 0s or 1s we simulated a 1D double-well system [30] together with the high-acceptance version of a novel path-sampling algorithm, wire fencing.The algorithm is described in reference [26], but for us the relevant part is that the highacceptance weight is the number of frames that a path has outside the interface for each ensemble times an extra factor 2 if the path ends at the last interface.As for the two-channel system, a worker was set to wait 9 times the time it took to complete the MD move before returning the result.5 independent ∞RETIS simulations were run with 1, 2, . . ., 7, 8 workers with interfaces placed at [−0.99, −0.8, −0.7, −0.6, −0.5, −0.4,−0.3, 1.0].This Supplementary Information contains the following data, derivations, and numerical examples.In Sec.II, we provide a proof that the replica exchange method with cost unbalanced replicas conserves the equilibrium distribution at the individual ensemble level.Instead of the superstate principle, the derivation is based on the individual ensemble's perspective where the other ensembles serve as an environment, which finally leads to a twisted detailed-balance relation.In Sec.III, we show a O(n 2 ) algorithm for computing the P -matrix from a W -matrix for the case that the W -matrix consists of rows having a series with ones, followed by zeros.This is the type of matrix that is relevant for RETIS simulations based on the standard shooting move.Sec.IV presents the derivations of the theoretical results on the crossing probabilities, rate constant, and permeability via Kramer's theory that are shown in table 1 of the main article.In Sec.V the computational efficiencies, including the derivations for the most optimal efficiencies, are discussed.Finally, in Sec.VI we provide some additional simulation results on the relative transition probabilities through the lower and higher barrier channel.

II. DETAILED-BALANCE RELATIONS
In this section, we will derive detailed-balance relations for parallel replica's that are not based on the common superstate viewpoint.These alternative relations can be used to validate the replica exchange algorithm for replica's with unequal CPU cost.Our derivation is based on the finite swapping approach, though the infinite swapping version follows automatically from this when the probability to perform a swap goes to unity (P RE → 1) as explained in the main text.To simplify matters, we assume that we have one type of replica exchange move that is low in CPU cost and one type of ensemble move that operates within one ensemble and has a high CPU cost.The relations that we derive are, however, by no means limited to that.In fact, in the RETIS algorithm there is also a point exchange move between the [0 − ] and [0 + ] ensemble.In previous publications this move, annotated as [0 − ] ↔ [0 + ], was categorized as a special type of swapping/replica exchange move.In this article we reserve the name swap or replica exchange to an operation that involves the swapping of full paths, which does not require any MD steps.In contrast, the [0 − ] ↔ [0 + ] point exchange implies the exchange of time slices at the end and start of the paths that are then extended at the other side of the λ 0 interface.In our implementation, this [0 − ] ↔ [0 + ] move is carried out by a single worker that locks both the [0 − ] and [0 + ] ensembles during this move.As the [0 − ] paths can never be swapped with any of the other paths, we can view the point exchange move as an ensemble move in ensemble [0 + ].
As explained in the main article, the replica exchange algorithm that we propose is based on a set of workers and a set of ensembles.The number of workers K is less than the number of ensembles N .Most of the time the worker is performing a CPU intensive single-ensemble move.The ensemble in which the worker operates is considered occupied/locked.Once a worker has completed a CPU intensive move, the move will be accepted or rejected, after which either a replica exchange move will be carried out with any of the unoccupied ensembles or the worker will be assigned to do a new single-ensemble move at a randomly picked free ensemble.
In order to indicate the difference between occupied and unoccupied ensembles, we introduce a new state vector that indicates both the available ensembles as in the main text and the occupied ensembles with a bar, e. g: S = (s 1 , s 2 , s 3 , s 4 , s 5 ) to show that there are 5 ensembles of which ensemble 3 and 5 are occupied by a worker.For both occupied and unoccupied ensembles, the s i -terms reflect the most recent state that was sampled in the ith ensemble.Now our sole aim is to ensure that if we just count the instances that an ensemble i is updated with a new sample (which could be a copy of the previous sample in case of a rejected move), these should be distributed according the correct probability density ρ i .* titus.van.erp@ntnu.no It is important to note that the time between two updates can vary and depends on the state that was most recently sampled.However, the waiting time between an update of a specific ensemble and the point in time that this ensemble gets occupied by a worker will depend on the states of all other ensembles, but not on the state in the ensemble considered.Since the ensembles are independent, this waiting time will be the same on average irrespective to this sampled state.This has as a consequence that if we take "photographs" of the state vector, at intervals or randomly, evenly distributed over time, we should again obtain the correct distributions ρ i , for all i, of the states in ensemble i as long as we ignore the instances that this ensemble is occupied.In other words, we can write for the previous example state vector where ρ i (•) is the statistically correct distribution of ensemble i, and ρ u j (•) an unknown distribution for occupied ensemble j that has no clear physical interpretation.For instance, it can happen that a state s is relatively unlikely to exist in ensemble i, low ρ i (s), but that any MC move starting from that state takes a very long time, resulting in a high ρ u i (s).Now, let's consider the Markov chain from the perspective of ensemble 1 where we monitor its state at the point that a new MC is initiated from an old state s 1 .From the viewpoint of ensemble 1, the other ensembles are viewed as an "environment" (E = (s 2 , s 3 , s 4 , s 5 ) in the aforementioned example), that might or might not influence the MC move.The probability of state s 1 in ensemble 1 can be written as an integral of the conditional probability given an environment: ( As the ensembles are independent we can write but we temporary keep the condition to clarify the logical structure of the upcoming derivation. As stated, we assume that we employ two types of moves: 1) a CPU intensive move that modifies s 1 without using the environment E and 2) a swapping move.In addition, the environment might influence the relative selection probabilities for choosing either 1) or 2).Typically, this selection probability will depend on N a (E), the number of unoccupied ensembles in E. Further, we need to keep in mind that during the execution of the MC move in ensemble 1, the environment changes.How much the environment changes will depend on how long it takes to fully execute the move involving ensemble 1.
To derive detailed-balance relations for the replica exchange method for cost unbalanced ensembles, we start with the more general balance concept; if we have an infinite number of states distributed according to the equilibrium distribution, all of which make a MC move at the same time, then we have to get the equilibrium distribution again.This means that the flux out off s 1 should be equal to the flux into s 1 which can be written as The transition probability π(•) can be split into the transitions via the different types moves (that we will indicate with the Greek letter α) which will be selected with a probability P sel α (E) that can depend on the environment E: This shows another complicating factor as in standard detailed-balance we need to consider the probability that the exact reverse move will be executed once the new state has been established.However, as the environment could have changed, the reverse move might involve different selection probabilities.By substituting Eq. 5 into Eq.4, we get an extra summation over α in addition to the integrals: But at this point, we apply the first level of "detailedness" by requiring the equation to hold for each α: So now we can evaluate the different moves separately.We further simplify this expression by integration out the variables E ′ and E ′′′ using the following relation: where a E refers to any possible environment.Substitution of Eq. 8 in Eq. 7 gives: First, we consider α = 1 referring the CPU intensive move that only operates in ensemble 1.For this move we substitute α = 1 in Eq. 9 and replace E ′′ and s ′′ 1 with respectively E and s ′ 1 , which is allowed since these are dummy integration variables Then, we fix another level of detailedness by requiring that the integrands at the left and right side of equality sign to be identical for any E and s ′ 1 .As a result, ρ(E)P sel α (E) will cancel out such that we can write Since in move 1) the ensembles progress independently from each other, we have The subscript "1" in π 1 (E → a E) might seem contradictory to the previous statement on independent progression, but it just indicates that the points in time at which the environment is evaluated relates the duration of the MC move in ensemble 1: E is the environment at the start of the MC move in ensemble 1, and a E is that when the move is completed.As the time for a s 1 → s ′ 1 move is likely not the same as the time for a s ′ 1 → s 1 move, the final environments are likely not the same.However, a E refers to any environment.Hence, by substituting Eq. 11 into Eq.10, π 1 (E → a E) does not only cancel as it appears at both sides of the equals sign, it is also equal to one.We therefore have not just one, but two very good reasons to eliminate this term such that: or, via Eq.3: This equation essentially the same as the standard detailed balance equation such that we can adapt our acceptance according to which is exactly the same as in standard Metropolis-Hastings.Still, the underlying philosophy is different from a super-state perspective as the number of transitions from old to new, S (o) → S (n) , is not the same as from new to old, S (n) → S (o) .Instead, by writing S = (s 1 , E) we have that the number of (s ) transitions should be equal to the number of (s 1 , a E (n) ) transitions.In addition, as at the end of the move we only update ensemble 1, and not those that are here considered as environment, the number of sampled states in the ensembles do not increase in cohort.Sampling all states simultaneously like in a true superstate move would imply that distributions get mixed with the unknown and unphysical ρ u i distributions.For the swapping move we just consider the example of an attempted 1 ↔ 2 swap as all other swaps i ↔ j are completely analogous.We start again at Eq. 7 with α = 1 ↔ 2, and further we split the environment E = {s 2 , E ✄ 2 } into the part that participates in the swap move, s 2 , and the rest, E ✄ 2 : Here, we assume that the selection probability P sel 1↔2 depends on E ✄ 2 .The chance to do a replica exchange move equals P RE , but once it is decided to perform a replica exchange move, all possible swaps i ↔ j compete to be selected with an equal probability.Hence, the probability for the 1 ↔ 2 swap to be selected depends on the number of available ensembles, which is the total number of ensembles minus the number of occupied ones.This latter information is contained in E ✄ 2 The swapping transition probability π 1↔2 relates to a move that has only one possible outcome, namely the one in which the states in ensemble 1 and 2 are exchanged.Therefore, We can, therefore, write where the transition probability with the hat, π1↔2 , differs from transition probability without the hat, π 1↔2 , by the fact that the latter considers any potential (even if impossible) result of the swapping operation, while the former actually relates to the probability of successfully executing the move in practice in which s 1 and s 2 change places.Substitution of Eqs.16 in Eq. 15 allows us to eliminate the integrals over s ′ 1 , s ′ 2 , s ′′ 1 , and s ′′ 2 via the delta-function integration property.
We then eliminate the integrals over E ′ ✄ 2 and E ′′′ ✄ 2 using a similar expression as Eq. 8.
In the next step, we change some of the dummy integration variable names: s ′′′ 2 to s 2 and E ′′ ✄ and use a detailed-balance principle by stating that the equality does not only hold when integrated, but is true for any pair s 2 , E ✄ 2 .
We further simplify ρ 1 (s 1 |s 2 , E ✄ 2 ) by ρ 1 (s 1 ) using Eq. 3, and split π1↔2 (s where the latter term cancels like before: Since π1↔2 (s 2 , s 1 → s 1 , s 2 ) is the transition probability from (s 1 , s 2 ) to (s 2 , s 1 ) in the first two ensembles given that the 1 ↔ 2 swap move was selected, and given that there are no other possible outcomes of this swap (P gen = 1), the transition probability equals the acceptance probability: To satisfy this relation, Eq. ( 4) of the main article suffices.
So also here, the standard replica exchange acceptance rule applies.The main difference is that ensembles are not updated in cohort.After the 1 ↔ 2 swap move we only update ensembles 1 and 2. Alternatively, after the 1 ↔ 2 swap all other free ensembles will be updated as well with "null moves".In the example of Eq. 1 this would mean that besides, ensemble 1 and 2, also ensemble 4 would be updated.As the state in this ensemble is not changing in a 1 ↔ 2 swap, this would imply recounting the existing s 4 state.Hence, this could be viewed as a superstate move, but then without the occupied states.Resampling s 4 is allowed as the chance for resampling is independent of the content of ensemble 4.However, the sampling of the ensembles 3 and 5 should, while occupied, at all cost be avoided since the time that ensembles 3 and 5 remain occupied can correlate with the values of s 3 and s 5 , respectively.
Like in Eq. 14, the acceptance rule of Eq. 23 is based on a twisted detailed balance relation: we require that, given an equilibrium distribution, the number of (s ) transitions should be equal to the number of (s = s 2 .So in this section, we proved that standard acceptance-rejection rules can be applied in a parallel scheme in which replica exchange moves occur only between unoccupied ensembles, such that ensembles are not updated in cohort.

III. MATRICES WITH CONSECUTIVE ONES AND ZEROS
If the high-acceptance approach is not applied, w i (X) = 1 in Eq. ( 6) of the main article and the W -matrix has rows consisting of a sequence of ones, followed a sequence of zeros.The P -matrix can then be determined from Eq. ( 7) of the main article which has an O(n 2 ) scaling.In this section we provide the proof of this equation.
Let n i be the number of ones in row i.The first step to order the rows with increasing order of n i .For instance in the following 5 × 5 matrix e 1 e 2 e 3 e 4 e 5 s 1 1 1 0 0 0 we see that s 2 , originating from an MC move in ensemble e 2 , is also valid for e 3 and e 4 .State s 3 that was created in e 3 only reaches the minimal condition for that ensemble.In path sampling, where s 2 and s 3 are paths and e 2 , e 3 and e 4 refer to path ensembles [k + ], [l + ] and [m + ] with m > l > k, it would mean that path s 3 crosses λ l , but not λ m , while path s 2 crosses at least m − k more additional interfaces than strictly needed for being a valid trajectory in e 2 = [k + ].As a result, the third row has fewer ones than the second row.After reordering, the W -matrix looks as follows: e 1 e 2 e 3 e 4 e 5 s ′ 1 = s 1 1 1 0 0 0 where we introduced the bracket notation W [•] indicating the number of ones in each row in which 1 ≤ n 1 ≤ n 2 ≤ n 3 . . .≤ n n = n.Likewise, we always have n i ≥ i.
Based on the recursive relation, perm(W ) = j W 1j perm(W {1j}), and the fact that the matrix after removing row 1 and column j, W {1j}, is identical for any j ≤ n 1 , we can write The permanent of the remaining matrix W [n 2 − 1, n 3 − 1, . . ., n n − 1] can again be written as (n 2 − 1) × perm(W [n 3 − 2, . . ., n n − 2]) and so on.The permanent is, hence, equal to The P -matrix follows from Eq. ( 5) of the main article: P ij = W ij perm(W {ij}/perm(W ).This means that P ij = 0 whenever W ij = 0.If W ij = 1, and n i−1 < j or i = 1, we have that for a matrix W [n 1 , n 2 , . . ., n i−1 , n i , n i+1 , . . ., n n ] the following matrix remains after removal of row i and column j: and the permanent perm(W {ij}) and, therefore, for this case we have If for some k < i, n k ≥ j, while n k−1 < j or k = 1, we have that for a matrix W [n 1 , n 2 , . . ., n k−1 , n k , . . ., n i , n i+1 , . . ., n n ] the following matrix remains after removal of row i and column j: Therefore, the permanent of W {ij} can be written as This gives for P ij : We can compare this result with that of one row below (row i + 1): Therefore, we have following recursive relations For the example given above, this relation gives the following P -matrix: This O(n 2 ) algorithm can be done within a second for n ≤ 3500, bigger than any foreseeable RETIS simulation, without even leveraging the block-diagonalization.One could swap again the second and third row to get them ordered according to the original s i -states, though there is in principle no need for this.This is because it is irrelevant to connect the existing states to the ensembles in which they were originally created.

IV. KRAMER'S THEORY
For Langevin dynamics, Kramer's relation provides a way to improve upon transition state theory via an approximate expression for the transmission coefficient: Here, γ is the friction coefficient of the Langevin dynamics and ω b = k/m with m the particle's mass and k the curvature along the reaction coordinate at the transition state.The rate constant is then the product of the transmission coefficient times the transition state theory expression for the rate: For a one-dimensional motion along a coordinate z, the transition state theory expression can be expressed as [1]: where V (•) is the underlying potential, T the temperature, k B the Boltzmann constant, and β = 1/k B T .The transition state is here assumed to be located at z = 0 and the system is in state A, the reactant state, if z < 0. The Kramer's approximation for the rate constant k follows from Eqs. 34-36.However, other properties like crossing probabilities and the permeability through a membrane can be derived from the transmission coefficient as well.
The crossing probability P A (λ B |λ A ) from interface λ A to interface λ B follows from the main TIS/RETIS rate equation: where f A is the conditional flux through λ A given the system is in state A. Here, λ A and λ B correspond to the first, λ 0 , and last interface, λ M , respectively.The flux f A through λ A is similar to k TST , the flux through the transition state without recrossing correction, as it counts all positive crossings and is based on the same normalization (integration over state A): From Eqs. 34-38 we end up with an equation for the crossing probability: Hence, based on the underlying potential and Kramer's expression, Eq. 34, one can obtain an approximate value for the crossing probability.Likewise, for a membrane system we can derive a Kramer's expression for the permeability P starting from Eq. 18 in Ref. [2]: where ρ ref refers to the probability density for a permeant at a location away from the membrane, z ref , where V (•) is considered to be flat, and the subscript (•) A indicates that it is normalized over the reactant state region A: Note that the integral in the denominator of Eqs.38 and 41 is usually diverging since the underlying potential V (•) is generally flat away from the barrier in a membrane system.Fortunately, this integral term cancels in Eq. 40: where in the second equality we substituted P A (λ B |λ A ) using Eq.39.Hence, based on Eq. 34 and Eq.42, we can obtain a value for the permeability based on Kramer's theory.
The aforementioned equations can be generalized for multidimensional systems by replacing the V (z) terms with the Landau free energy F (z).That is, for one additional degree of freedom y: In addition, if multiple reaction channels yield competing parallel saddle points in the potential energy surface, these need to summed up as we will do in the next section.
A. Kramer's relation for crossing probability of a two-channel system The potential energy surface described in Ref. [2] is the following Note that the potential is periodic along the y-direction such that V (y, z) = V (y + L y , z) and that it is zero in the limit |z| → ∞.Further, the following mass, Langevin friction coefficient and thermodynamic parameters were set in dimensionless reduced units: γ = 5, T = m = k B = β = 1.The first and last interfaces were set at: λ A = −1.5 and λ B = 1.2.In this case, we have two saddle points at (−L y /4, 0) and at (+L y /4, 0) where the former is slightly lower in potential energy by 1k B T (V 1 and V 2 , respectively).The curvatures can be obtained by applying a second order Taylor expansion around z = 0: which gives w b,1 = √ 20 and w b,2 = √ 22.As a result κ 1 = 0.5866, κ 2 = 0.6002 via Eq.34.From this we can compute the crossing probability based on essentially Eq. 39, but using the Landau free energy, F (•), by Eq. 43, instead of the potential energy, V (•), and using both transmission coefficients for the parts along the orthogonal coordinate, y, where they are relevant: where the integrals over y are taken over one period.Note that the system in Ref. [2] actually contains 3 particles that move in this 2D potential energy surface such that the dimension of the system is actually 6.However, since we follow one single target permeant and the other particles are assumed to have no influence on the target (the interparticle interaction was set to 0 [2]), the effective dimension for our analysis is 2 with coordinates y and z.
The permeability then follows from Eq. 42 with V (•) replaced by F (•), where we used the expression based on the crossing probability to have the effect of the two different transmission coefficients directly included: where we assumed that z ref is taken far away from the membrane at z = 0 such that z ref ≪ 0 and V (y, z ref ) ≈ 0.
well.Based on the expected error in the local crossing probability, the CPU-based efficiency time of ensemble [k + ] can be expressed as [5]: where p k is the local crossing probability of ensemble [k + ], L k is the average path length (expressed in MD steps or CPU seconds), and ξ k is the ratio of the average cost of a MC move to L k .In other words, ξ k L k is the average computational cost for doing a MC move (creation of a trial path that might then be accepted or rejected).Finally, N k is a measure of the effective correlations between MC moves also called the "statistical inefficiency".Paths can be correlated due to rejections, which implies that the old path is recounted, or because of similarities between accepted paths.In practice, N k tends to be significantly larger than 1 while ξ k is often smaller than 1 as many rejections occur without that a trial path needs to be fully completed.In addition, some MC moves like the replica exchange move or the time-reversal move do not require any MD steps.
In the following, we will neglect the effect that the replica exchange moves have on the errors and on the CPU-time.Under this assumption, the successive MC moves are completely independent.In addition, the ensemble moves are memoryless (hence N = 1).The overall error can thus be computed from the errors in the individual ensembles using standard error propagation rules for independent estimates.Except for the replica exchange part, the MSVS simulation is rejection-free such that we also have ξ = 1.In addition, the random artificial MD time for a path in ensemble in ensemble [k + ] was on average 0.1 k + 0.1 seconds.To simplify our analysis, we neglect the final 0.1 addition, and state that L k = ak with a = 0.1.Finally, we fixed the local crossing probability to p k = p = 1/10 for all ensembles [k + ] such that The relative error in estimate of the local crossing probability of ensemble [k + ] follows from Eq. 49: The first expression is the standard error propagation rule for the error in a final estimate that is obtained from a product of independent estimates.Now let us first consider standard TIS or the N = K case.In this simulation we would have an equal number of workers as ensembles.Each worker is solely designated to a single ensemble such that an equal amount of CPU-time is spend per ensemble when the simulation is stopped.So we can simply put τ k = 1 such that τ = N and where in the last equality we assumed N ≫ 1.The efficiency time for TIS is hence This is in agreement with Ref. [5] which stated that an equal division of CPU-time or aiming for the same error in each ensemble gives the same efficiency.Since the local crossing probability is the same for each ensemble, p k = p, aiming for the same error in each ensemble is equivalent to having the same number of MC moves per ensemble (if the statistical inefficiencies, N k , are the same).The optimal division of CPU-time over the different ensembles is, however, τ k ∝ τ eff k [5].By taking τ k = √ k, the total CPU-time becomes and the total error which by Eq. 49 results in a slightly lower efficiency time than for TIS/RETIS: , for an optimal division (61) Based on a = p = 0.1 and N = 50, the efficiency times are τ eff = 56250 for TIS/RETIS and τ eff = 50000 for the optimal division.Naturally, the corresponding CPU-time efficiencies by Eq.48 are 1/56250 and 1/50000.Furthermore, based on Eq. 50, the optimal wall-time efficiency and the optimal TIS/RETIS wall-time efficiency are given by K/50000 and K/56250, respectively.These are the continuous black and purple curves in Fig. 1d of the main article.
It is interesting to observe that the optimal TIS/RETIS CPU-time efficiency is only 12.5% lower than the optimal CPU-time efficiency.This seems to suggest that it is difficult to improve the CPU-time efficiency of TIS and RETIS unless the division of CPU-time is exactly targeted to do so.On the other hand, one can easily get a much worse CPUtime efficiency when errors in some ensembles are reduced to unnecessary small values while the other ensemble errors are ignored.Based on the fact that τ k ∝ τ eff k gives the optimum, the optimum division of MC moves is obtained when in ensembles [k + ] the number of MC moves is proportional to τ eff k /L k .For the MSVS system this means that the number of executed MC moves in each ensemble should optimally be taken as ∝ 1/ √ k for k = 1, 2, . . ., M − 1 (to account for k = 0 we should have kept the neglected 0.1 addition in the path length to avoid divergence).This means that it is actually good to execute more MC moves at the lower rank ensembles (low k) than at the higher rank (high k).However, this should not be exaggerated since too many MC moves in the low ranked ensembles will just result in inefficient use of CPU-time as discussed above.Based on the numerical sampling ratios, we determined the CPU-time spend in each ensemble, τ k , by multiplying these ratios by L k = ak.We then estimated the error based on Eqs.54 and 52.The resulting efficiency, based on the actual sampling ratios of ∞RETIS, turned out to give a slightly better CPU-time efficiency than that of TIS/RETIS.The resulting wall-time efficiencies of this hybrid theoretical/numerical result is shown by the purple dots if Fig. 1d as well.This shows that ∞RETIS can actually improve both the CPUand wall-time efficiency compared to TIS/RETIS.The latter is expected based on the brute force principle that more CPU power is used per second.The former is more subtle and related to the fact that ∞RETIS leads to a more efficient distribution of the CPU-time among the different ensembles compared to TIS or RETIS.The ratio of first crossings points for the last ensemble in the more favorable channel.The blue icons shows the sampled ratio for each simulation, the blue line is the average of 5 simulations for each amount of workers and the black line is the expected value from direct integration of exp(−βV (y, z)) over y with z fixed at z = λ10 = −0.2.The 3 icons overlapping at 0.0 for 12 workers is the result of the known ergodicity issues of the TIS algorithm due to the lack of replica exchange moves.

A. Ratios of channels crossings
time eff.CPU-time eff.opt.wall-time eff.opt.TIS/RETIS wall-time eff.hybrid wall-time eff.with wi e fencing wall-time eff.

2 =
) with τ k the CPU-time that is spend to ensemble [k + ].Given a certain division of the total simulation time τ into the times (τ 0 , τ 1 , . . ., τ N −1 ), we can compute the total efficiency time by Eq. 49 with ǫ

2 a 1 − p p N 3 , 2 =
for TIS or K = N (56) For serial RETIS, each ensemble is updated by a MC move before a next cycle of moves is started.As a result, in each ensemble the same number of MC moves are carried out such that τ k ∝ L k ∝ k.By taking τ k = k, we get that τ = (N − 1)N/2 ≈ N 2 /2 and ǫ FIG. 1.The ratio of first crossings points for the last ensemble in the more favorable channel.The blue icons shows the sampled ratio for each simulation, the blue line is the average of 5 simulations for each amount of workers and the black line is the expected value from direct integration of exp(−βV (y, z)) over y with z fixed at z = λ10 = −0.2.The 3 icons overlapping at 0.0 for 12 workers is the result of the known ergodicity issues of the TIS algorithm due to the lack of replica exchange moves.