The Apogee to Apogee Path Sampler

Abstract Among Markov chain Monte Carlo algorithms, Hamiltonian Monte Carlo (HMC) is often the algorithm of choice for complex, high-dimensional target distributions; however, its efficiency is notoriously sensitive to the choice of the integration-time tuning parameter. When integrating both forward and backward in time using the same leapfrog integration step as HMC, the set of apogees, local maxima in the potential along a path, is the same whatever point (position and momentum) along the path is chosen to initialize the integration. We present the Apogee to Apogee Path Sampler (AAPS), which uses this invariance to create a simple yet generic methodology for constructing a path, proposing a point from it and accepting or rejecting that proposal so as to target the intended distribution. We demonstrate empirically that AAPS has a similar efficiency to HMC but is much more robust to the setting of its equivalent tuning parameter, the number of apogees that the path crosses. Supplementary materials for this article are available online.


Introduction
Markov chain Monte Carlo (MCMC) is often the method of choice for estimating expectations with respect to complex, high-dimensional targets (e.g. Brooks et al., 2011;Gilks et al., 1996). Amongst MCMC algorithms, Hamiltonian Monte Carlo (HMC, also known as Hybrid Monte Carlo; Duane et al., 1987) is known to offer a performance that scales better with the dimension of the state space than many of its rivals (Beskos et al., 2013;Neal, 2011b).
Given a target density π(x), x ∈ R d , with respect to Lebesgue measure, and a current position, at each iteration HMC samples a momentum and numerically integrates Hamiltonian dynamics on a potential surface to create a proposal that will either be accepted or rejected. As such, HMC has two main tuning parameters: the numerical integration step size, , and the total integration time, T . Given T , guidelines for tuning have been available for some time (Beskos et al., 2013); however, the integration time itself, is notoriously difficult to tune (e.g. Neal, 2011b), with algorithm efficiency often dropping sharply following only slight changes from the optimum T value, and usually exhibiting approximately cyclic behaviour as T increases.
The sensitivity is illustrated in the left panel of Figure 1, in which HMC is applied to a modified Rosenbrock distribution (see Section 4.1) of dimension d = 40. In the top half of this plot, the efficiency measure, which will be described in detail in Section 4, is given as a function of and the number of numerical integration steps, L. The bottom half of the panel shows the analogous plot for a modification of the HMC algorithm (Mackenze, 1989;Neal, 2011b), which we will call blurred HMC where at each iteration, the actual step-size is sampled (independently of all previous choices) uniformly from the interval [0.8 , 1.2 ]. This step was designed to mitigate the near reducibility of HMC that can occur when T is some rational multiple of the integration time required to return close to the starting point, but as is visible from the plots, it also makes the performance of the algorithm more robust to the choice of T , and, we have found, often leads to a slightly more efficient algorithm. In both cases, the optimal tuning choice appears as a narrow ridge of roughly constant T .
Motivated by the difficulty of tuning T , Hoffman and Gelman (2014) introduces the no U-turn sampler. This uses the same numerical integration scheme as standard HMC, the leapfrog step, to integrate Hamiltonian dynamics both forward and backward from the current point, recursively doubling the size of the path until the distance between the points the farthest forward and backward in time stops increasing. Considerable care must be taken to ensure that the intended posterior is targeted, making the algorithm relatively complex; however, the no U-turn sampler is the default engine behind the popular STAN package (Stan Development Team, 2020), which has a relatively straightforward user interface.
Integration of Hamiltonian dynamics can be thought of as describing the position and momentum of a particle as it traverses the potential surface U . During its journey, provided T is not too small, the particle will reach one or more local maximum in the potential. We call each of these maxima an apogee. The leapfrog scheme creates a path which is a discrete set of points rather than a continuum, and so (with probability 1) apogees occur between consecutive points in the path; however, they are straightforward to detect. We call the set of points (positions and momenta) between two apogees a segment. The discrete dynamics using the leapfrog step share several properties with the true dynamics, including the following: if we take the position and momentum of any point along the path, and integrate forward and backward for appropriate lengths of time we will create exactly the same path, and hence the same set of apogees and the same set of segments. This invariance is vital to the correctness and flexibility of the algorithm presented in this article.
In Section 3 we present the Apogee to Apogee Path Sampler (AAPS). Like HMC, AAPS is straightforward to implement and has two tuning parameters, one of which is an integration step size, . As with the no U-turn sampler, given a current point (position and momentum), AAPS uses the leapfrog step to integrate forwards and backwards in time. However, the integration stops when the path contains the segment in which the current point lies as well as K additional segments, where K is a user-defined tuning parameter. A point is then proposed from this set of K + 1 segments and either accepted or rejected. The positioning of the current segment within the K + 1 and the accept-reject probability are chosen precisely so that the algorithm targets the intended density. The invariance of the path to the starting position and momentum leads to considerable flexibility in the method for proposing a point from the set of segments, which in turn allows us to create an algorithm which enjoys a similar efficiency to HMC yet is extremely robust to the choice of the parameter K. Both of these properties are evident from the right-hand panel of Figure  1 and analogous plots for other distributions in Appendix E.
2 Hamiltonian dynamics and Hamiltonian Monte Carlo

Hamiltonian dynamics
The evolution of the position, x, and momentum, p, across a frictionless potential surface U (x) evolve according to Hamilton's equations: For a real object, M is the mass of the object, a scalar, but the properties of the equations themselves that we will require hold more generally, when M is a symmetric, positivedefinite mass matrix. We define z t = (x t , p t ) and the map φ t which integrates the dynamics forwards for a time t, so z t = φ t (z 0 ). The map, φ t has the following fundamental properties (e.g. Neal, 2011b): 1. It is deterministic.
2. It has a Jacobian of 1.

It preserves the total energy
Except in a few special cases, the dynamics are intractable and must be integrated numerically, with a user-chosen time step which we will denote by . The default method for Hamiltonian Monte Carlo, is the method which will be used throughout this article, the leapfrog step; the leapfrog step itself is detailed in Appendix A. Throughout the main text of this article we denote the action of a single leapfrog step of length on a current state z as LeapFrog(z; ). The leapfrog step satisfies Properties 1-3 above (see Appendix A), but it does not preserve the total energy.
Consider using L leapfrog steps of size = t/L to approximately integrate the dynamics forward for time t. Since each individual leapfrog step satisfies Properties 1-3, so does the composition of L such steps, which we denote φ t (z 0 ; ).

Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) creates a Markov chain which has a stationary distribution of π(x) = exp{−U (x)}. Given a current position, x curr , and tuning parameters and L, a single iteration of the algorithm proceeds as follows: 1. Sample a momentum p 0 ∼ N(0, M ) and set z 0 = (x curr , p 0 ).
Here, with ρ(p) denoting the density of the N(0, M ) random variable, If x curr is in its stationary distribution thenπ is the density of z 0 = (x curr , p 0 ).
3 The Apogee to Apogee Path Sampler

Apogees and segments
The left panel of Figure 2 shows L = 50 leapfrog steps of size = 0.1 from a current position, x 0 simulated randomly from a two-dimensional posterior U (x) (indicated by the Figure 2: Left panel: L = 50 leapfrog steps of size = 0.1 from the current point. Right panel: the current segment, S 0 , and two segments forward and one segment backward using = 0.1. The current point, x 0 is simulated from a target, which has a density of π(x) ∝ exp(−x 2 1 /2 − 6x 2 2 ); p 0 is simulated from N(0, I 2 ). Different colours and symbols are used for each segment along the path. contours), and momentum p 0 simulated from a N(0, I 2 ) distribution. Different symbols and colours are used along the path, with both of these changing from step l to step l + 1 if and only if The condition (4) is fundamental to all that follows. Intuitively it indicates when the "particle" has switched from moving "uphill" to moving "downhill". By (2), p M −1 ∇U ≡ dx/dt · ∇U ≡ dU/dt, so (4) indicates a local maximum in U (x t ) between x l and x l+1 .
Between such a pair of points at l and l+1 there is a hypothetical point, a with l < a < l+1 where the particle's potential has reached a local maximum, which we call an apogee. Under the exact, continuous dynamics, this point would, of course, be realised, but under the discretised dynamics the probability of this is 0. We call each of the realised sections of the path between a pair of consecutive apogees (i.e., each portion with a different colour and symbol in Figure 2) a segment. Each segment consists of the time-ordered list of position and momentum at each point between two consecutive apogees.
Instead of integrating forward for L steps, one can imagine integrating both forwards and backwards in time from z 0 = (x 0 , p 0 ) until a certain number of apogees have been found. The right pane of Figure 2 shows the segment to which the current point belongs, which we denote S 0 (z 0 ), together with two segments forward and one segment backward. We denote the jth segment forward by S j (z 0 ) and the jth segment backward by S −j (z 0 ). We abbreviate the ordered collection of segments from S a (z 0 ) to S b (z 0 ) as S a:b (z 0 ). Thus, the right panel of Figure 2 shows the positions from S −1:2 (z 0 ). For a particular point z = (x , p ) ∈ S a:b , we denote the segment to which is belongs by S # (z ; z 0 ).
The following segment invariance property, is vital to both the simplicity and correctness of our algorithm. For a ≤ b, any z and any z ∈ S a:b (z) For the right panel of Figure 2, for example, picking any z = (x , p ) from S 1 (z 0 ), S −2:1 (z ) would give the same ordered set of segments as illustrated in the figure. This is because the numerical intergation scheme is deterministic and skew reversible, so the apogees would all occur in exactly the same positions with exactly the same momenta.

The AAPS algorithm
We now introduce our algorithm, the Apogee to Apogee Path Sampler, AAPS. The algorithm requires a weight function w : R 4d → [0, ∞), where d is the dimension of the target. Weight functions are investigated in more detail in Sections 3.3, but for now it might be helpful keep in mind the simplest that we consider: w(z, z ) =π(z ).
Given a step-size , a non-negative integer, K, a mass matrix, M and a current position x curr , one iteration of AAPS proceeds as follows: 1. Sample a momentum p 0 ∼ N(0, M ) and set z curr = z 0 = (x curr , p 0 ).
3. Create S a:b by stepping forward from (x 0 , p 0 ) and then backward from (x 0 , p 0 ).

With a probability of
set z curr ← z prop else z curr ← z curr .
6. Discard p curr and retain x curr .
Proposition 1. The AAPS algorithm satisfies detailed balance with respect to the extended posteriorπ(x, p).

Proof.
Step 1 preservesπ because p is independent of x and is sampled from its marginal.
To understand why the remainder of the algorithm produces a chain with the intended stationary distribution it will be helpful to define a = a − S # (z prop ; z curr ) and b = b − S # (z prop ; z curr ) and note that segment invariance (5) is equivalent to Further, since S a:b is only defined for integer a, b, The resulting chain satisfies detailed balance with respect toπ becausẽ which is .
The final expression is invariant to (z curr , a, b) ↔ (z prop , a , b ).
Remark 1. The validity of the AAPS algorithm does not rely on the Jacobian of the leapfrog step being 1; however, a Jacobian other than 1 will lead to greater variability inπ along a path and, hence, to a reduced efficiency.
The path S a:b may visit parts of the statespace where the numerical solution to (2) is unstable and the error in the Hamiltonian may increase without bound, leading to wasted computational effort as large chunks of the path may be very unlikely to be proposed and accepted. Indeed it is even possible for the error to increase beyond machine precision. The no U-turn sampler suffers from a similar problem and we introduce a similar stability condition to that in Hoffman and Gelman (2014). We require that for some prespecified parameter ∆. This criterion can be monitored as the forward and backward integrations proceed, and if at any point the criterion is breached, the path is thrown away -the proposal is automatically rejected. Segment invariance means that z ∈ S a:b (z curr ) ⇐⇒ z ∈ S a :b (z prop ), so the same rejection would have occured if we created the path from any proposal in S a:b (z curr ); thus it is immediately clear that detailed balance is still satisfied. For the experiments in Section 4 we found that a value of ∆ = 1000 was sufficiently large that the criterion only interfered when something was seriously wrong with the integration.
Step 3 of the algorithm then becomes: 3. Create S a:b by stepping forward from (x 0 , p 0 ) and then backward from (x 0 , p 0 ). If condition (7) fails then go to 6.

Choice of weight function
The weight function w(z, z ) =π(z ) mentioned previously is a natural choice, and substitution into (6) shows that this leads to an acceptance probability of 1; however, it turns out not to be the most efficient choice. For example, intuitively the algorithm might be more efficient if points on the path that are further away from the current point are more likely to be proposed. To investigate the effects of the choice of w we examine six possibilities: 1. w(z, z ) =π(z ), which leads to α = 1.
is defined below; this also leads to α = 1.
For Scheme 6, denote the points in S a:b (z curr ) from furthest back in time to furthest forward in time as z B , . . . , z 0 , . . . , z F , and for l ∈ {B, . . . , F } let T l := l i=Bπ (z i ). The points in the first "half" of S a:b are {z l : T l ≤ T F /2} and in the second "half", {z l : T l−1 > T F /2}. We define H(z) to be the opposite "half" of S a:b (z) to z. Care must be taken with the boundary value h where T h−1 < T F /2 ≤ T h ; Appendix G gives a complete description of the sampling mechanism for Scheme 6. Figure 3 shows, for a particular choice of and target, the efficiency of AAPS as a function of K for each of the six weight schemes. Scheme 6 is the most efficient; however Schemes 3 and 5 each of which involve some measure of jumping distance modulated byπ are not far behind. Indeed there is only a factor of two between the least and most efficient. Very similar relative performances were found for all the other toy targets from Section 4.1 and across a variety of choices of , except that when becomes small, modulation of SJD or AJD byπ makes little difference since relative changes inπ are small.
A simple heuristic can explain the difference of a factor of nearly 2 between Scheme 1 and Scheme 6 in the case whereπ is approximately constant; for example, when is small. S a:b is constructed prior to making the proposal, so the computational effort does not depend on the scheme; thus efficiency can be measured crudely in terms of the squared jumping distance of the proposal (e.g. Roberts and Rosenthal, 2001;Sherlock and Roberts, 2009;Beskos et al., 2013), since it is accepted with probability 1. Without loss of generality, we rescale S a:b to have unit length. For two independent Unif(0, 1) variables U 1 and U 2 , Scheme 1 is equivalent to an expected squared jumping distance A naive implementation of each of the above weighting schemes would require storing each of the points in S a:b (z curr ), which has an associated memory cost of O(K) and an exact cost that varies from one iteration to the next as the number of points in each segment is not fixed. For all schemes except the last there is a simple mechanism for sampling z prop with a fixed O(1) memory cost; however, this would be useless if calculation of the acceptance ratio α in (6) still required storage of O(K). For Schemes 1, 2 and 3, however, it is also possible to calculate α with a fixed O(1) memory cost, and in practice we have observed this to improve the efficiency measured in terms of effective samples per CPU second by a  factor of between 2 and 5. Since there is little otherwise to choose between the schemes, we opt for the most efficient of these, Scheme 3, w(z, z ) = ||x − x|| 2π (x ) and apply this thoughout the remainder of this article. Appendix B details the implementation of Schemes 1-3 with O(1) memory cost.
Scheme 3 also contributes to the robustness of the efficiency to the value chosen for the tuning parameter K, as visible in Figure 1 and similar figures in Appendix E. As can be seen from the left panel of Figure 2, as well as Figure 9 in the Appendix, the distance between the current position, x curr , and the HMC proposal at time T as a function of T is approximately periodic. Furthermore,π varies in an approximately periodic pattern and, typically, the optimal value is such that the amplitude is substantial. The sensitivity of HMC to the the choice of T derives from the combination of these two approximate periodicities. Even with Scheme 1, AAPS removes both sensitivities as the proposal is taken from all points along the path, with probability proportional toπ. Scheme 3 is still more robust as when K is too large and the end point of a path is much closer to x curr than the furthest point on the path is, the proposal is much more likely to come from near the furthest point in the path.

Diagnostic
We now describe a diagnostic that we have found useful for choosing the K parameter, the number of segments to use over and above segment 0.
Each iteration, given a current position x curr , AAPS samples a value c uniformly from {0, . . . , K}, AAPS proposes a point x from the set of points in the K+1 segments numbered −c, . . . , K − c, with a probability proportional to our chosen weight function, w(x curr , x ) ∝ π(x )||x − x curr || 2 . It will be helpful in the sequel to consider the case where the segment number j was, instead, sampled uniformly from {−c, . . . , K − c}. This would be equivalent to sampling c and e independently and uniformly from {0, . . . , K} and setting j = e − c. Thus, the marginal distribution from which j arose would be: To decide upon a suitable K, we first perform a relatively short run of AAP S with a large K, K * . Each iteration, we note the segment number, j ∈ {−a, . . . , K * − a} from which the proposal arose. By the symmetry of the sampling of both a and the momentum, p, the distribution of j is symmetric about 0, so we track k = |j| and keep a running total of the number of times, n K * (k), that there has been a proposal from a segment with an index whose absolute value is k for each k from 0 to K * .
If the weights had been irrelevant and all segments contained the same number of points then The method for placing the set of segments over the current segment inherently causes lower values of k to appear more often, whatever the weights might suggest. To account for this, we calculate and choose the optimal number of additional apogees, K as Appendix D gives a further heuristic for why this tuning mechanism is reasonable and several plots from empirical studies which demonstrate it working in practice. In practice, during the tuning run we monitorm K * (k) := 100(K * + 1)m K * (k)/ K * i=0 m K * (i) since it stabilises as the number of iterations increases.

Tuning
If is too small then the simulated dynamics will be very close to the true dynamics (2), andπ will vary little across the trajectory; however, for any given number of apogees, the number of leapfrog steps, and hence of gradient evaluations, is ∝ 1/ , so the computational cost becomes very large. On the other hand, if is large, the Hamiltonian can vary considerably, leading to lowπ at otherwise desirable positions, those furthest from the current point, and even leading to growth without bound and a break down of the integration. These heuristics suggest the existence of an optimal .
A further, minor consequence, of using a large is that the numerical integration may skip (not discover) apogees that would exist under the true dynamics. This simply redefines the detailed meaning of K.
For AAPS using weighting schemes other than Schemes 1 and 6, the fraction of proposals that are accepted, or the acceptance rate, is a natural diagnostic. Empirically, across all the targets in Section 4 AAPS using Scheme 3 achieved optimal performance, as measured over a grid of values, at acceptance rates of between 75 − 87%. The optimal value of and the optimal acceptance rate were relatively insensitive to the choice of K, except when K = 0, where a large can give only one or two points, including z curr , within S 0 .
To tune we recommend the following procedure: 1. Repeatedly run AAPS with K = 0 for multiple iterations to approximately identify the largest such that (7) always holds.
2. Run AAPS for a large enough number of iterations that the pattern in the set of normalised counts,m K * (k), is relatively stable.
3. If the maximum normalised count, K, is close to K * then increase K * and return to 2; else choose K for the main run.
As with HMC and the no U-turn sampler, the mass matrix, M , used by AAPS can also be tuned, and mixing will be optimised if M −1 ≈ Var π [X], as approximated from a tuning run. However, the matrix-vector multiplication required for simulating p 0 and in each leapfrog step can be expensive, so it is usual to choose a diagonal M instead.

Gaussian process limit
Intuitively, the more eccentric a target the more apogees one might expect per unit time.
The culmination of this section is an expression which makes this relationship concrete for a product target where each component has its own length scale. With the initial condition (X 0 , P 0 ) sampled fromπ, P t M −1 ∇U (X t ) can be considered as a random function of time.
We first show that when the true Hamiltonian dynamics are used, subject to conditions, a rescaling of this dot-product tends to a stationary Gaussian process as the number of components in the product tends to infinity. A formula for the expected number of apogees per unit time then follows as a corollary.
We consider a d-dimensional product target with a potential of HMC using a diagonal mass matrix, M , and a product target is equivalent to HMC using an identity mass matrix and a target of M −1/2 x (e.g. Neal, 2011b), which, in the case of (8) is also a product target, but with different ν i . For simplicity, therefore, we assume the identity mass matrix throughout this section. We also consider the true Hamiltonian dynamics which are approached in the limit as ↓ 0, but approximate the dynamics for small to moderate reasonably well.
Assumptions 1. We assume that g ∈ C 1 , that there is a δ > 0 such that and that for each y 0 ∈ R, there is a unique, non-explosive solution a(t; y 0 , y 0 ) to the initial value problem: d 2 y dt 2 = −g (y); y(0) = y 0 , y (0) = y 0 .
Theorem 1 is proved in Appendix C.1.
Theorem 1. Let the potential be defined as in (8) and where g satisfies the assumptions around (10) and (11). Further, let µ be a distribution with support on R + with for some δ > 0, and let ν i iid ∼ µ. Define where the expectation is over the independent variables X ∼ π 1 , P ∼ ρ 1 and ν ∼ µ, and assume that for any finite sequence of n distinct times (t 1 , . . . , t n ), the n × n matrix Σ with Let D (d) be the scaled dot product defined in (9), and let (X 0 , P 0 ) ∼π; then as d → ∞, where Y ∼ SGP(b, V ) denotes that Y is a one-dimensional stationary Gaussian process with an expectation of b and a covariance function of Cov Ylvisaker (1965) shows that the expected number of zeros over a unit interval of a stationary Gaussian process with a unit variance is: 1 π −C (0), where C is its covariance function. Zeroes can be either apogees or perigees (a perigee occurs when U (x) achieves a local minumum along the path), and these alternate. Hence, with some work (see Appendix C.2), we obtain the following: Corollary 1. The expected number of apogees of D over a time T is Since ν is a squared inverse scale parameter, the first term in the product simply relates the time interval to the overall length scale of the target, given that P 0 ∼ N (0, I d ) and an identity mass matrix is used. The second part of the product is more interesting and shows how the expected number of apogees per unit time interval increases with variability in the squared inverse lengths scales of the components of the target. The second term also makes it plain that the tuning parameter K relates to properties intrinsic to the target; unlike L, its impact is relatively unaffected by a uniform redefinition of the length scale of the target or by the choice of , the latter providing yet another reason for the robustness of AAPS.
In this section we compare, AAPS with HMC, blurred HMC (see Section 1) and the no U-turn sampler over a variety of targets. For fairness we use the basic no U-turn sampler from Hoffman and Gelman (2014), without it needing to adaptatively tune ; instead tuning using a fine grid of values. For both varieties of HMC we use a grid of and L values, and for AAPS we use a grid of and K values. To ensure that the various toy targets are close to representing the difficulties encountered with targets from complex statistical models, all algorithms use the identity mass matrix.
Efficiency for each target component is measured in terms of effective samples per leapfrog step. All code (AAPS, HMC and no U-turn sampler) was written in C++; the effective sample size was estimated using the R package coda, and the mean number of leapfrog steps per iteration (for AAPS and the no U-turn sampler) was output from each run as a diagnostic.

Toy targets
Here we investigate performance across a variety of targets with a relatively simple functional form, and across a variety of dimensions. Many are product densities with independent components: Gaussian, logistic and skew-Gaussian. We consider different, relatively large ratios ξ between the largest and smallest length scales of the components in a target, as well as different sequences of scales from the smallest to the largest. We also consider a modification of the Rosenbrock banana-shaped target.
For a target of dimension d, given a sequence of scale parameters, σ 1 , . . . , σ d we consider the following four forms: where Φ is the distribution function of a N(0, 1) variable, α = 3, β = 1 and s 2 i = 99(i − 1)/(d/2 − 1) + 1. The targets π G , π L an π SG are products of one-dimensional Gaussian, logistic and skew-Gaussian distributions respectively. The target π M R is a modified version of the Rosenbrock function (Rosenbrock, 1960), a well-known, difficult target for optimisation algorithms, which is also a challenging target used to benchmark MCMC algorithms (e.g. Pagani et al., 2021; Heng and Jacob, 2019). The tails of the standard Rosenbrock potential increase quartically, but all algorithms which use a standard leapfrog step and Gaussian momentum are unstable in the tails of a target where the potential increases faster than quadratically. We have, therefore, modified the standard Rosenbrock target to keep the difficult banana shape whilst ensuring quadratic tails.
For π G , π L and π SG , we denote the largest ratio of length scales by ξ := max 1≤i≤j≤d σ i /σ j and define four different patterns between the lowest and highest scaling, depending on which of σ i , σ 2 i , 1/σ 2 i or 1/σ i increases approximately linearly with component number. To minimise the odd behaviour that HMC (but not AAPS or the no U-turn sampler) can exhibit when scalings are rational multiples of each other (a phenomenon which is rarely seen for targets in practice) we jitter the scales for all the intermediate components.
We start by investigating products of Gaussians in d = 40 using each of the above four progressions. Of the four possibilities, the relative performance of AAPS to HMC, HMC-bl and the no U-turn sampler is worst for π V AR G , and so we investigate this regime further using alternative component distributions, and choice of dimension and ξ.
For each target, empirical efficiencies were calculated as the quotient of the minimum effective sample size over all components and the number of leapfrog steps. These were divided by the efficiency for AAPS to obtain efficiencies relative to AAPS. All effective sample sizes were at least 1000. Table 1 shows the efficiencies of HMC, blurred HMC and the no U-turn sampler relative to AAPS. For some targets some of the other algorithms are slightly more efficient than AAPS, and for others they are slightly less efficient, but across the range of targets, no algorithm is more than 1.7 times as efficient as AAPS. Empirical acceptance rates at the optimal parameter settings are provided in Appendix H.

Stochastic volatility model
Consider the following model for zero-centred, Gaussian data y = (y 1 , . . . , y T ) where the variance depends on a zero-mean, Gaussian AR(1) process started from stationarity (e.g. Table 1: Relative efficiency compared with AAPS of HMC, blurred HMC (HMC-bl) and the no U-turn sampler (NUTS); raw efficiencies were taken as the minimum over the d components of the number of effective samples per leapfrog step. (1) This is not a typographic error: the values differ at the 5th decimal place. (2) ξ for odd-numbered components of π M R . (3) The tuning surface for HMC was so uneven that were were unable to ascertain the optimal tuning parameters with any degree of certainty; instead we picked the least unreasonable of the combinations we tried.
As in Girolami and Calderhead (2011) and Wu et al. (2019) we generate T = 1000 observations using parameters φ = 0.98, κ = 0.65 and σ = 0.15. We then apply blurred HMC, AAPS and the no U-turn sampler to perform inference on the 1003-dimensional posterior. We ran AAPS using two different K values, one found by optimising the choice of ( , K) over a numerical grid and one by using the tuning mechanism in Section 3.5. Tuning standard HMC (not blurred HMC) on such a high-dimensional target was extremely difficult. Due to the algorithm's sensitivity to the integration time, we could not identify a suitable range for L. Widely used statistical packages such as Stan (Stan Development Team, 2020) and PyMC3 (Salvatier et al., 2016) perform the blurring by default, and so we only present the results for HMC-bl.
Each algorithm was then run for ten replicates of 10 5 iterations using the optimal tuning parameters. Efficiency was calculated for each parameter, and the minimum efficiency over the latent variables and over all 1003 components were also calculated. For each algorithm the mean and standard deviation (over the replicates) of these efficiencies were ascertained; Table 2: Relative efficiency compared with AAPS g (AAPS tuned using a grid) of AAPS a (AAPS tuned using the advice from Section 3.5), blurred HMC (HMC-bl) and the no Uturn sampler (NUTS) for the stochastic volatility model using 10 replicates of 10 5 iterations. Raw efficiencies were the number of effective samples per leapfrog step; these were then normalised by the mean efficiency from AAPS G ; normalised standard deviations are reported in brackets.  Table 2 reports these values normalised by the mean efficiency for AAPS for that parameter or parameter combination. Overall, on this complex, high-dimensional posterior, AAPS is slightly less efficient than blurred HMC, and slightly more efficient than the no U-turn sampler.

Discussion
We have presented the Apogee to Apogee Path Sampler (AAPS), and demonstrated empirically that it has a similar efficiency to HMC but is much easier to tune. From a current point, AAPS uses the leapfrog step to create a path consisting of a fixed number of segments, it then proposes a point from these segments and uses an accept-reject step to ensure that it targets the intended distribution.
We investigated six possible mechanisms for proposing a point from the path, and for the numerical experiments we made the probability of choosing a point proportional to the product of the extended target density at that point and the proposal's squared distance from the current point, which allowed an O(1) memory cost. However, the flexibility in the proposal mechanism allows other possibilities such as (||x prop − µ|| 2 − ||x curr − µ|| 2 ) 2 for some central point, µ, with a similar motivation to the ChEEs diagnostic of Hoffman et al. (2021).
Choosing the current segment's position uniformly at random from the K + 1 segments is not the only way to preserve detailed balance with respect to the intended target. For example, the current segment could be fixed as segment 0 and proposals could only be made from segment K, a choice which bears some resemblance to the window scheme in Neal (1992); in early experiments, however, we found that this had a negative impact on the robustness of the efficiency to the choice of K.
Because of its simplicity, many extensions to the AAPS algorithm are clear. For example, if the positions along the path are stored, then a delayed rejection step may increase the acceptance probabilities. A cheap surrogate for π could be substituted withinπ in any weighting scheme. Indeed, given c, the next value could be chosen conditional on z curr using any Markov kernel reversible with respect toπ (see also Neal, 2011a). A non-reversible version of the algorithm could set K = 1 and always choose a path consisting of the current segment and the next segment forward; instead of completely refreshing momentum at each iteration, the momentum at the start of a new step could be a Crank-Nicolson perturbation of the momentum at the end of the previous step as in Horowitz (1991). The properties of the leapfrog step required for the validity of AAPS are a subset of those required for HMC (see Section 2.2), so any alternative momentum formulation (e.g. Livingstone et al., 2019) or numerical integration scheme that can be used within HMC could also be used within AAPS.
The AAPS algorithm with K = 0 is reducible on a multimodal one-dimensional target, and this might lead to concerns about the algorithm's performance on multimodal targets in general. However, in d dimensions, because the dot product p · ∇U (x) is a sum of d components even with K = 0, AAPS is not reducible on multimodal targets with d > 1. This is illustrated in Appendix F, which also details a short simulation study on three 40-dimensional bimodal targets where AAPS is more efficient than the no U-turn sampler and is never less than two-thirds as efficient as blurred HMC.

A The leapfrog step
From a current z 0 = (x 0 , p 0 ) the leapfrog step takes this to z 1 = (x 1 , p 1 ) as follows: The transformation is clearly deterministic, and starting at (x 1 , −p 1 ) and applying the three steps leads to (x 0 , −p 0 ). Further, each of the three transformations has a Jacobian of 1, so the composition of the three also has a Jacobian of 1.

B Weighting Schemes 1-3 with O(1) memory cost
As in Section 3.3 we label the points in S a:b (z curr ) from furthest back in time to furthest forward in time as z B , . . . , z 0 , . . . , z F . If z has been chosen from {z 0 , . . . , z F } with a probability of w(z )/ F i=0 w(z i ) and z has been chosen from {z B , . . . z −1 } with a probability of w(z )/ −1 i=B w(z i ), then we obtain the desired proposal by setting z prop = z with a probability of , and otherwise setting z prop = z . We, therefore, use the following algorithm that samples z from an ordered list of vectors z c , . . . , z e with a probability proportional to w(z, z ) and without storing the intermediate values; we then apply this function to {z 0 , . . . , z F } and {z −1 , . . . , z B }.
2. For j in {c + 1, . . . , e} • With probability w(z,z j ) j i=c w(z,z i ) set z = z j , else leave z unchanged.
As required, .
This allows us to propose z prop with the correct probability for the first five weighting schemes, as well as to evaluate the numerator term z∈S a:b (z curr ) w(z curr , z) in (6). However, if all of the individual z and evaluations ofπ(z) were forgotten, then they would need to be recalculated from scratch to evaluate the denominator term z∈S a:b (z curr ) w(z prop , z). We now show how this can be evaluated using only 3 × d summary statistics when x ∈ R d . Firstly, where for k = 1, . . . , d, We create these 3d summary statistics for {z 0 , . . . , z F } and for {z −1 , . . . , z B }, apply (15) in each case, and add the two final totals together.
In the case of Scheme 2,π(z i ) is replaced with 1, and for Scheme 1, the acceptance probability is 1, so (15) is unnecessary.
C Proofs of results in Section 3.6 C.1 Proof of Theorem 1 Proof. When U has the form (8) and ρ ≡ N(0, I d ), the components are independent and, from Hamilton's equations (2), they evolve independently. We, therefore, start by considering an individual component, initiated at a scalar position X 0 ∼ π ν and momentum P 0 ∼ ρ 1 , and with (x t , p t ) = φ(t; x 0 , p 0 ) as in Section 2.1. The deterministic map (x 0 , p 0 ) → (x t , p t ) has a Jacobian of 1 so that the density of (X t , P t ) is f (x t , p t ) =π ν (φ(x t , p t ; −t)) =π ν (x 0 , p 0 ). However, the conservation of energy under Hamiltonian dynamics is equivalent to d dtπ ν (x t , p t ) = 0, so f ≡π ν and (X t , P t ) is stationary. Now consider the contribution to the dot product from this single component: by Hamilton's equations (2). Hence by stationarity, giving the expectation in (14).
Next, by Hamilton's equations (2) Hence, we may write By the stationarity of (X t , P t ) when (X 0 , P 0 ) ∼π ν , the joint distribution of (X s , P s , X t , P t ) for two times s and t is the same as that of (X 0 , P 0 , X t−s , P t−s ). Moreover, (x t , p t ) = φ(t; x 0 , p 0 ) = φ(t − s; x s , p s ). In particular, p t from a start at (x 0 , p 0 ) is identical to p t−s from a start at (x s , p s ); i.e., a ( √ νt; Define V (s, t; ν) := Cov (X 0 ,P 0 )∼πν [D(s; X 0 , P 0 ), D(t; X 0 , P 0 )]. Since E [D] = 0, Then, by the strong law of large numbers, as defined in (13).
Hence, for any integer n, all n-dimensional distributions tend to a multivariate Gaussian with covariances given according to V , and the result follows.
This proof would follow through for a non-Gaussian momentum formulation subject to a moment condition on p 0 .

D.1 Heuristic explanation
For a particular segment, j segments from the current point, let s * (j; x, p) represent the average (over the segment) squared distance in position between the current point, (x, p), and points in the segment. To explain, heuristically, how the diagnostic works we make three simplifying assumptions: 1. s * (j; x, p) = c(x, p)s(j), for some functions c and s with s(0) = 0.
2. The number of points in segment j does not depend on j; i.e., |S j (X, P )| = N (X, P ) for some integer-valued function N .
3. Acceptance probabilities are generally large enough that variation in these is a secondary effect.
The first assumption seems reasonable as, at least for low j, s * (j; x, p) ∝ j 2 , approximately, although, strictly s * (0; x, p) > 0 unless there is a single point in the initial segment. The second assumption is strictly incorrect, but in the limit as ↓ 0, E [|S j (X, P )|] does not depend on j since, at stationarity, all points in the Hamiltonian path have the same density as the initial point (see the proof of Theorem 1). Thus, the second assumption is reasonable provided values do not vary too much from N (X, P ). The third assumption is certainly correct in the limit as the step size, ↓ 0, but is reasonable empirically more generally.
Since proposals are made in proportion to squared jumping distance, We now assume that the tuning parameter has been set to K. The absolute segment number k is proposed with a probability proportional to p K (k)N (X, P )c(X, P )s(k). The mean squared jumping distance resulting from a tuning parameter K is, therefore .
Denoting E [c(X, P )] by c 2 , the expectation over all initial values is since s(0) = 0. We can simplify this to J does not take into account the computational effort, which is proportional to the number of segments, K + 1. Hence, we define the efficiency as Intuitively, for small j, s(j) ∝ j 2 , approximately, which motivates the assumptions in the following.
Proof. Jensen's inequality gives J K ≥ s(E [J]). Further, as s is convex, for any b ≥ a, the slope of the chord from 0 to b is at least as large as that of the chord from 0 to a: since s(0) = 0. Thus, for j ≤ (K + 1)/2, setting b = K + 1 − j and a = j, we have and so E [J] ≥ (K + 1)/2. Since s (j) ≥ 0 on {0, . . . , K}, we have proving the first part. Reusing (18) with b = (K + 1)/2 and a = K/2 gives, as required, Indeed, straightforward algebra shows that if s(j) = λj then Eff K = λ/2; the inequality is tight for convex functions. However, in practice, s(j) is strictly convex initially, which suggests the maximum efficiency may occur after the first point when s is no longer convex. Plots from various starting points of the mean squared Euclidean distance from the start to points in segment j show behaviour approximately similar to s(j) ∝ 1 − cos(πj/b), for some b, although there is, typically, less oscillation between peaks and troughs once the first peak has been passed. For small j, this formulation gives, approximately, s(j) ∝ j 2 which fits with intuition. Figure 4 plots s(j) against j when b = 15, and the resulting Eff K against K. The efficiency is maximised at a value close to inf j≥0 arg max s(j) and the relative difference between the efficiencies at the two points is small (here less that 0.5%). The 1/(K + 1) penalty term means that damping of the oscillations of s(j) after the first peak will make no difference to the point of maximum efficiency, only to the tail of the efficiency curve. Figure 5 compares the optimal K found by using a fine grid of (K, ) values, with the value suggested by our diagnostic. All plots are for the skewed-Gaussian product targets described in Section 4.1, but using a wider variety of eccentricity parameters, ξ, and all plots show good agreement between the diagnostic and the empirical estimate of the truth. Similar agreement was found for other targets from Section 4.1 that we investigated.  Figure 5: The optimal K found by testing over a fine grid of possible values compared with predictions based on the diagnostic using a K * = 60. All targets had dimension d = 40 and were a product of skewed Gaussians. Each run of AAPS was for at least 10 5 iterations. These are products of one-dimensional distributions, respectively, Gaussian, logistic and skewed Gaussian, π V AR G , π V AR L and π V AR SG with ξ = 40 as described in Section 4.1. For each target, over all components, the smallest scale parameter is σ 1 = 1. For a Gaussian target this implies that the leapfrog step is only stable provided < 2σ 1 = 2 (e.g. Neal, 2011b). We, therefore, stop at = 1.9, and even here, undesirable behaviour can be seen in Figure  6. For a skewed normal distribution with σ 1 = 1 and α = 3, the density of the narrowest component in the left tail is

D.2 Empirical verification
So, in the left tail, it is approximately equivalent to a Gaussian with σ 1 = 1/ √ 10. Thus stability might only be hoped for when < 2/ √ 10, as born out in Figure 7. The plots all show that for any sensible , AAPS performance is much more robust to the choice of K than HMC (whether blurred or not) is to the choice of L, or equivalently, T .

F Bimodal Targets
The AAPS algorithm with K = 0 is reducible on a multimodal, one-dimensional target as it cannot travel between the modes; however Figure 9 illustrates that even in d = 2, AAPS with K = 0 is not reducible. The figure depicts the path under Hamiltonian dynamics,    (19) with d = 2 and a = 3.5, but the variance of the second mixture component is 4I 2 . The particle starts at the blue square and ends at the yellow triangle with apogees marked as red dots.
marking on the apogees and clearly showing paths which cross from one potential well into another between apogees. The approximate dynamics using the leapfrog step can allow movement between the two wells in between two apogees by this mechanism, but also, when is large, the dynamics sometimes miss an apogee entirely, which again allows movement between the models.
To compare algorithm efficiencies on bimodal targets we ran AAPS, blurred HMC and the no U-turn sampler on targets of the form (Pompe et al., 2020, rotated): with three different values for a from 7 (modes barely separated) to 15 (substantial separation). Table 3 shows the efficiencies of the optimised algorithms in each case.
Blurred HMC is more efficient than AAPS which is more efficient than the no U-turn sampler; however, AAPS is always at least half as efficient as HMC.

G Weighting Scheme 6
The description of Scheme 6 in the main text does not explain how to deal with the point z h , where T h−1 < T F /2 ≤ T h . Of course it may be that T F /2 = T h , in which case the issue  Table 4: Acceptance rates at the optimal parameter settings for AAPS, HMC, blurred HMC (HMC-bl) and the no U-turn sampler (NUTS). * ξ for odd-numbered components of π M R . * * The acceptance rate for HMC on π RN G was extremely sensitive to T . Target  does not arise, but this event has a probability of 0.
We split the point z h into two identical points, z B h = z h , which has a weight of w B h = T F /2 − T h−1 and z F h = z h which has a weight of w F h = T h − T F /2. Thus w B h + w F h =π(z h ). If h > 0 then z curr is in the "first half" of S a:b ; if h < 0 then z curr is in the "second half". If h = 0 then we set z h = z B h with a probability of w B h /π(z h ), and say z curr is in the 'first half", else we set z h = z F h and say z curr is in the "second half." When sampling from the opposite half to z curr , if this is the first half then we sample z prop from {z B , . . . , z h−1 , z B h }, else we sample from {z F h , z h+1 , . . . , z F }. In each case the sampling probability is proportional toπ(z prop ). Table 4 gives the empirical acceptance rates at the optimal parameter settings for each of the four algorithms and for the set of targets described in Section 4.1. These correspond to the efficiencies reported in Table 1.