Utilising Partial Momentum Refreshment In Separable Shadow Hamiltonian Hybrid Monte Carlo

Sampling using integrator-dependent shadow Hamiltonian’s has been shown to produce improved sampling properties relative to Hamiltonian Monte Carlo. The shadow Hamiltonian’s are typically non-separable, requiring the expensive generation of momenta, with the recent trend being to utilise partial momentum refreshment. Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) employs a canonical transformation which results in the Hamiltonian being separable and makes use of a processed leapfrog integrator. In this work, we combine the benefit of sampling using S2HMC with partial momentum refreshment to create the Separable Shadow Hamiltonian Hybrid Monte Carlo with Partial Momentum Refreshment (PS2HMC) algorithm which leaves the target distribution invariant. Numerical experiments across various targets show that the proposed algorithm outperforms S2HMC and Shadow Hamiltonian Monte Carlo with partial momentum refreshment. Comprehensive analysis is performed on the Banana shaped distribution, multivariate Gaussian distributions of various dimensions, Bayesian logistic regression and Bayesian neural networks.


I. INTRODUCTION
Markov Chain Monte Carlo (MCMC) methods have been successfully employed to sample from complex statistical and machine learning models [1,2,3,4]. The first MCMC method to be introduced into the literature is the Metropolis-Hastings [5] method, which has since been enhanced using gradient-free MCMC techniques [6,7,8] as well as approaches that incorporate the gradient [9,10,11] information of the target posterior. MCMC methods have been applied in various contexts including health, renewable energy, finance and inverse problems [12,13,14,15,16].
A popular MCMC algorithm is Hamiltonian Monte Carlo (HMC) [9,17,18], which utilises Hamiltonian dynamics to sample from the target posterior. HMC improves on random walk samplers such as the Metropolis-Hastings algorithm by utilising the first-order gradient information of the unnormalised posterior distribution to guide its exploration. This results in lower auto-correlations between the generated samples when compared to random walk samplers.
There have been various improvements to the HMC algorithm first introduced by Duane et al. [9]. These variations include Magnetic Hamiltonian Monte Carlo [19,11,20,21], which adds a magnetic field to HMC and leads to lower autocorrelations in the generated samples, Riemanian Manifold Hamiltonian Monte Carlo [10] which employs second order gradient information to explore the target, Wormhole Hamiltonian Monte Carlo [22] which efficiently samples from isolated modes of the target distribution by exploiting the Riemannian geometric properties of the target distribution, the No-U-Turn Sampler which addresses a key impediment of HMC which is the tuning of the step size and trajectory length parameters that are very difficult to manually tune [23,14], as well as methods that use modified or shadow Hamiltonian's to sample from high dimensional targets [1,14,24,13].
Since the seminal work of Izaguirre and Hampton [1] on integrator-dependent shadow Hamiltonian's, there has been a proliferation of shadow Hamiltonian Monte Carlo methods in the literature. The shadow Hamiltonian methods are premised on the fact that shadow Hamiltonian's are better conserved when compared to the true Hamiltonian's [1]. This allows one to use larger step sizes, or perform sampling on problems with larger dimensions, without a significant decrease in the acceptance rates when compared to Hamiltonian Monte Carlo methods [25,26,27]. The authors introduce a constant, which determines how close the true and the shadow Hamiltonian's are, to control the generation of the momentum. This increases overall computational time of the method.
Sweet et al. [24] improve on the work of Izaguirre and Hampton [1] by using a canonical transformation on the parameters and momentum. This canonical transformation is substituted into the non-separable Hamiltonian introduced in Izaguirre and Hampton [1] so that it now becomes separable. This results in a processed leapfrog integration scheme which is more computationally efficient when compared to the original shadow Hamiltonian Monte Carlo method of Izaguirre and Hampton [1], as computationally expensive momentum generation for the non-separable Hamiltonian is no longer required.
Partial momentum refreshment has been utilised by Radivojevic and Akhmatskay [26] and Akhmatskaya and Reich [25] to generate momenta in the context of non-separable Hamiltonian's. Radivojevic and Akhmatskay [26] also consider higher order integrators and their corresponding shadow Hamiltonian's and propose the Mix and Match Hamiltonian Monte Carlo algorithm which provides better sampling properties to HMC.
Heide et al. [27] derive a non-separable shadow Hamiltonian for the generalised leapfrog integrator used in Riemannian Manifold Hamiltonian Monte Carlo (RMHMC), which results in improved performance relative to sampling from the true Hamiltonian. The authors employed partial momentum refreshment to generate the momenta. Partial momentum refreshment has also been used by Horowitz [28] to improve the sampling properties of Hamiltonian Monte Carlo and by Mongwe et al. [21] to enhance the performance of Magnetic Hamiltonian Monte Carlo. The results showed the significant benefits that can be obtained by utilising partial momentum refreshment in Hamiltonian Monte Carlo methods [21,28]. Employing partial momentum refreshment in S2HMC is yet to be explored in the literature. This manuscript aims to fill this gap in the literature.
In Monte Carlo with partial momentum refreshment on all the performance metrics considered. The remainder of this manuscript proceeds as follows: Section II discuss the Markov Chain Monte Carlo considered in this work, Section III presents the proposed method, Section IV outlines the target densities considered, Section V outlines the experiments conducted, Section VI presents and discusses the results of the experiments and we provide the conclusion in Section VII.

II. SHADOW HAMILTONIAN MONTE CARLO
Hamiltonian Monte Carlo (HMC) employs Hamiltonian dynamics to efficiently explore the parameter space [9,18,29]. HMC adds an auxiliary momentum variable p ∈ R D to the parameter space w ∈ R D , where D is the number of dimensions. The resultant Hamiltonian H : R 2D → R from this dynamic system is written as [18]: where U (w) is the negative log-likelihood of a differentiable target posterior distribution and K(p) is the kinetic energy defined by the kernel of a Gaussian with a covariance matrix M [21,18,29]: The trajectory vector field is defined by considering the parameter space as a physical system that follows Hamiltonian dynamics [21,18]. The equations governing the trajectory of the chain are then defined by Hamilton's equations at a fictitious time t as follows [18]: The evolution of this Hamiltonian system must preserve both volume and total energy [21]. As the Hamiltonian in equation (1) is separable, to traverse the space we use the leapfrog integrator [21,9,18]. The leapfrog integration scheme proceeds as follows: the next point on the trajectory is reached by taking a half step in the momentum direction, followed by a full step in the parameters and concluding with a half step in the momentum direction [21,18]. Mathematically, this is expressed as: A Metropolis-Hastings acceptance step is then performed in order to take into account the discretisation errors introduced by the numerical integration scheme. The leapfrog integrator for HMC only preserves the Hamiltonian up to second order [1,24]. In order to increase accuracy and maintain the acceptance rate for larger systems, one could decrease the step size or design more accurate numerical integrators that preserve the Hamiltonian to a higher order [26,27]. However, these approaches tend to be computationally expensive [26,27]. The approach in this work relies on backwards error analysis to instead derive a shadow Hamiltonian, whose energy is more accurately conserved by the leapfrog algorithm. We thus instead target the corresponding modified density and employ importance sampling to correct the generated samples towards the true density [26,27].
Shadow or modified Hamiltonian's are perturbations of the Hamiltonian that are by design exactly conserved by the numerical integrator [26,1,3,13]. In the case of shadow Hamiltonian Hybrid Monte Carlo, we sample from the importance distribution defined by the shadow Hamiltonian whereH [k] is the shadow Hamiltonian defined using backward error analysis of the numerical integrator up to the k th order [13,3]. When performing backward error analysis, the shadow Hamiltonian can be defined by an asymptotic expansion in the powers of the discretisation step size around the Hamiltonian: This asymptotic expansion diverges in practice, however a k th order truncation of the expansion is used: The H k terms can be determined by matching the corresponding components of the Taylor series in terms of and the expanded exact flow of the modified differential equation of the Hamiltonian. These modified equations can be proved to be Hamiltonian for symplectic integrators such as the leapfrog integrator [24,1,13,3].
In this work, we focus on a fourth-order truncation of the shadow Hamiltonian under the leapfrog-like integrator.
Since the leapfrog is second-order accurate (O 2 ), the fourthorder truncation is conserved with higher accuracy (O 4 ) than the true Hamiltonian. In theorem II.1, we derive the fourthorder shadow Hamiltonian Monte Carlo under the leapfrog integrator.
Hamiltonian function. The fourth-order shadow Hamiltonian functionĤ : R d × R d = R corresponding to the leapfrog integrator of HMC is given by: Proof. The Hamiltonian vector field: will generate the exact flow corresponding to exactly simulating the HMC dynamics. We obtain the shadow density by simply exploiting the separability of the Hamiltonian. The leapfrog integration scheme in equation (4) splits the Hamiltonian as: and exactly integrates each sub-Hamiltonian. Via repeated application of the Baker-Campbell-Hausdorff formula we obtain [30]: where the canonical Poisson brackets are defined as: The shadow Hamiltonian for the leapfrog integrator is then: It is worth noting that the shadow Hamiltonian in (13) is conserved to fourth-order [27,24,26].
Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) utilises a processed leapfrog integrator to create a separable Hamiltonian [24,13,3]. The separable Hamiltonian in S2HMC is given as: which is obtained by substituting a canonical transformation (ŵ,p) = X (w, p) into (13). The map should commute with reversal of momenta and should preserve phase space volume so that the resulting S2HMC ensures detailed balance [24,13,3]. Propagation of positions and momenta on this shadow Hamiltonian is performed after performing this reversible mapping (ŵ,p) = X (w, p). The canonical transformation X (w, p) is given as [24,13,3]: where (ŵ,p) is found through fixed point 1 iterations as: After the leapfrog is performed, this mapping is reversed using post-processing via following fixed point iterations: Once the samples are obtained from S2HMC, importance weights are calculated to allow for the use of the shadow canonical density rather than the true density. These weights are based on the differences between the true and shadow Hamiltonian's as b m = exp[−(H(w, p) −Ĥ(w, p))]. Mean estimates of observables f (w) which are functions of the parameters w can then be computed as a weighted average.

III. SEPARABLE SHADOW HAMILTONIAN HYBRID MONTE CARLO WITH PARTIAL MOMENTUM REFRESHMENT
We now introduce the Separable Shadow Hamiltonian Hybrid Monte Carlo With Partial Momentum Refreshment (PS2HMC) algorithm. This method combines the Separable Shadow Hamiltonian Hybrid Monte Carlo (S2HMC) algorithm with the sampling benefits of utilising partial momentum refreshment. The benefits of employing partial momentum refreshment in general have already been established in [31,25,21], while the advantages of S2HMC are presented in [24,13,3]. In this work, we combine these two concepts with the aim of creating a new sampler than outperforms S2HMC across various performance metrics. This approach is yet to be explored in the literature.
An analysis of the shadow Hamiltonian corresponding to HMC in equation (13) shows that the shadow density's conditional density π H (p|w) is no longer Gaussian. This implies that simply sampling new Gaussian momenta and accepting with probability one no longer leaves the conditional density invariant [27,25]. This necessitates computationally intensive momentum generation [1] or partial momentum refreshment [26,25].
In this work, we utilise the partial momentum refreshment technique outlined in [21,25,27] in which an auxiliary noise vector u ∼ N (0, M) is drawn and a momentum proposal is generated via the mapping: The new parameter, which we refer to as the momentum refreshment parameter, ρ = ρ(w, p, u) takes values between zero and one and controls the extent of the momentum retention [21,27,26]. When ρ is equal to one, the momentum is never updated and when ρ is equal to zero, the momentum is always updated [21]. The momentum proposals are then accepted according to the modified non-separable Shadow density given asH(w, p, u) =Ĥ(w, p) + 1 2 uMu. The updated momentum is then taken to be ρp + 1 − ρ 2 u with probability: This process produces a Markov chain that conserves some of the dynamics between the consecutive generated samples [21,31,27,25,26]. The effect of ρ is that is adds an extra degree of freedom to the algorithm and can be constructed so that it depends on the momentum and the position [25,27,21]. In Section V-C, we assess the sensitivity of the sampling results on the user specified value of ρ.
An algorithmic description of the PS2HMC sampler is provided in Algorithm 1. The algorithm proceeds by first generating a proposal of the momentum, which is accepted or rejected via a Metropolis-Hastings step, and concludes by generating the positions before applying another Metropolis-Hastings step. Thus, the method employs a composition of two reversible Metropolis-Hastings steps, which implies that the resulting Markov chain is no longer reversible [27,25,26,21]. This raises the question of whether or not the method converges to the correct target distribution [27,26,21]. Theorem III.1 provides this guarantee.
Theorem III.1. The PS2HMC sampler leaves the target density invariant.
The proof of theorem III.1 is taken from [32,26] after observing that the explicit form of the shadow density is not required for the proof [27,21]. This means that the shadow density can be of any form.
It is worth noting that the momentum regeneration scheme used in Shadow Hamiltonian Monte Carlo with partial momentum refreshment (SHMC) in this work is the same as the one outlined in the Algorithm 1.
The proposed method involves a minimum (depending on how the gradients are calculated) of 14 likelihood or target posterior function evaluation to generate a single sample, this is much larger than the minimum of 4 evaluations required by standard HMC. However, it should be noted that the proposed PSHMC method only adds two more likelihood evaluations compared to S2HMC (on which the method is based) and S2HMC has already been shown in [24,14] to outperform HMC. In this work, we show that our proposed PS2HMC method outperforms S2HMC and consequently outperforms HMC.

IV. THE TARGET POSTERIORS
In this section, we outline the target distributions that were considered in this work. These target distributions are inline with those used in Mongwe et al. [21], with the additional target considered being Bayesian neural networks.

A. BANANA SHAPED DISTRIBUTION
The banana-shaped density is a 2-dimensional non-linear target which was first presented in Haario et al. [8]. The likelihood and prior distributions are given as: We generated one hundred data points for y with σ 2 y = 4 and σ 2 w = 1. Due to independence of the data and parameters, the posterior distribution is proportional to: where N = 100 is the number of observations.

B. MULTIVARIATE GAUSSIAN DISTRIBUTIONS
We follow the approach of Mongwe et al. [21] and sample from D-dimensional Gaussian distributions N (0, Σ) with mean zero and covariance matrix Σ. For our purposes, we set the covariance matrix Σ to be diagonal. We sample the standard deviations from a log-normal distribution with zero mean and unit standard deviation. We assess the case where the number of dimensions D is in the set {10, 50, 100}.

C. BAYESIAN LOGISTIC REGRESSION
We utilise Bayesian logistic regression to model the real world binary classification datasets in Table 1. The negative log-likelihood l(D|w) function for logistic regression is given as: where D is the data and N is the number of observations. The log of the unnormalised target posterior distribution is given as: ln p(w|D) = l(D|w) + ln p(w|α) where ln p(w|α) is the log of the prior distribution on the parameters given the hyperparameters α. The parameters w are modelled as having Gaussian prior distributions with zero mean and standard deviation α = 10.

D. BAYESIAN NEURAL NETWORKS
Artificial neural networks are learning machines that have been extensively employed as universal approximators of complex systems with great success [13,3]. This work focuses on Multilayer Perceptrons (MLP) with one hidden layer, an example of which is shown in Figure 1. In this paper, MLPs are used to model the real world datasets outlined in Table 2. These datasets are regression datasets, with the negative log-likelihood being the sum of squared errors. It was shown in [35] that MPLs can be utilised to approximate any arbitrary function if the MLP has enough hidden units [36]. The outputs of a network with a single output as depicted in Figure 1 are defined as: where w ij is the weight connection for the i th input to the j th hidden unit and v jk is the weight connection between  the j th hidden unit to the k th output. Note that in this work k = 1. The activation function Ψ provides the non-linearity required to approximate complex non-linear relationships between inputs and outputs. The Sigmoid and Relu activation functions are examples of common activation functions used in practice [37,13].
The Bayesian framework provides a principled approach for the inference of neural network model parameters. The Bayesian framework is strongly tied to Bayes theorem. Employing Bayes theorem for a neural network model with architecture H, weights w and training dataset D, we have [38,39,13]: where P (w|D, H) is the target posterior probability of the weights given the data and model architecture, P (D|w, H) is the likelihood of the data given the model and P (w|H) is the prior probability of the weights. P (D|H) is the probability of the data given the model [13,38].

V. EXPERIMENTAL SETUP
In this section we outline the experimental setup. We present the settings used for the experiments, the performance metrics employed in the analysis and we present the sensitivity analysis of the sampling results for a user chosen value of ρ.

A. EXPERIMENT SETTINGS
In the experiments, we assess the performance of PS2HMC when compared to S2HMC and Shadow Hamiltonian Monte Carlo with partial momentum refreshment. The MCMC methods are compared using the effective sample size, effective sample size per second and the acceptance rate metrics. We set the momentum refreshment parameter ρ to 0.9 across all the targets. We further asses the effect of ρ on the sampling results on Section V-C. We set the trajectory length for the three MCMC methods considered in this work to 100 across all the target densities. A step size of 0.1 was used for the Banana distribution, 0.02 for the Bayesian logistic regression datasets, 0.005 for the Bayesian neural network datasets while step sizes of 0.1, 0.07 and 0.05 were used for each value of D in that order for the multivariate Gaussian distributions.
A total of ten independent chains were run for each MCMC algorithm across all the target posterior distributions. For the Bayesian neural network targets, 5 000 samples were generated after 2 500 samples of burn-in. For the other targets, we generated 3 000 samples for each target, with the first 1 000 being discarded as the burn-in. These settings were sufficient for the considered MCMC methods to converge across all the posteriors. All experiments were conducted on a 64bit CPU using PyTorch.

B. EFFECTIVE SAMPLE SIZE
This work employs the multivariate effective sample size metric developed by Vats et al. [40] instead of the minimum univariate ESS metric typically used in analysing MCMC results. The minimum univariate ESS measure is not able to capture the correlations between the different parameter dimensions, while the multivariate ESS metric is able to incorporate this information [3,40,10,4]. The minimum univariate ESS calculation results in the estimate of the ESS being dominated by the parameter dimensions that mix the slowest, and ignoring all other dimensions [40,3]. The multivariate ESS is calculated as: where N is the number of generated samples, D is the number of parameters, Λ is the sample covariance matrix and Σ is the estimate of the Markov chain standard error. When D = 1, the multivariate ESS is equivalent to the univariate ESS measure. We now address the effective sample size calculation for Markov chains that have been re-weighted via importance sampling, such is the case for the MCMC algorithms considered in this paper [26,27,3,13]. For N samples re-weighted by importance sampling, the common approach is to use the approximation by [41] given by This accounts for the possible nonuniformity in the importance sampling weights. In order to account for both the effects of sample auto-correlation and re-weighting via importance sampling, we approximated the effective sample size under importance sampling by:

C. IMPACT OF PARTIAL MOMENTUM REFRESHMENT
In this section, we assess the impact of the momentum refreshment parameter ρ on the sampling results. We ran a total of ten independent chains of SHMC and PS2HMC on the ten dimensional multivariate Gaussian distribution for ρ ∈ {0.1, 0.3, 0.5, 0.6, 0.7, 0.9}. Each chain utilised the same simulation parameters as outlined in Section V. The results are displayed in Figure 2. The results show that SHMC and PS2HMC produce the same acceptance rates regardless of the value of ρ. PS2HMC outperforms SHMC across all the considered performance metrics. The MCMC methods show a general trend of increasing ESS and normalised ESS with increasing ρ. The optimal ρ depends on the target distribution and requires manual tuning by the user. The tuning of this parameter is still an open research problem [21]. We plan to address the automatic tuning of this parameter in future work. As a guideline, higher values of ρ seem to be correlated with higher effective sample sizes. These results were also observed by Mongwe et al. [21] in the context of sampling from Magnetic Hamiltonian Monte Carlo.

VI. RESULTS AND DISCUSSION
We now present and discuss the results of the experiments outlined in Section V. The performance of the MCMC methods across the various metrics are presented in Figure 3 and Tables 3 to 6. Note that the results in Figure 3 are presented as follows: the plots on the first row for each target distribution show the effective sample size while the plots on the second row show the effective sample size normalised by execution time t. The displayed results are for a total of ten independent runs of each MCMC algorithm.
The execution time t in Figure 3 and Tables 3 to 6 is in seconds. The results in Tables 3 to 6 are the mean results over the ten runs for each MCMC algorithm. Note that we use the mean values over the ten runs in Tables 3 to 6 to form our conclusions about the performance of the MCMC algorithms.
The results in Figure 3 and Tables 3 to 6 show that the proposed PS2HMC method outperforms the other MCMC methods considered in this work on an effective sample size basis across all the target posteriors considered. Furthermore, it outperforms all the other methods on a normalised effective sample size basis, or produces similar results to the other methods.
The proposed method has a higher computational burden when compared to the other MCMC methods due to the  extra evaluations of the shadow Hamiltonian. This results in the normalised effective sample size performance being affected -although still outperforming or producing similar performance to the other MCMC methods. We intend to address this short coming in future work by utilising surrogate methods such as Gaussian processes by learning the shadow Hamiltonian during the burn-in phase [42]. These empirical results show the significant benefit that can be derived from utilising partial momentum refreshment in S2HMC. However, the full benefits of employing partial momentum refreshment in S2HMC are only obtained when the momentum refreshment parameter ρ has been optimally tuned.    A limitation of the proposed method is the larger execution time when compared to S2HMC. The larger execution time is due to the evaluation on the shadow Hamiltonian when performing the momentum refreshment. We plan to address this in future work by utilising surrogate approaches for the shadow Hamiltonian, which should reduce the execution time without a large reduction in the performance of the algorithm. Another limitation of the new algorithm is the need for the user to manually specify the momentum refreshment parameter. The user would be required to perform trial and error runs to select an appropriate value for the parameter. The results in this work suggest that larger values of the momentum refreshment parameter tend to improve the effective sample size. However, a more theoretically robust approach to the selection of the parameter is required.
This work can be improved upon by investigating an automated approach to optimally tune the momentum refreshment parameter and thus removing the need for the user to perform trial and error. As future work, we plan to compare the performance of the proposed algorithm against manifold based shadow Hamiltonian methods on larger datatsets. Furthermore, we plan on utilising surrogate model approaches to reduce the execution time of the proposed method.