Bilby-MCMC: An MCMC sampler for gravitational-wave inference

We introduce Bilby-MCMC, a Markov-Chain Monte-Carlo sampling algorithm tuned for the analysis of gravitational waves from merging compact objects. Bilby-MCMC provides a parallel-tempered ensemble Metropolis-Hastings sampler with access to a block-updating proposal library including problem-specific and machine learning proposals. We demonstrate that learning proposals can produce over a 10-fold improvement in efficiency by reducing the autocorrelation time. Using a variety of standard and problem-specific tests, we validate the ability of the Bilby-MCMC sampler to produce independent posterior samples and estimate the Bayesian evidence. Compared to the widely-used dynesty nested sampling algorithm, Bilby-MCMC is less efficient in producing independent posterior samples and less accurate in its estimation of the evidence. However, we find that posterior samples drawn from the Bilby-MCMC sampler are more robust: never failing to pass our validation tests. Meanwhile, the dynesty sampler fails the difficult-to-sample Rosenbrock likelihood test, over constraining the posterior. For CBC problems, this highlights the importance of cross-sampler comparisons to ensure results are robust to sampling error. Finally, Bilby-MCMC can be embarrassingly and asynchronously parallelised making it highly suitable for reducing the analysis wall-time using a High Throughput Computing environment. Bilby-MCMC may be a useful tool for the rapid and robust analysis of gravitational-wave signals during the advanced detector era and we expect it to have utility throughout astrophysics.


INTRODUCTION
Gravitational-wave astronomy has enabled the first measurements of masses of merging binary black holes (Abbott et al. 2016), new constraints on the equation of state of nuclear matter (Abbott et al. 2018), and offers a new opportunity to measure the expansion rate of the Universe (Abbott et al. 2017) and break the existing measurement tension (Feeney et al. 2019).In the coming years, the Advanced-LIGO (Aasi et al. 2015), Virgo (Acernese et al. 2015), and KAGRA (Aso et al. 2013) detectors are expected to start a fourth observing run which will see the rate of observed signals from binary mergers increase by up to an order of magnitude.This brings with it challenges in data analysis: we need software that is rapid, reliable, and can take advantage of available large-scale computing.
Many of the science goals of gravitational-wave astronomy rely on the ability to robustly draw samples from the posterior distribution (|), where  are the model parameters and  is the data, and estimate the Bayesian evidence Z. (For an introduction to Bayesian inference for gravitational-wave astronomy, see, e.g., Thrane & Talbot (2019)).For Compact Binary Coalescence (CBC) signals, the posterior distribution is highly non-Gaussian with complicated correlations.Because of the complicated structure of the posterior, stochastic sampling is one of the only viable processes which can robustly generate these quantities for the full complexity of the model (though see Green et al. (2020); Gabbard et al. (2019) for machine-learning ★ Contact e-mail: gregory.ashton@ligo.orgbased approaches and Pankow et al. (2015); Lange et al. (2018) for iterative-fitting based approaches).
Stochastic sampling algorithms to analyse CBC signals have predominately applied either a Markov-Chain Monte-Carlo (MCMC) approach (Hastings 1970;Metropolis et al. 1953), as introduced by Christensen & Meyer (1998) or Nested Sampling (Skilling 2006), as introduced by Veitch & Vecchio (2008).The dominant software used since the first observing run (O1) has been LALI (Veitch et al. 2015), which provides three independent stochastic samplers: two Nested Sampling algorithms, LALI -N and B (Graff et al. 2012)) and a Metropolis-Hastings MCMC (LALI -MCMC) algorithm.Veitch et al. (2015) provided a variety of standard analytical test cases to demonstrate the validity of each of the samplers.But, in the absence of analytical posterior distributions for CBC signals, cross-sampler comparisons, especially between different sampling algorithms, are an important check that results are robust.That LALI offered multiple samplers was critical to its success.
The LALI software had been widely used, well-tested, and become a benchmark for other packages developed since.Since its development, a number of high-quality general-use stochastic sampling software packages are now available.Modern gravitationalwave data-analysis software has been developed which provide an interface to use these off-the-shelf samplers for gravitational-wave astronomy.For example, both B (Ashton et al. 2019) and P CBC-I (Biwer et al. 2019) utilise the (Speagle 2020) Nested Sampling algorithm and the (Vousden et al. 2016) MCMC algorithm (amongst others).In Romero-Shaw et al. ( 2020), a detailed cross-sampler comparison was performed between the B -implementation of the sampler and LALI and the two where found to agreement to within statistical uncertainties.However, such samplers rarely work out of the box and often need some customisation and validation to handle CBC signals.For example, a study by Kulkarni & Capano (2020) compared the and samplers and found the sampler unable to produce unbiased results for binary black hole systems.
The B package provides a modular interface to several stochastic samplers and the ability to implement arbitrary likelihoods and priors.This flexibility has made B a popular choice across astrophysics.However, our testing with B has revealed that the implemented MCMC algorithms do not produce results which match those of either the or LALI packages for CBC-like use cases.This has a significant future impact: the ability to crosscheck between samplers remains a critical test for robustness.In addition, MCMC-based methods are nearly embarrassingly-parallelisable making them ideal for use in a High Throughput Computing (HTC) environment. 1With these two motivations, we verify an MCMCbased algorithm implemented within B as of version 1.1.3.We began by looking at off-the-shelf options.These are preferable as they are well-tested and often actively maintained and improved by the open-source community.An obvious choice which has demonstrated performance for CBC inference (Biwer et al. 2019) is the MCMC sampler, an adaptive parallel-tempered version of the (Foreman-Mackey et al. 2013) algorithm.(The multimodal posterior distributions inherent to CBC problems necessitate the use of the parallel-tempering approach (Gilks et al. 1998;Earl & Deem 2005)).Both of these algorithms use ensemblesampling in which  ens MCMC chains evolve in tandem, new points are proposed based on the position of other chains in the ensemble (cf.Section 2.7).However, in testing we found to be inefficient compared to the sampler (by up to a factor of 100).In comparison, the LALI -MCMC sampler has a demonstrated efficiency similar to that of the Nest algorithm (Veitch et al. 2015).Unlike the off-the-shelf samplers, LALI -MCMC utilises a proposal distribution (cf.Section 2) which takes advantage of knowledge about the problem in hand (in our case, CBC signals).This suggests the need for a parallel-tempered MCMC sampler with access to problem-specific proposals.
In this work, we develop B -MCMC, a from-scratch MCMC algorithm with adaptive parallel tempering and ensemble sampling.We develop B -MCMC to take advantage of a wide variety of proposal distributions including standard, problem-specific and machine-learning based proposals.We validate the B -MCMC sampler against the nested sampler (as described in Romero-Shaw et al. ( 2020)) for its use in both standard validation problems and CBC inference.In this paper, we discuss the tuning and validation of B -MCMC for efficient inference of CBC signals.However, as part of the B package, B -MCMC can be used as a sampler for any inference problem, and includes access to an interface to define custom problem-specific proposal distributions.
We introduce the B -MCMC sampler in Section 2 then discuss the validation for a set of standardised tests in Section 3. In Section 4, 1 HTC environments differ from High Performance Computing in that the interconnect speeds between nodes is slow.This makes HTC environments sub-optimal for algorithms which require regular inter-node communication (e.g, the massively parallel methods explored in Smith et al. (2020)).As discussed later in Section 2.9, multiple independent MCMC algorithms can be run which continuously produce independent samples, making them ideal for a HTC environment.
we apply a set of CBC-specific validation tests and describe the performance before concluding in Section 5.

BILBY-MCMC
MCMC algorithms generate correlated samples from the target distribution, in our case the posterior distribution, by a sequential stepping process.We now describe the details of the algorithm relevant to the B -MCMC implementation, for a more thorough introduction to MCMC algorithms in astrophysics we refer the reader to recent reviews (Sharma 2017;Hogg & Foreman-Mackey 2018).
We apply the Metropolis-Hastings algorithm (Hastings 1970;Metropolis et al. 1953) to draw samples from the target density where L (|) is the likelihood of the model parameters  and () is the prior probability of the model parameters.Throughout, we assume a fixed model , though formally we note that both the likelihood and priors are model-dependent [i.e.L (|) is more completely written as L (|, ) and similarly for the prior].
Given a current sample , a proposed sample  is generated from a proposal distribution ( |).(We discuss proposal distributions in Section 2.1).We accept the proposed sample  with a probability and append  to a chain of samples.If the proposal is rejected, the current sample, , is appended to the chain.We implement the Metropolis Hastings step in practice by drawing a random number  from a uniform distribution on the unit interval, if   the proposal is accepted, otherwise it is rejected.We initialize the chain with a random draw from the prior distribution () and iterate the Metropolis-Hasting algorithm to generate a chain of  samples {  } where  ∈ [0, ).Samples in the chain are generally correlated.Independent samples can be obtained from {  } by selecting a subset of / samples where  is the autocorrelation time (ACT) of the chain.We select the subset by taking a sample every  steps.We iterate the algorithm until reaching the stopping criteria where  burn is the number of samples discarded to remove the chain initialization, known as the burn-in period, and  1 is a thinning factor (in the python interface,  is thin_by_nact).
To estimate , we use the autocorrelation module provided by the emcee v3.0 package (Foreman-Mackey et al. 2013).This method improves on the traditional approach by adding an automated function to choose the window (c.f.Hogg & Foreman-Mackey (2018) along with the software documentation https://emcee.readthedocs.io/en/stable/tutorials/autocorr/).
We provide a number of automated approaches to estimate the burn-in period  burn .The primary method is a simple scaling: we discard  burn autocorrelation times, i.e.  burn =  burn .By default we use,  burn = 10, but this scaling factor can be varied by the user through the burn_in_nact option.In addition, proposal methods which violate the assumptions of the MCMC algorithm (e.g., using dynamic tuning to improve convergence), set minimum values for  burn and the user can also specify  burn directly if these automated approaches fail.
To produce independent samples (matching the behaviour of the LALI -MCMC sampler), we can set the thinning factor  = 1.However, thinning is inefficient in the sense that unthinned samples ( < 1) are unbiased and provide greater precision for summary statistics (Link & Eaton 2012).In cases where  < 1, we differentiate between the number of samples Eq. ( 3), and the effective number of samples  eff samples =  samples .For the standard validation tests in this work, we use  = 1.For the CBC validation tests, we use  = 1/5 and  samples 25000.This ensures a minimum number of 5000 independent samples while smoothing posterior plots and providing more accurate summary statistics.
Having introduced the simple Metropolis-Hastings algorithm, we now turn to the specifics of the B -MCMC implementation which enable it to efficiently perform CBC parameter estimation.In Section 2.1, we define the standard library of proposal distributions then introduce the learning proposals in Section 2.2 and the gravitationalwave specific proposals in Section 2.3.We note that B -MCMC provides a flexible interface to define and use new proposals.As such, this list is not an exhaustive set of all available proposals.In Section 2.4, we describe how the proposals are used together in a block-updating sampling approach.In Section 2.5, we describe the extension to a parallel-tempered sampler required to analyse multimodal distributions then describe how ensemble-sampling is implemented in Section 2.7.In Section 2.8, we provide a model for the efficiency of the sampler, then introduce a timing model and discuss computational parallelisation in Section 2.9.

Standard proposal distributions
Broadly speaking, the performance of an MCMC sampler is determined by the ACT of the chains it produces.Chains with smaller ACTs taker fewer steps to traverse the target distribution and hence produce more independent samples for a fixed number of MCMC steps (or equivalently, computational cost).The ACT itself depends on how efficient the proposal distribution ( |) is in proposing points which enable the chains to traverse the posterior.
For the Metropolis Hastings algorithm, there are two ways to optimize a stochastic sampler to reduce the ACT.First, we can choose a parameterisation which reduces the complexity of the parameter space.If under a transformation , the posterior distribution has a simpler form (e.g, if  maps a Banana-like distribution to a multivariate Gaussian, or softens hard edges), then sampling in  () rather than  is the most straight forward approach to improving the algorithm performance (Hogg & Foreman-Mackey 2018).In Section 4.1, we discuss the best known parameterisation for CBC signals.Second, once the best known parameterisation is chosen, we optimize the choice of proposal distributions.
In this section, we introduce the standard library of proposals implemented in B -MCMC and discuss their performance and utility.For each proposal, we also provide the Hastings factor which ensures detailed balance is met (Hastings 1970) and enables unbiased sampling using asymmetric proposals.

FG: Fixed Gaussian
The Fixed Gaussian proposal implemented in B -MCMC is a generalisation of the zero-mean multivariate Gaussian proposal (Gelman et al. 1996) in which a proposal for the   ℎ parameter is generated from where   is a user-defined scale parameter,   is the prior support (if the prior has infinite support, we set   = 1) for   , and  is a draw from a standard normal distribution.The introduction of the scaling by the prior support enables some automatic tuning to the anticipated scale of the problem, while the   enables the user to define varying length scales for each parameter.Of note, our implementation does not allow the user to change the spatial orientation of the proposal (i.e. through correlations between parameters).For the Fixed Gaussian proposal, which is symmetric, H =1.
In practice, the Fixed Gaussian proposals have limited use and require manual tuning (through the choice of   ) to achieve meaningful performance on realistic problems.As such, we do not enable this proposal by default.

AG: Adaptive Gaussian
To circumvent the tuning requirements of a Fixed Gaussian proposal, Haario et al. (2001) introduced the notion of an adaptive proposal which uses past performance of the sampler to drive the sampler to a target acceptance rate.
Such adaptive proposal are non-Markovian and may lead to the generated samples not being representative of the posterior.However, as discussed in Haario et al. (2001) and Veitch et al. (2015) (in the context of CBC signals), if the adaptation rate decays throughout the run or the adaptation is halted sufficiently early in the run, the equilibrium distribution may be sufficiently close to the posterior.We verify that, to within the statistical uncertainties relevant for typical CBC problems, this is true for our Adaptive Gaussian proposal.
To dynamically adapt the proposal, we use the acceptance ratio: to quantify the current performance (Gelman et al. 1996).If  ∼ 1, proposals are accepted too often: this suggests slow exploration of the posterior.If  1, proposals are infrequently accepted: the proposed points tend to jump away from areas of high posterior support.In both cases, this leads to large autocorrelation times.For well-tuned proposals (which reduce the ACT relative to poorly-tuned proposals) and under idealised settings, Roberts et al. (1997) demonstrated that  ∼ 0.23.We set this as a target acceptance rate and dynamically adapt the proposal to achieve it.
Our implementation of the Adaptive Gaussian proposal extends Eq. ( 5): adding a factor of , the adapting scale parameter.Initially,  = 1, then on each iteration which the proposal is applied, we update  following Veitch et al. (2015): if the previous proposed point was accepted or if the previous proposed point was not accepted.Here  = 0.234 is the target acceptance rate and the quantity is the adaptation decay rate with  the number of points proposed and  a user-specified number of steps after which to stop adapting, the default value is  = 10 5 .We set a minimum scale   −1 .For the Adaptive Gaussian proposal, which is symmetric, H =1.

DE: Differential Evolution
We implement the Differential Evolution proposal (Ter Braak 2006;ter Braak & Vrugt 2008) as described in Veitch et al. (2015).Two samples   and   are drawn at random from the chain.Then where  is chosen randomly from  = 1,  (0, 2.38/ √︁ 2 dim .When  = 1, this acts as a mode-hopping proposal improving the performance of the sampler in multi-modal problems.When  is drawn from the normal distribution [as proposed by Ter Braak (2006); Roberts & Rosenthal (2001)], the proposed points lie along the line passing through  and  .As such, the proposal is well suited to posterior distribution with linear correlations in which the line passing through  and  lies along the principle axes.As with the Adaptive Gaussian proposal, formally this proposal makes the chain non-Markovian.Later, in Section 3, we verify that the equilibrium distribution is statistically identical to the posterior, i.e the posterior is unbiased.Like the Gaussian proposals, the Differential Evolution proposal is symmetric, such that H =1.

PR: Prior Proposal
The prior proposal draws  from the prior ().For well-measured parameters, in which the posterior is much narrower than the prior, this proposal is highly inefficient.However, we find it to be effective when used as part of a block-updating set of proposals applied to poorly measured parameters (e.g., the spin and tidal parameters of the secondary lower-mass object in a CBC inference problem).It also aids mode-mixing in high temperature chains (see Section 2.5).For the Prior proposal, the Hastings factor is H = ()/( ).

UN: Uniform Proposal
A simplification of the Prior Proposal, this proposal proposes points uniformly within the prior bounds.We utilise this proposal as a robust and simpler variant of the Prior Proposal with similar performance.For the Uniform proposal, which is symmetric, the Hastings factor is H = 1.

Machine learning proposal distributions
In B -MCMC, we introduce a class of learning proposals which, as we show in Section 3, dramatically decrease the ACT while producing statistically identical posterior distributions.learning proposals use a random sampling from the past MCMC chain to learn the distribution and then generate new samples.For all learning proposals, during an initialization stage (during which the MCMC chain has not yet been explored), they fall back to an Adaptive Gaussian proposal.Once the initialization stage is complete, they sample the MCMC chain, use the samples to fit the proposal distribution, and then this distribution is used to propose new points.Periodically, the proposal distribution is refitted using new samples from the MCMC chain to circumvent premature learning.As with the Adaptive Gaussian proposal, the use of the past chain again breaks the Markovian property of the chain, but we verify in Section 3 that the resulting posterior remains unbiased using validation tests.

KD: Gaussian Kernel Density Estimate
We fit a Gaussian Kernel Density Estimate (KDE) (Rosenblatt 1956;Parzen 1962) to a random draw of samples from the MCMC chain.We find this non-parametric multivariate density estimate to be both rapid in learning (typical learning times are fractions of a second) and flexible enough to fit complicated features.KDE methods have previously been used in the context of CBC inference by the (Farr & Farr 2015) ensemble sampler.
When used to estimate a probability density from a set of samples, KDE methods suffer a subtle dependence on a tuneable "bandwidth" parameter and typically over-smooth hard edges and multi-modal distributions.However, when used as a learning proposal density, these issues only result in loss of efficiency, and do not bias results.To understand why, consider a parameter with a hard edge (e.g. the lower bound on the spin of a black hole which cannot be negative).A KDE proposal fitted to a chain will over-smooth the hard edge and propose nonphysical points with negative spin.However, the MCMC algorithm will never accept these points.This results in a small loss of efficiency, but no bias.
We utilise the standard implementation of Gaussian KDE in the (Virtanen et al. 2020) package with bandwidth estimated using "Scotts rule" (Scott 2015).Once the KDE  () is fitted, proposal samples can be drawn directly and the Hastings factor calculated by H =  ()/ ( ).We find that fitting the KDE takes fractions of a second while the proposal time is negligible compared to typical CBC likelihood evaluation times.

GM: Gaussian Mixture Model
While KDEs smooth a set of samples as a Gaussian centered on each sample, in a Gaussian mixture model (GMM) the density is estimated using a finite number of Gaussian distributions.As with KDE methods, this model is not good at fitting distributions with hard edges.The means and covariance matrices of these Gaussian distributions are chosen using an expectation-maximisation algorithm.We use the (Pedregosa et al. 2012) package to fit the GMM.In this work, we use 10 components in the mixture.Fitting the GMM takes slightly more time than fitting the KDE, however, it is typically <1 s and sampling from/evaluating the GMM is faster than sampling from the KDE as there are fewer components.As with the KDE proposal, the Hastings factor is calculated from H = ()/( ) where () is the fitted GMM.

NF: Normalizing Flows
The normalizing flows class of machine learning algorithms (Papamakarios et al. 2019) learn a bĳective map from the target density (the set of training samples drawn from the MCMC sampler) to a latent space, in our case a multivariate Gaussian.Normalizing flows have previously been used in gravitational-wave astronomy to directly sample the CBC posterior distribution (Green et al. 2020;Green & Gair 2021) and as way to propose new points in a nested sampler (Williams et al. 2021).Following the work of Hoffman et al. (2019); Moss (2020), we use the nflows package (Durkan et al. 2020) which implements the normalizing flows algorithm in PyTorch (Paszke et al. 2019), to learn the proposal distribution.We periodically optimize the normalizing flow using the Jensen-Shannon divergence test (cf Appendix A) between samples drawn from the learnt map and a set of independent validation samples.Unlike the KDE and GMM proposals, the normalizing flows proposals can take several tens of seconds to minutes to train.In practice (cf.Section 3), we find the normalizing flows method has a similar performance to the GMM method, but at an increased computational cost.Therefore, we do not utilize it for CBC inference problems.The Hastings factor is again given by the ratio of the normalizing flow density at the initial and proposed points.

Gravitational-wave proposal distributions
We implement the gravitational-wave specific polarization and phase correlation, phase reversal, and phase and polarization reversal proposals as described in Veitch et al. (2015).We find these proposals dramatically improve the sampling for analyses which do not utilize analytic marginalization of the binary phase (cf.Section 4.2).We do not implement the sky reflection, extrinsic-parameter, and Gibbs sampling of distance proposals described in Veitch et al. (2015); Raymond & Farr (2014).While we expect these to be general improvements to the algorithm, the use of distance marginalization (cf.Section 4.2), and our choice of parameterisation (cf.Section 4.1) diminish the expected utility of these proposals.

Block sampling
Each of the proposal distributions described in the last three sections can update either all parameters in the set of model parameters , or only a subset of those parameters.The B -MCMC sampler is initialised with a list of individual proposals, the subset of  which they are to update, and their unnormalised weighting.We then use the weighting to create a cyclic proposal cycle.At each step of the sampler, the next proposal in the cycle is chosen, a point is proposed and accepted/rejected based on the condition described in Eq. ( 2).The proposal cycle enables weighted block-updating of proposals and ensures the detailed balance condition is met as described in Veitch et al. (2015).
For non-CBC inference problems (i.e, the standard tests considered in Section 3), we default to an equal-weighted set of the Adaptive Gaussian, Differential Evolution, Uniform, KDE, GMM, and Normalizing Flow proposals.Though, this can be customized by users.For CBC inference problems, we define a proposal cycle described in Table 1.We arrived at this choice by hand tuning: running analyses on simulated signals and identifying opportunities for improvement.As such, we do not anticipate that the proposals selected in Table 1 are optimal and we expect improvements to be made in the future.Users can modify and extend proposal cycle using the flexible interface.

Parallel-tempering
The standard Metropolis Hastings algorithm does not produce estimates of the evidence, and fails when attempting to sample from multi-modal distributions.Parallel-tempering (Gilks et al. 1998;Earl & Deem 2005) addresses both of these issues.
As the name suggests,  temps parallel MCMC chains are run.(In practice, these can be updated sequentially, i.e. stepping each chain in turn, or using the parallelisation techniques described later in Section 2.9.However, the chains must remain pseudo-synchronised to enable swaps between chains).For the  th chain, the likelihood in Eq. ( 2), is modified: where   1 is the chain "temperature".Note that the ladder of temperatures {  } is ordered  +1 >   .The  0 = 1 "cold" chain samples from the target posterior distribution.But, for "hot" chains with   > 1, the likelihood is flattened out and easier to sample.
Periodically, swaps are proposed between adjacent chains and accepted with a probability min 1, These swaps provide a mechanism for the cold temperature chain (which generates posterior samples) to explore multi-modal likelihoods.We utilize the dynamic temperature adaption methods described in Vousden et al. (2016) to optimize the choice of the temperature ladder {  }.Samples taken during this optimization period are automatically labelled as part of the burn-in epoch.

Evidence calculation
In addition to resolving the problem of sampling multimodal distributions, parallel-tempering also enables an estimate to be made of ln Z, the natural logarithm of the Bayesian evidence.To estimate the evidence in B -MCMC, we implement thermodynamic integration (Goggans & Chi 2004;Lartillot & Philippe 2006) as described in Littenberg & Cornish (2009); Veitch et al. (2015) and the steppingstone algorithm (Xie et al. 2010;Maturana-Russel et al. 2019).In testing, we verify the findings of Maturana-Russel et al. (2019): the stepping stone method is superior, producing more accurate results for the same computational cost.As such, while B -MCMC calculates both methods, we report only the stepping stone evidence throughout this work.

Ensemble sampling
In recent years, ensemble-sampling algorithms have been highly successful in astrophysics (see, e.g.Foreman-Mackey et al. (2013); Vousden et al. (2016); Farr & Farr (2015)).These algorithms use an ensemble of interacting MCMC samplers.New points are proposed based on the current distribution of the ensemble of points enabling automatic tuning of the proposals to the target density.That these algorithms self-tune has been paramount to their versatility and use throughout astrophysics.
In B -MCMC, an ensemble of  ens chains can be utilised with inter-chain swaps proposed by an ensemble stretch proposal (Goodman & Weare 2010).In comparison to the and samplers, B -MCMC is poorly vectorised and does not scale to many hundreds of chains.
If used in conjunction with parallel-tempering, one can either use  temps ensembles (for a total of  temps ×  ens samplers) or with one parallel-tempered chain (for a total of  temps +  ens − 1 samplers).The former configuration mirrors how the sampler operates while the latter configuration may be useful, for example, to provide an estimate of the evidence with a reduced computational cost.For thermodynamic integration, each set of parallel-tempered chains is used to calculate an estimate of the evidence, then the results are averaged between chains.In the validation tests described in Section 3 and Section 4, we do not utilise the ensemble sampler as it was found to provide no practical improvement in efficiency.

Efficiency
Throughout this work, we will quantify and compare the posterior sampling efficiency of samplers by the ratio of the number of independent samples to the number of likelihood evaluations: For a simple MCMC sampler, the number of steps is equal to the number of likelihood evaluations.However, in B -MCMC we do not evaluate the likelihood if the proposed sample is outside the prior bounds.Nevertheless, the number of likelihood evaluations is the relevant weighting as it is the dominant computational operation.
We calculate the efficiency directly for the validation tests in Section 3, but here we first derive an efficiency model.For a parallel-tempered ensemble sampler with  temps ×  ens chains where  burn =  burn  are discarded for burn-in (in practice, there are several alternative methods which can determine  burn as described in Section 2), the efficiency is where is the burn-in inefficiency, the fraction of 'wasted' samples due to the burn in process.
When configuring the sampler, care should be taken to ensure  1 to avoid significant wasted computation.For example, drawing 1000 independent samples using  ens = 1 and the default  burn = 10, the burn-in inefficiency is a reasonable 1%.However, if we attempt to use 10 co-evolving ensembles  ens = 10, the burn-in inefficiency also increases to 10%.(The same logic equally applies to configurations which combine independent runs as discussed in Section 2.9, replacing  ens with the number of independent runs and  eff samples with the number of samples per run.).
Provided  1, the efficiency is determined by the ACT  and the number of parallel-tempered chains  temps .The ACT is a property of the sampling algorithm, which can be reduced using the methods discussed in Section 2.1.Naively, reducing  temps appears to improve the posterior sampling efficiency.However,  temps >1 is required to sample from multimodal distributions, calculate the evidence, and can reduce .In Section 3 we will demonstrate with a specific example, but as a rough rule of thumb about  temps = 8 is sufficient for the multi-modal posteriors of CBC inference problems and provides a reasonable estimation of the evidence.However, if a refined estimate of the evidence is required, more temperatures are needed, decreasing the posterior sampling efficiency.
We develop a resampling approach to reclaim some of this lost efficiency.For the  th chain with temperature   we define {  } ( ) as the set of posterior samples it produces from the tempered posterior distribution: For each sample  from the tempered posterior distribution, we calculate a weight from the ratio of the hot likelihood to the  = 0 likelihood.Then, we rejection sample2 (MacKay 2003) the tempered posterior samples resulting in a set of posterior samples from the cold posterior distribution.For low dimensional problems, we find this produces a modest gain in efficiency at no additional cost.As an example, analysing a CBC signal using a non-spinning model and using the analytic marginalization of the distance, phase and time (cf.Section 4.2), rejection sampling the hot chains produces a ∼ 20% efficiency improvement.However, for fully-precessing CBC problems we find the rejection sampling does not accept any new points (i.e. the efficiency remains unchanged).In Section 3 and Section 4, we do not utilise the rejection sampling method.
In this work, we will compare the efficiency of B -MCMC with that of the sampler using the random walk proposal method described in Romero-Shaw et al. (2020).This proposal method has a tuning parameter  act which determines the number of internal MCMC steps to take based on the estimated autocorrelation time.Following Speagle (2020) (in which the sampler was shown to be more efficient than the sampler), we calculate the efficiency from Eq. ( 14), with  eff samples calculated from the effective sample size as estimated from the nested sampling weights.We note that this assumes that new points proposed during nested sampling are independent, however this isn't required (Salomone et al. 2018) or guaranteed in practice.If the points are correlated,  eff samples will significantly overestimate the efficiency of the sampler.To guard against this (and to investigate the potential impact on posterior estimation), for the Rosenbrock and Unimodal Gaussian validation tests, we run the sampler with two different values of  act and verify that the efficiency approximately scales with  act .This demonstrates that the  act value chosen is sufficiently large to generate independent samples and hence that the efficiency of the is not overestimated.

Timing model and parallelisation
We distinguish two levels of computational parallelisation which can reduce the wall time: combining independent runs and multiprocessing an individual run.
Combining independent runs is embarrassingly parallel: we simply repeat  copies of the analysis using an identical data and configuration, but a different random seed.If each analysis produces  eff samples independent samples, then in total we end up with  eff samples .Such a configuration is ideal for use in a HTC environment and has the added advantage that one can cross-compare between chains.However, this approach is limited by the increase in burn-in inefficiency (cf.Eq. ( 16)): each independent run has to burn-in.This may be worthwhile to decrease the wall-time for important and time-sensitive results.
Before discussing the use of multiprocessing, we introduce a timing model to understand the wall-time required to produce  eff samples independent samples from the posterior of a single serial run.For CBC inference problems, the compute-time is determined by  ℓ , the time required to evaluate the likelihood 3 and the number of likelihood evaluations required.In a serial-processing model, the total time  can be estimated by where we have taken a typical efficiency from Section 4.3 for an CBC analysis using  temps = 8.
If either  temps > 1 or  ens > 1, B -MCMC can be trivially parallelised leveraging the multi-core processors typically available in modern processors.We implement this parallelisation using the python standard-library multiprocessing module.In this model of multiprocessing, there is an overhead cost to transferring the data (i.e.any data products required by the likelihood).For typical CBC problems, this can be as much as a few milliseconds.[We note that the LALInference MCMC sampler (Veitch et al. 2015) mitigates this by the use of a distributed computing model with a Message Passing Interface].This overhead time is comparable to the likelihood evaluation time  ℓ and results in imperfect scaling of the timing model Eq. ( 19).We model this by introducing  1, a parallelisation inefficiency which we will measure empirically.Then, the timing model for an analysis parallelised over  cores is The number of cores should be matched to the number of parallelisable jobs, i.e.  cores = where  is a non-zero natural number.If the number of cores is mismatched with the number of parallelisable jobs, i.e.  cores >  temps  ens , this will always leave one or more cores idle.When  cores =  temps  ens , we refer to this as perfect matching.
We can measure  empirically by looking at the speed-up for an 3 Typically  ℓ ranges from a few to many hundreds of milliseconds and is dominated by the cost to evaluate the waveform.Longer-duration signals typically take longer to evaluate.However, when calculating the likelihood during an MCMC analysis, cached waveform evaluations can be used, e.g. when proposing a move only in the extrinsic parameters.The discussion in this section assumes a fixed  ℓ , resulting in a conservative timing estimate which ignores this potential computational saving.identical analysis as a function of  cores .We find that direct parallelisation results in values of  which are as small as (or in the worst case smaller than) 1/ cores .I.e., the parallelisation can be slower than a serial run.This is because of the substantial data-transfer overhead.
To mitigate the data transfer overhead, in parallel analyses, we transfer data and then take a fixed number,  1 , of 'internal' MCMC steps.
To further improve the efficiency, we do not store these internal steps.
In effect, this pre-thins the MCMC chains by a factor of  1 .When  1 > 1, the ACT and other associated quantities are calculated on the stored chain, but can be re-scaled.In Fig. 1, we determine the speed-up factor  for a test case in which  temps = 16 and the per-likelihood evaluation time  ℓ is held fixed at 10 ms.We run the experiment twice.First, we use  1 =10, which demonstrates poor parallelisation scaling with an overall speed-up factor of ∼1/8 for perfect matching.Then, we increase  1 to 100 and see improved scaling with ∼3/4 for perfect matching.For  cores =8 or less, the performance is near-optimal  ∼ 1.The marginal improvement in speed up between  cores of 10, 12, and 14 demonstrates the effect of imperfect matching.
Using the empirically measured  from Fig. 1, if our analysis using  temps = 8,  ens = 1, we can see the rough timing predicted by Eq. ( 20) for perfectly-matched parallelised runs: It is worth pointed out two caveats to this timing model.First, while increasing  1 improves , if  1 is greater than the typical ACT, this itself introduces a new type of inefficiency (namely an over-thinned chain).Second, these quantities are not independent.For example, if one wants to combine a large number of independent runs, each producing  eff samples = 10 samples, it may appear that Eq. ( 21) would predict a ∼ 3 minute analysis time.However, the burn-in inefficiency would be increased leading to a decrease in the efficiency and hence increase in the overall run time.This may still be preferable, but we urge users to give consideration to the efficiency before starting analyses.We comment that the inefficiencies commented on above are specific to the naive approach of combining multiple independent runs.The fundamental issue arises that a standard MCMC chain produces unbiased samples from the target density as the number of steps tends to infinity.There are sophisticated approaches which will overcome these inefficiencies by instead aiming to obtain unbiased samples as the number of MCMC chains tends to infinity (Jacob et al. 2020).These approaches could be used in future to improve the efficiency of B -MCMC when parallelised over many cores.In this section, we have seen that B -MCMC can be parallelised both by combining independent runs and utilising multiprocessing.By comparison, the run-time of the nested sampler can only be reduced by the use of multiprocessing.This is because it is not possible to configure a nested sampler to run part of the full analysis (i.e.only to produce a small subset of the required total number of independent samples).In this work, we utilize multiprocessing of the sampler in Section 4 which can take advantage of multicore processors.We note that can also be used in High Performance Computing (HPC) environments by multiprocessing using many multi-core processors (Smith et al. 2020).We discuss the relative merits of these two approaches with reference to a specific example in Section 4.3.

STANDARD VALIDATION TESTS
In this section, we outline a suite of tests designed to validate the B -MCMC package for standardised problems.These tests build on previous validation tests of gravitational-wave samplers (Veitch et al. 2015;Biwer et al. 2019) and tests of the sampler (Speagle 2020) implemented in B (Ashton et al. 2019;Romero-Shaw et al. 2020).Though not reported here, we additionally perform integration checks on individual aspects of the sampler and verify that when the likelihood is uninformative the prior is properly recovered.The scripts used to perform all verification checks and additional figures are available from git.ligo.org/gregory.ashton/bilby_mcmc_validation; in Table 2, we also link to the individual tests.

Standard Normal distribution
As an initial test, we evaluate a one-dimensional standard-normal likelihood where and the prior is uniform between -10 and 10: In this case, the evidence can be estimated as: and the posterior () is a standard-normal distribution.
Running the B -MCMC and samplers on this problem, in Table 2, we report the configurations, the difference in logevidence, and quantities related to the performance.
To verify the posterior sampling, we calculate the Jensen-Shannon divergence (JSD, see Appendix A for an extended discussion) between 5000 independent posterior samples drawn using the sampler and samples drawn directly from the known posterior.For all configurations, we report JSD values below a threshold of 2 mb (where mb is the shorthand for a milli-bit of information): this demonstrates the posteriors are statistically identical.As such, we conclude that both the and B -MCMC samplers are able to sample this simple inference problem without bias and report accurate estimates of the uncertainty.
To verify the estimates of the Bayesian evidence, we compare with the known evidence calculated in Eq. ( 24).Both the and B -MCMC sampler using  temps = 32 produce estimates of the evidence that agree with Eq. ( 24) to within the stated uncertainties.However, it is known that parallel-tempered evidence estimates have a bias which can be reduced by increasing the number of temperatures (Xie et al. 2010;Maturana-Russel et al. 2019).This point is demonstrated by the B -MCMC analyses with  temps =16 which does not produce a result consistent with the known evidence.

Rosenbrock likelihood
We analyze the Rosenbrock likelihood (Rosenbrock 1960), taking the explicit form and priors from Eqn. (C2) of Fowlie et al. (2020).The banana-shaped posterior is challenging to sample from and representative of the types of posteriors seen in CBC inference problems.This makes it an ideal validation test.Results for several configurations of both samples are listed in Table 2.
We sample directly from the posterior distribution of the Rosenbrock likelihood using a reparameterization.This enables us to calculate the maximum JSD between samples drawn using different configurations of the B -MCMC and samplers and the directly-sampled posterior.The maximum JSD for the B -MCMC analyses all fall below the 2 mb threshold for statistically identical posteriors.However, we find that the samples from the sampler using  act = 10 are marginally above this threshold while the analysis with  act = 50 is below.In re-running these analyses, we find variations in the JSD value of order ∼ 50%: this indicates the analyses are subtly biased. act is a user-controlled parameter described in Romero-Shaw et al. (2020) which determines the number of internal MCMC steps to take based on the estimated autocorrelation time.A value of 10 was previously found to be sufficient for binary black hole analyses (Romero-Shaw et al. 2020), but this test demonstrates larger values may be necessary to ensure convergence for the Rosenbrock likelihood.The dependence on  act indicates the cause is likely to be the MCMC-within-nested-sampling algorithm itself (we used the version in Bilby v1.1.3for the analyses in this work); investigation is needed to determine if this is failing and to resolve this bias.
We visualise the results in Fig. 2: B -MCMC and the "direct" samples agree, but samples from the analyses (with  act = 50) are overly constrained.This is a typical failure mode of posterior samples generated by nested sampling methods which use bounding ellipsoids to improve performance of the sampler.We note that we do not see similar issues for CBC inference problems (see Section 4.3 and Section 4.4).This failure requires further investigation and highlights the need for cross-sampler comparisons.
The various MCMC configurations in Table 2 enable a comparison of the impact of the learning proposals.Using all three learning proposals (AG-DE-UN-GM-NF-KD), reduces the ACT by a factor of 10 with respect to the analysis without any learning proposals (AG-DE-UN).By running each of the learning proposals individually, we see that the normalizing flows and GMM proposals both have a similar performance improvement (with respect to the AG-DE-UN proposals alone) to all three together.Meanwhile the KDE proposal alone provides only a factor of ∼ 2 reduction in the ACT.This demonstrates that learning-proposals are a powerful tool in improving the efficiency of the MCMC algorithm.2020)).For the gravitational-wave validation tests, we use the proposal set described in Table 1.In the Evidence column, we report Δ ln Z = ln Z − ln Z for the standard validation tests where the exact evidence ln Z is known; for the gravitational-wave validation tests (gray rows), where the evidence is not known, we report the natural logarithm of the signal vs. Gaussian noise Bayes factor.Where the posterior distribution can be directly sampled from, we report the maximum JSD (see Appendix A) in milli-bits [mb].For the MCMC configurations, we list the final-estimated ACT ; this is always given in raw steps (i.e.we rescale runs which use  1 > 1).For the gravitational-wave validation tests,  is given by the mean valued averaged over all independent runs, typically this varies by several tens of percent.We also report the posterior sampling efficiency described in Section 2.8, the total number of likelihood evaluations, and the number of independent samples the analysis produced.
Finally, we turn to evidence estimation.The Rosenbrock likelihood used in this work has an analytically approximated evidence of ln Z = −5.804(Fowlie et al. 2020).In Table 2, we provide evidence estimates for the and B -MCMC sampler with  temps = 16.For the sampler, the evidences agree to within the stated uncertainties.For the B -MCMC sampler, the evidence estimate disagrees at the level of 1 standard deviation.This performance is consistent with the findings of Veitch et al. (2015) in which the LALInference MCMC sampler similarly struggled to consistently estimate the evidence of the Rosenbrock likelihood.

15-dimensional unimodal Gaussian
We analyze the 15-dimensional unimodal multivariate Gaussian distribution originally proposed in Veitch et al. (2015) using the specific configuration from Romero-Shaw et al. (2020).We report the results in Table 2, varying the number of parallel-tempered chains, but utilising the standard proposal sets.For all samplers and configurations, the maximum JSD falls below the nominal 2 mb threshold for statistically identical samples.To visualise the results, in Fig. 3 we plot a subset of 5 posteriors in a violin plot.This shows strong agreement between the samplers and with samples drawn directly from the posterior.
The evidence for this 15-D unimodal Gaussian test case can be estimated directly (Romero-Shaw et al. 2020) as ln Z ≈ −34.54.The sampler correctly estimates the evidence to within the stated uncertainty in both configurations.Meanwhile, the B -MCMC sampler gets close to the true evidence, with  temps = 32, but suffers the previously-discussed bias when  temps is small.We study the performance of B -MCMC while varying  temps .In this unimodal case, even a single-temperature sampler can sample the posterior.Increasing the number of parallel temperatures from 1 to 16 marginally reduces the ACT.As  temps is increased, the uncertainty on the evidence estimate decreases, but the ACT does not significantly change.This is expected since it is a unimodal target density which does not require parallel-tempering to hop between modes.As such, increasing the number of temperatures (to achieve a meaningful estimate of the evidence) results in a reduction in the efficiency as predicted by Eq. ( 16).

15-dimensional bimodal Gaussian
We analyse a bimodal Gaussian distribution consisting of two copies of the unimodal Gaussian distribution (cf.Section 3.3) and means separated by 8 standard deviations in each dimension (as used in Romero-Shaw et al. (2020)).This test probes the ability of the sampler to efficiently hop between modes.With a single cold chain, the MCMC sampler is unable to find both modes (in other words, the ACT is infinite).With  temps = 16, B -MCMC is able to sample from both modes.Comparing to samples drawn directly from the posterior, the maximum JSD for the sampler and B -MCMC sampler both fall below the threshold for statistically identical samples.Romero-Shaw et al. (2020) noted that the sampler tends to overweight one or other of the two modes in this test.But, that combining over many runs the effect averages out.We confirm this  2) quantifies that the B -MCMC sampler produces statistically identical posterior samples while the sampler produces JSD values at the failure threshold.Visually, we see that the posterior samples produced by the sampler are overly constrained.
in our individual run of the sampler.For the B -MCMC sampler, we find that the effect is weaker.Quantifying the effect by the number of samples in each mode, the B -MCMC tends to produce more equal-weighted posteriors (in agreement with the true posterior).This can be understood because the MCMC sampler is proposing jumps between modes while for the sampler the relative weights of the two modes is determined by the bounding ellipsoids.
The evidence for the 15-D bimodal Gaussian be directly estimated (Romero-Shaw et al. 2020) as ln Z ≈ −34.54.Comparing the evidence estimated by the samplers to this direct estimation, we find similar performance to that of the 15-D unimodal Gaussian studied in Section 3.3.Namely, the sampler outperforms B -MCMC in accuracy and uncertainty.

GRAVITATIONAL-WAVE VALIDATION TESTS
In this section, we discuss the specifics and validation of the B -MCMC sampler for CBC gravitational-wave inference.The inference of CBC coalescence signals has been well-studied in the literature.The fundamentals can be found in Veitch et al. (2015), a recent review in Thrane & Talbot (2019), and the specifics of the B interface in Ashton et al. (2019) and Romero-Shaw et al. (2020).In Section 4.1, we introduce the basics of the CBC model and describe the best-known parameterisation of  to reduce the ACT.Then, in Section 4.2, we discuss the use of analytic marginalization methods which reduce the dimensionality of  in sampling.

Models, optimal parameterisation, and priors
A circularised gravitational-wave signal from a CBC can be described by a set of 17 model parameters .We can partition  into 11 intrinsic parameters (two mass, six spin parameters, the binary phase, and up to 2 tidal deformability parameters) and 6 extrinsic parameters (the 3-D localisation, polarisation, merger time, and the angle between the total angular momentum and the line of sight).
There are many ways to choose these 17 parameters in the literature.These different parameterisations offer varying levels of computational convenience and interpretability.In this work we use the following parameterisation for CBC analyses based on which parameters lead to the shortest auto-correlation lengths in our tests.
Mass.Labelling the detector-frame mass of the two objects in the binary  1 and  2 , we sample in the detector-frame chirp mass and mass ratio  =  2  1 .We apply prior cuts, discussed below, such that  1.This is the standard choice employed for compact binary analyses as the chirp mass is the best measured parameter for binary inspirals, followed by the mass ratio (Cutler & Flanagan 1994).
Spin.The spin of the compact objects contribute six degrees of freedom to the problem.Following (Farr et al. 2014), we sample in the magnitudes and tilts parameterised in spherical coordinates with the −axis aligned with the total angular momentum using the magnitude   and tilt   (where  ∈ [1, 2] labels the primary and secondary objects) along with two azimuthal parameters {  ,  12 }.
Tides.Tidal deformability of neutron stars is typically described in terms of either two dimensionless deformability parameters Λ  or combinations of these two combinations of these parameters that directly determine the contribution to the phase contribution { Λ,  Λ} (Flanagan & Hinderer 2008;Favata 2014;Wade et al. 2014).We sample in the latter set as these reduce the ACT.
Location.The location of the binary is uniquely described by four parameters, the distance to the source, the two-dimensional sky location, and the merger time.We choose these parameters as in LALI .We specify the merger time as the time of arrival of the merger signal at one of the detectors, ideally the one where we expect the highest signal-to-noise ratio.We characterize the sky location using a reference frame based on the separation vector of two of the detectors; see Romero-Shaw et al. (2020) Sec. 3.1.7for an explicit definition.For all of the results presented here, we marginalize over the distance to the source (see Section 4.2), so the specific choice of distance parameter is irrelevant.
Orientation.Finally, we require three Euler angles to convert the binary frame to the galactic reference frame.These are the inclination angle between the binary angular momentum and the line-of-sight from the source to the observer  JN , the binary phase at a reference frequency  (typically the merger frequency), and the polarization of the source .Throughout, we sample in cos( JN ) and .In practice there is a strong correlation between the phase and the polarization and so we sample in a phase offset parameter: The change of sign is due to a change in the direction of the degeneracy when observing from above/below the orbital plane.This parameterisation introduces a discontinuity in the likelihood at    = /2.However, the proposal schemes outlined in Section 2.1, including the machine-learned proposals, do not depend on assumptions of smoothness.In practice we find that the parameterisation improves the performance relative to analyses which use the phase directly.
Following LALI , we apply a prior uniform on the component masses  1 and  2 with cuts in the chirp mass and mass ratio.We then apply the non-informative priors on all other parameters and a uniform in the source-frame prior for the luminosity distance (Romero-Shaw et al. 2020).

Analytic likelihood marginalization
Of the 17-dimensional parameter space described in Section 4.1, there are three, the luminosity distance, geocentric time, and binary phase over which we are able to efficiently marginalize the gravitational-wave likelihood [see Veitch & Del Pozzo (2013); Farr (2014); Veitch et al. (2015); Singer & Price (2016); Singer et al. (2016) and Thrane & Talbot (2019) for a review].In the context of an MCMC sampler, the marginalized likelihood has a shorter ACT relative to the non-marginalized likelihood.This is both due to the reduction in dimensionality and to the reduction in the complexity of the posterior.Since it is possible to reconstruct the marginalized parameters after analysis (Thrane & Talbot 2019), where possible marginalized likelihoods are strongly recommended.For the luminosity distance, we always marginalize the likelihood.For the geocentric time, we marginalize the likelihood [and add the time jitter,   as described in Romero-Shaw et al. (2020)] except in instances where the reduced-order-quadrature method ROQ is used in which time-marginalization has not yet been implemented.The assumptions made in marginalizing the binary phase are invalid for precessing CBC systems or models that include higher-order emission modes.Therefore, we do not marginalize the binary phase in this work.But, in future use cases, where a non-pressing waveform without higher-order emission modes is considered, we do recommend using phase marginalization.

Fiducial binary black hole: BBH A
We simulate a fiducial (reference) binary black hole (BBH) signal observed by the LIGO Hanford and Livingston detectors (Aasi et 3. Simulation parameters for the three fiducial events analysed in Section 4. In the left-two columns, we provide the parameter groups names as described in Section 4.1. 2015) at their design sensitivity (Abbott et al. 2020).The simulation parameters, labelled as BBH A, are given in Table 3.We use the IMRPhenomPv2 (Hannam et al. 2014;Schmidt et al. 2012) waveform approximant to both simulate and analyse the signal.In this noise realisation, the simulated signal has a network matched-filter signal to noise ratio (SNR) of ∼ 13.We analyse 4 s of simulated data with the and B -MCMC samplers using the configurations described in Table 2, the priors described in Section 4, and distance and time marginalization.For the B -MCMC sampler, we use 13 independent chains, a thinning factor of  = 0.2, and run each chain until it produces 2000 samples.In total, this produces 25000 samples with  eff samples = 5000.For the sampler, we use the standardised configuration listed in Table 2, but use two independent run to enable a robustness check.
It is not possible to sample directly from the posterior in this case, so we resort to cross-sampler comparisons to verify posterior sampling.Across all CBC parameters, we find that the maximum JSD between the samplers falls below the 2 mb threshold: i.e. we find statistically identical posteriors between and B -MCMC.To visualise these difference in Fig. 4, we plot histograms of several quantities of astrophysical interest, along with their individual JSD.
In the evidence column of Table 2, we report the Bayes factor between the signal evidence and the Gaussian noise evidence (for a fixed realisation of the noise and power spectral density this is a fixed quantity).The evidence estimates disagree at the 1- level of the quotes uncertainties.The difference is likely explained by the known bias in parallel-tempered evidence (cf.Section 2.6) when  temps is small.Here, we tune  temps for efficient sampling of the posterior, rather than evidence estimation.In such a configuration, we recommend that the evidence estimate only be used as a rough guide, but not be used for quantitative analysis.To reduce the bias,  temps can be increased at the cost of posterior sampling efficiency.
Comparing the performance of the two samplers, the sampler is an order of magnitude more efficient than the B -MCMC sampler.However, this efficiency does not directly translate into an order of magnitude reduction in wall-time.To understand why, we need to discuss the parallelisation strategies available.
As discussed in Section 2.9, we have two available levels of parallelisation: combining independent runs and multiprocessing using  cores processors.For the sampler, reductions in wall-time  2. Vertical lines mark the edges of the 90% credible interval for each sampler and a black dotted lines marks the value used to simulate the data.Note that we do not expect the posteriors to peak at the simulation values due to the influence of the simulated noise and the Bayesian prior.In the title of each figure, we give the JSD; across all parameters this JSD is found to be below the 2 mb threshold we use to determine if the two sets of posteriors are statistically different.
can only be achieved via multiprocessing.This is because it is not possible to configure a nested sampler to run part of the full analysis (i.e. to only produce a small subset of the required total number of independent samples).A simple model for the wall-time of the run which agrees with our measured wall-time is: where  ℓ is the number of likelihood evaluations (cf.Table 2) and  ℓ is the approximate per-likelihood evaluation time for the BBH A likelihood.Here, we use 16-core processors: below we will discuss the potential scaling to larger multiprocessing pools.On the other hand, for B -MCMC we can parallelise using independent runs and multiprocessing.We run several independent runs, each producing 400 independent samples.In a HTC environment (and assuming access to resources is not limited), these can be run at the same time so that the total analysis wall time is given by the wall-time of any individual run.Using Eq. ( 20) and perfectly matching  cores to  temps : where we use the actual efficiency from Table 2 and multi-processing speed-up factor from Section 2.9.Both Eq. ( 27) and Eq. ( 28) agree with the empirically measured values (up to errors expected for varying access to resources in a HTC environment).
The net result is that the B -MCMC sampler is less efficient, but can be setup to enable a shorter wall time by utilising independent runs.Some of this inefficiency arises from the sampler itself, some from the burn-in inefficiency.For this configuration, the burn-in inefficiency, Eq. ( 16), is a few percent; further parallelisation (in terms of more independent runs) would increase this inefficiency.
For the sampler, reducing the wall-time can only be achieved via access to a larger multiprocessing pool.The ability to do this is restricted by the available hardware:  cores of 8 to 16 are typical in most HTC environments though modern CPUs with up to 128 cores do exist which could provide significant speed ups.Beyond this, massively parallelised nested sampling can leverage multiple CPUs in a HPC environment: in Smith et al. (2020), processing pools including several hundred cores have been used providing two orders of magnitude of speed up.(We caution that we have not verified the validity of Eq. ( 27) for such massively-parallel environments).However, access to such resources requires synchronised usage of a dedicated HPC environment.
To investigate the potential for bias in the B -MCMC sampler, in Fig. 5, we show the results of a parameter-parameter (PP) test (Cook et al. 2006;Talts et al. 2018) for BBH systems.This is an important test, typically it fails when one or more of the proposal distributions does not respect detailed balance.In this test, we simulate 100 BBH signals drawn from an astrophysical prior distribution, analyse each using the B -MCMC sampler, and then check the consistency of the reported credible intervals.Specifically, Fig. 5 shows the number of events in a given confidence interval as a function of the confidence interval.We find that the B -MCMC sampler is unbiased at the level probed by this test.

Fiducial binary neutron star: BNS A
We simulate a fiducial binary neutron star (BNS) merger using the IMRPhenomPv2_NRTidalwaveform (Dietrich et al. 2017(Dietrich et al. , 2019) ) which includes matter effects from the two neutron stars.The simulation parameters of the system, BNS A, listed in Table 3 are much lower in mass than that of the BBH systems previously studied.The result of this lower-mass is that the signal spends a longer duration in the observable band of the detectors (typically, above 20 Hz).To capture this, we analyse 128 s of data.Necessarily, this results in a significant increase in the time required to analyse the likelihood and hence overall wall-time.To mitigate this, we use the Reduced-Order-Quadrature (ROQ) method (Antil et al. 2012;Canizares et al. 2013Canizares et al. , 2015;;Smith et al. 2016;Qi & Raymond 2020) with the basis provided by Baylor et al. (2019) to decrease the per-likelihood evaluation cost.
The simulated signal has small spin components aligned along the angular momentum axis, an arbitrarily selected choice of tidal deformability parameters, and nearly equal mass components.In the specific noise realisation used, the network matched-filter SNR is ∼ 18.We analyse the signal using both the and B -MCMC samplers using the configurations described in Table 2.The analyses are identical to those of the BBH A analysis, except, we use the IMRPhenomPv2_NRTidalwaveform model (through the ROQ basis), use only distance marginalization, and restrict the spins to a low-spin configuration (dimensionless spin magnitude less than 0.05 (Abbott et al. 2019)).
As with the BBH case, the Bayesian evidence estimates (see Table 2 for the signal vs. noise Bayes factor) disagree.Again, we conclude this is due to the known bias in the parallel-tempered evidence -MCMC sampler for simulated BBH systems.We plot the fraction of simulated events found within the confidence interval (C.I.) as a function of the C.I.For an unbiased sampling from the posterior distribution, lines of this plot are diagonal: we add three gray shaded regions showing the 1, 2, and 3 −  quantiles.To quantify if the results are consistent with an unbiased sampling, we calculate a p-value of the probability that they are unbiased.The p-value for each individual parameter is given in the legend and a combined p-value is given in the title.Under an unbiased result, we would expect the p-value to be a draw from uniform distribution on [0, 1].Since all individual parameters (and the combined result) are greater than 1/15 (a nominal threshold based on the number of parameters), we conclude the sampler is unbiased, at least at the level probed by 100 simulations.
estimate.The posterior distributions from the and B -MCMC are statistically identical, except for the inclination parameter    .In Fig. 6, we reproduce histograms for selected parameters of typical astrophysical inference visually demonstrating the agreement and inclination difference.The cause of the difference in inferred inclination is not yet fully understood, but we note that the difference is only marginally above our threshold for statistically identical.Comparing individual re-analyses between the two samplers, the difference persists suggesting it is systematic and not a random fluctuation.We note this is an instance where the posterior is bimodal and speculate this could be a symptom of the nested sampling failing to fully explore both modes.However, the difference is sufficiently small for us to conclude the underlying conclusions about the source (i.e. the 90% credible intervals) are robust, while the posterior shape is subject to some sampling error (from one or both samplers).
Due to the larger SNR of the fiducial BNS, and the increase in dimension of the prior, the efficiency of both the and B -MCMC samplers is reduced compared to that of the fiducial BBH.The ratio of efficiencies is also increased: the sampler is ∼ 30 times more efficient in this case.As with the BBH analyses, this efficiency does not directly translate into wall-time savings due to the different parallelisation approaches.However, the efficiency is significant.In future work, we aim to improve the choice of parameterisation and proposals to improve the efficiency of the B -MCMC sampler.2. Vertical lines mark the edges of the 90% credible interval for each sampler and a black dotted lines marks the value used to simulate the data.Note that we do not expect the posteriors to peak at the simulation values due to the influence of the simulated noise and the Bayesian prior.The largest JS-divergence reported across all parameters occurs for the inclination,    , above the threshold of 2 mb (see Appendix A).

SUMMARY
We introduce B -MCMC, a parallel-tempered ensemble sampler with problem-specific and machine-learning based proposals.The B -MCMC sampler is the first MCMC sampler implemented in the B (Ashton et al. 2019) inference package with demonstrated performance for analysing CBC events observed by ground-based gravitational-wave detectors.We demonstrate, using both comparisons to known results and cross-sampler comparisons, that the posterior samples are unbiased.Compared to the nested sampling algorithm, B -MCMC suffers a known bias in its estimation of the Bayesian evidence when the number of parallel-tempered chains,  temps , is small.Increasing  temps reduces the bias, but at the cost of posterior-sampling efficiency.We introduce a method to resample from the tempered chains, recovering some of this inefficiency, but find it provides little improvement for typical CBC inference problems.We conclude that B -MCMC is ideal for problems in which only the posterior distribution is of interest, but that nested sampling approaches should be preferred when evidence calculations are required.This makes B -MCMC unsuitable for model-comparison via a Bayes factor (MacKay 2003).Instead, one may wish to develop a hyper-model where the model is treated as a random variable (see e.g. the Reverse Jump Markov Chain Monte Carlo approach described in (Cornish & Littenberg 2007)).
B -MCMC can be trivially and asynchronously parallelised.This enables it to be configured to leverage High-Throughput Computing environments to reduce the wall-time.That the parallelisation is asynchronous makes it ideal for utilising non-interacting distributed computing such as the Open Science Grid (Pordes et al. 2007;Sfiligoi et al. 2009).By comparison, nested sampling approaches can be parallelised solely through the use of multiprocessing.Smith et al. (2020) demonstrated massive scaling of the (Speagle 2020) sampler to many hundreds of cores; B -MCMC cannot similarly be scaled due to the fundamental limit of the burn-in inefficiency.However, the Smith et al. (2020) approach requires synchronised access to a High-Performance Computing environment in which the communication times between cores is rapid.
B -MCMC provides the user access to a modular library of proposal distributions which can be chained together.The choice of parameterisation and proposals has a significant effect on the efficiency of the sampler.We anticipate further development in both these aspects will improve the sampler efficiency resulting in reduced wall-time.Users adapting B -MCMC to other astrophysical inference problems can define their own sets of proposal distributions and easily implement new problem-specific proposals by sub-classing the existing software.

Figure 1 .
Figure 1.Empirically measured speed-up for a test analysis with  temps = 16 and  ens = 1.Solid lines indicate the mean while shaded region indicates the standard deviation as measured over 3 identical runs.The right-hand axis provides the speed-up factor  at perfect matching.

Figure 2 .
Figure 2. Comparison of the sampler (with  live = 2000 and  act = 50), the B -MCMC sampler (with  temps = 1 and the full set of learning proposals), and samples drawn directly from the posterior for the Rosenbrock test.The JSD test between each of the samplers and the direct samples (see Table2) quantifies that the B -MCMC sampler produces statistically identical posterior samples while the sampler produces JSD values at the failure threshold.Visually, we see that the posterior samples produced by the sampler are overly constrained.

Figure 3 .
Figure 3. Violin plot showing posteriors from 5 parameters of the 15dimensional unimodal Gaussian test.Each violin shows 5000 samples drawn directly from the posterior (gray), from the analysis with  act = 50 (blue) and the B -MCMC analysis with  temps = 1 (orange).Vertical lines denote the median and 90% credible interval.

Figure 4 .
Figure 4. Histograms of the posteriors from the (blue) and B -MCMC (orange) analyses of the fiducial BBH.Configurations and summary statistics are given in Table2.Vertical lines mark the edges of the 90% credible interval for each sampler and a black dotted lines marks the value used to simulate the data.Note that we do not expect the posteriors to peak at the simulation values due to the influence of the simulated noise and the Bayesian prior.In the title of each figure, we give the JSD; across all parameters this JSD is found to be below the 2 mb threshold we use to determine if the two sets of posteriors are statistically different.

Figure 5 .
Figure 5.A parameter-parameter test for the B-MCMC sampler for simulated BBH systems.We plot the fraction of simulated events found within the confidence interval (C.I.) as a function of the C.I.For an unbiased sampling from the posterior distribution, lines of this plot are diagonal: we add three gray shaded regions showing the 1, 2, and 3 −  quantiles.To quantify if the results are consistent with an unbiased sampling, we calculate a p-value of the probability that they are unbiased.The p-value for each individual parameter is given in the legend and a combined p-value is given in the title.Under an unbiased result, we would expect the p-value to be a draw from uniform distribution on [0, 1].Since all individual parameters (and the combined result) are greater than 1/15 (a nominal threshold based on the number of parameters), we conclude the sampler is unbiased, at least at the level probed by 100 simulations.

Figure 6 .
Figure 6.Histograms of selected posteriors from the (blue) and B -MCMC (orange) analyses of the fiducial BNS A. Configurations and summary statistics are given in Table2.Vertical lines mark the edges of the 90% credible interval for each sampler and a black dotted lines marks the value used to simulate the data.Note that we do not expect the posteriors to peak at the simulation values due to the influence of the simulated noise and the Bayesian prior.The largest JS-divergence reported across all parameters occurs for the inclination,    , above the threshold of 2 mb (see Appendix A).

Table 1 .
The gravitational-wave proposals set used in this work.For a description of the proposals themselves, see Section 2.1.In cases where the -subset is "all", the whole set of  is updated.Where a subset is listed, see Section 4.1, only that subset is updated by the proposal.The weights are unnormalised and determine the relative frequency of each proposal.In the final row of "Prior" proposals, each is updated individually, not as a set.

Table 2 .
Validation tests reported in this work.White-shaded rows are those from the standard validation tests (Section 3) while gray-shaded rows are tests from gravitational wave validation tests (Section 4).For the standard validation tests we give the B -MCMC configuration by the set of proposals (described in Section 2.1) while for the sampler we give  live and  act (cf Romero-Shaw et al. ( al.