Reversible Jump PDMP Samplers for Variable Selection

A new class of Markov chain Monte Carlo (MCMC) algorithms, based on simulating piecewise deterministic Markov processes (PDMPs), have recently shown great promise: they are non-reversible, can mix better than standard MCMC algorithms, and can use subsampling ideas to speed up computation in big data scenarios. However, current PDMP samplers can only sample from posterior densities that are differentiable almost everywhere, which precludes their use for model choice. Motivated by variable selection problems, we show how to develop reversible jump PDMP samplers that can jointly explore the discrete space of models and the continuous space of parameters. Our framework is general: it takes any existing PDMP sampler, and adds two types of trans-dimensional moves that allow for the addition or removal of a variable from the model. We show how the rates of these trans-dimensional moves can be calculated so that the sampler has the correct invariant distribution. Simulations show that the new samplers can mix better than standard MCMC algorithms. Our empirical results show they are also more efficient than gradient-based samplers that avoid model choice through use of continuous spike-and-slab priors which replace a point mass at zero for each parameter with a density concentrated around zero.


Introduction
There currently is much interest in developing MCMC algorithms based on simulating piecewise deterministic Markov processes (PDMPs).These are continuous time Markov We demonstrate how to adapt existing PDMP samplers to variable selection problems.Specifically they evolve as the PDMP sampler when exploring the posterior associated with a given model, but with two additional events: if any parameter value hits 0 the PDMP jumps to the smaller model where the corresponding variable is removed; whilst with some rate there are events that re-introduce variables into the model.We show in Section 3 how to calculate the rate and transition for these new types of event so that the sampler has the correct invariant distribution.To calculate these we need different techniques than those used for existing PDMP samplers, as we need to account for the behaviour of the process when parameters hit zero.The techniques we use are most similar to those in Bierkens et al. (2018), which considers PDMPs with restricted domains.However in that paper the dynamics at the boundary of the domain could be chosen so that the net flow of probability at the boundary is zero; whereas we need to balance the probability flow out of a model which occurs when a parameter hits zero with the flow into the model caused by the events that re-introduce variables.
The approach we present is generic, in that it can take any current PDMP sampler and be used to obtain a version that can be applied to the variable selection problem.We call the new class of samplers reversible jump PDMP samplers, due to the analogy with reversible jump MCMC (Green, 1995).We show how to derive reversible-jump versions of both ZigZag and the Bouncy Particle Sampler in Section 4, before investigating empirically these algorithms on both logistic regression and robust linear regression models.
Proofs of all theorems are relegated to the appendix.Code for implementing the new reversible jump PDMP samplers, and for replicating our examples, is available from https://github.com/matt-sutton/rjpdmp.

Variable Selection
We will consider model selection problems that arise from variable selection.The general framework is that we have a vector of parameters, θ = (θ 1 , ..., θ p ), and each model is characterised by setting some subset of the θ j s to 0. This is a common setting across linear models, generalised linear models and various extensions.
To make ideas concrete, consider a linear model where Y is a vector of response variables, each X j is a vector of covariates, and is an additive noise vector.When p is large it is common to fit such a model under a sparsity assumption, namely that many of the θ j s are 0.
In a Bayesian analysis, such a sparsity assumption is encapsulated in our choice of prior on θ.To aid interpretation of the variable selection priors it is common to introduce a latent variable γ = (γ 1 , ..., γ p ) where γ j = 1 if the covariate X j is included in the model, i.e. if θ j = 0. We let |γ| = γ j the number of covariates included in the corresponding model.Indexing θ γ as the sub-vector of θ with only the selected variables, any prior can be written in a hierarchical form where we have a prior on γ, then conditional on γ we have a prior on θ γ , and set all remaining entries of θ to 0.
A special case is where each component of θ is independent of the others.In which case the prior can be written as θ j ∼ w j g j (θ j ) + (1 − w j )δ 0 (θ j ), j = 1, ..., p, where w j ∈ (0, 1) is the prior probability that γ j = 1, δ 0 is a Dirac measure at zero and g j (θ j ) is a distribution that models our prior beliefs for θ j conditional on that variable being included in the model.Bayesian approaches to variable selection that put a probability mass on θ j = 0 in this way will be referred to as Dirac spike and slab methods.Notable examples of these methods include Mitchell and Beauchamp (1988); Kuo and Mallick (1998); Geweke (1996); Smith and Kohn (1996); Bottolo and Richardson (2010).While this formulation is natural from a modelling perspective it, sampling from the resulting posterior distribution can be challenging, with, for example, MCMC samplers that use gradient information such as Hamiltonian Monte Carlo (Neal, 2011) not being applicable.To circumvent this issue it is common to use an approximation to this prior which replaces the point mass at 0 with a density that is peaked around 0, such as where c j is taken small so that N (0, c 2 j τ 2 j ) approximates the Dirac spike.We will refer to Bayesian variable selection methods that replace the Dirac in the prior with a continuous approximation as continuous spike and slab methods.This prior was originally proposed in linear regression where it is commonly referred to as the stochastic search variable selection procedure (George and McCulloch, 1993).

PDMP Samplers
Piecewise Deterministic Markov Processes (PDMPs) are an emerging class of non-reversible continuous-time samplers.To be consistent with commonly used notation for PDMP samplers, we will consider sampling from a distribution with density π(x) defined on some space X ; which is a slight change of notation relative to our regression model of the previous section.Current samplers augment the state to include a velocity vector and sample from a distribution on E = X × V.In the following, for z ∈ E we will use the notation z = (x, v) with x ∈ X a position and v ∈ V a velocity.
The PDMP can be defined by (i) deterministic dynamics between a set of random event times; (ii) the state-dependent rate at which events occur, λ(z); and (iii) a probability distribution for the change in state at each event, with density q(z |z).We will consider PDMPs whose deterministic trajectories follow a differential equation of the form: with Φ : E → V a smooth function.This setting contains the usual PDMP samplers such as ZigZag (Bierkens et al., 2019), the Bouncy Particle Sampler (Bouchard-Côté et al., 2018), or the Coordinate Sampler (Wu and Robert, 2020).
For example the ZigZag algorithm to simulate from π(x) for x ∈ R p , would introduce a velocity vector v ∈ {−1, 1} p , and deterministic dynamics which are the dynamics of constant velocity model.Events occur at a rate that depends on the gradient of π(x) in each component of the velocity, and at an event one component of the velocity is switched.The rate at which the jth component of the velocity, These rates depend on the target distribution just through the gradient of the log of the target -which importantly means that we need only know the target distribution up to proportionality.
We can apply current PDMPs, such as ZigZag, to the Bayesian variable selection problem if we use the continuous spike-and-slab prior (1).Realisations of such a sampler are shown in the left two plots of Figure 1 as we vary how concentrated the spike distribution is.For the more concentrated case, the sampler becomes inefficient as it involves many switching events when the state variable is close to 0.
Intuitively as we make the variance of the spike component of the prior tend to 0 the prior converges to a prior with a point mass at 0; furthermore we can observe the output of our PDMP sampler "converging" to a process shown in the right-hand plot of Figure 1.Here, rather than the state having periods where it oscillates around 0, it has periods of time where θ j = 0 and thus has many fewer events.

Reversible Jump PDMP Samplers
Let M = (M 1 , M 2 , ...) be a set of models indexed by a parameter k.To each model M k correspond a state space X k of dimension p k .For the sake of clarity, we will limit ourselves to variable selection, though it is straightforward to apply the results below to more general model selection problems.In our case X k has a specific form: where we abuse notation by using R 0 = {0} and where (γ k i ) i is a sequence of numbers in {0, 1} representing whether a variable is enabled for model M k .Let π be the target posterior probability defined on X = ∪ k X k .We further assume that the restriction of π to each X k has a density, and denote this by π k (x).The first ingredient of our sampler is a collection of PDMPs defined for each model.Each PDMP sampler adds a velocity space V k to the space X k and samples from the space E k = X k × V k .Finally, for each model M k , the associated PDMP sampler has an extended infinitesimal generator A k (Davis, 1993) with invariant distribution proportional to ν k with for some set of conditional densities p k (v|x).
Remark 1.For samplers such as ZigZag, p k (dv|x) is a measure with support on a discrete set.By choosing V k to be the support of p k (dv|x), p k (dv|x) still has a density -and integrals can be interpreted as sums over the support points.This allows us to treat all samplers within the same framework.
The second ingredient of our reversible jump PDMP sampler is a set of jumps between models.For our case of variable selection, we only allow adding or removing one variable at a time.Hence, let be a set of pairs of transitions between models, with these ordered (i, j) such that model M j being obtained from M i by removing one of the variables in M j .For transition i → j, we define an active boundary Γ i,j = X j × V i , a subspace of E i .Each trajectory passing through Γ i,j has some probability p i,j of jumping to E j using a deterministic jump function g i,j .We assume that the jump function does not change the position.So, if a trajectory z t of our process has a left limit at time t, z t− in Γ i,j , then with some probability p i,j , z t = g i,j (z t− ) ∈ E j with x t = x t− .For transition j → i, we introduce β i,j (z) a Poisson rate and a jump kernel Q i,j (•, z) such that if the trajectory is in E j , then with rate β i,j (z t− ), z t is drawn from Q i,j (•, z t− ) ∈ Γ i,j .We impose symmetry in the jumps between models so that for any z ∼ Q i,j (•, z), g i,j (z ) = z.

Invariant distribution
Let ν = k ν k be a measure on ∪ k E k .The x-marginal distribution of ν is π and this section provides conditions on Q i,j , p i,j and β i,j that has to be satisfied if ν is to be the invariant distribution of the process.To avoid additional technicalities that complicate the proof, we gives results for PDMPs with bounded velocity spaces.
Theorem 1. Assume V k is bounded for each k, and that we can bound the event rate over any compact region; i.e if K ∈ E is compact then max z∈K λ(z) < ∞.Suppose a measure µ is a stationary distribution with density on each E i .Then for each function f in a suitable set F: where n i,j is the normal to Γ i,j .
The set F in this theorem is the domain of the generator of our PDMP, and is described precisely in Davis (1993).
To get a directly usable conditions on β i,j , p i,j and Q i,j , some additional notation must be introduced.When jumping from z = (x, v) ∈ E j to z = (x, v ) ∈ E i , the dimension of the velocity vector needs to be increased by one, but the position is unchanged.Hence g −1 i,j (z) is a one-dimensional manifold, which can be related to a subset, U say, of R by introducing a function G i,j : U × E j → E i such that for any α ∈ U , we have G i,j (α, z) ∈ g −1 i,j ({z}) and for a fixed z ∈ E j , α → G i,j (α, z) is a one to one mapping from U to g −1 i,j (z) (similar ideas are seen in reversible jump MCMC; see Green, 1995).For a given z ∈ E j , it is natural to rewrite the jump kernel Q i,j in terms of α ∈ U , and henceforth we abuse notation and write Q i,j (•, z) as a density on U .That is we can simulate the transition from E j to E i , by simulating α ∼ Q i,j (•, z) and then setting z = G i,j (α, z).
Finally, let be an unnormalised density on Γ i,j and let νi,j be the pushforward measure on E j of ν i,j by g i,j : νi,j (B) = ν i,j (z)dz for any B ⊂ E j measurable.
Informally this is the measure under ν i,j associated with values of x ∈ Γ i,j that would be mapped by g i,j to the set B.
We are now in a position to give simple conditions for (2) of Theorem 1 to hold.To do this it is helpful to consider separately the cases where the space of velocities is continuous and the case where it is discrete.
Theorem 2. Assume the space of velocities is continuous.A sufficient condition for (2) of Theorem 1 to hold for ν is that, for every (i, j) ∈ T , the following conditions holds where J G i,j denotes the Jacobian associated with the transformation G i,j .
Intuitively, this result can be understood as a detailed balance condition: we balance the probability flow for each jump z → z from E i to E j , with that of the reverse jump z → z.
For discrete velocity spaces the result is slightly simpler, as the Jacobian term is not required.
Theorem 3. Assume the velocity space is discrete.A sufficient condition for (2) of Theorem 1 to hold for ν is that, for every (i, j) ∈ T , the following conditions holds 4 Example Samplers

ZigZag Sampler
We first derive the jump rates and transitions for the ZigZag sampler described in Section 2.2.For this sample the velocity of each active component is either 1 or -1, which simplifies the computation.We choose g i,j to be the projection that sets to 0 the disabled variable.Let (i, j) ∈ T be a transition.For any v ∈ V i , we have | v, n i,j | = 1.Therefore on E i , For the jump to E i we change the velocity for the component of the model that is added, whilst the other components of the velocity is unchanged.Denoting the new velocity of the added component by α, we have from Theorem 3 that For our variable selection problem, the ratio of the posterior density that appears in β i,j will simplify to the ratio of the priors as the likelihood terms are common and cancel.If we have independent priors on the parameters for each variable this term will be a constant, which simplifies the simulation of the events at which we add new variables into our model.This comment applies also to the rates for the Bouncy Particle Sampler which we derive next.

Bouncy Particle Sampler
We consider two versions of the Bouncy Particle Sampler.The first version has velocities on the unit sphere, so Like the ZigZag sampler, the deterministic dynamics are given by a constant velocity model.The event rate for sampling from a density π k (x) is with the velocity reflecting in the normal to log π k (x) at an event.The Bouncy particle also often has refresh events, at which a completely new velocity is sampled.
Extending the Bouncy Particle Sampler to the variable selection problem requires a more careful analysis than for the ZigZag sampler due to the geometry of the velocity space.For the jump to E i , we will let α denote the new velocity for the component added to the model.To ensure the new velocity lies on the unit sphere, we re-scale the other components of the velocity: so if the old velocity is v then the new velocity is The following proposition states how to choose the jump rate, β i,j and the density for α.
Proposition 1.For the Bouncy Particle Sampler with velocities on the unit sphere, (3) and ( 4) are satisfied if, for all (i, j) ∈ T with |γ j | > 0: the area of the unit sphere of R |γ i | ; and if for |γ j | = 0, where the Bouncy Particle Sampler and ZigZag are equivalent, we use the ZigZag rates.
The second version of the Bouncy Particle Sampler has velocities in R |γ k | , with their density being standard Gaussian and independent of x.The dynamics are as previously.This sampler does not satisfy our requirement that the velocity space is bounded, though the argument behind Theorem 1 can be extended to this sampler, and our results would apply to this sampler if we replaced the Gaussian density by a truncated Gaussian (with any arbitrarily large truncation value).
For the jump to E i , we again let α denote the new velocity for the component added to the model, whilst for this version the other components of the velocity remain unchanged.
Proposition 2. For the Bouncy Particle Sampler with Gaussian velocities, (3) and ( 4) are satisfied if, for all (i, j) ∈ T :

Simulation Study
In this section we demonstrate the potential advantage of our new samplers compared to alternative approaches for Bayesian variable selection.To compare between different samplers we consider the Monte Carlo estimates of the posterior probabilities of inclusion, the posterior means for the regression coefficients, and the posterior means conditioned on the model.Similar comparisons methods are used in Zanella and Roberts (2019) for comparing efficiency of MCMC methods on Bayesian variable selection problems.For a given sampler the statistical efficiency is measured by the mean squared error of the sampler denoted by σ 2 sampler and is calculated using R runs of the sampler as where qr for r = 1, ..., R are the estimates of a quantity of interest from the R runs, and q is either the exact posterior quantity of interest, if available, otherwise it is the estimate from an independent long run of an MCMC method.In multiple dimensions the statistical efficiency is measured as the median σ 2 sampler over all dimensions.To compare the performance of different samplers we also consider a measure of efficiency relative to a reference sampler.If we denote the reference sampler by ref, then we define Relative Statistical Efficiency (RSE) and Relative Efficiency (RE) by where n sampler and n ref are the number of iterations of the algorithms and t sampler and t ref are the computation times of the algorithms.The RSE measures the relative efficiency of the algorithms per iteration whereas the RE measures the efficiency per second.For interpretation an RSE or RE value of 2 implies that the sampler is 2 times more efficient than the reference method.The sensitivity of the methods to the choice of reversible jump parameters p i,j and regular PDMP tuning parameters is explored empirically on a simple model in the Supplementary Material.Based on these results we fixed p i,j = 0.6 for all i and j in all reversible jump PDMP samplers and fixed the refreshment rate at 0.1 for the reversible jump BPS methods for the following results.

Logistic regression
First we compare PDMP based samplers with a collapsed Gibbs sampler and a reversible jump sampler on a logistic regression problem.The Gibbs sampler is based on the Polya-Gamma sampler of Polson et al. (2013): details of this sampler are given in the Supplementary Material.We implemented reversible jump MCMC using the NIMBLE software package ( de Valpine et al., 2017de Valpine et al., , 2020) ) using independent Normal proposal for selected variables.
The logistic regression model has a p-dimensional regression parameter θ ∈ R p , and a binary response y i ∈ {0, 1} which is distributed as x i,j θ j ) where x i is the p-vector of covariates for observation i.In our simulation study, each vector x i is simulated from a multivariate Normal with mean zero and p × p covariance matrix Σ.We use a prior that is independent for each θ j and where θ j ∼ 10 p N (0, 10)+(1− 10 p )δ 0 , which corresponds to a prior that favors models with 10 selected variables.Data was generated using this model and the following choices for θ and covariance matrix Σ: 1.A pair of correlated covariates, one of which is in the model: θ = (1, 0, ...0) T with Σ 2,1 = Σ 2,1 = 0.9, with Σ i,i = 1 and Σ i,j = 0 if i = j.
2. Structured correlation between all covariates with six active covariates: 3. No correlation between covariates and six active covariates: These simulation scenarios are analogous to others previously considered in the literature for linear regression.Scenarios 1 and 3 are are similar to considered by Wang et al. (2011) and Zanella and Roberts (2019) while Scenario 2 is similar to one considered by Yang et al. (2016).We present results for both ZigZag and the Bouncy Particle Sampler with Gaussian distributed velocities (almost identical results are obtained for the Bouncy Particle Sampler where the velocities are on the unit sphere).See Section D of the Appendix for more details.
The PDMP methods are competitive with Gibbs sampling in low sample sizes and offer substantial efficiency gains for larger sample sizes.Smaller gains can also be seen when the dimension is increased with fixed sample size.Both BPS and ZigZag methods offer similar relative efficiencies across the experiments.The greatest efficiency gains for reversible jump PDMP methods was seen in Scenario 1 with Scenarios 2 and 3 offering lower gains in performance.This may be due to smaller models being more likely in Scenario 1 and thus less computational effort required for the PDMP methods in gradient calculations and simulation of event times.For all experiments the Gibbs sampler offers a computational advantage over the reversible jump methods in relative efficiency for marginal posterior probabilities of inclusion.This increased performance is unsurprising as the model update steps in the Gibbs sampler have marginalised over the parameter values θ yielding efficient moves through the model space.In more general settings where it is not possible to integrate over θ the mixing will be substantially poorer.
The use of subsampling methods for Bayesian variable selection problems is a recently emerging area (Song et al., 2020;Buchholz et al., 2019) though it remains under-studied.One of the attractions of PDMP samplers is that they can be implemented in a way where they only access a small subset of data at each iteration, whilst still targeting the true posterior.We now investigate how these ideas work in the variable selection problem, by comparing the efficiency of three implementations of ZigZag with that of the Gibbs sampler, and see how this depends on the number of observations.These are ZigZag using the full data, ZigZag using subsampling with a global bound, and ZigZag with subsampling control variates (see Bierkens et al., 2019, for details of both subsampling approaches).
Standard application of control variates requires calculation of the gradient at a reference point using the full likelihood.Due to the trans-dimensional nature of variable selection problems, a full gradient calculation is not well defined.For this reason we choose to make use of control variates defined for a fixed model M where the gradient is well defined.These control variates are only used when the sampler is in this model.Extensions where multiple control variates are used for models with high probability would be straightforward.
Results are shown in Figure 2. Despite our simplistic implementation, these results indicate that Zig-Zag with control variates is becoming increasingly efficient relative to Zig-Zag using the full dataset as the number of samples increases.Furthermore we see evidence of super-efficiency -whilst the computational cost per ESS of the Gibbs sampler is expected to be linear in the number of observations, the relative efficiency plots suggest that this is sub-linear for ZigZag with control variates.
Figure 2: Log-log plots of efficiency, relative to the Gibbs sampler, of different samplers as we vary the number of observations.Plotted are the relative efficiencies for the posterior mean conditional on model M * where M * corresponds to the true data generated model.The dataset was generated with a 15-dimensional regression parameter θ = (1, 1, 0, 0, ..., 0).The methods run are the Zig-Zag applied to the full dataset (zz, black), Zig-Zag with subsampling using global bounds (ss, blue), Zig-Zag with control variates (cv, magenta) and Gibbs sampling (Gibbs, green).All methods were initialised at the location of the control variate.Methods were given the same computational budget, for details see Section D of the Appendix.

Robust regression
As mentioned in the introduction, a common approach to Bayesian variable selection is to use continuous spike-and-slab priors for each parameter rather than try to sample from the joint posterior of model and parameters.Such an approach is attractive as it enables standard gradient-based samplers, such as Hamiltonian Monte Carlo, to be used.We now compare such an approach, implemented with the popular STAN software (Carpenter et al., 2017), to our PDMP samplers.Our aim is to both investigate the computational efficiencies of the two approaches and to show the differences in posterior that we obtain from these different types of prior.Our comparison is based on a robust linear regression model.In particular, we model the errors in our linear regression model as a mixture of Normals with different variances.Thus The continuous variable selection prior we consider is the regularised horseshoe (Piironen and Vehtari, 2017a,b) for j = 1, ..., p where C + (0, 1) denotes the half-Cauchy distribution for the standard deviation λ j .The regularised horseshoe is a variation of the horseshoe prior (Carvalho et al., 2010) that offers a continuous approximation of a Dirac spike and slab where the slab is a Normal distribution with finite variance c 2 .The hyper-parameter τ controls the global shrinkage of the variables towards zero.In Carvalho et al. (2010) it was shown that for standard linear regression the optimal choice for a fixed value of τ is τ 0 = σ p 0 (p−p 0 ) √ n where p 0 is a the number of nonzero variables in the sparse model and σ is the noise variance.In line with this, we compare the regularised horseshoe prior with fixed hyper-parameter τ 0 = p 0 (p−p 0 ) √ n against the spike-and-slab prior for j = 1, ..., p. MCMC for the model using a horseshoe prior was performed by Stan's implementation of NUTS (Hoffman and Gelman, 2014).We first compare the variable selection dynamics for a simple model with p = 4 variables, n = 120 observations and regression parameter θ = (0.5, 0.5, 0, 0) T .The covariate values and residuals were generated as independent draws from a standard Normal and the prior expected model size is set to p 0 = 1.Example output for the PDMP samplers and the Stan implementation is shown in Figure 3.The posteriors show the horseshoe prior replicating the effect of the spike-and-slab through shrinking the coefficients towards zero, but it is not able to give exact zeros.
We now compare the reversible jump PDMP methods in terms of their sampling efficiency for a higher dimensional problem.The dataset is generated for p = 200 variables and n = 100 observations with regression parameter θ = (2, 2, 2, 2, 0, 0, 0....0) T .The covariates for each observation were drawn from an AR(1) process with lag-1 correlation of 0.5.The residuals were generated from a standard Cauchy distribution.
We ran Stan with the default settings for a burnin of 1000 iterations and then for 16, 32, 64, ..., 2048 iterations.We ran the reversible jump PDMP samplers for the same wall clock time for both burnin and subsequent iterations.The experiment was repeated 35 times and the resulting boxplots of the posterior mean of θ 1 are shown in Figure 4.For the same computational budget, the reversible jump PDMP methods are able to attain better performance.However, it is also apparent from this simulation that the BPS reversible jump PDMP methods are more sensitive to local modes.It is unclear if this is an artifact of the sampler or due to the choice of refreshment rate (0.6).This sensitivity is seen to diminish as the sampler is run for longer.
The predictive ability of the methods is compared in Figure 5.Here an additional n = 100 observations were drawn from the same model and these were used as a hold-out dataset to validate the posterior predictive ability.The predictive ability is defined in terms of the Monte Carlo estimate of the mean square prediction performance 1 100 100 i=1 (y i −x T i θ) 2 where θ is replaced by the samples generated by either Stan or samples given by a discretisation of time for the reversible jump PDMP samplers.The reversible jump PDMP samplers all give the same predictive performance for large iteration numbers while Stan, which uses a horseshoe prior, performs slightly worse.

Discussion
We have shown how PDMP samplers can be extended so that they can sample from the joint posterior over model and parameters in variable selection problems.There are a number of open challenges that stem from this work.First, as with any MCMC algorithm, the reversible jump PDMP samplers have tuning parameters.Our simulation results were based on choosing these after empirically evaluating the performance of the samplers on one simple problem.Whilst the samplers mixed well, it is likely that better mixing could be achieved if more informed choices of tuning parameters were made, and theory for guiding such choices is needed.
Second, the form of our reversible jump PDMP samplers is based on particularly features of the variable selection problem.In other model choice settings, different trans-dimensional moves may be needed.The theory we developed should be able to be adapted to give rules for choosing rates of such moves.Also our trans-dimensional moves are reversible, that is they balance probability flow from model i to model j by the flow of probability from model j to model i -it would be interesting to see if non-reversible trans-dimensional moves could be constructed.
Third, it is likely that the reversible jump PDMP samplers will still struggle in situations where the posterior is multi-modal with well separated modes.For such cases it would be interesting to try and incorporate ideas such as tempering (Marinari and Parisi, 1992) to allow for better mixing across modes.Also given the close links between PDMP samplers and Hamiltonian Monte Carlo, it would be interesting to see if the ideas we have presented for trans-dimensional jumps could be adapted to be used with Hamiltonian Monte Carlo algorithms.

A Proofs
A.1 Proof of Theorem 1 Recall theorems 34.15 and 34.19 from Davis (1993): Theorem 4 (34.15).Suppose that µ is a stationary distribution for (x t ).Then there exists a finite measure σ on Γ (the active boundary) such that for all t > 0: where p * is the counting process that counts the number of jumps for the boundary Theorem 5 (34.19).Suppose µ is a stationary distribution for (x t ) and let σ be the corresponding boundary measure defined above.Then: where A is the extended generator and Cf = (Qf − f ).
Remark 2. The construction of PDMPs with an active boundary we presented differs slightly from the one found in (Davis, 1993, p. 57).However our process still fits the definition of Davis (1993): one simply needs to separate each X k in two at the boundary and consider the two parts as two separated spaces.
From the expression of the extended generator given in (26.14) of Davis (1993), it follows that corresponds to the first term of Theorem 5.It is now left to show that corresponds to the boundary term of Theorem 5. Following Davis (1993) notations, for a fixed (i, j) in T , the operator C = (Qf − f ) where Q is the jump kernel at the boundary has the following expression: Finally, we need to find an expression for measure σ.We start from the definition in Theorem 4: where p * s is the counting process that counts the number of times we hit Γ i,j .This holds for all t, and we will derive σ by considering the value of the left-hand side as t → 0. Intuitively the idea is that, in this limit, the impact of the events will be negligible and we can evaluate the left-hand side by considering a process where z 0 is simulated from µ but then is purely deterministic, with dynamics given by the deterministic part of the PDMP.
As indicator functions of compact sets separate measures on Γ i,j we can restrict ourselves to these functions.Let K be a compact of Γ i,j and f its characteristic function, f = 1 K .We also introduce some additional notation.Let zt be a deterministic process that follows the dynamics of the deterministic part of the PDMP, and let p * t be the associated process that counts the number of times zt hits Γ i,j .For our PDMP let Ω 0 (t) be the set of paths for which no events occur by time t, and Ω 1 (t) the set of paths with at least one event prior to time t.We can write that is split the integral into two, with the first being the contribution from paths with no event and the second the contribution of paths with at least one event.
We now introduce two sets.Let K t = {z ∈ E|z s ∈ K with z 0 = z and s ∈ [0, t]}, and Kt = {z ∈ E i | zs ∈ K with z0 = z and s ∈ [0, t]}.The first is the set of all possible starting points for our PDMP process for which it is possible to hit K by time t, the latter is the same set but for the purely deterministic process (equivalently, all PDMP paths with no events by time t).As K is compact so is Kt .Furthermore as the velocity space is bounded and K is compact so is K t .This, together with our assumption on the event rate means that we can upper bound the event rate for both z ∈ K t and z ∈ Kt , and denote such an upper bound by λ + .Now E µ t 0 1 K (z s − )1 Ω 1 (t) dp * s can be bounded as for the integral to be non-zero we need z 0 ∈ K t and the number of times the PDMP hits Γ i,j is at most twice the number of events.Thus an upper bound is 2λ + t Pr µ (z 0 ∈ K t ).As the volume of K t tends to 0 as t → 0, this upper bound is o(t).
To bound the contribution from the other integral, we notice that we can bound this from above by the the PDMP process without events, and from below by using our bound on the event rate.Thus Thus Thus, for any compact subset of Γ i,j , K, we have The final step is to obtain lim t→0 E µ t 0 1 K ( zs − )dp * s , which comes from the following result for solutions of ordinary differential equations: Theorem 6.Let E be a subset of R d , µ a measure with a density with respect to the Lebesgue measure of E and let X be the vector field associated to the order ordinary differential equation Let H be an hyperplane of E with normal n H and t * (z) be the first hitting time of a trajectory with H starting from z. Assuming there exists a measure σ on H such that for all f : H → R: then σ has a density with respect to the Lebesgue measure of H and For completeness, this result is proven in Appendix A.6.The intuition behind the | z, n H | factor is that 'area' of space that hits z within an infinitesimal time interval dt is | z, n H |dt.
Applying this theorem to the deterministic part of our PDMP we get where n i,j is the normal the the boundary Γ i,j , which concludes the proof.

A.2 Proof of Theorem 2
Proof.The core of the proof is rewriting (2) of Theorem 1 in a more suitable way.First, notice that by definition of the pushforward measure νi,j .Second, since Q i,j integrates to 1, we get: Third, using the definitions of G i,j Finally, using a change the change of variable from z = (x , v ) to α, z defined as z = G i,j (α, z), and the definition of ν i,j : Thus (2) of Theorem 1, evaluated for π = ν, can be restated as: By construction, for any function f ∈ F: Conditions 3 and 4 of the theorem clearly imply that respectively, which concludes the proof.

A.3 Proof of Theorem 3
Proof.The proof is largely the same as the proof of Theorem 2. The only change is how to treat the change of variable z = G i,j (α, z).First, notice that G i,j leaves the position invariant since g i,j leaves the position invariant.Then, since the velocity space is discrete and α → G i,j (α, z) is a one to one mapping for a fixed z ∈ E j : which we can write in integral form as A.4 Proof of Proposition 1 Proof.Here: the area of the unit sphere of R |γ i | .
In this case we can define G i,j : where the last inequality uses the symmetry of the integral with respect to α, and that < v, n i,j >= 0 by definition of velocities in E j .
The determinant of G i,j must be carefully computed since the velocities lives in the sphere and not on the full vector space.We get: Therefore for Hence for |γ j | > 0: 2A sphere (|γ j |) A sphere (|γ i |) and For |γ j | = 0, BPS and ZigZag are equivalent, thus we use ZigZag rates.

A.5 Proof of Proposition 2
Proof.Here: Therefore, for z ∈ E i : Clearly, we have Furthermore, |J G i,j | = 1 thus: A.6 Proof of Theorem 6 Proof.Let Z(t, z) be the flow associated to the ODE.Fix z 0 in H such that a = X(z 0 ), n H = 0.In order to show that σ has the required form it is enough to prove that where B(z 0 , r) is the ball of radius r centered on z 0 , H ∩ B(z 0 , r) is the slice of this ball that lies on H, and Volume H is the Lebesgue measure of the hyperplane H.
For r and t > 0, let We can re-write the definition of σ in terms of E r,t , from which we are reduced to show that The idea of the proof is that to calculate µ(E r,t ) we need to find the volume of E r,t as r, t → 0, and we will do this by bounding E r,t from both above and below by cylinders.We can approximated these cylinders by the set E r,t that we would obtain if X(z) = X(z 0 ) (i.e. the derivative was constant).The following lemma enables us to quantify the order of error for such an approximation.
We clearly have C z 0 ,r < ∞ and C z 0 ,r − − → r→0 0. We set r 0 = 2 z − z 0 , and using the mean value theorem with the function e z (s) = e(s, z) whose derivative is X(Z(s, z)) − X(z 0 ) with the fact that e(0, z) = 0 implies e(s, z) ≤ sC z 0 ,2 z−z 0 which concludes the proof.
The following two lemmas then give, respectively, the cylinders that lower and upper bound E r,t .
Lemma 2. For z ∈ B(z 0 , r)∩H and s ∈ [−t, 0], there exists a constant C > 0 and function o(r) ≥ 0 with lim r→0 o(r)/r = 0 such that The addition on right-hand side of this statement should be interpreted as addition of sets on a product space.That is the set on the right-hand side is the cylinder of points that can be written as z + sn H where z ∈ B(z 0 , r + tC) ∩ H and s ∈ [−at − o(r)t, 0].
Proof.We have that  We compare the methods based on the statistical efficiency of the estimates of marginal means and probability of inclusions.The PDMP methods appear to perform optimally in estimation of the marginal mean when they are able to spend a reasonable amount of time exploring the model space and exploring the parameter space.This can be seen in Figure 6 where optimal statistical efficiency for the marginal means occurs when p i,j is the range 0.2 to 0.7 across all samplers.Figure 6 also shows that higher values of the reversible jump parameter p i,j gives better estimation of the probabilities of inclusions.Having a high refreshment rate appears to impact BPS with Normal velocity distribution less than BPS with velocities distributed uniformly on the hyper-sphere.This is seen as the magenta curve (refreshment = 10) is lower for BPS N than BPS S in Figure 6 across all reversible jump parameters.All BPS methods appear to out-perform Zig-Zag in terms of marginal mean estimation when the refreshment rate is low.

C General implementation details
Inference in Bayesian model selection relies on expectations with respect to a posterior target distribution, π(θ, γ).The parameters are θ while γ is a vector which indexes the model with elements γ j = 1 if the jth variable is included and γ j = 0 otherwise.The posterior has the form π(θ, γ) ∝ L(y 1:n |θ, γ)π 0 (θ|γ)π 0 (γ) where L(y 1:n |θ, γ) defines a likelihood function for observations y 1:n , π 0 (θ|γ) and π 0 (γ) denote prior distribution for θ and γ.We abuse notation writing θ γ to denote the subvector of θ with only the elements where γ j = 1.Moreover, we write π(θ γ ) for π(θ | γ) where π(θ | γ) = 0 whenever |θ j | > 0 with corresponding γ j = 0.When simulating from the reversible jump PDMP sampler there are two types of events: normal events for the PDMP sampler within a model γ and model jump events.The standard PDMP events are taken with respect to π(θ|γ) so rates to sample are given using the usual Bouncy Particle Sampler or Zig-Zag rates on the conditioned model In practice to simulate these events we first bound the rates by a simple function, which for the examples we consider will be linear in time.We then simulate events from a Poisson process with this linear-in-time rate, which can be done exactly, and use thinning to generate the actual events in the PDMP.Derivations of the linear-in-time bounds on the rates that we use are now given, before we give the rates for jumps between models.

C.1 Rates for logistic and robust regression
Here we give details on simulating rates for the logistic regression example.Taking a simple independent Gaussian prior the reintroduction rate simplifies to where γ and γ are defined as above.The standard PDMP rates are used for π(θ|γ) = π(θ γ ).So to simplify notation we will assume a fixed dimension and write θ dropping the indexing with γ.The log posterior for both logistic and robust regression can be written in the form For logistic regression e i = −x T i θ and g(e i ) = − log exp(y i ) 1+exp(e i ) , and for robust regression e i = y i − x T i θ and g(e i ) = − log exp(− 1 2 e 2 i ) + 1 10 exp(− 1 200 e 2 i ) .We consider bounding the event rates for the Bouncy Particle Sampler and ZigZag below.
Bouncy Particle Sampler: Let f (θ + tv) = −∇ θ log π(θ + tv) the event rate depends on the quantity where g (e i (t)) is the derivative of g evaluated evaluated at e i (t) = −x T i (θ + tv) for logistic regression and e i (t) = y i − x T i (θ + tv) for robust regression.The in-time derivative of this quantity is If the in-time derivative can be bounded by a constant we can simulate using linear rates.
For logistic regression g (e i ) ≤ 1 4 and for robust regression g (e i ) < 1.The Bouncy Particle Sampler rate is bounded by the linear rate where c is chosen according to the application.Inversion methods for thinning a Poisson process can be used to simulate the events (Bierkens et al., 2018).
ZigZag: Let f (θ + tv) = − d dθ i log π(θ + tv) the event rate depends on the quantity where g (e i (t)) is defined as in the BPS rate.The in-time derivative of this quantity is Using the same method as before the ZigZag rate is bounded by the linear rate max(0, where c is chosen according to the application.

C.2 Rates of jumps between models
Model jump events occur when a parameter, θ i , hits a hyper-plane θ i = 0 and with probability p γ→γ we jump to model γ where γ i = 0.The other type of model jump event occurs when a variable is reintroduced.For each of the deactivated variables (γ i = 0), we simulate a time to reintroduce the variable (switching to γ where γ −i = γ −i with γ i = 1).The rate to reintroduce the variable in the Bayesian inference problem is β γ →γ = p γ →γ L(y 1:n |θ γ )π 0 (θ γ )π 0 (γ ) L(y 1:n |θ γ )π 0 (θ γ )π 0 (γ) , where often computational savings are possible since the reintroduced variable will be zero θ i = 0 and it is often the case that L(y 1:n |θ γ ) = L(y 1:n |θ γ ).In these cases the rate to reintroduce a variable will only depend on the choice of prior.Both examples we consider had a Gaussian spike and slab prior, of the form for a fixed w.The rate to reintroduce the ith variable, jumping from model γ to γ where γ −i = γ −i with γ i = 1 and γ i = 0 is given by γ , the ratio simplifies as, .
In our examples the prior is independent across components and this ratio simplifies to a constant.As the prior mean is 0 and, if we denote the prior variance for θ i for any active covariate i as σ 2 , we have .

C.3 Pólya-Gamma Gibbs sampling for logistic regression
The Polya-Gamma Gibbs sampling approach is an auxiliary variable approach for Bayesian Logistic regression.A Polya-Gamma random variable ω ∼ PG(b, 0), b > 0, with probability density p(ω) has the property (Polson et al., 2013) that for any ψ ∈ R and a ∈ R Thus the implied conditional distribution for ψ, given auxiliary variable ω, is Gaussian.The advantage of this approach is that when updating the model γ we can integrate over the parameters θ yielding much more efficient moves.The updates for the collapsed Gibbs sampling procedure follow the form: (1) sample γ ∼ γ | ω; (2) sample θ ∼ θ | ω, γ; (3) sample ω ∼ ω | θ, γ.
D Computation of relative efficiencies in Section 5.1 In order to compute the relative efficiency we need an estimate of the statistical efficiency (11).We estimate this quantity using a reference estimate q from an independent 6-hour run of the Gibbs sampling method for each combination of n, p and Scenario in Tables 1-3 and the results of Figure 2.For the results in Tables 1-3 the quantities of interest are the estimation of the posterior marginal inclusion probabilities π(|θ j | > 0) (PPI) and the marginal posterior means E[θ j ] (Mean).These two quantities allowed us to see how efficient the sampler was in terms of exploring both the parameter and model space.For the simulations in the subsampling comparison (Figure 2) the quantity of interest was the posterior mean conditioned on being in model γ = (1, 1, 0, 0, ..., 0)).We estimate the mean square error of these terms by running 100 independent runs of each algorithm and comparing to the corresponding long Gibbs run.Methods used in Tables 1-3 were initialised at the maximum a posterior estimate for a model with active variables chosen from an initial LASSO fit.For the subsampling comparison, methods were initialised at the location of the control variate (the maximum a posterior estimate with using the true nonzero variables).
For each algorithm in the simulations of Tables 1-3 we use a computational budget of 10 6 iterations with a maximum run time of 2 minutes.For the simulations in the subsampling comparison we used a computational budget of 10 6 iterations with a maximum run time of 6 minutes.Algorithms were then compared on the basis of relative computational efficiency using RE or relative efficiency per iteration using RSE.An iteration for the Gibbs sampler is considered to be a full update of all parameters (i.e. one run of all steps in Section C.3) whereas an iteration of the PDMP methods is considered to be one simulated event time.

Figure 1 :
Figure 1: Sample paths of PDMPs implementing the variable selection priors in 1 dimension.The left and centre plots show the trajectories for a continuous spike-and-slab prior 0.5N (0, τ 2 ) + 0.5N (0, τ 2 c 2 ) where τ 2 = 16.As c decreases the spike component in the mixture approaches a Dirac mass.The figure on the right is the limiting process where we set the velocity to zero allowing the variable to stay fixed at zero.

Figure 3 :
Figure 3: Dynamics of the samplers on a robust regression example with spike and slab or horseshoe prior.The top row shows the posterior for θ 1 and θ 2 , bottom row shows the estimates for θ 2 and θ 3 .The spike and slab distributions are sampled using the reversible jump PDMP samplers with reversible jump parameter 0.6 and refreshment for the BPS methods set to 0.5.All methods are shown with 10 3 samples (red) and the PDMP dynamics are shown in black.Sampling with the Horseshoe prior was implemented in Stan using NUTS.Both Stan and PDMP methods were run for the same computing time.To aid visualisation only the first 30% of the PDMP trajectories are shown.

Figure 4 :
Figure 4: Sampling efficiency for reversible jump PDMP vs Stan for the robust regression example.The top figure is boxplots of the posterior mean of θ 1 for increasing computational budget, with outliers from the sampler removed for visualisation purposes.These removed outliers correspond to times that the sampler has become stuck in a local mode where θ 1 = 0.The subplot shows the full results including outliers from the sampler.Both the Stan and reversible jump PDMP methods converge to similar solutions despite having different priors.The bottom figure shows the number of times that the samplers did not find the global mode.

Figure 5 :
Figure 5: Predictive ability of reversible jump PDMP vs Stan for the robust regression example.The predictive ability is measured by Monte Carlo estimates of the mean square predictive performance.

Figure 6 :
Figure 6: Plots of the Monte Carlo variance for different samplers with varying choices of tuning parameters.The Monte Carlo variance is defined relative to the estimates of the posterior means and probabilities of inclusion.

Table 1 :
Scenario 1 (pair of correlated variables): Relative efficiencies (RE) for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI).

Table 3 :
Scenario 3 (No correlation): Relative efficiency (RE) for methods, against a Reversible Jump algorithm, for the marginal posterior means (Mean) and marginal posterior probabilities of inclusion (PI).