Adaptive MCMC for multiple changepoint analysis with applications to large datasets

: We consider the problem of Bayesian inference for changepoints where the number and position of the changepoints are both unknown. In particular, we consider product partition models where it is possible to integrate out model parameters for the regime between each changepoint, leaving a posterior distribution over a latent vector indicating the presence or not of a changepoint at each observation. The same problem setting has been considered by Fearnhead (2006) where one can use ﬁltering recursions to make exact inference. However, the complexity of this ﬁltering recursions algorithm is quadratic in the number of observations. Our approach relies on an adaptive Markov Chain Monte Carlo (MCMC) method for ﬁnite discrete state spaces. We develop an adaptive algorithm which can learn from the past states of the Markov chain in order to build proposal distributions which can quickly discover where changepoint are likely to be located. We prove that our algorithm leaves the posterior distribution ergodic. Crucially, we demonstrate that our adaptive MCMC algorithm is viable for large datasets for which the ﬁltering recursions approach is not. Moreover, we show that inference is possible in a reasonable time thus making Bayesian changepoint detection computationally eﬃcient.


Introduction
Changepoint problems arise in many practical instances in statistics, for example, signal processing, financial economics, process monitoring control and DNA sequence analysis.Here we consider chronologically ordered data over a period of time where it is suspected that there may have been some change(s) in the underlying generating process.For changepoints in parametric models, a parameter value (e.g.Gaussian mean or Gaussian precision) applicable to a certain time period may not extend well to another time period.Some examples include the rate of occurrences of coal mining disasters during the 18th and 19th century (Raftery and Akman, 1986), gene expression sequences (Hocking et al., 2014) and financial time series (Chen and Gupta, 2011).In this paper it is shown that analysis of multiple changepoint problems is feasible for larger datasets in a Bayesian setting using adaptive MCMC.
conditional on a fixed number of changepoints, Stephens (1994) presents an MCMC method for this problem.When the number of changepoints is unknown, inference is more challenging.This is the problem which we address in this paper.A common approach for state-space dimension traversing is the reversible jump algorithm of Green (1995) which performs trans-dimensional MCMC over a set of models, each incorporating a different number of changepoints.A drawback of this algorithm is that it can be difficult to design proposals so that the chain mixes well within and well between all available models.An alternative approach due to Chib (1998) compares models with different numbers of changepoints using approximate Bayes Factors from the MCMC output in a post-processing step.
The latter method requires MCMC model output for each number of changepoints under consideration.Fearnhead (2006) developed a clever forward-backward algorithm, filtering recursions, which allows one to sample exactly from the posterior distribution of changepoints.The filtering recursions share some similarity to product partition models (Barry and Hartigan, 1992).The overwhelming advantage of this method is that once the filtering recursions have been calculated, it allows one to draw to be sampled from the posterior using Carpenter's algorithm (Carpenter et al., 1999) that exploits the exponentially distributed spacing of order statistics in a uniform distribution.
However, a drawback of filtering recursions is that the algorithm requires a precomputation step to compute the recursions which has a time complexity that is quadratic in the number of observations and thus restricts the amount of data that can be used to perform efficient inference in a reasonable time.Fearnhead (2006) offers an solution to this problem that lowers the precision of the recursions in order to make their calculation time approximately linear in the number of observations and the price to pay is it results in an approximate algorithm thereby.
Adaptive Markov Chain Monte Carlo Methods (AMCMC) have recently emerged in an attempt to improve the efficiency of MCMC algorithms.Typically adaptive MCMC uses on-the-fly refinement of the proposal distribution, taking information from the past history of the MCMC chain to yield a better mixing algorithm.The adaptive Metropolis algorithm of Haario et al. (2001) was one of the earliest adaptive MCMC algorithms using a random walk Metropolis algorithm with an adapted covariance matrix.It is limited to continuous state spaces and to target distributions where a Gaussian proposal is suitable.
Adaptive MCMC methods on discrete state spaces have not yet been widely studied, yet these are very well suited to this methodology.This is because the design of adaptable proposals on discrete state spaces has the advantage that discrete state spaces carry the property of smallness, outlined in Meyn and Tweedie (2012), so that simultaneous uniform ergodicity of the proposal kernels is guaranteed, provided that the state space is irreducible and the transition kernel is aperiodic.The second condition necessary for ergodicity of adaptive MCMC, diminishing adaptation, can be satisfied in many ways on a discrete state space leading to widely applicable methods in problems such as variable selection and Bayesian optimisation (Mahendran et al., 2012).Griffin et al. (2014) presents an adaptive MCMC algorithm on a discrete state space to carry out variable selection in a model choice setting.
In recent years, the emergence of big data across a vast range of models in statistics and machine learning has lead to the need for methods that can scale well to large datasets.We highlight how our adaptive changepoint approach scales well with an increasing number of observations and an increasing number of changepoints.The size of large datasets can present challenges for non-adaptive MCMC due to the presence of many local modes in the posterior distribution.We show empirically how are algorithm learns to move away from local modes which hinder MCMC.
The remainder of the paper is organised as follows.Section 2 describes multiple changepoint models in a Bayesian framework, Section 3 describes our adaptive changepoint sampler and introduces some advanced adaptation techniques which improve efficiency.We present a brief review in Section 4 of filtering recursions (Fearnhead, 2006).Section 5 provide a proof of our algorithm and results for three datasets are presented in Section 6 along with comparisons to filtering recursions.
All methods in this paper have been implemented in C using the Intel C compiler running on an Intel i7 3.40GhZ equipped machine with 16GB of RAM.Code is available on request from the authors.

Multiple Changepoint models
Consider observed data y = (y 1 , y 2 , . . ., y n ), where observation y i is observed before observation y j , for i < j.We model y such that each observation y i arises independently from a likelihood model depending on a parameter θ i ∈ Θ whose value may or may not change from one observation to the next.The points at which θ i does change are called changepoints.
Consider the possibility of an unknown k < n changepoints in y occurring at positions τ = {τ 1 , τ 2 , . . ., τ k }.These changepoints partition y into k + 1 contiguous non-overlapping segments (1) This partitioning of y can be represented by a fixed length latent changepoint indicator vector z = {z 1 , z 2 , . . ., z n−1 } with z t = 1 for each t ∈ τ and z t = 0 for each t / ∈ τ , with the number of changepoints satisfying k = n−1 i=1 z i .Within segment j, the likelihood has a constant parameter θ j , 1 ≤ j ≤ k + 1.The full likelihood across all segments can be expressed as a product of k + 1 segment likelihoods where τ 0 = 0, τ k+1 = n and where f (y i |θ j ) denotes the likelihood of observation y i in a segment with parameter θ j .In a Bayesian formulation the joint posterior distribution for the latent changepoint indicator vector z and segment parameters θ = {θ 1 , . . ., θ k+1 } can be written as a product of the full segment likelihood (2) and the priors for z and θ, (3) The dependence of θ on z is only through the prior multiplicity of θ (k + 1) which sets the dimension of the prior term π(θ|z).This shares some similarity to the hierarchical changepoint model used in Green (1995) except that it does not condition on the number of changepoints and so this varies over the support of z.
The prior for z specifies how the changepoint positions should be distributed prior to the data being observed.A convenient form that captures the gap lengths between changepoints is, where g 0 (•) is the distribution of the distance to the first changepoint, g(•) is the gap distribution for the distance between successive changepoints and G(•) is the cumulative distribution function for g(•).
The choice for g can be a negative binomial or its special case, a geometric distribution A more complex prior that minimises the a priori clustering of changepoints (Green, 1995), is specified by the distribution of even order statistics of a draw of size 2k + 1 from (1, . . ., n − 1) without replacement.This prior prevents changepoints occurring at adjacent observations which minimises outliers (degenerate changepoints) being classified as true changepoints.
The priors for θ j can be chosen to be conjugate to the likelihood, however this is not a requirement.
The next section details the collapsing of the joint posterior (3) when the prior is conjugate but if it is possible to collapse the joint posterior using another method (e.g.quadrature) this is also feasible for use in our algorithm.

Collapsing multiple changepoints models
We assume that it is possible to integrate (collapse) out θ = {θ 1 , θ 2 , . . ., θ k+1 } parameters from the posterior (3) to leave a discrete state space of changepoint positions.This is also the approach taken by Fearnhead (2006).With an appropriate conjugate prior for θ, the resulting posterior for z is where P(τ j−1 + 1, τ j ) = θ j τ j i=τ j−1 +1 f (y i |θ j )π(θ j ) dθ j denotes the evidence for segment (y τ j−1 +1 , y τ j ).The evidence (marginal likelihood) is the probability of the data observed in that segment after the dependence on the parameter θ j has been integrated out with respect to its prior.The dependence of θ on z has been removed the position of changepoints and the within segment parameter are assumed independent.

A simple example of Collapsing -Poisson Gamma
Consider the case where the data in segment j can be modelled by a Poisson distribution with parameter θ j > 0. Placing a Gamma(α, β) prior on each θ j and integrating out θ j for j ∈ {1, . . ., k + 1}, the marginal likelihood for segment (y a , y b ) is, where Precomputation of F 1:t and S 1:t for 1 ≤ t ≤ n and using the following recursions negates the need to store each individual y i for computation of the marginal likelihood (5) which is required to be computed many times in filtering recursions and in our algorithm.Similar precomputations are available for other likelihood models, see Appendix A for the Gaussian distribution mean and precision.

Adaptive MCMC changepoint sampler
We now introduce our adaptive MCMC changepoint sampler to sample from the posterior distribution of changepoints (4).Wyse and Friel (2010) developed an MCMC scheme based on adding, deleting and position adjusting changepoints using samplers similar to those used by Lavielle and Lebarbier (2001).The algorithm of Wyse and Friel (2010) turns out to be a special case of our adaptive algorithm when no adaptation occurs and as we shall see, the adaptive MCMC algorithm we develop offers an improvement in efficiency, by comparison.
Sampling over z is a challenging problem as the size of the space scales exponentially with n, leaving brute force enumeration of all z intractable.However, for datasets with few changepoints (k n) the realised z vectors will be quite sparse.The design of our algorithm motivates searching element-wise through z identifying which elements (positions) are likely changepoints and those which are not.
Positions which are deemed unlikely to be changepoints will tend not to be proposed as changepoint locations and conversely locations which are identified as being locations of changepoints will tended to be proposed more frequently.In this way, our algorithm will facilitate proposed moves to centre around areas of high changepoint activity and move away from areas of low changepoint activity.As we will shortly see, this adaptive algorithm where proposed changepoint locations change over time will by design preserve the ergodicity of the adaptive Markov chain.
We now describe the adaptive algorithm in detail and defer a proof of ergodicity to Section 5.

Detailed description
At iteration t denote the current state of changepoint locations as z (t) .Our algorithm consists of three proposal moves to update the vector z (t) .The three proposal moves involve adding a new changepoint to z (t) (add move), deleting a changepoint from z (t) (delete move) and moving an existing changepoint within z (t) (adjust move).At each iteration t, one of either the add move or the delete move is selected with probability p and 1 − p, respectively.The adjust move is performed after either an add or delete move and in our implementation of the adaptive MCMC algorithm is an optional move type.
The space of all realisable z vectors is large, having 2 n−1 elements.It is important therefore to add changepoints in locations of high posterior changepoint probability and delete changepoints in areas of low posterior changepoint probability.It turns out that adaptively learning these areas on-the-fly provides a route to a scalable inferential framework for large datasets, as we now illustrate.
We associate with z (t) two iteration dependent selection weight vectors n−1 }.We remark that these weights are correspond to how often the algorithm should pick a particular point.This is different to the approach of Griffin et al. (2014) where the vectors are used as inclusion probabilities for variable selection.If a changepoint is proposed to be added, a position i (having z (t) i = 0) will be selected as the add position with probability a If a changepoint is proposed to be deleted, some position i (having z (t) i = 1) will be selected as the deletion position with probability d If the relevant add or delete move is accepted then the selected element i of z (t+1) will be toggled, otherwise z (t+1) does not change from z (t) .
The probability of accepting or rejecting the moves described above will depend on the relative change in the marginal likelihood of the segment added or deleted around position i.Let a be the changepoint immediately before i and b the changepoint immediately after i.The addition of a changepoint at position i would cause the segment that contains position i to be split into two new segments (y a+1 , y i ) and (y i+1 , y b ).The deletion of a changepoint at position i would cause the two segments created by the changepoint at i to merge into one segment (y a+1 , y b ).All other segments remain the same.The marginal likelihood ratios are thus Add Move → P(a + 1, i) The two moves are summarised clearly in Figure 1.The adjust move simply selects uniformly some z i = 1 and propose to move it locally somewhere between the changepoint before it and the changepoint after it.If there are no changepoints this move cannot be and is not attempted.This is the basis of our changepoint sampler.We are now left to describe the adaptation scheme used to update the a (t) and d (t) vectors during the algorithm, using the past history of the add and selected moves.This is a crucial part of the algorithm as these parameters decide where to place changepoints and remove changepoints in an efficient manner.This is described in the following section. (t) and d (t)   The MCMC algorithm of Wyse and Friel (2010) selects positions i for addition and deletion uniformly at random from all the valid n − 1 positions.This is equivalent to having constant vectors a (t) and

Adaptation of the selection weights a
where where .

Figure 1:
Adaptive MCMC changepoint sampler moves, the add move is performed with probability p and the delete move with probability 1 − p. d (t) which do not vary with iteration t.The adaptive method we use, proposes to update a (t) and d (t) using information from previously accepted add and delete moves.The scheme for adaptation is given in Figure 2. The strategy is to target the acceptance rate of the add and delete moves to an overall target acceptance rate α target by updating the a (t) and d (t) at each iteration.The updates are performed on the log scale to ensure that the weights remain positive.

At iteration t:
1.If an add move at point i has been accepted then update only the a (t) i parameter as follows log(a 2. If a delete move at point i has been accepted then update only the d i parameter as follows log(d t/n -Monte Carlo time, iterations (t) per number of datapoints (n) Figure 2: Adaptation scheme to update the vectors a (t) and d (t) .
This adaptation scheme is different from Griffin et al. (2014) in that there is no restriction on 0 < a i < 1 or 0 < d i < 1 as these are unnormalised selection weights and not probabilities.The parameter h controls the initial intensity of the adaptation, we find values << 1 work well.

A note on Non Uniform Sampling for selection weights
The a (t) and d (t) weights, once normalised appropriately using a (t) + and d (t) + (see Figure 1), must be sampled from to propose elements of z (t) for toggling.Discrete random variate generation for non-uniform probability vectors presents an extra level of complexity.In the case of Wyse and Friel (2010) with no adaptation, selection of elements for toggling is O(1) and is extremely efficient.To take advantage of the adaptive proposals the algorithm requires an efficient non-uniform sampler.
A naïve implementation of non-uniform sampling from the a (t) and d (t) vectors involves building a cumulative distribution of the values, O(n) time, and then sampling from this by binary lookup, O(log 2 n) time.This is significantly slower and may even detriment the use of the adaptive algorithm in the first instance.A method due to Walker (1974) overcomes this problem by precomputing lookup tables called alias tables in O(n) time and then sampling in O(1) time.A numerically stable implementation of Walker's method that overcomes numerical errors is due to Vose (1999).A discussion of the alias method is given in Appendix B.
Using alias tables we can get quite close to uniform sampling efficiency.Note that Matias et al. (1993) allows updating alias tables in less than O(n) time, however this imposes restrictions on the magnitude of the change in weights at each adaptation step.

Advanced Adaptation Techniques
In this section some advanced techniques are presented to improve the efficiency of the adaptive method.It is possible to implement thresholding of the a (t) values so that only some of the values use alias tables.Dual adaptation is used by Griffin et al. (2014) to simultaneously update a (t) and d (t)   after an accepted move.We modify this to our adaptation scheme.These advanced techniques allow the algorithm to be computationally efficient while performing the adaptive updates.Many issues with adaptive MCMC can arise due to adapting too quickly.These issues are discussed in Łatuszyński and Rosenthal (2014).

Advanced Adaptation 1: Thresholding of non-changepoints
Many of the a i values won't significantly change in magnitude over the course of the algorithm.This is due to the update of the a i values only being performed on acceptance of a changepoint and for points far away from changepoints the a i will rarely change.Computational time is still spent embedding these small a i in the rebuilding of alias tables each time any a i changes.This problem isn't as pronounced for the d i values as we assume that there are many more non-changepoints than changepoints in a dataset.
To take advantage of the low number of changepoints, we propose to split the points that are not changepoints into two groups, one with high posterior probability of being added, G active , and the other with a low posterior probability of being added, G inactive .The membership of each group is mutually exclusive and is determined by a threshold parameter a cutoff .All points begin in G inactive and as the a i values are adapted, points with a i > a cutoff move to G active .The other points remain in G inactive and are assumed to have a flat weight of a inactive < a cutoff which means they can be sampled without the use of alias tables (equivalent to uniform sampling within G inactive ).Each element of G inactive will retain it's true underlying a i value but this will only be used for sampling if and when it moves into G active .The thresholding will modify the algorithm slightly and the modifications to the acceptance probabilities are show in Figure 3.

Advanced Adaptation 2: Dual adaptation
As can be seen in the description of the moves, knowledge of α add allows one to also calculate α del quite easily.Griffin et al. (2014) uses this idea to perform a double or dual adaptation of both a (t)   and d (t) at each acceptance in the algorithm rather than updating only one of these vectors.The dual adaptation approach is applied without thresholding to the updates in Figure 2 and is described in Appendix D.
of size N from the posterior distribution of positions as follows: 1. Initialise all N samples to have a changepoint at t = 0, i.e. τ 0 = 0 2. For t = 0, . . ., n − 2, (a) Find n t , the number of samples for which the last changepoint was at time t.
(7) (c) Sample n t times, using Carpenter's algorithm (see Appendix C for details), from P(τ j |τ j−1 ) and update the n t samples using a random permutation of the n t samples.
The filtering recursions approach has the advantage that the design of the method allows one to draw independently from the posterior distribution.Moreover Carpenter's algorithm for sampling the changepoints is fast.This method however has some drawbacks which arise as the dataset increases in size.Firstly, the calculation of the Q(t) values is O(n 2 ) as the recursion for each possible ordered pair of points (i < j) must be computed before perfect simulation can begin.This calculation time can be reduced by truncating the Q(t) sums once they fail to grow by a certain amount, Fearnhead (2006) suggests 10 × 10 −10 and we compare various truncation levels in the results section.The price to pay for this reduced run time is that the truncation introduces an approximation to the recursion algorithm.Secondly, hyperparameters must remain fixed throughout the algorithm as a change in hyperparameters or indeed the inclusion of a hyperprior would require complete recalculation of Q(t).Thirdly, for larger datasets (≈ 260,000 observations for the largest example considered in this paper) the transition probabilities in (7) have the potential to become numerically unstable, as we outline in Section 6.3.1.We suggest using the exact algorithm, where possible.However for larger (> 100,000 observations) datasets we advocate the use of our adaptive changepoint sampler as it is much more stable, by comparison.

Proof of ergodicity for the Adaptive MCMC algorithm
There are two parts to proving ergodicity for an adaptive MCMC algorithm on a discrete state space X .The first establishes the notion of simultaneous uniform ergodicity and the second establishes diminishing adaptation.An adaptive MCMC algorithm which satisfies both of these conditions is ergodic by Theorem 1 of Rosenthal and Roberts (2007).

Simultaneous Uniform Ergodicity
We first recap the definition of uniform ergodicity for a Markov chain, the equivalent Doeblin's condition and simultaneous uniform ergodicity for transition kernels on a state space X .
Definition 5.1.(Uniform ergodicity) A Markov chain on a state space X with a transition kernel where • TV is the total variation norm.
An equivalent definition by Theorem 16.0.2(Meyn and Tweedie, 2012) states that there exists some r > 1 and R < ∞ such that ∀x ∈ X This implies that the convergence takes place at a geometric rate independent of the starting point x 0 ∈ X of the algorithm.
Uniform ergodicity is generally difficult to prove directly using Definition 5.1.Instead uniform ergodicity can be often more easily checked by equivalence to Doeblin's condition on X .This equivalence is shown in Theorem 16.0.2 of Meyn and Tweedie (2012) and is repeated here.
Theorem 5.1.(Doeblin's Condition) Suppose that Doeblin's Condition holds (as defined in Meyn and Tweedie (2012, p 396)) so that there exists a probability measure φ on the measurable space (X , σ{X }) with the property that for some m, some constant measure ρ < 1, some β > 0 and for a set then the chain under transition kernel P m (x, •) is uniformly ergodic.
Proof.See Theorem 16.2.3 of Meyn and Tweedie (2012) and relevant lemmas.
Finally Rosenthal and Roberts (2007) define the notion of simultaneous uniform ergodicity for a collection of transition kernels indexed by γ ∈ Γ.This definition is repeated here.
Definition 5.2.(Simultaneous Uniform Ergodicity) A collection of transition kernels indexed by where • TV is the total variation norm.
Remarks.The uniform ergodicity parameters R γ and r γ for each kernel may depend on γ but not on the states x ∈ X as otherwise uniform ergodicity would not hold.
Verifying multiple Doeblin's Conditions is equivalent to verifying uniform ergodicity for all kernels P n γ (x, •).This in turn guarantees simultaneous uniform ergodicity.We will now prove simultaneous uniform ergodicity for the adaptive changepoint sampler.
Theorem 5.2.(Simultaneous uniform ergodicity of the adaptive changepoint sampler) Let Γ (t) = (a (t) , d (t) ) be the set of adaptive weights at iteration t and let z (t) be the current state of the chain.
Then for all t the transition kernel using the weights Γ (t) , P Γ (t) (z (t) , •) is uniformly ergodic.
Proof.As we are working over a discrete state space, Z, the lower bound of the transition kernel P Γ (t) (z (t) , •) can be used as the value of β in order to satisfy Doeblin's Condition.Denote the overall minimum value of any element of a (t) or d (t) by > 0. The existence of this minimum follows from the adaptation scheme in Figure 2 where it is not possible for any a (t) or d (t) to reach 0 when started from a positive value.Take the measure φ(z) in Doeblin's Condition to be the posterior distribution of z, π(z|y).The value φ(z) is always positive as the prior for z allows for all 2 n−1 values of z to occur with non-zero probability.
The 1-step transition moves of our algorithm are the add and delete moves, i.e. m = 1 in Doeblin's Condition.An m-step kernel can be achieved by iteration of the 1-step kernel m times, Doeblin's condition only requires the existence of some m and the aperiodicity of the chain.Under add or delete moves the 1-step transition kernel at iteration t, P m=1 Γ (t) (z (t) , z ), can be separated into its proposal and acceptance parts where δ z (t) (z ) is 1 if and only if z (t) is identical to z in every element (i.e. the Hamming distance between current and proposal z is 0).By design of our algorithm, δ z (t) (z ) = 0 as a different vector is always proposed by the add and delete moves and we only need consider the first term of ( 8) The proposal kernel q Γ (t) (z (t) , z ) ≥ ω (t) > 0, where ω (t) normalises the a (t) or d (t) weights depending on which of the add or delete is taking place, therefore This inequality holds since π(z (t) |y)q This verifies that Doeblin's condition holds for the 1-step proposal kernel under any fixed set of adaptive weights Γ (t) which is sufficient to prove simultaneous uniform ergodicity.

Diminishing Adaptation
The second part of the proof is to verify diminishing adaptation for P Γ (t) (z, •), ∀t.Recall the definition of diminishing adaptation (Rosenthal and Roberts, 2007) Definition 5.3.(Diminishing adaptation) A series of transition kernels indexed by time t, P Γ (t) (z, •), are said to obey diminishing adaptation if For this section of the proof, the two other definitions needed are the concept of Lipschitz and bi-Lipschitz continuity of a real-valued function.
By the Mean Value Theorem this is equivalent to the function f having a bounded first derivative.
Definition 5.5.(bi-Lipschitz continuity) A function f is bi-Lipschitz if f and its inverse f −1 are both Lipschitz and thus one has where K > 0 is the Lipschitz constant of f and the inverse constant of f −1 .
Proof.For a 1-step move from z = z (t) to z define ∆ (t) to be the difference in the transition kernels between iteration t and t + 1. Again we only need consider the first term of (8) in this difference.
It is easy to see that ∆ (t) is the difference of two Metropolis acceptance probabilities scaled by the proposal q(•) and the difference can be one of 4 possible values depending on which quantity is the minimum in each of the minimum operators.For ease of notation, relabel the inner terms of the minimum operator as A, B, C, D so ∆ (t) as min{A, B} − min{C, D}.Considering these cases the 4 possible values for ∆ (t) are (11) For the first two cases of ( 11) we only need to show that which amounts to showing that |a i | → 0 and this will be shown below in equation.For the third case of ( 11) we can bound ∆ (t) from above as follows because the first term is bounded in the M-H Ratio (A > B) so this is equivalent to the first case of (11).Finally for the final case of ( 11) we can bound the first term again using the M-H Ratio and factor out the likelihood factor as follows For all cases in (11), bounding |q Γ (t+1) (z, z ) − q Γ (t) (z, z )| is enough to establish diminishing adaptation.It therefore only necessary to prove that |a Without loss of generality consider only the a (t) i parameter.The general form of the update scheme for a and as t → ∞, with h < ∞ and α target < 1 log(a To prove (13) implies |a i | → 0, we must prove that the log function is bi-Lipschitz.Since log(x) and its inverse, exp (x) have a bounded first derivative, provided 0 < x < ∞ which will be satisfied by the existence of > 0, log is bi-Lipschitz (Definition (5.5)).Therefore and so diminishing adaptation for P Γ (t) is established.N (µ i , σ 2 ), where µ i is the mean parameter for the ith segment and σ is fixed to 2,500.Independent N (115,000, τ 2 σ 2 ) priors are placed on each µ i with τ 2 set fixed to 16.Using the methods of Section 2.1 the segment marginal likelihood can be shown (Appendix A) to be The quantities s 1 and s 2 are the sum and the sum of squares of the data {y a , . . .y b }, respectively.

Well Log Drilling -Results & Algorithm Comparison
The results for the Well Log Drilling data run across adaptive MCMC, non-adaptive MCMC and filtering recursions are shown in Figure 5.The filtering recursions was run at 3 different levels of precision (full precision, 1e-6, 1e-4) to give 5 sets of results.The adaptive MCMC changepoint sampler was run for 5 seconds (16,000,000 iterations) with adaptive parameter h = 0.00119 and a target acceptance rate of 15%.The non-adaptive MCMC sampler was run for 5 seconds (2,000,000 iterations).
The results in the right panel of Figure 5 illustrate that the modal number of changepoints is estimated as 51 for all algorithms.All algorithms capture the same posterior distribution of the number of changepoints and changepoint positions.Additionally, the left panel of Figure 5 displays the posterior position of changepoints from the adaptive MCMC run which (although not presented here) was very similar to the non-adaptive MCMC and filtering recursion algorithms.The acceptance rates for the adaptive and non-adaptive MCMC algorithms were 15.31% and 15.10%, respectively and both the adaptive and non-adaptive MCMC algorithms were started from the same changepoint configuration, 40 changepoints randomly distributed throughout the data.
To compare the results of the adaptive MCMC changepoint sampler against filtering recursions, we compare the divergence of the adaptive and non-adaptive MCMC changepoint samplers to the output of filtering recursions run at full precision for 100 million chains from Carpenter's algorithm.

Posterior Position of Changepoints
Observation Index For the 2 lower levels of precision (1e-6, 1e-4) in Figure 6, the filtering recursions algorithm fails to target the correct posterior once the precision level of the recursions equals 1 × 10 −4 .The adaptive MCMC changepoint sampler appears to converge marginally quicker to the target distribution, in the sense of reaching a low divergence, than the non-adaptive and filtering recursions algorithms.
However we would overall recommend the use of filtering recursions for datasets of this size and smaller.The adaptive algorithm is marginally faster than the non-adaptive version and with a higher acceptance rate (15.31%).This outlines that the Adaptive MCMC is competitive not only to the filtering recursions but also to the non-adaptive algorithm.

Dataset 2 -Gaussian precision changepoint -Channel Noise Data
Variations in a signal can be detected by considering the change in variance around a fixed mean.
For example a web server may exhibit rapid variations in traffic across a period of time or a failing component of a machine may lose its precision as it fails.If we assume a constant mean for each of the segments and allow there to be a change in precision λ = 1 σ 2 at a changepoint we can model a process such as shown in Figure 7.
The likelihood for each observation y i is N (µ, λ −1 ) for a fixed µ.Assuming a prior on the precision, λ ∼ Gamma(α 0 , β 0 ), allows one to integrate over 0 < λ < ∞, leaving a marginal likelihood For this data the hyperparameters were set to α 0 = 12.0, β 0 = 4.8 and µ = 0.The parameter µ can be set to 0 prior to analysis provided the data is shifted using its known mean.A geometric gap prior was placed on z with p = 0.0006.The adaptive algorithm is the most competitive of the MCMC algorithms.

Results
The results for the channel noise data are shown in Figure 8 for filtering recursions, the adaptive MCMC changepoint sampler and the non-adaptive MCMC changepoint sampler.All 3 algorithms give a modal 18 number of changepoints in the data and each algorithm captures the full posterior distribution for both the positions and number of changepoints.The adaptive MCMC and non-adaptive MCMC algorithms were each run for 300 seconds, 60,000,000 iterations and 67,000,000 iterations respectively.The filtering recursions were run for the necessary precomputation time of 572 seconds and then a further 8,500 seconds using Carpenter's algorithm (100,000,000 chains).Extra runs of the filtering recursions were run at precision levels (1e-12, 1e-10 and 1e-8) for comparison.
For this example, it is possible to compare the convergence properties of the adaptive MCMC, non-adaptive MCMC and filtering recursions.Due to the extended precomputation time required for the filtering recursions for datasets of this size, we can only compare the two algorithms after the precomputation has completed.We compare the algorithms by examining the time to converge to a certain level of Divergence, D δ .In Figure 9 we mark an approximate convergence point at 53 seconds for the adaptive changepoint sampler with a divergence of 1.43 × 10 −6 nats.The filtering recursions is then run at full precision until it reaches this level of divergence or below, which takes 1,738 seconds.

Gaussian mean changepoint -Large Data example
Genome variation profiling arises in the analysis of neuroblastoma (cancer) samples.The data comes in the form of DNA single nucleotide polymorphism (SNP) arrays.We analyse the log 2 ratioAB series of sample GSM333824 UTP-N-12NMapping250KNsp from the SegAnnDB repository (Hocking et al., 2014).The data consisting of 262,230 observations is displayed in Figure 10.Further details of the data and collection can be found in Chen et al. (2008).
This is a large dataset and we analyse the change in the mean only.We assume a Geometric prior for the changepoint positions with parameter p = 5.72 × 10 −5 .The likelihood is take as N (µ j , 0.13) and the prior for µ j is N (0, 116.0).Hocking et al. (2014) analysed these data sequences using the PrunedDP algorithm of Rigaill (2015).Their algorithm works by fixing the number of changepoints 1 < k < k max in such a way to minimise the least squared error of all possible segmentations using k + 1 segments.The value of k max is chosen as low as 20.We find for our analysis we find between 219 and 260 changepoints using the adaptive changepoint sampler.

Results for the Adaptive Algorithm
The adaptive algorithm was run for 40,000,000,000 iterations with 20,000,000,000 iterations removed by burn-in.The long length of burn-in was necessary by the analysis in Figure 13.The adaptive parameter h was set to 0.001, while a target acceptance rate of 1.0% was chosen to tune the adaptive scheme.The algorithm took 20,665 seconds on an Intel i7 3.40GHz and the achieved acceptance rate was 1.048%.Over 210 changepoints are detected by the adaptive changepoint sampler.This includes the changepoints detected by the SegAnnDB repository and more changepoints at other locations.The PrunedDP algorithm of Rigaill (2015) requires the user to choose an optimal k in a post processing step.This contrasts with our adaptive algorithm which presents the user with the posterior distribution of the number of changepoints k, and so allowing the user to examine the a posteriori probability of various k and examine the relevant strength of different segmentations.In the first octile, high changepoint activity is observed.In the first half of the seventh octile, many outliers are detected by the adaptive algorithm.

Algorithm Comparison -Adaptive and Non-Adaptive MCMC
It is not possible to run the filtering recursions for this data due to issues discussed in Section 6.3.1.
In particular, and in contrast with the two previous examples, it is not possible to assess how well each algorithm converges to the target posterior distribution.However we can still provide a good indication of the convergence of each of the adaptive and non-adaptive MCMC algorithms by exploring the trajectory of the state of each chain towards the maximum a posteriori of the target distribution.
We performed a 6 hour run of both algorithms for over 80,000,000,000 iterations and monitored the maximum a posteriori (MAP) estimate achieved.The results are shown in Figure 13.The Adaptive  algorithm reaches a higher area of the posterior after about 2,000 seconds and continues to find higher areas of posterior mass.In constrast the non-adaptive version is much slower to climb to this area and after 6 hours still had not reached the MAP estimate of the Adaptive algorithm.This gives some indication that the adaptive MCMC algorithm is better able to reach the high-posterior density regions than the non-adaptive MCMC algorithm.We therefore conclude for datasets of this size, that the adaptive algorithm is many times more competitive than the non-adaptive algorithm and due to filtering recursion being unavailable is an ideal algorithm for big data changepoint problems.

Conclusion & Discussion
This paper introduces an adaptive changepoint sampling algorithm for multiple changepoint problems.We have described how our algorithm is be designed to learn on-the-fly where changepoints are likely to be located in a dataset.We prove that the adaptive MCMC scheme which we develop leaves the posterior distribution ergodic.Moreover the adaptive MCMC algorithm scales to large datasets in contrast to the filtering recursions of Fearnhead (2006) which is unreliable and prone to numerical instability in this case.Three datasets increasing in size from 4,000 observations to over 260,000 observations have been illustrated in this paper.The latter and largest dataset is unable to be analysed using filtering recursions and we show that our algorithm works well here to detect the number and location of changepoints in a reasonable computational time.We recommend using the filtering recursions for smaller datasets (e.g. up to size 100,000 observations) and where computational time is not an issue.However for datasets with more than 100,000 observations we advocate using the adaptive changepoint sampler.
Further work will involve extending this adaptive MCMC approach to other posterior distributions on discrete state spaces.For example, the likelihood of the data in this paper assumes independent observations within a segment between two changepoints.This could be replaced with a dependence within segment likelihood as in the work of Wyse et al. (2011) where the marginal segment likelihood is replaced with integrated nested Laplace approximations.
The diminishing adaptation condition we have proved in this paper is just one method of automatically tuning adaptive proposals.Our adaptation condition takes the form of a stochastic approximation algorithm but more involved adaptation schemes may be designed using the theory developed in this paper and this is a focus of future work.To conclude, we feel that there is much wider scope for the implementation of adaptive MCMC in practice and we hope that this article will encourage more work in this direction.

A. Normal Marginal Likelihood Calculation
The marginal likelihood for a changepoint in the mean parameter for normally distributed data with known variance (σ 2 ) and with a N (µ, τ 2 σ 2 ) prior on µ can be expressed with k = b − a + 1 as the integral of the product of two normal densities The choice of w is recommended as 0.5 by Griffin et al. (2014).

Figure 3 :
Figure 3: Adjusted moves for use with thresholding of a i values.Note that |G inactive | denotes the cardinality of the inactive set.

Figure 5 :
Figure 5: Well log data -The left panel presents the estimated posterior probability of a changepoint at each observation based on the adaptive MCMC algorithm.The right panel presents the estimated posterior probability of the number of changepoints for each of the adaptive MCMC, non-adaptive MCMC and filtering recursions algorithms.This illustrates that each algorithm converges to the same stationary distribution.

Figure 6 :
Figure 6: Well log data -Divergence, D δ , between a precise estimate of the posterior distribution of the number of change points based on a long run of the filtering recursion algorithm to the adaptive and non-adaptive MCMC algorithm.This plot shows the convergence to the ground truth.All chains converge to the ground truth except for the low precision recursions.The adaptive algorithm is the most competitive of the MCMC algorithms.

Figure 9 :
Figure9: Channel Noise data -Divergence, D δ , between a precise estimate of the posterior distribution of the number of change points based on a long run of the filtering recursion algorithm to the adaptive and non-adaptive MCMC algorithm.The adaptive MCMC algorithm outperforms non-adaptive MCMC and filtering recursions for all precision levels.We mark a convergence point for the adaptive MCMC of 53 seconds with divergence 1.43×10 −6 nats in the posterior distribution of the number of changepoints.The non-adaptive MCMC algorithm takes 150 seconds to reach this level of divergence.Note that (although not shown on the plot) the full-precision filtering takes 1,738 seconds to reach this same level of divergence

Figure 11 :
Figure 11: Genome variation dataset -The adaptive algorithm captures a bimodal posterior distribution for the number of changepoints (right panel).The posterior distribution of the changepoints positions is presented in the left panel.

Figure 12 :
Figure 12: Genome variation dataset -More detailed examination of the location of changepoints.In the first octile, high changepoint activity is observed.In the first half of the seventh octile, many outliers are detected by the adaptive algorithm.

Figure 13 :
Figure13: Genome variation dataset -Comparison between the trajectory of the log unnnormalised posterior for the adaptive and non-adaptive algorithms.It is clear that the adaptive algorithm climbs to an area of high posterior probability many times faster than the non-adaptive algorithm.
The data consists of 3979 observations after outliers have been removed.Visually there are many changepoints in the data, prior to analysis.

Table 1 :
A simulated dataset where the variance is assumed to change over time around a fixed mean.The results of this analysis are show in Figure9and Table1, with Table1showing the relative speed of each algorithm.There is a vast in improvement using the adaptive MCMC algorithm.Channel Noise data -Relative speed and acceptance rates of each (MCMC) algorithm are shown.