Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method

We propose a new scheme for selecting pool states for the embedded Hidden Markov Model (HMM) Markov Chain Monte Carlo (MCMC) method. This new scheme allows the embedded HMM method to be used for efficient sampling in state space models where the state can be high-dimensional. Previously, embedded HMM methods were only applied to models with a one-dimensional state space. We demonstrate that using our proposed pool state selection scheme, an embedded HMM sampler can have similar performance to a well-tuned sampler that uses a combination of Particle Gibbs with Backward Sampling (PGBS) and Metropolis updates. The scaling to higher dimensions is made possible by selecting pool states locally near the current value of the state sequence. The proposed pool state selection scheme also allows each iteration of the embedded HMM sampler to take time linear in the number of the pool states, as opposed to quadratic as in the original embedded HMM sampler. We also consider a model with a multimodal posterior, and show how a technique we term"mirroring"can be used to efficiently move between the modes.


Introduction
Consider a non-linear, non-Gaussian state space model for an observed sequence y = (y 1 , . . ., y n ).This model, with parameters θ, assumes that the Y i are drawn from an observation density p(y i |x i , θ), where X i is an unobserved Markov process with initial density p(x 1 |θ) and transition density p(x i |x i−1 , θ).Here, the x i might be either continuous or discrete.We may be interested in inferring both the realized values of the Markov process x = (x 1 , . . ., x n ) and the model parameters θ.In a Bayesian approach to this problem, this can be done by drawing a sample of values for x and θ using a Markov chain that alternately samples from the conditional posterior distributions p(x|θ, y) and p(θ|x, y).In this paper, we will only consider inference for x by sampling from choose pool states at each time i by constructing a "pseudo-posterior" for each latent variable by taking the product of a "pseudo-prior" and the observation density, the latter treated as a "pseudo-likelihood" for the latent variable.In Shestopaloff and Neal (2014), we choose pool states at each time i by sampling from the marginal prior density of the latent process.
Ways of choosing pool states that work well in one dimension begin to exhibit problems when applied to models with higher-dimensional state spaces.This is true even for dimensions as small as three.Since these schemes are global, designed to produce sets of pool states without reference to the current point, as the dimension of the latent space grows, a higher proportion of the sequences in the ensemble ends up having low posterior density.Ensuring that performance doesn't degrade in higher dimensions thus requires a significant increase in the number of pool states.As a result, computation time may grow so large that any advantage that comes from using embedded HMMs is eliminated.One advantage of the embedded HMM method over PGBS is that the embedded HMM construction allows placing pool states locally near the current value of x i , potentially allowing the method to scale better with the dimensionality of the state space.Switching to such a local scheme fixes the problem to some extent.However, local pool state schemes come with their own problems, such as making it difficult to handle models with multiple posterior modes that are well-separated -the pool states might end up being placed near only some of the modes.
In this paper, we propose an embedded HMM sampler suitable for models where the state space is high dimensional.This sampler uses a sequential approximation to the density p(x i |y 1 , . . .y i ) or to the density p(x i |y i+1 , . . ., y n ) as the pool state density.We show that by using this pool state density, together with an efficient MCMC scheme for sampling from it, we can reduce the cost per iteration of the embedded HMM sampler to be proportional to nL, as with PGBS.At the same time, we retain the ability to generate pool states locally, allowing better scaling for high-dimensional state spaces.Our proposed scheme can thus be thought of as combining the best features of the PGBS and the embedded HMM methods, while overcoming the deficiencies of both.We use two sample state space models as examples.Both have Gaussian latent processes and Poisson observations, with one model having a unimodal posterior and the second a multimodal one.For the multimodal example, we introduce a "mirroring" technique that allows efficient movement between the different posterior modes.For these models, we show how our proposed embedded HMM method compares to a simple Metropolis sampler, a PGBS sampler, as well as a sampler that combines PGBS and simple Metropolis updates.Further details on ensemble methods are available in the PhD thesis of Shestopaloff (2016).

Embedded HMM MCMC
We review the embedded HMM method (Neal, 2003;Neal, Beal and Roweis, 2004) here.We take the model parameters, θ, to be fixed, so we do not write them explicitly.Let p(x) be the density from which the state at time 1 is drawn, let p(x i |x i−1 ) be the transition density between states at times i and i − 1, and let p(y i |x i ) be the density of the observation y i given x i .
Suppose our current sequence is x = (x 1 , . . ., x n ).The embedded HMM sampler updates x to x as follows.First, at each time i = 1, . . ., n, we generate a set of L pool states, denoted by i }.The pool states are sampled independently across the different times i.We choose l i ∈ {1, . . ., L} uniformly at random and set x n using a Markov chain that leaves a pool density κ i invariant, as follows.Let R i (x |x) be the transitions of this Markov chain with Ri (x|x ) the transitions for this Markov chain reversed (i.e.Ri (x|x ) = R i (x |x)κ i (x)/κ i (x)), so that for all x and x .Then, starting at j = l i − 1, use reverse transitions Ri (x i and starting at j = l i + 1 use forward transitions R i (x At each i = 1, . . ., n, we then compute the forward probabilities α i (x), with x taking values in P i .At time i = 1, we have and at times i = 2, . . ., n, we have Finally, we sample a new state sequence x using a stochastic backwards pass.This is done by selecting x n amongst the set, P n , of pool states at time n, with probabilities proportional to α n (x), and then going backwards, sampling x i−1 from the set P i−1 , with probabilities proportional to α i−1 (x)p(x i |x).Note that only the relative values of the α i (x) will be required, so the α i may be computed up to some constant factor.Alternatively, given a set of pool states, embedded HMM updates can be done by first computing the backward probabilities.We will see later on that the backward probability formulation of the embedded HMM method allows us to introduce a variation of our proposed pool state selection scheme.Setting β n (x) = 1 for all x ∈ P n , we compute for i < n A new state sequence is then sampled using a stochastic forward pass, setting x 1 to one of the x in the pool P 1 with probabilities proportional to β 1 (x)p(x)p(y 1 |x) and then choosing subsequent states x i from the pools P i with probabilities proportional to β i (x)p(x|x i−1 )p(y i |x).
Computing the α i or β i at each time i > 1 takes time proportional to L 2 , since for each of the L pool states it takes time proportional to L to compute the sums in (3) or (4).Hence each iteration of the embedded HMM sampler takes time proportional to L 2 n.

Particle Gibbs with Backward Sampling MCMC
We review the Particle Gibbs with Backward Sampling (PGBS) sampler here.For full details, see the articles by Andrieu, Doucet and Holenstein (2010) and Lindsten and Schon (2012).
Let q 1 (x|y 1 ) be the importance density from which we sample particles at time 1, and let q i (x|y i , x i−1 ) be the importance density for sampling particles at times i > 1.These may depend on the current value of the parameters, θ, which we suppressed in this notation.Suppose we start with a current sequence x.We set the first particle x 1 to the current state x 1 .We then sample 1 from q 1 and compute and normalize the weights of the particles: for l = 1, . . ., L.
For i > 1, we proceed sequentially.We first set x [1] i = x i .We then sample a set of L − 1 ancestor indices for particles at time i, defined by A

[l]
i−1 .The ancestor index for the first state, A i−1 ) and compute and normalize the weights at time i w A new sequence taking values in the set of particles at each time is then selected using a backwards sampling pass.This is done by first selecting x n from the set of particles at time n with probabilities W [l] n and then selecting the rest of the sequence going backward in time to time 1, setting A common choice for q is the model's transition density, which is what is compared to in this paper.
Note that each iteration of the PGBS sampler takes time proportional to Ln, since it takes time proportional to L to create the set of particles at each time i, and to do one step of backward sampling.

An embedded HMM sampler for high dimensions
We propose two new ways, denoted f and b, of generating pool states for the embedded HMM sampler.Unlike previously-used pool state selection schemes, where pool states are selected independently at each time, our new schemes select pool states sequentially, with pool states at time i selected conditional on pool states at time i − 1, or alternatively at time i + 1.

Pool state distributions
The first way to generate pool states is to use a forward pool state selection scheme, with a sequential approximation to p(x i |y 1 , . . ., y i ) as the pool state density.In particular, at time 1, we set the pool state distribution of our proposed embedded HMM sampler to As a result of equation ( 2), α 1 (x) is constant.At time i > 1, we set the pool state distribution to which makes α i (x) constant for i > 1 as well (see equation ( 3)).
We then draw a sequence composed of these pool states with the forward probability implementation of the embedded HMM method, with the α i (x)'s all set to 1.
The second way is to instead use a backward pool state selection scheme, with a sequential approximation of p(x i |y i+1 , . . ., y n ) as the pool state density.We begin by creating the pool P n , consisting of the current state x n and the remaining L − 1 pool states sampled from p n (x), the marginal density at time n, which is the same as p(x) if the latent process is stationary.The backward probabilities β n (x), for x in P n , are then set to 1.At time i < n we set the pool state densities to so that β i (x) is constant for all i = 1, . . ., n (see equation 4).
We then draw a sequence composed of these pool states as in the backward probability implementation of the embedded HMM method, with the β i (x)'s all set to 1.
If the latent process is Gaussian, and the latent state at time 1 is sampled from the stationary distribution of the latent process, it is possible to update the latent variables by applying the forward scheme to the reversed sequence (y n , . . ., y 1 ) by making use of time reversibility, since X n is also sampled from the stationary distribution, and the latent process evolves backward in time according to the same transition density as it would going forward.We then use the forward pool state selection scheme along with a stochastic backward pass to sample a sequence (x n , . . ., x 1 ), starting with x 1 and going to x n .
It can sometimes be advantageous to alternate between using forward and backward (or, alternatively, forward applied to the reversed sequence) embedded HMM updates, since this can improve sampling of certain x i .The sequential pool state selection schemes use only part of the observed sequence in generating the pool states.By alternating update directions, the pool states can depend on different parts of the observed data, potentially allowing us to better cover the region where x i has high posterior density.For example, at time 1, the pool state density may disperse the pool states too widely, leading to poor sampling for x 1 , but sampling x 1 using a backwards scheme can be much better, since we are now using all of the data in the sequence when sampling pool states at time 1.

Sampling pool states
To sample from κ f i or κ b i , we can use any Markov transitions R i that leave this distribution invariant.The validity of the method does not depend on the Markov transitions for sampling from κ f i or κ b i reaching equilibrium or even on them being ergodic.Directly using these pool state densities in an MCMC routine leads to a computational cost per iteration that is proportional to L 2 n, like in the original embedded HMM method, since at times i > 1 we need at least L updates to produce L pool states, and the cost of computing an acceptance probability is proportional to L.
However, it is possible to reduce the cost per iteration of the embedded HMM method to be proportional to nL when we use κ f i or κ b i as the pool state densities.To do this, we start by thinking of the pool state densities at each time i > 1 as marginal densities summing over the variable = 1, . . ., L that indexes a pool state at the previous time.Specifically, κ f i can be viewed as a marginal of the density while κ b i is a marginal of the density Both of these densities are defined given a pool P i−1 at time i − 1 or pool P i+1 at time i + 1.This technique is reminiscent of the auxiliary particle filter of Pitt and Shephard (1999).We then use Markov transitions, R i , to sample a set of values of x and , with probabilities proportional to λ i for the forward scheme, or probabilities proportional to γ i for the backward scheme.
The chain is started at x set to the current x i , and the initial value of is chosen randomly with probabilities proportional to p(x i |x i+1 |x i ) for the backward scheme.This stochastic initialization of is needed to make the algorithm valid when we use λ i or γ i to generate the pool states.
Sampling values of x and from λ i or γ i can be done by updating each of x and separately, alternately sampling values of x conditional on , and values of conditional on x, or by updating x and jointly, or by a combination of these.
Updating x given can be done with any appropriate sampler, such as Metropolis, or for a Gaussian latent process we can use autoregressive updates, which we describe below.To update given x, we can also use Metropolis updates, proposing = +k, with k drawn from some proposal distribution on {−K, . . ., −1, 1, . . ., K}.Alternatively, we can simply propose uniformly at random from {1, . . ., L}.
To jointly update x and , we propose two novel updates, a "shift" update and a "flip" update.Since these are also Metropolis updates, using them together with Metropolis or autoregressive updates, for each of x and separately, allows embedded HMM updates to be performed in time proportional to nL.

Autoregressive updates
For sampling pool states in our embedded HMM MCMC schemes, as well as for comparison MCMC schemes, we will make use of Neal's (1998) "autoregressive" Metropolis-Hastings update, which we review here.This update is designed to draw samples from a distribution of the form p(w)p(y|w) where p(w) is multivariate Gaussian with mean µ and covariance Σ and p(y|w) is typically a density for some observed data.This autoregressive update proceeds as follows.Let L be the lower triangular Cholesky decomposition of Σ, so Σ = LL T , and n be a vector of i.i.d.normal random variables with zero mean and identity covariance.Let ∈ [−1, 1] be a tuning parameter that determines the scale of the proposal.Starting at w, we propose Because these autoregressive proposals are reversible with respect to p(w), the proposal density and p(w) cancel in the Metropolis-Hastings acceptance ratio.This update is therefore accepted with probability min 1, p(y|w ) p(y|w) Note that for this update, the same value of is used for scaling along every dimension.It would be of independent interest to develop a version of this update where can be different for each dimension of w.

Shift updates
We can simultaneously update and x at time i > 1 by proposing to update (x, ) to (x , ) where is proposed in any valid way while x is chosen in a way such that x and x

[ ]
i−1 are linked in the same way as x and x [ ] i−1 .The shift update makes it easier to generate a set of pool states at time i with different predecessor states at time i − 1, helping to ensure that the pool states are well-dispersed.This update is accepted with the usual Metropolis probability.
For a concrete example we use later, suppose that the latent process is an autoregressive Gaussian process of order 1, with the model being that X i |x i−1 ∼ N (Φx i−1 , Σ).In this case, given , we propose as a result of the transition densities in the acceptance ratio cancelling out, since To be useful, shift updates normally need to be combined with other updates for generating pool states.When combining shift updates with other updates, tuning of acceptance rates for both updates needs to be done carefully in order to ensure that the shift updates actually improve sampling performance.In particular, if the pool states at time i − 1 are spread out too widely, then the shift updates may have a low acceptance rate and not be very useful.Therefore, jointly optimizing proposals for x and for x and may lead to a relatively high acceptance rate on updates of x, in order to ensure that the acceptance rate for the shift updates isn't low.

Flip updates
Generating pool states locally can be helpful when applying embedded HMMs to models with high-dimensional state spaces but it also makes sampling difficult if the posterior is multimodal.Consider the case when the observation probability depends on |x i | instead of x i , so that many modes with different signs for some x i exist.We propose to handle this problem by adding an additional flip update that creates a "mirror" set of pool states, in which −x i will be in the pool if x i is.By having a mirror set of pool states, we are able to flip large segments of the sequence in a single update, allowing efficient exploration of different posterior modes.
To generate a mirror set of pool states, we must correctly use the flip updates when sampling the pool states.Since we want each pool state to have a negative counterpart, we choose the number of pool states L to be even.The chain used to sample pool states then alternates two types of updates, a usual update to generate a pool state and a flip update to generate its negated version.The usual update can be a combination of any updates, such as those we consider above.So that each state will have a flipped version, we start with a flip transition between x [1] and x [2] , a usual transition between x [2] and x [3] , and so on up to a flip transition between x [L−1] to x [L] .
At time 1, we start with the current state x 1 and randomly assign it to some index l 1 in the chain used to generate pool states.Then, starting at x 1 we generate pool states by reversing the Markov chain transitions back to 1 and going forward up to L. Each flip update is then a Metropolis update proposing to generate a pool state −x 1 given that the chain is at some pool state x 1 .Note that if the observation probability depends on x 1 only through |x 1 | and p(x) is symmetric around zero then this update is always accepted.
At time i > 1, a flip update proposes to update a pool state (x, ) to (−x, ) such that x Here, since the pool states at each time are generated by alternating flip and usual updates, starting with a flip update to x [1] i , the proposal to move from to can be viewed as follows.Suppose that instead of labelling our pool states from 1 to L we instead label them 0 to L − 1.The pool states at times 0 and 1, then 2 and 3, and so on will then be flipped pairs, and the proposal to change to can be seen as proposing to flip the lower order bit in a binary representation of .For example, a proposal to move from = 3 to = 2 can be seen as proposing to change from 11 to 10 (in binary).Such a proposal will always be accepted assuming a transition density for which p(x i |x i−1 ) = p(-x i |-x i−1 ) and an observation probability which depends on x i only via |x i |.

Relation to PGBS
The forward pool state selection scheme can be used to construct a sampler with properties similar to PGBS.This is done by using independence Metropolis to sample values of x and from λ i .
At time 1, we propose our pool states from p(x).At times i > 2, we propose by selecting it uniformly at random from {1, . . ., L} and we propose x by sampling from p(x|x . The proposals at all times i are accepted with probability min 1, p(y This sampler has computational cost proportional to Ln per iteration, like PGBS.It is analogous to a PGBS sampler with importance densities and with the key difference between these two samplers being that PGBS uses importance weights p(y i |x i ) on each particle, instead of an independence Metropolis accept-reject step.

Proof of correctness
We modify the original proof of Neal (2003), which assumes that the sets of pool states P 1 , . . ., P n are selected independently at each time, to show the validity of our new sequential pool state selection scheme.Another change in the proof is to account for generating the pool states by sampling them from λ i or γ i instead of κ f i or κ b i .This proof shows that the probability of starting at x and moving to x with given sets of pool states P i (consisting of values of x at each time i), pool indices l i of x i , and pool indices l i of x i is the same as the probability of starting at x and moving to x with the same set of pool states P i , pool indices l i of x i , and pool indices l i of x i .This in turn implies, by summing/integrating over P i and l i , that the embedded HMM method with the sequential pool state scheme satisfies detailed balance with respect to p(x|y), and hence leaves p(x|y) invariant.
Suppose we use the sequential forward scheme.The probability of starting at x and moving to x decomposes into the product of the probability of starting at x, which is p(x|y), the probability of choosing a set of pool state indices l i , which is 1 L n , the probability of selecting the initial values of i for the stochastic initialization step, the probability of selecting the sets of pool states P i , P (P 1 , . . ., P n ), and finally the probability of choosing x .
The probability of selecting given initial values for the links to previous states 2 , . . ., n is The probability of choosing a given set of pool states is At time 1, we use a Markov chain with invariant density κ 1 to select pool states in P 1 .Therefore For times i > 1 we use a Markov chain with invariant density λ i to sample a set of pool states, given P i−1 .The chain is started at x Finally, we choose a new sequence x amongst the collection of sequences consisting of the pool states with a backward pass.This is done by first choosing a pool state x n uniformly at random from P n .We then select the remaining states x Thus, the probability of starting at x and going to x , with given P 1 , . . ., P n , l 1 , . . ., l n and l 1 , . . ., l n is Here, we have x and Therefore ( 28) can be simplified to The last factor in the product only depends on the selected set of pool states.By exchanging x and x we see that the probability of starting at x and then going to x, with given sets of pool states P i , pool indices l i of x i and pool indices l i of x i is the same.

Test models
To demonstrate the performance of our new pool state scheme, we use two different state space models.The latent process for both models is a vector autoregressive process, with where X i = (X i,1 , . . ., X i,P ) and Note that Σ init is the covariance of the stationary distribution for this process.
For model 2, we increase the dimensionality of the latent space to 15 and the sequence length to 500.We set ρ = 0.7 and φ j = 0.9, σ j = 0.8 for j = 1, . . ., P , with P = 15.
We generated one random sequence from each model to test our samplers on.These observations from model 1 and model 2 are shown in Figure 1.Note that we are testing only sampling of the latent variables, with the parameters set to their true values.The code for all experiments in this paper is available at http://arxiv.org/abs/1602.06030.

Single-state Metropolis Sampler
A simple scheme for sampling the latent state sequence is to use Metropolis-Hastings updates that sample each x i in sequence, conditional on x −i = (x 1 , . . ., x i−1 , x i+1 , . . ., x i ) and the data, starting at time 1 and going to time n.We sample all dimensions of x i at once using autoregressive updates (see section 5.4.3).
The conditional densities of the X i are The densities p(x 1 |x 2 ), p(x i |x i−1 , x i+1 ), and p(x n |x n−1 ) are all Gaussian.The means and covariances for these densities can be derived by viewing p(x 1 ) or p(x i |x i−1 ) as a Gaussian prior for x i and p(x i+1 |x i ) as a Gaussian likelihood for x i .In particular, we have where To speed up the Metropolis updates, we precompute and store the matrices [ as well as the Cholesky decompositions of the posterior covariances.
In both of our test models, the posterior standard deviation of the latent variables x i,j varies depending on the value of the observed y i,j .To address this, we alternately use a larger or a smaller proposal scaling, , in the autoregressive update when performing an iteration of the Metropolis sampler.

Particle Gibbs with Backward Sampling with Metropolis
We implement the PGBS method as described in Section 5.3, using the initial density p(x) and the transition densities p(x i |x i−1 ) as importance densities to generate particles.We combine PGBS updates with single-state Metropolis updates from Section 5.6.2.This way, we combine the strengths of the two samplers in targeting different parts of the posterior distribution.In particular, we expect the Metropolis updates to do better for the x i with highly informative y i , and the PGBS updates to do better for the x i where y i is not as informative.

Tuning the Baseline Samplers
For model 1, we compared the embedded HMM sampler to the simple single-state Metropolis sampler, to the PGBS sampler, and to the combination of PGBS with Metropolis.For model 2, we compared the embedded HMM sampler to the PGBS with Metropolis sampler.For both models and all samplers, we ran the sampler five times using five different random number generator seeds.We implemented the samplers in MATLAB on a Linux system with a 2.60 GHz Intel i7-3720QM CPU.

Model 1
For the single-state Metropolis sampler, we initialized all x i,j to 0. Every iteration alternately used a scaling factor, , of either 0.2 or 0.8, which resulted in an average acceptance rate of between 30% and 90% for the different x i over the sampler run.We ran the sampler for 1000000 iterations, and prior to analysis, the resulting sample was thinned by a factor of 10, to 100000.The thinning was done due to the difficulty of working with all samples at once, and after thinning the samples still possessed autocorrelation times significantly greater than 1.Each of the 100000 samples took about 0.17 seconds to draw.
For the PGBS sampler and the sampler combining PGBS and Metropolis updates, we also initialized all x i,j to 0. We used 250 particles for the PGBS updates.For the Metropolis updates, we alternated between scaling factors of 0.2 and 0.8, which also gave acceptance rates between 30% and 90%.For the standalone PGBS sampler, we performed a total of 70000 iterations.Each iteration produced two samples for a total of 140000 samples and consisted of a PGBS update using the forward sequence and a PGBS update using the reversed sequence.Each sample took about 0.12 seconds to draw.For the PGBS with Metropolis sampler, we performed a total of 30000 iterations of the sampler.Each iteration was used to produce four samples, for a total of 120000 samples, and consisted of a PGBS update using the forward sequence, ten Metropolis updates (of which only the value after the tenth update was retained), a PGBS update using the reversed sequence, and another ten Metropolis updates, again only keeping the value after the tenth update.The average time to draw each of the 120000 samples was about 0.14 seconds.

Model 2
For model 2, we were unable to make the single-state Metropolis sampler converge to anything resembling the actual posterior in a reasonable amount of time.In particular, we found that for x i,j sufficiently far from 0, the Metropolis sampler tended to be stuck in a single mode, never visiting values with the opposite sign.
For the PGBS with Metropolis sampler, we set the initial values of x i,j to 1.We set the number of particles for PGBS to 80000, which was nearly the maximum possible for the memory capacity of the computer we used.For the Metropolis sampler, we alternated between scaling factors of 0.3 and 1, which resulted in acceptance rates ranging between 29% and 72%.We performed a total of 250 iterations of the sampler.As for model 1, each iteration produced four samples, for a total of 1000 samples, and consisted of a PGBS update with the forward sequence, fifty Metropolis updates (of which we only keep the value after the last one), a PGBS update using the reversed sequence, and another fifty Metropolis updates (again only keeping the last value).It took about 26 seconds to draw each sample.

Embedded HMM sampling
For both model 1 and model 2, we implemented the proposed embedded HMM method using the forward pool state selection scheme, alternating between updates that use the original and the reversed sequence.As for the baseline samplers, we ran the embedded HMM samplers five times for both models, using five different random number generator seeds.We generate pool states at time 1 using autoregressive updates to sample from κ f 1 .At times i ≥ 2, we sample each pool state from λ i (x, l) by combining an autoregressive and shift update.The autoregressive update proposes to only change x, keeping the current l fixed.The shift update samples both x and l, with a new l proposed from a Uniform{1, . . ., L} distribution.For model 2, we also add a flip update to generate a negated version of each pool state.
Note that the chain used to produce the pool states now uses a sequence of updates.Therefore, if our forward transition does an autoregressive update and then a shift update, the reverse transitions must first do a shift update and then an autoregressive update.
As for the single-state Metropolis updates, it is beneficial to use a different proposal scaling, , when generating each pool state at each time i.This allows generation of sets of pool states which are more concentrated when y i is informative and more dispersed when y i holds little information.

Model 1
For model 1, we initialized all x i,j to 0. We used 50 pool states for the embedded HMM updates.For each Metropolis update to sample a pool state, we used a different scaling , chosen at random from a Uniform(0.1,0.4) distribution.The acceptance rates ranged between 55% and 95% for the Metropolis updates and between 20% and 70% for the shift updates.We performed a total of 9000 iterations of the sampler, with each iteration consisting of an embedded HMM update using the forward sequence and an embedded HMM update using the reversed sequence, for a total of 18000 samples.Each sample took about 0.81 seconds to draw.

Model 2
For model 2, we initialized the x i,j to 1.We used a total of 80 pool states for the embedded HMM sampler (i.e.40 positive-negative pairs due to flip updates).Each Metropolis update used to sample a pool state used a scaling, , randomly drawn from the Uniform(0.05,0.2) distribution.The acceptance rates ranged between 75% and 90% for the Metropolis updates and between 20% and 40% for the shift updates.We performed a total of 9000 iterations of the sampler, producing two samples per iteration with an embedded HMM update using the forward sequence and an embedded HMM update using the reversed sequence.Each of the 18000 samples took about 1.4 seconds to draw.

Comparisons
As a way of comparing the performance of the two methods, we use an estimate of autocorrelation time1 for each of the latent variables x i,j .Autocorrelation time is a measure of how many draws need to be made using the sampling chain to produce the equivalent of one independent sample.The autocorrelation time is defined as τ = 1 + 2 ∞ i=1 ρ k , where ρ k is the autocorrelation at lag k.It is commonly estimated as where ρk are estimates of lag-k autocorrelations and the cutoff point K is chosen so that ρk negligibly different from 0 for k > Here where γk is an estimate of the lag-k autocovariance γk When estimating autocorrelation time, we remove the first 10% of the sample as burn-in.Then, to estimate γk , we first estimate autocovariances for each of the five runs, taking x to be the overall mean over the five runs.We then average these five autocovariance estimates to produce γk .To speed up autocovariance computations, we use the Fast Fourier Transform.The autocorrelation estimates are then adjusted for computation time, by multiplying the estimated autocorrelation time by the time it takes to draw a sample, to ensure that the samplers are compared fairly.
The computation time-adjusted autocorrelation estimates for Model 1, for all the latent variables, plotted over time, are presented in Figure 2. We found that the combination of single-state Metropolis and PGBS works best for the unimodal model.The other samplers work reasonably well too.We note that the spike in autocorrelation time for the PGBS and to a lesser extent for the PGBS with Metropolis sampler occurs at the point where the data is very informative.This in turn makes the use of the diffuse transition distribution the particles are drawn from inefficient and much of the sampling in that region is due to the Metropolis updates.Here, we also note that the computation time adjustment is sensitive to the particularities of the implementation, in this case done in MATLAB, where performance depends a lot on how well vectorization We now look at how the samplers perform on the more challenging Model 2. We first did a preliminary check of whether the samplers do indeed explore the different modes of the distribution by looking at variables far apart in the sequence, where we expect to see four modes (with all possible combinations of signs).This is indeed the case for both the PGBS with Metropolis and Embedded HMM samplers.Next, we look at how efficiently the latent variables are sampled.Of particular interest are the latent variables with well-separated modes, since sampling performance for such variables is illustrative of how well the samplers explore the different posterior modes.Consider the variable x 1,300 , which has true value −1.99. Figure 3 shows how the different samplers explore the two modes for this variable, with equal computation times used to produced the samples for the trace plots.We can see that the embedded HMM sampler with flip updates performs significantly better for sampling a variable with well-separated modes.Experiments showed that the performance of the embedded HMM sampler on model 2 without flip updates is  We can also look at the product of the two variables x 3,208 x 4,208 , with true value −4.45.The trace plot is given in Figure 4.In this case, we can see that the PGBS with Metropolis sampler performs better.Since the flip updates change the signs of all dimensions of x i at once, we do not expect them to be as useful for improving sampling of this function of state.The vastly greater number of particles used by PGBS, 80000, versus 80 for the embedded HMM method, works to the advantage of PGBS, and explains the performance difference.
Looking at these results, we might expect that we can get a good sampler for both x 1,300 and x 3,208 x 4,208 by alternating embedded HMM and PGBS with Metropolis updates.This is indeed the case, which can be seen in Figure 5.For producing these plots, we used an embedded HMM sampler with the same settings as in the experiment for Model 2 and a PGBS with Metropolis sampler with 10000 particles and Metropolis updates using the same settings as in the experiment This example of Model 2 demonstrates another advantage of the embedded HMM viewpoint, which is that it allows us to design updates for sampling pool states to handle certain properties of the density.This is arguably easier than designing importance densities in high dimensions.

Conclusion
We have demonstrated that it is possible to use embedded HMM's to efficiently sample state sequences in models with higher dimensional state spaces.We have also shown how embedded HMMs can improve sampling efficiency in an example model with a multimodal posterior, by introducing a new pool state selection scheme.There are several directions in which this research can be further developed.
The most obvious extension is to treat the model parameters as unknown and add a step to sample parameters given a value of the latent state sequence.In the unknown parameter context, it would also be interesting to see how the proposed sequential pool state selection schemes can be used together with ensemble MCMC updates of Shestopaloff and Neal (2013).For example, one approach is to have the pool state distribution depend on the average of the current and proposed parameter values in an ensemble Metropolis update, as in Shestopaloff and Neal (2014).
One might also wonder whether it is possible to use the entire current state of x in constructing the pool state density at a given time.It is not obvious how (or if it is possible) to overcome this limitation.For example, for the forward scheme, using the current value of the state sequence at some time k > i to construct pool states at time i means that the pool states at time k will end up depending on the current value of x k , which would lead to an invalid sampler.
At each time i < n, the pool state generation procedure does not depend on the data after time i, which may cause some difficulties in scaling this method further.On one hand, this allows for greater dispersion in the pool states than if we were to impose a constraint from the other direction as with the single-state Metropolis method, potentially allowing us to make larger moves.On the other hand, the removal of this constraint also means that the pool states can become too dispersed.In higher dimensions, one way in which this can be controlled is by using a Markov chain that samples pool states close to the current x i -that is, a Markov chain that is deliberately slowed down in order not to overdisperse the pool states, which could lead to a collection of sequences with low posterior density.

Figure 2 :
Figure 2: Estimated autocorrelation times for each latent variable for Model 1, adjusted for computation time

Figure 5 :
Figure 5: Combination of embedded HMM and PGBS with Metropolis samplers