Structured filtering

A major challenge facing existing sequential Monte Carlo methods for parameter estimation in physics stems from the inability of existing approaches to robustly deal with experiments that have different mechanisms that yield the results with equivalent probability. We address this problem here by proposing a form of particle filtering that clusters the particles that comprise the sequential Monte Carlo approximation to the posterior before applying a resampler. Through a new graphical approach to thinking about such models, we are able to devise an artificial-intelligence based strategy that automatically learns the shape and number of the clusters in the support of the posterior. We demonstrate the power of our approach by applying it to randomized gap estimation and a form of low circuit-depth phase estimation where existing methods from the physics literature either exhibit much worse performance or even fail completely.


I. INTRODUCTION
Across a range of physical sciences, much of the work of the experimentalist centers around learning from observed data.In metrology, for instance, experimental data is used to infer parameters of interest such as magnetic fields, temperature, or other physical quantities [1].The process by which these parameters are learned from data is thus of critical importance to tasks as diverse as metrology and quantum information processing [2][3][4][5][6].
The importance of learning physical parameters has motivated developing and making practical advances in statistically-principled approaches to parameter estimation.Bayesian methods in general and Bayesian inference in particular have proven to provide a compelling framework for drawing inferences from experimental observations in a rigorous, robust and practical manner [7].Numerical algorithms such as particle filtering then provide a general and practical framework for approximate Bayesian inference, as well as for statistical computation more generally [8].
In practice, existing approaches suffer from a great deficiency: they implicitly assume that the probability distribution that describes the current state of knowledge about the parameters has a particular structure to it.These assumptions are often reasonable, such as the assumption that the probability distribution is Gaussian or, weaker still, unimodal.While these reasonable sounding assumptions often work well, they can fail in catastrophic ways when evidence is provided that favors a multitude of equivalent (or near-equivalent) hypotheses.We refer to such learning problems as degenerate in analogy to quantum mechanics.The challenges posed by degeneracy can sometimes be circumvented through the use of cunningly designed experiments, but this places additional experimental demands, and is not always possible.As a result, there are broad classes of parameter estimation problems for which we have no robust automated methods for parameter estimation.
In this paper, we allowing the inference algorithm to learn the structure of its posterior distribution over the parameters of interest.We show that doing so also allows for relaxing these experimental constraints.Specifically, we demonstrate that this can allow such algorithms to learn when existing methods fail.Our algorithm augments traditional particle filtering methods with a dynamically generated tree describing the structure of a Bayesian posterior as a hierarchal clustering of particle filters.We rely on statistically principled approaches to model selection to remove structural elements that are inconsistent with the observed data, ensuring that the trees generated by our algorithm usefully represent its state of knowledge.
The advantages offered by our structured filtering algorithm are especially relevant in the case of degeneracy, which presents a significant challenge to existing work.Conventional methods are either inefficient or subtly bias the inference towards unimodal distributions.If the model for a learning problem contains two sets of parameters that are nearly equally likely for the data observed, these efficient inference algorithms usually fail in catastrophic ways.Methods for dealing with this, such as annealing [9], qualitatively fail to give us a solution because they cause the solution to focus on just one of the families of degenerate models.Commonly used solutions such as these are therefore, at best, imperfect solutions.Furthermore, these limitations prevent the application of these methods to cases where dynamical systems are probed using uninformative experiments, which often yield outcomes that are consistent with several hypothetical dynamical models; that in turn places several experimental constraints that are relaxed by our algorithm.
We begin in Section II by reviewing Bayesian inference as a framework for learning parameters from experimental observations, then introduce our structured filtering algorithm for Bayesian inference in Section III.In Section IV and Section V, we present numerical evidence for the efficiency and correctness of our approach before concluding in Section VI.

II. BAYESIAN INFERENCE
Bayesian inference has become a fundamental tool for modeling quantum systems.The basic object in Bayesian inference is the prior distribution.The prior distribution is a probability distribution over a set of hypotheses that could describe the system.In particular, let us assume that we have a model for a data set that is parameterized by x then the prior distribution is P(x).The prior distribution represents any beliefs that the experimenter may hold about the true model before processing any subsequent data.While the prior distribution is subjective, the Bernstein-von Mises theorem [10] states that under most circumstances a poorly chosen prior will simply slow down the inference.This subjectivity has made the use of prior distributions a source of contention between frequentists and Bayesian statisticians.If one dislikes the Bayesian interpretation of the prior it is possible to eschew the discussion of the prior (almost) completely by making the initial prior uniform and avoiding adaptive strategies.Moreover, by performing Bayesian inference on an artificial likelihood function, one can derive a useful approximation of maximum-likelihood estimation [7], such that the utility of Bayesian inference does not hinge on adopting a particular philosophical view.That said, we will take a Bayesian interpretation of probability in the following as a matter of convenience.
The likelihood function is the second component of Bayesian inference.Its purpose is to compute the probability of observing a vector of experimental outcomes D given that a hypothesis x is true.It is denoted P(D|x; t).For example, in quantum mechanics consider x = [ω] to be a Rabi frequency for the Hamiltonian H(x) = ωσ x /2.Then given a state e −iH(x)t |0 and that D = [0] is observed, the likelihood function for this experiment is given by The likelihood function is then used to update the user's prior beliefs, conditioned on the observed data, using Bayes' theorem: P(x|D; t) = P(D|x; t)P(x) P(D|x; t)P(x)dx . ( The probability density P(x|D; t) is known as the posterior distribution, and can be interpreted as the probability distribution that an experimenter should hold after being confronted with data D given their preconceptions, which are represented through the prior P(x).
Except in a few special cases, such as when conjugate priors are used, exact Bayesian inference is intractable.This is often addressed by using Sequential Monte Carlo (SMC) [8] approximations, also known as particle filters.SMC works by approximating the probability density using a convex combination of Dirac-delta functions.In particular, the probability density at each step is approximated by This notably reduces the integral in Bayes' theorem to a discrete sum over N part discrete hypotheses, each of which is called a particle.The w j are positive real numbers called weights, and satisfy ∑ j w j = 1.Thus the SMC approximation can be interpreted as a discrete probability distribution.
A drawback of SMC is that as the algorithm proceeds, the vast majority of the weights tend to zero.This is because, with high probability, none of the initial particles correspond to the true hypothesis.Thus, in the case of Bayesian inference, the particle filter approximation above will require a number of particles that is exponential in the length of x to estimate the true hypothesis within fixed error.
The exponential scaling of naïve particle filter approximations can be addressed by moving the particles and changing their weights during the inference process.This allows us in many cases to achieve an arbitrarily accurate approximation of the true model with a small (in some cases constant) number of particles.Conventionally, this is done by resampling particles to draw a new set of particles representing the same posterior distribution.Resampling algorithms exploit a duality in the SMC approximation: any probability density can be described either by choosing the weights of particles appropriately or by moving them such that their concentrations represent the probability density.Resampling takes the latter tack.It resamples the particles from a new distribution that maintains certain  structural features of the posterior and assigns all the new particles to have the same weight.This in effect reconcentrates the particles in regions where higher probability regions while removing them from places where the probability is low.One such resampler that has gained great popularity for physics applications is the Liu-West resampler [11].The Liu-West resampler draws N part particles from the probability distribution P(x).The mean µ and covariance matrix Σ are then computed for the SMC approximation.The resampler then, for parameter a ∈ [0, 1], shifts the particle slightly towards the mean of the SMC distribution, letting µ j ← ax j + (1 − a)µ.Finally it draws a new particle from the distribution N (µ j , [1 − a 2 ]Σ) and assigns a weight of 1/N part to the particle.
In cases where a = 1, Liu-West resampling is equivalent to the bootstrap filter [12], in which each new particle is drawn from the original SMC distribution with replacement.If a = 0, the resampler simply draws particles from a Gaussian that matches the posterior distributions mean and covariance, as is useful for rejection filtering [13].Typically a = 0.98 works well in practice [11]; although the optimal value can depend on the likelihood function.Furthermore, for any a ∈ [0, 1], the distribution of accepted samples has the same mean and covariance as the initial SMC approximation.Thus the Liu-West resampler is specifically designed to preserve the first two moments, keeping much of the structure-preserving features of the bootstrap filter, while allowing a model to be estimated with exponentially fewer particles than the bootstrap filter requires.
The Liu-West resampling algorithm works in a wide range of applications [5,[14][15][16], providing a practical means of implementing Bayesian inference in experimental practice.There is a major drawback to this approach though, in that the benefits of the Liu-West algorithm do not extend to multi-modal distributions.By choosing to perturb the resampled particles towards the mean of the SMC distribution, we implicitly bias the distribution towards a unimodal posterior.In cases where the distribution is multi-modal this strategy can fail horribly because very little probability density is actually located near the mean.We see this in Figure 1, where we consider learning the Rabi frequency given by (1), using Liu-West resampling for ω ∈ [−1, 1].In this case, the likelihood function assigns equal values for both positive and negative frequencies.We refer to likelihood functions that have such symmetries as degenerate.This means that regardless of the true frequency, exact Bayesian inference on a uniform prior will always yield a bimodal distribution with zero mean.Since the Liu-West resampler always moves particles towards the mean, we expect Liu-West resampling to fail [17] and indeed notice such a failure in the numerical experiments given in Figure 1.
In some ways this is a trivial problem, since the experiment cannot possibly distinguish between positive and negative frequency.Thus, if the user is aware of this degeneracy, they can choose ω ∈ [0, 1] and learn the sign in  subsequent experiments.We see that this approach is successful in Figure 2.However, not all degeneracies are so obvious or so easily countered.For example, the degeneracies that appear in randomized gap estimation (RGE) [18] are significantly harder to incorporate than in the Rabi example.Rather than placing the onus of data processing on the user, it is highly desirable to have a method for automatically addressing such problems as they appear.Similar challenges emerge when learning with nearly degenerate likelihood functions.Current practices for mitigating this include using significantly more particles and less frequent resampling steps, avoiding transient failures of the Liu-West algorithm [19].In such cases, Liu-West resampling will ultimately be successful once enough data has been accumulated to break the degeneracy.However, until the degeneracy can be resolved the algorithm needs to keep track of all the potential hypotheses that could explain the distribution, mandating a much larger number of particles.Liu-West resampling will often fail long before it is able to break the degeneracy.Such problems are often addressed by using an approach known as annealing (which is a distinct concept from simulated annealing) with multiple restarts [9], however such approaches do not give a good estimate of the posterior distribution and require substantial fine tuning.Instead, it would be very useful to have a resampler that infers and preserves the actual structure of the posterior distribution without requiring domain knowledge.In the next Section, we detail our algorithm for finding appropriate mixture models and applying resampling on the resultant structure in a way that preserves even highly multi-modal posteriors.

III. STRUCTURED FILTERING
How does one discover the structures that need to be preserved in the posterior distribution to allow resampling to succeed for degenerate distributions?One simple approach is to cluster the posterior using an algorithm like k-means clustering [7]; however, unless we have domain knowledge it may not be clear a priori how many clusters to use.Here we take a somewhat bold step and use an AI-based approach that allows a computer to entertain multiple potential clusterings and reason about which clustering is best.By allowing the algorithm to decide upon the structures that are most relevant and are most parsimonious with future data, the problem of dealing with spurious near-degeneracies can be avoided.
We discuss the components of this AI-based solution below.First, we discuss clustering.Second, we discuss our method for selecting a model for the posterior.Third we discuss how clustering and selection can be combined together in a single graphical framework.Finally we discuss the details of how our structured filtering algorithm combines these features to learn a model for the posterior distribution.,000 weighted particles.The positions of each particle were drawn uniformly at random, while the weights were chosen to represent a mixture of four Gaussian distributions.The magnitude of each weight is visualized by the size of the points in the figure, with very low-weight particles being ommited for visual clarity.The centroids found by each algorithm are indicated by stars.The unweighted clustering was performed using SciKit-Learn [21], while the weighted clustering was performed using Algorithm 2.

A. Weighted and Unweighted Clustering
Clustering seeks to solve a problem that is often second nature to humans: clustering data points into groups of related examples.While humans excel at this in low dimensions, getting computers to effectively cluster data with the same robustness that humans exhibit can be comparably challenging.Perhaps the most popular algorithm for clustering data is the k-means clustering algorithm.The algorithm seeks to divide a set of N points into k clusters that minimizes the intra-cluster variances of the clusters.Specifically we define S p to be the p th cluster, and then seek cluster centroids {y p : p = 1, . . ., k} to satisfy An exact solution to ( 4) is unlikely to be found in general.In fact, finding an optimal cluster assignment is known to be NP-complete, which means that the existence of any efficient algorithm for finding the optimal clusters would imply P = NP.Since it is widely conjectured that P = NP, it is fair to assume that k-means clustering is hard in general.Despite the apparent difficulty of such clustering problems, it can be shown that small perturbations about the hard instances can render them efficiently solvable [22].Thus the average complexity of clustering is polynomial, which is why at least approximate clustering is not a computationally challenging task.
The k-means clustering algorithm is simple.Given k clusters of vectors compute the centroids of each cluster and assign {y} to these values.For each vector (data point) in the set find the cluster centroid that is closest to the example with respect to the Euclidean norm and take each S p to be the corresponding cluster.This procedure is then repeated until convergence is reached.
The standard k-means clustering algorithm cannot be directly applied to weighted data, such as particles in a SMC approximation, as seen in Figure 3.The reason for this is that the objective function in (4) applies a penalty solely on the distance between the vectors and the centroid.This is to say that it considers all vectors in the data set equally important, irrespective of their weights.
Fortunately we can address this by using a weighted k-means algorithm.This algorithm instead seeks to minimize The weighted k-means algorithm proceeds exactly as the unweighted version except the cluster centroids are computed as the expectation values of the vectors in each cluster (after renormalizing the weights into a probability distribution) rather than simply by summing the results.We note in Figure 3 that this modification allows multimodal weighted data to be appropriately divided into clusters.A formal algorithm for this procedure is given as Algorithm 2 in Appendix B.
We utilize the weighted k-means algorithm with the k-means++ heuristic for initial centroid selection [23] to divide our cluster into a set of different clusters.In practice, practitioners will often choose which value of k to use in modeling distributions by plotting the objective function (4) and choosing an inflection point, as such inflection points are suggestive of overfitting.By contrast, model selection provides a more formal approach to reasoning about overfitting [24].In particular, below we discuss the Bayes factor [25], which allows us to algorithmically decide on the correct cluster number k for application to posterior distributions.

B. Model selection for the posterior
Ultimately, the task of selecting the number of clusters to use to represent a SMC approximation to a posterior is one of model selection.For example, we could have one model that uses k = 2 another that uses k = 4 and we wish to know which model (i.e.clustering) does a better job of representing the data.In this case there's a straight forward approach: consider both clusterings for the posterior and only make a decision between the two when sufficient evidence mounts for the superiority of one of the two models.
We use Bayes factors to compare the validity of two models.The Bayes factor can be thought of as a generalization of the likelihood ratio test, and is defined for models M 1 and M 2 and data record D as Here K > 1 implies model 1 is superior to model 2 and vice versa.It reduces to the likelihood ratio test when the prior over models is uniform (P(M 1 ) = P(M 2 )), and when priors within each model are chosen such that and similarly for the second model.However, Bayes factors have a very nice feature absent from the likelihood ratio test.The values of the integrals in ( 6) depend on the volumes of the parameter spaces of the models.This penalizes models with more parameters, and gives us a simple alternative to model selection that does not involve maximization as in the Bayesian information criteria [26].This means that Bayes factors can easily be computed using SMC approximations from the particle weights and the likelihood function [27].
In practice, we also use Bayes factors to choose one option among several possible competing options.In such cases, we compare the values of K for all possible pairs of models that we want to compare and then select a model once max k (min j K(M j , M k )) ≥ K champ where K champ is a user specified threshold.Typically a value of K champ ≥ 100 is considered strong evidence in favor of one of the models.In practice, we usually take K champ ≥ 2000 to be our threshold.We do this because excluding the correct number of clusters can often have a worse impact on the algorithm performance than entertaining a hypothesis than is likely false and because we perform many such tests in structured filtering.

C. Graphical Models for Posteriors
In order to convert the problem of automatically choosing the optimal number of clusters in the data into one that a computer can easily solve, we introduce a graphical model for describing the reasoning process that is employed to decide between different clusterings of the posterior distribution.The graphs we consider are directed rooted trees with edges pointed away from the root node, but more general directed acyclic graphs could also be considered.We provide examples of our graphical notation in Figure 4.
The vertices in these graphs are given one of three distinct labels, in addition to their index.The first such label denotes that the vertex is a filter node, which serves as a container for a set of SMC particles that we assume are approximately unimodal.These nodes are denoted by squares.It is important to note that the filter nodes do not necessarily need to have the same likelihood functions, allowing for the inclusion of more general model selection problems.Furthermore, we do not need to use Liu-West resampling inside each of the filter nodes.Other filters such as the bootstrap filter, assumed density filtering [28] or rejection filtering [13] can be used in its place.
The second type is a mixture node.The mixture node defines a distribution that is a weighted mixture of the subtrees that descend from it, but contains no particles itself.For example, a 2 cluster approximation to the posterior would be described by a mixture node with two filter nodes as leaves.We denote mixture nodes as triangles.The ability to mix multiple nodes has several interesting properties.First, in principle we can emulate a filter node consisting of many particles with a mixture of single-particle filter nodes.Second, we can have mixtures of mixtures.This allows us to generate very rich clusterings even if the maximum degree for the graph is 3 (that is, if we restrict each node to have at most 2 children and one parent).
The final label that a vertex can be assigned corresponds to a decision node.A decision node is a mixture node but is put in place explicitly to test between the hypotheses described by the subtrees that descend from it.Such nodes are denoted with a circle.By convention, the root node is always a decision node .The edges in our graphical model describe the relationships between the different types of nodes.By definition a filter node has no children, and is hence always a leaf node; however, mixture nodes and decision nodes must have children.The edges between any two nodes are used to assign weights.In a mixture node these weights are used to set the weights properly for the particles that reside in the filter nodes within the subtrees that descend from them.In particular, the actual weights are the products of the weights within the filter nodes and all outgoing edges of mixture nodes that connect it to the root.In a decision node these weights serve to represent the algorithm's confidence that one of the competing hypotheses, described by the descendant subtrees, is correct.
The inference process will often uncover structure in the posterior that was not apparent in earlier steps of the learning process.This necessitates adapting our graphs in order to match the changing structure.We do this using three simple rules, which we demonstrate in Figure 5 and Figure 6.The champion rule and floor pruning rules are designed to simplify the graph when certain structures are not needed to represent the data.The champion rule states that when the weight of one edge overwhelms the sum total of the other weights sufficiently then all other hypothetical subtrees are disregarded except for the "champion."The floor prune rule immediately eliminates a subtree if the edge weight is smaller than a threshold.This rule is useful because it allows the algorithm to free up memory and processing to address more fruitful clusterings when one has been effectively excluded by the data.
The splitting move, on the other hand, increases the complexity of the graph.It takes a particle filter node and replaces that node by a subtree, as illustrated in Figure 6.Specifically, we replace the node with a decision node that has the original filter node as a child and also a mixture node that has at least two filter nodes as children.Although splitting moves could be performed at any time, we only deploy them during a resampling step to optimize the performance of the algorithm.These moves are implemented in our algorithm by taking the particle filter in the original node and using weighted k-means clustering, using weighted k-means++ to divide it into 2 or more clusters.This guarantees that the new mixture of particle filters is equivalent to the initial particle filter before resampling is applied.After splitting, each particle filter leaf node is then resampled independently.Thus, in the branch of the decision node corresponding to the correct number of clusters, the resampling takes place only locally within each cluster and preserves the multi-modal structure of the posterior.
A complication comes in with the number of particles assigned to each cluster.By default we divide the particles into two sub-clusters, assigning each original particle to one of the new clusters using the labels returned by the kmeans algorithm.Thus, we retain the total number of particles through a splitting move.However, this presents the danger that a cluster with a small number of particles will become numerically unstable, even if the weight assigned to that cluster is large.To remedy this, we also allow the number of particles used to vary dynamically by setting a minimum number of particles in each cluster.For example, imagine we wish to split a cluster with 1800 particles into 3 clusters and we set the minimum number of particles per cluster to be 1000.Then, at least one cluster would ordinarily be assigned less than the minimum number of particles that we have decided upon.Our algorithm will draw additional particles with correspondingly smaller weights for such clusters, preserving numerical stability while introducing no new approximations.This results in 3 clusters with at least 1000 particles each.Since we choose to apply restructuring moves (such as splitting) only when a resample would be triggered by the Liu-West resampler, generating additional particles is easy to do by following the same perscription used in the resampler.
Applied directly, the splitting move would generate exponentially-large trees for even simple models.We limit this by imposing a maximum depth d max at which the splitting move may be applied.If a particle filter node with depth d ≥ d max must be resampled, then we apply traditional Liu-West resampling at that node instead.In this way, we can control the maximum size of the structure that our algorithm is allowed to explore.
Though these three moves are sufficient to correctly implement structured filtering, we also consider two other pruning moves which reduce graph complexity without additional approximation.These moves allow the algorithm to reduce the depth of the tree dynamically as the floor and champion rules discussed above eliminate uninformative branches of the tree.In particular, the only-child and single-child pruning rules shown in Figure 7 replace the current tree with a simpler tree describing the same structure by eliminating redundant intermediate nodes.Under the onlychild pruning rule, a node is eliminated if it is the only child of its parent and has one or more children, such that those children can be attached directly to its parent.Similarly, the single-child rule removes any node with exactly one child, and places that child directly onto its grandparent.Trees matching the preconditions for only-child and single-child pruning are generated by applying the champion and floor pruning rules, as each of those eliminate leaf and intermediate notes that do not contribute substantially to the final estimate.By removing these intermediate nodes, the depth of nodes relevant to the final estimate can be decreased, allowing for the splitting move to be applied again.
Importantly, the floor, champion, and splitting moves are each parameterized, offering quality parameters for the particle filter approximation represented by the output of these moves.To allow these pruning and splitting parameters to dynamically depend on the tree structure, we encode them as properties of each node, collectively called a context.In this way, our algorithm can be customized with various starting trees representing prior knowledge about initial clustering, model selection problems of interest, or other structure of interest.Structured filtering recognizes the following context parameters, each of which may be specified at a given node, or inherited from the a node's parent:

Prune (boolean):
If this context parameter is set to false, then no pruning is applied at this node; in particular, no floor, champion, only-child or single-child rules are applied.For brevity, we omit this context parameter in the algorithm below.

Mixture floor (real):
This context parameter sets the minimum edge weight from a mixture node to one of its children that will be preserved by floor pruning.

Decision floor (real):
This context parameter sets the minimum edge weight from a decision node to one of its children that will be preserved by floor pruning.

Decision champion K champ (real):
This context parameter sets the ratio by which the edge weight from a decision node to one of its children must exceed the sum of all other outgoing edge weights from the same decision node before that child will be considered a champion node.

Decision region estimation champions (integer):
This context parameter controls the number of children of each decision node that will be kept when reporting region estimates, as described in Section III D. Figure 6.An example of a splitting move, in which a single particle filter node is replaced by a model selection over different clusterings, each of which is represented by an appropriate mixture model.In this example, the structured filter considers n clusters ∈ {1, 2}; since the n clusters = 1 mixture model node would be redundant, it is immediately eliminated, promoting its only particle filter child to be a child of the root model selection node.Similarly, the new decision node over the number of clusterings is redundant with the root decision node, and is also eliminated immediately.

D. Structured Region Estimation
As described by Ferrie [29], clustering can also be used to report region estimates of higher posterior density than unclustered methods.We use and generalize this observation by exploting the tree structure generated by splitting and pruning moves to form powerful credible region estimators.In particular, each particle filter (leaf) node already yields conventional region estimators such as covariance ellipsoids, convex hulls, and minimum volume enclosing ellipsoids [5], such that we can complete our region estimation procedure by specifying region estimators for each decision and mixture node, recursively.
Following this strategy, at each mixture node, our estimation procedure assigns a region estimate that is the union of the region estimates for each of its children.At each decision node, our procedure reports the union of the first n of its children's region estimates, with its children arranged in descending order by their weights, and with n obtained from the corresponding context parameter, as described in Section III C. The final region estimate X α can thus be interpreted as guaranteeing that the probability the model vector x is within X α , conditioned on the model with the highest posterior probability, is at least α.Importantly, the credibility parameter α is only used at the leaf nodes as a parameter to the local region estimation procedure.

E. Structured Filtering Algorithm
Our algorithm depends on two global parameters as well as the context parameters described above.d max (integer): This global parameter sets the maximum depth that a node is allowed to be from the root in the structure graph.
{n cluster,i } (set of integers): This global parameter sets degrees of each vertex in the structure graph that will be considered by splitting moves.This imposes a limit on the maximum number of clusters for the data of (max i n cluster,i With this description in place, we now present our algorithm in full as Algorithm 1. Important subroutines are listed separately in Appendix B. The splitting move used in Algorithm 1 is demonstrated graphically in Figure 6. The numerical examples shown in Section IV, Section V and Appendix A were obtained using an implementation in Python 2.7 (Anaconda distribution), with the NumPy [30], SciPy [31], and QInfer [32] libraries.It is therefore easier to learn the eigenvalues, which are unconstrained, than it is to impose the appropriate constraints on the gaps.Importantly, the RGE likelihood function is highly degenerate, as (7) only depends on {∆ i,j } and not on the eigenvalues of interest.These degeneracies can be included analytically, so that we can verify the estimates obtained by structured filtering.It is worth noting that in cases where the eigenvalues are randomly distributed then with high probability there will only exist two degenerate orderings and all other orderings can in principle be resolved by solving the turnpike problem.Despite this, many approximate degeneracies are likely to occur in such settings and these can be just as hazardous to learning as exact degeneracies.As such, there is still a major need for structured filtering even in unstructured gap estimation problems.
As a final point, in practice it is likely that experimental data will be provided for RGE in batches if processing time for each update is long relative to data acquisition time.We deal with this by assuming the data is acquired using n meas measurements and the resulting distribution of 0 or 1 measurements is given by a binomial distribution with p = Pr(0|H; t).In practice, we take 3 measurements in our numerical experiments.
There are many ways that we could pick the experimental parameter t.The best method is to choose θ to minimize the Bayes risk using a numerically optimized strategy.This approach, while highly successful, is computationally expensive [5].Another approach is to use a heuristic known as the particle heuristic (PGH) [16].The PGH, which is appropriate for periodic likelihoods like (7), guesses t ∝ 1/σ where σ is an estimate of the uncertainty in the posterior distribution.Here we approximate this by drawing unique x 1 and x 2 from the prior distribution and take The PGH is known to lead to asymptotically optimal scaling for phase estimation and slight variants of this have been shown to come close to saturating the Bayesian Cramér-Rao bound under the assumption of a Gaussian prior; however, less is known about its performance in multi-modal settings.While this sampling procedure is straight forward for unstructured filtering, it is slightly more complicated in structured resampling.This is because the structured filter is simultaneously entertaining a number of different hypotheses about the clusters that compose the posterior distribution.We avoid this problem by computing the Bayes factors for each such hypothesis supported by the structured filter and only apply the PGH on the hypothesis that has the smallest Bayes factor.That is, we purposely choose the experimental time based on the least probable explanation for the data.We make this choice because it is likely to be conservative and also because it demonstrates the algorithms ability to learn despite being provided with inferior data.The samples are then drawn from the remaining filter nodes according to their weights as per the PGH.
We first benchmark the performance of structured filtering as well as unstructured filtering for estimating the gaps in a three-level system (qutrit) where we have assumed without loss of generality that the lowest energy eigenvalue is λ 0 = 0. We then define the eigenvalue gaps to be ∆ i,j = λ j − λ i .For convenience, let us assume that λ 2 ≥ λ 1 .The three gaps in the problem are then ∆ 0,1 , ∆ 1,2 , ∆ 2,0 .The gaps that are learned can specify, up to an additive constant, the largest energy eigenvalue i.e. max{λ 1 , λ 2 } = ∆ 0,2 = ∆ 0,1 + ∆ 1,2 .They cannot, however, uniquely give the first Posterior distributions for using structured filtering to learn potential eigenvalues for randomized gap estimation with three eigenvalues where one is without loss of generality taken to be 0. The true model has unknown eigenvalues (0.75, 0.15) and the symmetries of the problem imply the final posterior should ideally be concentrated at (0.75, 0.15), (0.15, 0.75), (0.6, 0.75), (0.75, 0.6).The floor threshold was set to 0.1 and champion threshold was set to 20.For this data and the graph was set to have maximum degree 3 and maximum depth 4. Liu-West resampling with a = 0.98 was used to cluster each cluster in the posterior and a minimum of 1000 particles was assigned to each cluster.A video showing the operation of our algorithm is provided in the supplementary material, or online at goo.gl/4NKYaX.excited state.This is because both λ 1 = ∆ 0,1 and λ 1 = ∆ 1,2 are consistent with the data.There is also an obvious symmetry in this problem wherein λ 1 ↔ λ 2 leaves the likelihood invariant.In order to address these degeneracies in our assessment of the algorithm, we define a canonical ordering of the eigenvalues and assignment of the gaps that in effect removes the four-fold degeneracy.We then compute the canonical quadratic loss as which we use as our figure of merit for the inference.Here each w i is a particle weight taken from the clustered posterior distribution.The multi-modal nature of the posterior distribution can be seen in Figure 8, wherein we consider a structured posterior that consists of a choice between 1 and 2 clusters at each splitting, with d max = 4, such that our algorithm includes the optimal number of clusters 4 as a possibility.The figure provides the posterior distribution of every particle in the filter wherein the size of the particle denotes the weight assigned to it in the filter node and all preceding mixture and decision nodes in the structure graph.See supplementary information for an animation of the complete inference process.We observe that after only 100 updates, the structuring algorithm has clearly identified the structure of the ideal posterior, although it has not yet conclusively learned the number of clusters.After 150 updates we observe peaks in the probability density begin to congeal around the true eigenvalues (in particular near (0.75, 0.15)).
At 250 updates we see that the structured filtering algorithm correctly identifies all 4 of the equivalent degenerate solution for the eigenvalues.The algorithm further recognizes that the 4 cluster model for the data is by far the best model for the posterior distribution.This is signnificant because we make no apriori assumption that the ideal posterior is composed of 4 clusters.Furthermore, apart from giving an accurate point estimate of the eigenvalues, we obtain a posterior distribution from which uncertainties in these four estimations can be gleaned.This information is of great value in experiments because principled estimates of uncertainty are often hard to find.By contrast, our algorithm outputs them by default.
In addition to outputting the posterior distribution, structured filtering can output α-credible regions and also structural information about the posterior distribution.Figure 9a provides an α = 0.95 credible estimate of a region where the true eigenvalues can be found.Notably, because this region estimate does not output a single pair of eigenvalues its interpretation remains consistent regardless whether structured or unstructured filtering is used.The region estimate also notably overlaps with the four modes of the ideal posterior distribution in this example.
Figure 9b gives the structure graph for the posterior after 300 updates.The three mixture nodes that directly descent from the root node reveal that the algorithm has completely excluded 1, 2 and 3 cluster models as viable explanations for the data.This structure again was not imposed upon the posterior, nor was it directly included via our splitting rule.Instead this structure emerged through the course of the 300 updates by following the rules set out by structured filtering for both growing as well as pruning the structure tree.This shows that structured filtering is not only able to learn patterns present in a complex posterior without substantial coaxing by the user, but also convey these inferred structures in a concise human-parsable format.Additionally, by tracing through the structure graph from the root node, we can see that a four cluster (or approximately four cluster) model is preferred in each branch considered.In particular, each of the four children of the mixture nodes closest to the root are decision nodes that prefer single-cluster explanations of the posterior.This illustrates that structured filtering is capable of more  than just inferring the structure of the posterior distribution: it is also capable of reporting it in a human readable format.
Figure 10 shows that this degeneracy in the likelihood function causes LW to fail in both the mean and median L can after roughly 200 experiments where it saturates at a value on the order of 10 −3 .This is to be expected because the unimodality assumption implicitly made in Liu-West resampling will generically fail here just as it did in Figure 1.
Structured filtering, on the other hand, can learn the correct eigenvalues within numerical precision with high probability after only 1000 experiments, and further we see clear evidence of exponential scaling of the mean canonical loss.This implies that even the worst case behavior of the algorithm is not pathological.We find from linear regression that after 400 experiments mean(L can ) ≈ e −0.0083n−6.7 .The data for the median does not demonstrate a single clear asymptotic scaling.The appearance of two different time scales for the problem likely correspond to the timescales needed to distinguish the two degenerate possibilities for the eigenvalues (i.e.solve the turnpike problem on the gaps) and the timescale needed to refine this knowledge once a degenerate set of solutions to the turnpike problem is found.
This clearly shows that in cases where we have a multi-modal posterior that structured filtering can succeed where the gold-standard particle filtering methods used in quantum experiments fail.It succeeds here because structured filtering is capable of recognizing the multi-modal nature of the posterior and adjust the particle filter accordingly.LW cannot succeed because it always assumes a unimodal posterior and here the ideal posterior has 4 modes.For this reason, we expect structured filtering to be a broadly applicable method capable of dealing both with the degeneracies that appear in randomized gap estimation and also approximate degeneracies that appear in other applications.

V. APPLICATION TO COLLAPSE-FREE PHASE ESTIMATION
Phase estimation has become a ubiquitous algorithm in quantum computing because of its ability to learn eigenvalues of a unitary using quadratically fewer queries to that unitary.Of the many variants of phase estimation, iterative phase estimation has gained in popularity owing to the fact that it does not require a large qubit register to store the estimated eigenvalues [33].It works by replacing the quantum interference step used in traditional phase estimation with an adaptive inference algorithm to learn the most likely phase given a set of measurements.This approach requires a number of queries to the blackbox that is optimal to within a small multiplicative factor and is thus also the preferred technique if speed is also an issue.
Both iterative and traditional phase estimation require long sequences of gates in order to learn the eigenstate, which is perhaps the biggest reason why most simulation experiments eschew this approach at present [34].These long sequences arise because long experimental times are needed to unambiguously collapse the state onto an eigenvalue with a small number of experiments according to energy/time uncertainty.One approach to address is this issue is to shift the burden of collapsing the state from the quantum computer to the classical inference algorithm by applying phase estimation on a fresh copy of the initial state with each experiment.We call this approach collapse-free phase estimation as it does not collapse the quantum state onto an eigenstate of the Hamiltonian.
A challenge facing collapse-free phase estimation is that signal is received from multiple eigenvalues when the input state is a superposition of different eigenstates.Thus any estimate of a single eigenvalue for the distribution will have to disambiguate data that comes from distinct eigenvalues.One approach to dealing with this issue is to mimic wave function collapse by implicitly introducing biases against data that is more likely to come from other eigenvalues.This approach is taken by Wiebe and Granade [35] and by Santagati et al. [36], wherein a Bayesian form of phase estimation is used to introduce such biases through applying a unimodal approximation to the posterior.
While unimodal approaches do not utterly fail here, they are not expected to perform as well because of the multimodal nature of the ideal posterior if multiple eigenvalues are present.To examine this further, let us first define where E j is the j th eigenvalue of a Hamiltonian H. Given such an initial state, the likelihood of measuring "zero" in the two outcome phase estimation experiment for e −iHt given experimental parameters t and θ is In collapse-free phase estimation, we do not know how many eigenvalues there are in the support of |ψ .We also do not want to track them explicitly because there are exponentially many in the worst case.Instead, we use the following likelihood function to model the experiment: where h ∈ [0, 1] is a hedging parameter that is used to combat overconfidence in an update.This model in essence assumes that all the data arises from a single eigenvalue, and aims to estimate the most likely single E given data that arises from (11).We similarly use (8) to choose the experimental times and take θ = x 2 in the PGH.Since this heuristic is known to provide asymptotically optimal scaling in cases where only one eigenvalue is present [16], we anticipate that it will provide similar advantages here as well.
We consider two eigenvalues with uniformly distributed values between 0 and 1 and attempt to learn one of the two eigenvalues.If the two eigenvalues are E 1 and E 2 then the loss function we consider is: where x i is the position of particle i each of which corresponds to an eigenvalue of H.We then consider the mean and the median over 1000 randomly pairs of eigenvalues, where we pick the worst case scenario of a 1 = a 2 = 1/ √ 2. We examine the performance of structured filtering and Liu-West resampling for this problem in Figure 11.Specifically we see that while both methods are capable of rapidly learning one of the eigenvalues, Liu-West resampling performs substantially worse.This is not simply because structured filtering used more particles; Liu-West resampling does not give significantly better results when we allow it more than 12000 particles.These differences are, unsurprisingly, most striking in the mean.With Liu-West resampling linear regression shows that the mean minimum quadratic loss decays roughly as L min ≈ [8.1 × 10 −3 ]e −0.0095n whereas the quadratic loss decays for structured filtering as L min ≈ [6.7 × 10 −3 ]e −0.0164n , where n is the number of experiments.This shows that the two asymptotic scalings differ polynomially, specifcially by a power of rougly 3/2.This shows that structured filtering can improve the performance of collapse-free phase estimation compared to Liu-West resampling, which is arguably the most powerful efficient inference method previously used for such problems.

VI. CONCLUSION
Existing methods for parameter estimation in quantum systems are implicitly biased via structural assumptions about the posterior distribution.Here we show that even the subtle biases introduced through filtering with Liu-West resampling can catastrophically fail to estimate parameters for problems where the experiments chosen are incapable of distinguishing between two equivalent hypotheses.We address this by introducing a method that can adaptively reason about the structure of the posterior distribution and break it up into clusters that individually are ammenable to Liu-West filtering or other methods that are appropriate for unimodal distributions.Specifically we represent the algorithm's beliefs about the structure of the posterior as a weighted graph that describes the different possible clusterings for the posterior distribution, and allow it to invent new hypotheses or eliminate previous ones via a discrete set of graph manipulations.By using this approach we grant structured filtering the ability to introspect on its own beliefs about the true model parameters.This introspection is key to the success of our algorithm.
We note that by allowing the inference algorithm to adapt its assumptions about the structure of its current state of knowledge we allow Bayesian inference to succeed in problems, such as randomized gap estimation, where existing leading methods fail.It also manages to outperform its unstructured counterpart in learning eigenphases of unitary operations in settings where the data comes from multiple eigenvalues.This illustrates the power and broad applicability of the technique.
Looking forward, there are many ways that these methods can be built upon.Firstly, while our approach does an excellent job of approximating the posterior distribution it does a less perfect job of approximating the regions of importance in the posterior distribution.In particular, our rules for restructuring the graph assume that regions of hypothesis space that have low probability have low importance; however, since Bayes' rule is additive in logarithmic space, wild fluctuations can occur in the posterior probability.Such fluctuations can be common in cases like collapse-free phase estimation because the approximate likelihood does not match the true likelihood.This means that it may be important for the structured filters to learn how to distribute particles to accommodate data that is given low probability by the assumed likelihood function.
Furthermore, structured filtering can be viewed as an application of a family of techniques known as probabilistic programming.Further work will be needed to see if subsequent insights from probabilistic programming may yield even more powerful representations for the posterior distribution.
Finally, while we have provided a proof of principle that structured filtering can solve many of the problems plaguing SMC approximations in physics, it remains to apply them experimentally.It is our hope that these methods may prove to be of great use estimating Hamiltonian parameters that have subtle influence on experimental likelihoods, such as those that appear in second order corrections to spectral line splittings.Structured resampling, and approaches like it, may finally have enough power and robustness to tackle such problems in an automated fashion relieving the experimentalist of much of the burden of designing clever experiments to learn hard to measure quantities.

Figure 2 .
Figure 2. Success of Liu-West resampling for capturing the unimodal posterior for the Rabi likelihood (1).The LW-approximated and exact posteriors are shown for 40 single-shot measurements at times t k = (9/8) k , with a uniform prior on ω ∈ [0, 1].

Figure 3 .
Figure3.Comparison of unweighted (left) and weighted (right) k-means clustering on 5,000 weighted particles.The positions of each particle were drawn uniformly at random, while the weights were chosen to represent a mixture of four Gaussian distributions.The magnitude of each weight is visualized by the size of the points in the figure, with very low-weight particles being ommited for visual clarity.The centroids found by each algorithm are indicated by stars.The unweighted clustering was performed using SciKit-Learn[21], while the weighted clustering was performed using Algorithm 2.

Figure 4 .
Figure 4. Examples of graphical models for (a) conventional particle filtering, (b) model selection between particle filters using distinct likelihood functions, and (c) model selection between two different clusterings of the posterior into mixture models.In each example, particle filtering is indicated by square nodes, model selection is indicated by circle nodes, and mixture models are indicated by triangle nodes.The shades of each edge correspond to their weights with darker edges corresponding to higher weight.

Figure 5 .
Figure 5. Examples of the champion and floor pruning rules applied to a model selection node with three initial child particle filter nodes.All such replacement rules work identically if the filter node is replaced with a subtree.

Figure 7 .
Figure 7. Examples of the only-child and single-child pruning moves, used to eliminate redundant intermediate nodes and reduce the depth of the structured filtering tree.Each of the pruning rules shown above exactly preserve the structure with a simpler tree.
Figure 8.Posterior distributions for using structured filtering to learn potential eigenvalues for randomized gap estimation with three eigenvalues where one is without loss of generality taken to be 0. The true model has unknown eigenvalues (0.75, 0.15) and the symmetries of the problem imply the final posterior should ideally be concentrated at (0.75, 0.15), (0.15, 0.75), (0.6, 0.75), (0.75, 0.6).The floor threshold was set to 0.1 and champion threshold was set to 20.For this data and the graph was set to have maximum degree 3 and maximum depth 4. Liu-West resampling with a = 0.98 was used to cluster each cluster in the posterior and a minimum of 1000 particles was assigned to each cluster.A video showing the operation of our algorithm is provided in the supplementary material, or online at goo.gl/4NKYaX.
Figure 9. (a) Region estimate and (b) structural tree for the example shown in Figure 8.Note the three mixture nodes that directly descend from the root in subfigure (b) indicate that the posterior consists of at least 4 clusters, as each of the four children of these mixture nodes are decision nodes representing at least one cluster.

Figure 11 .
Figure 11.Comparison of Liu-West resampling and structured filtering for collapse-free phase estimation with two true eigenvalues.
Comparison of Liu-West resampling and structured filtering for randomized gap estimation (RGE).Means and medians are computed using 1000 random RGE instances, with λ 1 and λ 2 sampled from the initial prior which is uniform on [0, 1] 2 .