Quantum Model Averaging

Standard tomographic analyses ignore model uncertainty. It is assumed that a given model generated the data and the task is to estimate the quantum state, or a subset of parameters within that model. Here we apply a model averaging technique to mitigate the risk of overconfident estimates of model parameters in two examples: (1) selecting the rank of the state in tomography and (2) selecting the model for the fidelity decay curve in randomized benchmarking.


I. INTRODUCTION
Parameter estimation is an integral part of physics. Accurate estimates of physical parameters in quantum mechanical models allows for precision quantum control [1,2] which enables practical goals such as quantum computation [3] and quantum metrology [4] which in turn can provide probes of fundamental physics such as gravity wave detection [5].
At the other end of the parameter spectrum the quantum state of a physical system is our most complete description of it. Thus estimation of quantum states [6] might be seen as the most ambiguous form of estimation. There are many approaches to the general problem [25][26][27][28], some which specialize for computational efficiency [29][30][31][32] and those which go beyond to estimation of regions [33][34][35][36][37][38].
All of the above mentioned results assume a model. That is, it was taken as given that a particular parametric distribution generated the data. But what if this assumption is not correct? This question has garnered recent interest and classical approaches to model selection have been used in a variety of experimental [43,44] and theoretical [39][40][41][42] works. Here we supplement these results with a new approach to parameter estimation: model averaging. The technique shares many similarities with model selection-in fact, model selection is a crucial component of model averaging, but goes beyond it. Although model selection adds an additional layer of security to overconfident estimation, selecting models can be itself a red herring, for the most probable model might only be slightly more probable than others.
The approach considered here combines Bayesian parameter estimation with Bayesian model selection, such that the final estimate of the parameters is the best value of the parameters within each model, averaged over the probability assigned to each model. We will show that such an approach can reduce the error incurred by first selecting a modelwhich has some probability of being incorrect-then selecting parameters within that model. In fact, the numerical experiments presented here show that model average estimation always does better than the estimates from incorrect models and in some scenarios can perform better than even estimates from the correct model. This is due to the additional hedging afforded by considering multiple models, all of which carry some information.
The paper is organized as follows. In Sec. II we outline the problem and review the model selection techniques used so far. In Sec. III, we present the full Bayesian approach to model selection and define the model average estimate of parameters. Sec. IV presents two distinct examples where the model average estimate provides an advantage for parameter estimation. We conclude with a discussion in Sec. V.

II. GENERAL PROBLEM AND COMMON METHODS
In Sec. II A we overview the problem. In Secs. II B and II C we review the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) which are by far the most commonly used approaches to model selection (and the only used thus far for quantum estimation). These sections are included for reference and completeness.

A. Problem setup
Let us begin with the base problem of parameter estimation. A physical model prescribes the probabilities for the outcomes of experiments: Pr(D|x; C, M ). Here D is some hypothetical or observed data, x is a set of real numbers in R d , where d is the dimension of the model, C is the experimental context 1 and M is our model. For example, we could have the model M be that of qubit which is parameterized by a Bloch vector x = (x, y, z). The experimental context could be a measurement basis, say that of σ x . Then, the quantum mechanical model prescribes Pr(±|x; C, M ) = (1±x)/2. In quantum mechanics this is called the Born rule and in statistics, the likelihood function.
Going from parameters to the probability of data is a deductive process-the model gives us numerical values of Pr(D|x; C, M ). The experiment, on the other hand, gives us a particular data set D-but what we really want is x. This is an example of an inverse problem. What the Bayesian solution provides is Pr(x|D; C, M ), the distribution of x given the data. Although we lack certainty about x, we can accurately (read: quantitatively, mathematically) describe our state of knowledge of x given we have seen the data D.
The formal path forward is through Bayes' rule: From a chronological point of view relative to D, we begin with the prior Pr(x|C, M ), which encodes the information we have about x prior to learning about D. We weight the prior by the likelihood function and normalize by the marginal likelihood 2 : The distribution produced as result of Bayes rule, Pr(x|D; C, M ), is called the posterior and represents our knowledge of x after the data has been observed. At this point we could call the problem solved. This, however, assumes the model is correct. If the model is suspect, then we have the meta-problem of determining the best model.

B. Akaike Information Criterion (AIC)
The most commonly used model selection 3 technique is the Akaike information criterion (AIC), which arises as follows. First, we suppose there is some true model, M = T , giving a distribution Pr(D |T ; C). We quantity the discrepancy between this model and our candidate model via the Kullback-Leibler divergence: for some appropriate choice of parameters x. The "best" model, then, is the one that minimizes KL(T M ), which is equivalent to maximizing the second term in (3) since the first term only depends on the true model. However, we do not know the true model and we do not know the best set of parameters within our candidate model. The latter problem is naturally address by collecting data D and producing an estimate of the parametersx(D), then averaging over possible data sets such that the quantity of interest becomes Akaike showed that, independent of the true model (and under some regularity conditions), an unbiased estimator of this quantity is where d is the number of dimension of the model (the number of free parameters). The preferred model is the one with largest value of AIC(M ). The simple linear penalization with dimension makes it is clear how models with more parameters are penalized.

C. Bayesian Information Criterion (BIC)
The Bayesian approach is more general. Being so, it is less obvious how it might penalize complex models. Here we show how an asymptotic approximation leads to a form similar to the AIC. First, we write the marginal likelihood (2) as the integral expectation Pr(D|M ; C) = dx Pr(x|M ; C) Pr(D|x; C, M ).
We will approximate this integral using Laplace's method. To this end, consider the Taylor expansion of the log of the likelihood function about its peak (note that for brevity, we have dropped the context C and model M from the conditionals): If we assume the number of measurements N → ∞, the law of large numbers gives us where I j is the Fisher information of the j'th measurement and I is the arithmetic average of these values. Then, the integral (6) becomes Now we take the logarithm to obtain If we ignore the terms not changing with N , we have a new quantity which is the well-known Bayesian information criterion or BIC. Notice the striking similarity to the AIC (5). Being nearly equivalent, the BIC is often considered in addition to the AIC. Next, we will consider the full solution, which will allow us to obtain more accurate estimates of parameters averaged over the competing models. Recent proposals have used the AIC/BIC on both simulated [39][40][41][42] and experimental data [43,44]. In [42], however, the authors caution its unattended use. The argument against AIC for quantum states, for example, is simple. The AIC is derived from a metric which measures their closeness of model in their predictive probability-a certainly well-motivated measure. However, such a measure is only useful if all future measurements will be the same as those used to perform the data analysis. That is, one can measure copies of a quantum system in some fixed set of bases, estimate the state, then use that estimate to predict the outcome of a measurement in a new basis. Thus, in quantum theory, one ought to consider a measure on predictive distribution as maximized over all possible future measurements. As suggested in [42], such a measure might well be the quantum relative entropy, for example. Here we avoid these problems by considering the full Bayesian solution, which we describe next.
Often, practitioners of Bayesian methods go one step further and compare two models-say, M 1 and M 2 -by taking the ratio of these posteriors, noticing the normalization factor cancels. This quantity is called the posterior odds ratio and first considered by Jeffreys [47]. Clearly, if the posterior odds ratio is larger than 1, we favor M 1 . The last fraction is called the prior odds ratio and the unbiased choice favoring neither model is set this term equal to 1. This leaves us with which is called the Bayes factor [48]. Each quantity in the ratio is marginal likelihood (2) of its respective model. For a discrete number of hypothetical models {M k }, and assuming one model must be chosen, the optimal strategy is to compute the marginal likelihood if each model and select the one with the highest value. Model selection of this type has been used in quantum mechanical problems of Hamiltonian finding [49] and estimating error channels from syndrome measurements [50].
Here we will investigate the idea of using a meta-model, which is an average over those in {M k }. Assume that we are interested in some subset of parameters y common to all models. Given we have taken data D and computed each marginal likelihood Pr(M k |D; C), we define the model average estimate (MAE) aŝ In words, this is the average (over models) of the average (over parameters within models). Variants of this approach are referred to as Bayesian model averaging [45]. Before moving to our examples, a few comments are in order. First, it is not necessary that the models are "nested" in the sense that we can order them into supersets. The only requirement is that the parameters of interest are included in each model. The other parameters in each model we might call nuisance parameters. Part of the appeal of the Bayesian approach is that these parameters are automatically dealt with and we can focus on those parameters which are of immediate interest. Of course, the nuisance parameters can be inferred as well.
Let us wax philosophical for a moment. What is being proposed is to select a meta-model, an average over many different physical models. This may seem awkward for physical theories-after all, there is only one true model, right? Not in our view. Models are human constructs, those Platonic ideals which describe another world, a world we were clever enough to find through our mastery of mathematics and abstraction. Here, we have dropped the idea that the point is to find the capital-T-Truth. Rather, we measure our understanding of Nature through our ability to predict and control its behavior. By averaging physical models, we can show that this idea has merit.

B. Sequential Monte Carlo
In practice, the Bayesian update rule and the expectations required in the equations above are analytically and computationally intractable since they involve complicated integrals over multidimensional parameter spaces which may include solutions to equations of motions which are themselves intractable. To perform the calculations we turn to Monte Carlo techniques. Our numerical algorithm fits within the subclass of Monte Carlo methods called sequential Monte Carlo (SMC) or particle filtering [51].
The SMC procedure prescribes that we approximate the probability distribution by a weighted sum of Dirac deltafunctions, where the weights at each step are iteratively calculated from the previous step via followed by a normalization step. The elements of the set {x j } n j=0 are called particles. Here, n = |{x i }| is the number of particles and controls the accuracy of the approximation. Like all Monte Carlo algorithms, the SMC algorithm approximates expectation values, such that In other words, sequential Monte Carlo allows us to efficiently compute multidimensional integrals with respect to the measure defined by the probability distribution. The resultant posterior probability provides a full specification of our knowledge. However, in most applications, it is sufficient-and certainly more efficient-to summarize this distribution. In our context, the optimal single parameter vector to report is the mean of the posterior distribution The SMC approximation can also provide efficient calculation and description of regions [38,52]. For our purpose, we also require the SMC approximation to give an accurate and efficient esimate of the marginal likelihood Eq. (6), which we need to calculate Bayes' rule at the level of models Eq. (16). Via the SMC approximation, the integral expectation in the definition of the marginal likelihood, Eq. (6), is It is not immediate obvious but is easy to see in hindsight that this is exactly the normalization that must be computed already in the SMC algorithm after the weight update, Eq. (21), is applied. By storing this value, we can apply Bayes rule at the meta-level of models Eq. (16). An iterative numerical algorithm such as SMC requires care to ensure stability. Conditions for stability of the algorithm and the specifications of an implementation have been detailed elsewhere [52,53]. The SMC algorithm has now been used in many quantum mechanical parameter estimation problems [38,49,[52][53][54][55] and a software implementation (the one used here) is available as a Python package [56,57].

A. Rank selection
We consider first an example similar to Guta, Kypraios and Dryden [44]: n qubits subjected to random Pauli measurements where the models under consideration are those of differing rank. Each model will be denoted M r where r is the rank of the unknown quantum state so that the dimension of model M r is d = 2 n+1 r. We generate unknown quantum states with fixed rank r as follows [58]. Begin with a matrix X ∈ C 2 n ×r where each component x ij is chosen independently according to (x ij ) ∼ N (0, 1) and (x ij ) ∼ N (0, 1)-standard Normal distributions. Then define the rank r density operator If r = 2 n , this construction is equivalent to a Hilbert-Schmidt random density matrix. The model parameters will be the vectorization of the matrix X: x = vec(X). We label the single qubit Pauli operators {σ 0 , σ 1 , σ 2 , σ 3 } and the multi-qubit Paulis by where k = k 1 + 4k 2 + 4 2 k 3 + · · · + 4 n−1 k n . Since each Pauli is idempotent, σ 2 k = 1, each individual measurement has 2 possible outcomes which we label d ∈ {+1, −1} for the +1 and −1 eigenvalues. Then the likelihood function of a single measurement can be related to the expectation value via: Pr(±1|σ k ) = (1 ± σ k )/2. Using the properties of the vec operation, we can write this as an explicit function of x: We label the parameter vector within the rank r model M r by x r and the associated density matrix given by Eq. (25) ρ r . Within each model, there is not a one-to-one correspondence between x r and ρ r -different vectors will yield the same density matrix. Since we will be interested in obtaining accurate estimates of density matrices as quantified by some norm on the space of density matrices, we will average the models over their ρ r 's rather than their x r 's.
Within each model, then, we have the mean density matrix (recall D is data, C is the context-that is, which Paulis were measured)ρ Explicitly, the MAE in Eq. (19) is We assume there is a true density matrix ρ t and we judge each estimate of the true stateρ by its spectral distance to ρ t : where σ max denotes the largest singular value. This is the norm induced by the usual Euclidean norm on vectors.
The data for 2 qubits is presented in Fig. 1. The important take-away is that the model average estimate does as well, or slightly better than the true model in every case. Also, we see that it is quite easy to identify rank 1 models (pure states) as well as rank 4 models (full rank states) while it seems difficult to identify states of non-extreme rank. Notice that it is extremely difficult to distinguish the rank 3 model from the full rank model when the former defines the underlying truth. However, both models perform well with respect to the error in the estimated parameters and the model average estimate does best, on average.
To further illustrate the difficulty in differentiating high rank states, the probabilities assigned to 3 qubit models is shown in Fig. 2. Again, we see that pure states and full rank states are correctly identified, yet it is difficult to correctly distinguish between rank 7 and rank 8 states, in the same way as it was difficult to distinguish rank 3 and rank 4 states for 2 qubits. We conclude that for the models considered here, it is easiest to correctly identify low rank and full rank states, while it is difficult to correctly identify nearly high rank states.
Notice also that models which are far away-in the sense that the ranks differ by relatively large amounts-are quickly ruled out. In these cases, since the SMC algorithm can be run online (in parallel with the experiment), simulating such models can be stopped to mitigate the computational difficultly in simultaneously simulating many quantum models. Still, tomography is at one extreme in the spectrum of methods for estimating quantum mechanical parameters-it is the most complete description of the physical system. At the other extreme is summarizing information from experiments into a single number, such as fidelity.

B. Randomized benchmarking
In the second example, we consider the experimental protocol of randomized benchmarking [59] which has been demonstrated in a variety of experimental settings [61][62][63][64] to efficiently characterize noise and quantum channels. The protocol consists of stringing together potentially long sequences of gates which is then undone to determine if the initial state has survived. In [59] the approach was shown to give, in expectation, an exponentially decay Ap m + B, where A and B encode the errors in preparation and measurement, p is the bare survival probability and m is the length of the sequence. In the models we consider, p can be related to the average fidelity over a group of gates, but more specialized protocols exists [24,65]. FIG . 2: The performance of Bayesian model selection and the model average estimate for 3 qubits. Each box represents the data for its labeled rank as the "true" model. The lines represent the median of the data and, where present, the shaded areas are the interquartile ranges. Each "measurement" on the horizontal axis corresponds to 100 experiments of a randomly chosen Pauli measurement. In the SMC algorithm, 10 5 particles were used in each model. For each true rank, 10 simulated states were generated and measured.
Typically, p is the only parameter of interest since it is directly related to the average fidelity of the device which is then compared to some threshold. The other parameters are often consider nuisance parameters. In [66] it was shown that the decay can be interpreted as probabilistic model where the binary outcome of each measurement sequence of length m has probability of survival (labeled 0) where the subscripts refer to this as the zeroth order model. In [59], a hierarchy of models was introduced because the zeroth order model assumes the errors in the gates within the sequence are independent. By dropping this assumption, a richer set of noise models can be studied [60]. Here we will study the zeroth order model and the first order model, where A 1 and B 1 again encode the preparation and measurement errors, C 1 encodes the error on the final gate in the sequence and q 1 − p 2 is a measure of the gate dependence in the errors.
For the zeroth order model, we take the prior to be a normal distribution with a mean vector (p, A 0 , B 0 ) = (0.95, 0.3, 0.5) and equal diagonal covariances given by a deviation of σ = 0.01. Note that the first order model is equal to the zeroth order model when either C 1 = 0 or q 1 = p 2 . In order to not make the model so different that it would be trivial to distinguish them, we look at two priors for the first order model which are close to the zeroth order model. The first is a normal distribution with a mean vector (p, A 1 , B 1 , C 1 , q 1 ) = (0.95, 0.3, 0.5, 0.03, 0.95) =: µ I and equal diagonal covariances given by a deviation of σ = 0.01 and the second slightly closer with the same covariance matrix but mean vector (p, A 1 , B 1 , C 1 , q 1 ) = (0.95, 0.3, 0.5, 0.02, 0.92) =: µ II . Note that the difference between these two distribution in the relative entropy divergence is only 0.050, and so we might expect them to behave the same. Since the both models are close to the zeroth order model, we expect it to be difficult to distinguish them.
In Fig. 3 we simulate the models noting again that, via the priors, they are very close. This intuition is quantified by the fact that the models are hard to distinguish, regardless of which is true. On the left in Fig. 3, we see that parameters in the first order model are so close to those in the zeroth order model that it is irrelevant which is chosen, for the purpose of estimating p. However, what is "close" can be deceiving as we see in the right of Fig. 3. Recall that relative entropy from µ I to µ II is only 0.050 (which explains why they are so difficult to distinguish). In this case, the accuracy of the estimates of the parameter p depend crucially on which model is actually correct. In such cases, the model average estimate can be seen as providing a more conservative estimate of the average gate fidelity by hedging what is at best a 50/50 guess on which model is correct.

V. CONCLUSION AND DISCUSSION
We have introduced a Bayesian model averaging approach to estimating parameters in quantum mechanical models describing data. In the examples considered, the model average estimate performance as well as the unknown true model in most cases. In situations where models are difficult to distinguish, the model average estimate can slightly outperform the true model.
For the quantum state estimation example (Sec. IV A) we considered models of differing rank of the density matrix. Ranks which differ by large amounts from the true rank are rapidly ruled out-that is, the probability assigned to them quickly approaches zero. Thus, they contribute nothing to the model average estimate. On the other hand, ranks which are close to the true rank-especially when the true rank is high-are not so easily distinguished. This means that first selecting a rank and then performing estimation within that model can lead to overconfident estimates of the state. By averaging, we can allow those estimates to only contribute with the relative probability that we deem them to be true. The mechanism for why higher rank states are hard to distinguish is yet unclear. Although the SMC algorithm and the implementation used here have been extensively studied for a wide range of problems, it is possible that state tomography is an outlier. That is, some major modification to modeling or the algorithm itself may be required to obtain the best performance. This resolution would be less interesting than a physical explanation, such as Pauli measurement tomography (which we have used in the example here) is not the optimal scheme to distinguish rank. These questions are left for future work.
In the example of randomized benchmarking (Sec. IV B), we explored a situation where the presence of higher order perturbations were difficult to detect. The Bayesian model selection approach accurately predicts this by assigning roughly 50/50 probability assignments to the models. Surprisingly, the "closeness" of the models as measured by our ability to distinguish them does not translate into our ability to accurately estimate parameters common to each. In some cases we can do no more than guess which is correct, yet guessing the wrong model may have disastrous consequences in our ability to accurately infer the parameters of interest. Again, the model average estimate mitigates the risk of improperly "guessing" which model is correct.
We have noted in the introduction the numerous approaches to estimation within quantum theory. This should urge one to ask, are all of these distinct approaches necessary-is there not some unified approach? Yes! The Bayesian framework outlined here is remarkably powerful in its generality. We note that Bayesian ideas have already been put to good use in quantum information theoretic and foundational problems [69][70][71][72][73][74][75][76][77] as well as for tomographic and parameter estimation problems [26,[78][79][80][81][82][83][84][85] and experimental design [86,87]. Importantly, the Bayesian algorithm can provide solutions to these problems online, while the experiment is running, with the same software tools [56].