A practical and efficient approach for Bayesian quantum state estimation

Bayesian inference is a powerful paradigm for quantum state tomography, treating uncertainty in meaningful and informative ways. Yet the numerical challenges associated with sampling from complex probability distributions hampers Bayesian tomography in practical settings. In this article, we introduce an improved, self-contained approach for Bayesian quantum state estimation. Leveraging advances in machine learning and statistics, our formulation relies on highly efficient preconditioned Crank–Nicolson sampling and a pseudo-likelihood. We theoretically analyze the computational cost, and provide explicit examples of inference for both actual and simulated datasets, illustrating improved performance with respect to existing approaches.


I. INTRODUCTION
Quantum state tomography (QST) is of fundamental importance in quantum information processing, where realization of computational advantages rests critically on the quality of the underlying quantum resources. In general, QST seeks to estimate the density matrix ρ describing a given state, utilizing the results of measurements on repeated state preparations [1]. As an encapsulation of the quantum state's properties, the density matrix facilitates quantitative predictions of quantum information protocols, clarifies the effects and sources of noise, and provides the foundation for analyzing entire circuits via quantum process tomography [2,3].
Yet QST is notoriously challenging for all but the smallest quantum systems. The Hilbert space of a collection of qubits grows exponentially with the number of particles, as does the number of independent quantities needed to fully characterize ρ. Indeed, such exponential scaling is the source of the unique computational power inherent in quantum information, and accordingly QST cannot be used for characterizing large-scale QIP systems of the future, at least in their entirety. However, there remains demand for efficient and informative QST techniques that make the most of available resources and push limits on system size. In this vein, Bayesian methods offer exciting promise. Built upon Bayes' rule for updating a prior probability distribution according to new information (measurements in the context of quantum tomography), Bayesian QST returns a complete probability distribution on ρ, quantifying uncertainty in a natural way, utilizing all available information optimally (in terms of minimizing an operational divergence), and avoiding unjustifiably optimistic estimates of low rank [4]. While Bayesian sampling approaches have been applied in several quantum optical experiments [5][6][7], the numerical * lukensjm@ornl.gov † kodylaw@gmail.com challenge of drawing from high-dimensional probability distributions impedes widespread use in the physics community.
In this work, we propose, analyze, and demonstrate a full Bayesian tomography method that is straightforward to implement and numerically efficient. Our stand-alone approach leverages recent developments multiple fields, including density matrix parameterization [8], PAC-Bayesian machine learning [9], and Markov chain Monte Carlo (MCMC) algorithms [10]. After introducing the algorithm in detail, we test it on experimental two-qubit data, obtaining a ∼3.5× speedup in our custom Metropolis-Hastings method over slice sampling. Additionally, with the aid of simulated data of much higher-dimensional two-qudit measurements, we observe a computational scaling advantage utilizing a pseudolikelihood in favor of a full multinomial likelihood. Overall, our method represents an improvement over previous Bayesian QST approaches and should provide a valuable tool for comprehensive, yet numerically efficient, state estimation.

II. BACKGROUND
In formulating the general problem, consider a system of n qudits-d-level quantum information carriers. The Hilbert space dimensionality is then D = d n , and the D × D density matrix ρ describing a state requires D 2 − 1 real numbers for specification. In order to designate a physically realizable state, ρ must be (i) normalized [Tr ρ = 1], (ii) Hermitian [ρ † = ρ], and (iii) positive semidefinite [ ψ|ρ|ψ ≥ 0 for all unit-norm D-dimensional states |ψ ]. Historically, three major approaches have been adopted to estimate ρ from measurements.
Linear inversion.-The first method considered in quantum information processing, linear inversion tomography relies on the fact that measurement outcome probabilities are linear functions of the individual elements comprising ρ [1]. Thus, if a sufficient number of measurements have been performed to access all D 2 − 1 parame-ters of ρ-and the outcome frequencies are equated with these probabilities directly-one can enlist, e.g., leastsquares (LS) inversion to obtain an estimate ρ LS . While straightforward, LS tends to return nonphysical states: normalization and hermiticity can easily be enforced, but positive semi-definiteness cannot be.
Maximum likelihood.-Maximum likelihood estimation (MLE) finds the density matrix which is most likely to have produced the observed data D: where L D ∝ P(D|ρ), the probability of receiving the particular set of outcomes given state ρ, as defined by some model [11,12]. Through appropriate parameterization of ρ, this method guarantees a result satisfying all physicality constraints. This advantage has made MLE the dominant approach to QST in recent years. However, as seen in Eq. (1), ρ M LE is a point estimate and so does not quantify the level of uncertainty in the result. In practice, error bars have been obtained by modifying the observations according to, e.g., a Poissonian noise model and computing many MLE estimates [13], a procedure which amounts to simulating further experiments and averaging the MLE results obtained from these. While likely to give reasonable estimates, this approach is somewhat ad hoc and conceptually undesirable, as it involves feeding in additional data beyond that obtained experimentally.
Bayesian.-The third and least explored approach, Bayesian QST [4,5,14] accounts for experimental uncertainty explicitly through Bayes' theorem. Suppose ρ(x) is parameterized by some vector x, such that any value within x's support returns a physical matrix. Bayes' theorem states that the posterior probability distribution of x, given results D of some experiment, follows via where L D (x) is the likelihood (as in MLE), π 0 (x) is the prior distribution (any beliefs about ρ before the experiment), and Z is a normalizing constant such that dx π(x) = 1. With access to π(x), the expectation value of any function φ of ρ can be obtained which can be used to compute, e.g., the mean and standard deviation of any quantify of interest. Nevertheless, evaluating integrals of the form in Eq. (3) is numerically challenging due to their generally complicated features and high dimensionality, even for moderate-size systems (e.g., two qubits). Accordingly, MCMC methods have been invoked in the literature, such as Metropolis-Hastings [4,8], sequential Monte Carlo (SMC) [14] and slice sampling [5]. These approaches are designed, in most cases, to obtain R samples {x (1) , x (2) , ..., x (R) }, so that Eq. (3) can then be approximated as Slice sampling in particular is an effective and quite general MCMC method, requiring no proposal distributions and largely insensitive to initial step settings settings [15,16]. However, the computing time required for convergence can easily make these methods intractable for systems of interest. A major motivation for the current work rests in the realization that a more tailored sampling method-focused on the specific density matrix parameterization and robust to increases in system dimensionality-can attain significant computational speedups. Finally, before describing in detail the procedure introduced here, we note an alternative view of Bayesian tomography associated with adaptive QST. In this application, Bayes' theorem is invoked in real-time, with the results from previous measurements used to hone in subsequent measurement choices and reduce the total number of bases required for reconstruction [17][18][19][20][21]. While beyond the scope of the present work, where we concentrate on Bayesian state reconstruction post-experiment, we could certainly envision incorporating aspects of approach into adaptive QST as well.

A. Steps
We now outline our proposed Bayesian QST workflow, summarized visually in Fig. 1. (The steps will be explained in detail in Sec. III B.) (a) Perform a set of measurements on unknown state ρ, amounting to a total of N individual outcomes (over all measurement settings).
(b) Compute the least-squares estimate ρ LS . If the number of measurements is tomographically incomplete, ρ LS lives in a subspace spanned by only those directions which were observed, which we can express through the function P M (·), i.e., (c) Parameterize the D × D density matrix by D nonnegative real numbers, y k , and D complex column vectors of length D, z k . The density matrix for parameter set x = {y 1 , ..., y D , z 1 , ..., z D } is then This satisfies all physicality conditions.
(d) Take the prior distribution for x as which amounts to treating the weights as Gamma- ∼ Γ(α, 1)] and the vectors as standard-normal complex Gaussians (e) Define the pseudo-likelihood as with A F ≡ Tr(A † A) denoting the Frobenius norm and P M (·) the projection introduced in Step (b).

B. Further Details on Specific Steps
Parameterization.-We have opted for the parameterization and prior employed by Mai and Alquier [8], which expresses the density matrix as a superposition of normalized (though non-orthogonal) projectors. Incidentally, this represents an over-parameterization, in that it relies on a total of 2D 2 +D real numbers, rather than the minimum of D 2 − 1 required for a D × D density matrix. We have found this parameterization significantly more efficient to sample from and evaluate than the Cholesky approach of Refs. [5,22]. For example, computing the determinant in the integration measure of Ref. [5]-needed to preserve Haar invariance [23]-requires O(D 6 ) operations for a given draw. On the other hand, the current over-parameterization utilizes a simple Cartesian differential, where z k,m denotes the m-th component of the complex vector z k . Constructing ρ given x requires only O(D 3 ) operations [Eq. (5)], offsetting the small overhead incurred from the additional parameters. The prior distribution [Eq. (6)], also from Ref. [8], is specified by one user-adjustable value, α, which can be used to favor low-or high-rank ρ, i.e., pure or mixed states, respectively. The collection of D normalized ran- ∼ Γ(α, 1) follows a Dirichlet distribution Dir(α), which guarantees both normalization and nonnegativity; α = 1 represents a fully uniform prior, with equal weight given to all physically realizable states, while α < 1 favors sparse Dirichlet draws [24] and hence purer states. (It is important to note that Haar invariance of the prior π 0 (x) obtains for any choice of α, due to rotational symmetry of the normal distribution.) Finally, the complex Gaussian vectors, when normalized to Z k /|Z k |, correspond to uniform draws from the complex unit hypersphere.
Pseudo-likelihood.-The particular L D (x) chosen in Eq. (7) is "pseudo" in that it does not proceed from Otherwise set x (j+1) = x (j) . Increment j by one and return to step 2. an explicit experimental model, but rather merely assigns a loss function between a proposed ρ(x) and experimental data, in this case the least-squares estimate ρ LS . Growing in popularity in the context of "probably approximately correct" (PAC) Bayesian machine learning [9], pseudo-likelihoods are useful when a firstprinciples model is either unknown or too complex to compute efficiently. The downside is the need to separately specify the weight between evidence and prior, as controlled by the constant appearing in the psuedolikelihood expression exp(const × loss). The larger its value, the more sharply peaked around ρ LS the posterior distribution becomes. As one of the strengths of Bayesian QST lies in its quantification of estimator uncertainty, it is essential that this scale factor reflect confidence levels commensurate with the amount of data gathered. For a quadratic loss function as in Fig. 1(e), one can associate this constant with 1/(2σ 2 ), with σ 2 the variance. If we take N as the total number of events utilized in the LS estimate, it is reasonable to assume σ 2 ∝ 1/N , although the specific proportionality factor is unclear. Reference [8] conjectures σ 2 = 2/N as optimal, but in the absence of more rigorous motivation, we select σ 2 = 1/N , which in the examples below leads to uncertainties comparable to that of a full likelihood, albeit slightly larger. In general, more thorough methods for selecting the variance represent an important direction for future research.
In Steps (b) and (e), we propose treating cases of incomplete measurements by projecting onto only those elements of ρ which are accessed in the experiment, expressed formally through the function P M (·). Consider the decomposition of a D × D density matrix in terms of D 2 − 1 traceless, Hermitian generators λ k of SU(D): In light of orthogonality [Tr λ k λ l = 2δ kl ], we have c k = Tr ρλ k . Thus, incomplete measurements reflect that only a subset of the D 2 − 1 observables can be estimated through linear inversion. Suppose that K M denotes this subset of indices; then we define P M (·) as For example, in the data utilized in Sec. IV, measurements were sensitive to eight of the fifteen coefficients required to specify a two-qubit state (|K M | = 8). By contrast, a tomographically complete experiment corresponds to P M (ρ) = ρ. We emphasize that this particular projector definition is merely a convenient choice, and it differs from the prob-estimator in Ref. [8].
Additionally, even though we have presented a pseudolikelihood formulation in defining the proposed method, the basic features can be readily applied to a full (modelinfused) likelihood as well. Suppose an experiment consists of Q positive-operator valued measures (POVMs) Λ (q) , each with S q total outcomes, associated with operators in the set Λ (q) = {Λ where the outcome associated with Λ pseudo-likelihood in our formulation is expected to impart a computational speedup for complete measurements but not necessarily for the incomplete case, a situation which is consistent with the results of Secs. IV and V.
Sampling algorithm.-The most challenging feature of Bayesian methods, sampling the posterior distribution faces slow convergence that becomes arbitrarily slow as dimension increases. In 2013, however, Cotter et al. [10] introduced a transformative approach to MCMC sampling which eliminates this "curse of dimensionality" for Gaussian priors, under appropriate assumptions on the likelihood. Titled "preconditioned Crank-Nicolson" (pCN for short), pCN modifies standard random-walk Metropolis sampling by scaling the previous iteration's position before adding a random shift and generating the proposal x . In Algorithm 1, pCN appears specifically in the factor 1 − β 2 z in Step 2. This small modification simplifies the acceptance probability A(x , x (j) ) significantly with respect to a standard random walk proposal, by removing terms of the form |z (j) | 2 − |z | 2 from the exponent. The difference in these terms can be large, which necessitates a smaller stepsize to maintain a given acceptance rate. Alternatively, independence sampling from the prior (β z = 1) also removes these terms, but the acceptance probability in that case is determined by the ratio of the likelihood at two independent prior samples, which one can expect to be large if the likelihood varies substantially over the support of the prior, i.e., if the posterior differs significantly from the prior. Therefore, unlike both standard random-walk Metropolis and independence sampling, the proposal here preserves random walk behavior and provides a simplified acceptance probability.
The specific expression for A(x , x (j) ) follows from the standard form for Metropolis-Hastings [16]. Letting p(x |x (j) ) denote the proposal density, we have Making use of the densities for the proposal distribu- , as well as Eqs.
(2), (6), and (7), returns the formula in Step 3. For efficient convergence in the sampling algorithm, we monitor the acceptance rate and increase or decrease the step sizes β y and β z in tandem to maintain an acceptance fraction between 0.1 and 0.3. This range is chosen to enclose 0.234, the optimum acceptance probability, under various assumptions, for random-walk Metropolis-Hastings [26]. Additionally, we note that the adaptation diminishes as the chain evolves, so it does not preclude ergodicity [27].

IV. EXAMPLE WITH EXPERIMENTAL DATA
To explore the effectiveness of the new method, we perform QST on results from the frequency-bin quantum optics experiment of Ref. [28], specifically the measurements in Fig. 4 thereof. We selected this experiment for comparison because: (i) its basis set is tomographically incomplete, thus enabling use of the projector formulation in Fig. 1(b); and (ii) the results were already analyzed with the method of Ref. [5], providing an initial reference point. In the following, we perform numerical benchmarking of method performance for a variety of configurations. All tests were completed in 64-bit MATLAB utilizing a single thread on a 2.5 GHz machine with 128 GB of RAM.
For the first test, we focus on the speed of the pCN sampling method, using slice sampling with MATLAB's built-in algorithm for comparison. In this example, we take α = 1 for a uniform prior and invoke the full likelihood of Eq. (11). We increase the number of points in the Markov chain and monitor the mean F and standard deviation ∆F of the fidelity F (x) = Ψ|ρ(x)|Ψ , where |Ψ = 1 √ 2 (|01 + |10 ) is the ideal entangled state; R = 2 10 samples are kept from a total of RT , with T the thinning parameter used to reduce serial correlation in the chain. For each value of T , we run 100 independent samplers (i.e., with random initial points), returning 100 separate estimates of F and ∆F . The results for slice sampling appear in Fig. 2(a), plotted against total time logged by each sampler. Thinning increases by factors of two from 2 0 to 2 7 on this plot, and we use a box plot format to summarize the statistics at each T : the center mark denotes the median, upper and lower lines enclose the the 25th-75th percentiles, and the whiskers extend to the smaller of the farthest point or 1.5× the length of the box. Fidelity converges to F = 0.93 ± 0.01, slightly higher than the mean of 0.92 found in Ref. [28]. This difference is unsurprising, though, since here we do not consider singles counts (i.e., events where only one of the two photons is detected), so our likelihood model differs. Figure 2(b) furnishes results for the identical test performed with our custom pCN algorithm, for thinning from 2 0 to 2 9 . The total times are lower by approximately 18× for the same value of T . Examining the codes in detail indicates this difference is caused by the fact the slice sampler evaluates π(x) many more times than pCN. Nevertheless, Fig. 2 reveals pCN's need for larger thin values to reach the same level of convergence as slice, so it is not clear a priori what, if any, quantitative advantage is obtained.
Accordingly, we next plot ∆F for both slice and pCN on the same logarithmic scale in Fig. 3. Initially, both approaches obtain a reduction in ∆F with log-log slope of −1/2 [i.e., ∆F ∝ (time) −1/2 ], until converging to final values. Linear least-squares fits to the first five and seven points of the slice and pCN curves, respectively, give a ∼3.5× temporal speedup for pCN over slice at the same convergence level. Such an improvement-even for this comparatively small system of two qubits-is significant for practical QST, where computational time represents an precious commodity.
In the second test, we shift focus away from the sampling procedure and concentrate on the likelihood, comparing the full [Eq. (11)] and pseudo [Eq. (7)] versions directly. Since the experiment in question measured in combinations of the Pauli-X and Z bases for the two qubits, but not Pauli-Y , the experimental LS estimate consists of only eight of the fifteen total Pauli basis components, thus requiring the projector formalism in Fig. 1(b). From a computational perspective, this projection can be efficiently implemented as a linear transformation, by writing the density matrix elements as a length-D 2 column vector ρ vec and finding the matrix V such that [P M (ρ)] vec = V ρ vec , which can be precomputed according to the relationship between Pauli and computational basis representations [29]. Likewise, the probabilities appearing in the full likelihood [Eq. (11)] can be vectorized so that log L D = N T log W ρ vec , where N denotes the vector of counts and W is the linear transformation mapping matrix elements to probabilities. Matrix-vector multiplication reduces function evaluation time and is essential in providing a fair comparison between the likelihood approaches. Figure 4(a) plots F and ∆F for 100 samplers utilizing the pCN algorithm, as the number of points increases, for both the full (left) and psuedo (right) likelihood models; both consider α = 1 in the prior. [The full likelihood results are the same as Fig. 2(b), reproduced here for comparison.] Both likelihoods converge to similar values, though the pseudo case returns slightly lower mean and higher uncertainty. The general congruity between the two cases offers evidence in favor of our choice for the variance (σ 2 = 1/N ) and the projector-based approach to limited measurements. Even the slightly lower fidelity for the pseudo-likelihood is a positive feature, in that it does not overestimate the state's fidelity beyond that predicted by a complete model. One of the advantages of the prior formulation is its ability to impart sparsity to the density matrix parameterization, through α, and thus favor pure states. In Fig. 4(b), we consider α = 1/4 and repeat the convergence tests with all other settings the same. In both the full and pseudo cases, the fidelity is slightly higher compared to α = 1, which makes sense in light of the extra weight given to pure states. Interestingly, the fulllikelihood case shows additional outliers at this α value (note the much wider y-axis scale). Evidently, the sparser prior increases the tendency for trapping of the Markov chain around local maxima. By contrast, the pseudolikelihood results remain much more consistent throughout the convergence plot. While it would be unwise to infer too much from acknowledged outliers, the pseudolikelihood approach nonetheless appears slightly more robust to fluctuations in the sampling algorithm, a valuable feature for Bayesian QST.
Yet the pseudo-likelihood does not lead to any observable speedup in sampler time. In this particular example (D = 4 and 16 total measurement outcomes), such a situation obtains, first, because computational cost is dominated not by calculating L D (x) but rather constructing ρ(x) [Eq. (5)] and, second, because of the additional cost of computing the projection P M (ρ) in this limited measurement case. As dimension D increases, though, the pseudo-likelihood's improved efficiency should ultimately surface, a question we address with simulated data in the next section.

V. EXAMPLE WITH SIMULATED DATA
In order to explore dimensionalities beyond that of the experimental data available to us, we next generate sim- ulated tomographic data for entangled two-qudit states with the ground-truth density matrix where |Ψ = d k=1 |k A |k B is a high-dimensional Bell state, D = d 2 , and I D is the the D × D identity matrix; λ ∈ (0, 1) controls the fidelity with respect to the ideal |Ψ . Count data is obtained by cycling through all pairwise combinations of (d + 1) MUBs [25], computing the d 2 outcome probabilities associated with the state in Eq. (13), and drawing from a multinomial to emulate an experimental coincidence distribution. This procedure amounts to Q = (d + 1) 2 = (D + 2 √ D + 1) total measurement settings, each with S q = D outcomes, which are then either used to compute ρ LS and perform pseudolikelihood-based QST or inserted directly as exponents in the full likelihood.
Explicitly, we consider qudit dimensions d ∈ {2, 3, 5, 7} (Hilbert space dimensions D ∈ {4, 9, 25, 49}). Prime d are chosen for convenience, for a complete (d + 1) set of MUBs can be generated easily in these cases utilizing Weyl operators [30]. We then set λ = 0.95 for all tests and acquire 100D coincidences for each pair of bases in the simulated experiments. Running the pCN algorithm on these observations and recording the time per sample for 2 14 points (following a burn-in period of 2 10 points), we find the trends in Fig. 5. While comparable at low D, the evaluation times for the full and pseudo-likelihood approaches become increasingly disparate as D grows, reaching ∼10× for D = 49. Due to limits on the size of the datasets we could generate, we were not able to reach the O(D 2 ) asymptotic scaling improvement of Sec. III. Nonetheless, these time tests confirm that the pseudolikelihood offers computational speedups under appropri-ate conditions.

VI. CONCLUSION
Continued research should enable even further improvements to the sampling algorithm. While here we have applied the pCN approach to parameters with Gaussian prior distributions, pCN can be extended to non-Gaussian priors as well [31], provided one can select a proposal distribution which preserves reversibility with respect to the prior. It is worth noting also that there exist additional enhancements, for example utilizing derivative information, which are out of scope of the present work [32][33][34]. The method can also be embedded within SMC samplers [35,36], and as D → ∞ one can leverage finite approximations to further improve complexity [36]. These directions are under investigation and will be reported in future work.
As it stands, the key features of the present method should find application in a myriad of quantum inference problems. With only minor modifications, models previously tackled by slice sampling-such as photon loss [5] or linear-optic transformations with dark counts [6]-can be transformed into a pCN formalism for algorithmic speedup. On the other hand, the impact of the pseudolikelihood in improving practical QST is less clear in our view. The pseudo-likelihood certainly appears more robust to initial conditions of the sampler [ Fig. 4(b)], with computational improvements for sufficiently large Hilbert spaces and complete measurements [ Fig. 5]. Yet at the dimensionalities where the pseudo-likelihood provides order-of-magnitude speedups (e.g., D ∼ 50 in the example of Fig. 5), it is possible that the experimental challenge of acquiring the needed QST data will so out- weigh the computational challenge of evaluating the full likelihood as to render the pseudo-likelihood superfluous. That being said, the pseudo-likelihood's general, modelindependent form could provide advantages which may not be evident in the specific QST problem of interest here, so that the full potential of the pseudo-likelihood remains an unanswered question.
In conclusion, we have introduced a Bayesian inference method for efficient quantum state tomography. Compatible with any number of observations, our approach enjoys all the standard advantages of Bayesian QST but with significantly improved computational efficiency, through a combination of well-chosen parameterization, likelihood, and MCMC sampling algorithm. Our numerical investigations on both real and simulated data confirm the promise of our approach, particularly the power of advanced statistical techniques such as pCN in practical quantum tomography.