Optimal, reliable estimation of quantum states

Accurately inferring the state of a quantum device from the results of measurements is a crucial task in building quantum information processing hardware. The predominant state estimation procedure, maximum likelihood estimation (MLE), generally reports an estimate with zero eigenvalues. These cannot be justified. Furthermore, the MLE estimate is incompatible with error bars, so conclusions drawn from it are suspect. I propose an alternative procedure, Bayesian mean estimation (BME). BME never yields zero eigenvalues, its eigenvalues provide a bound on their own uncertainties, and under certain circumstances it is provably the most accurate procedure possible. I show how to implement BME numerically, and how to obtain natural error bars that are compatible with the estimate. Finally, I briefly discuss the differences between Bayesian and frequentist estimation techniques.


3
it is impossible to bracket a zero probability with consistent error bars 2 . The MLE estimate is at best sub-optimal, and at worst dangerously unreliable (implying, for instance, that certain errors can be ruled out).
Bayesian mean estimation (BME) is an alternative procedure that avoids these pitfalls. Unlike MLE, which seeks a unique maximally plausible state, BME considers other states that are only slightly less plausible. The simple underlying principle is that the best estimate is an average over all states ρ consistent with the data, weighted by their likelihood. The BME estimate is always full-rank, and comes equipped with a natural set of error bars. Moreover, each eigenvalue λ ofρ BME yields an upper bound on its own uncertainty ( λ 2 λ). Best of all, BME is provably the most accurate scheme possible [5], under certain reasonable assumptions.
The body of this paper is divided into three sections. Section 1 explains the problems with MLE. Section 2 presents and analyzes the BME algorithm, along with one possible implementation. Section 3 discusses some unsolved problems.

The state of the art
The oldest and simplest estimation procedure is 'quantum state tomography'. In tomography, the estimator repeatedly measures several observables, records the frequencies of the outcomes and identifies the outcomes' frequencies with their probabilities. Inverting Born's rule yields a unique density matrixρ tomo that predicts these probabilities. The most important problem with tomography is thatρ tomo often has negative eigenvalues, which means that it cannot represent a physical state.
In 1996, Hradil proposed MLE as a more flexible and sophisticated approach [6]. An estimateρ is a theory about the unknown state. Statisticians define the likelihood of a theory, L(ρ), as the probability that the theory (ρ) would have predicted for the observed data (M) before the experiment took place: Thus,ρ MLE is simply the ρ that maximizes L(·)-i.e. the most 'likely' state. L(ρ) is not a probability distribution over ρ, soρ MLE is not in any well-defined sense the most probable state. Actually findingρ MLE requires numerics, but several algorithms exist [7]. MLE was successfully applied in 2001 to a quantum optics experiment [8] and has been used extensively since then. MLE has some critical flaws. The most visible is thatρ MLE can be rank-deficient. If |ψ is the eigenstate corresponding to a zero eigenvalue, then ψ|ρ|ψ = 0. Such an estimate, although perhaps not unphysical, is implausible-i.e. no experimentalist would believe it. It predicts exactly zero probability for every measurement outcome |ψ ψ| such that ψ|ρ|ψ = 0. This 2 The savvy reader may be wondering about the error-estimation procedure proposed for MLE in [6] and used (e.g.) in [27]. These error bars are the variance of many MLE estimates, on many datasets obtained by simulating measurements on the originalρ MLE . They are simply not compatible with the (rank-deficient) estimate. Are they good error bars, i.e. accurate, and compatible with some better estimate? The short answer is no, because this procedure computes the Fisher information matrix atρ MLE , not the true state. For example, suppose we flip a coin once (getting, w/o l.o.g., 'heads'), and use MLE to estimatep heads = 1. All the simulated measurements yield 'heads', leading to the absurd conclusion that p = 0. 4 implies absolute certainty that |ψ ψ| will not be observed, which cannot be justified by a finite amount of data. If N observations with d possible outcomes are available, then the lowest defensible probability estimate for any event is roughly 3 This is a practical concern. One of the seminal papers on MLE, James et al [8], estimated the polarization state of two entangled photons, produced by parametric downconversion. The estimated 4 × 4 density matrix has two eigenvalues that are exactly zero. More recently, MLE was used to estimate the entangled state of eight ionic qubits in a trap [9]. Of 256 eigenvalues, more than 200 are less than 1/N (about 10 6 measurements were made), and at least 80 are zero (to within machine precision).
Zero eigenvalues are just the most extreme illustration of a more general problem;ρ MLE implies predictions that cannot be justified by the data. After 100 observations, p = 10 −8 is no more credible than p = 0. To put it another way, takingρ MLE seriously might be exceedingly embarrassing in light of further data. Viewed this way, zero eigenvalues are just a symptom of the larger problem. However, they motivate some useful questions that lend structure to this analysis: 1. Why are zero eigenvalues a problem? 2. Why does MLE produce zero eigenvalues? 3. What is the underlying problem with MLE?

Why are zero eigenvalues a problem?
A quantum state is nothing more or less than a prediction of the future. Like a classical probability distribution, it predicts probabilities for all measurements that could be performed. A state estimate is the estimator's best prediction of what future experimentalists will find when they observe a copy of the estimated system. We should therefore evaluate an estimate on how well it predicts the future.
Quantitative evaluation of an estimate is a matter for debate. Statisticians disagree about how to interpret even a simple statement about a coin flip: 'The probability of observing 'tails' is p tails = 3/4'. However, it is indisputable that ' p tails = 0' implies that 'tails' will never be observed, and that ' p tails = 1' implies that nothing but 'tails' will ever be observed. Such a statement has the force and status of a mathematical theorem, just like 'There is no largest prime number'.
An estimator should hardly claim ' p tails = 0' just because he/she has never observed 'tails'. He/she has a finite number of observations to work from, and p tails might simply be very small. For example, if the data comprise a single flip, then at least one of the possible outcomes will never have been observed, but this does not justify asserting that it will never occur. Even if a dozen trials all yield heads, ' p tails = 0' is unjustified. No matter how many data points the estimator has, we can always imagine a much larger dataset in the future, which might (embarrassingly) debunk the prediction ' p = 0'. Thus, data can never justify reportingp = 0. Only prior knowledge, such as an impossibility theorem, can do so.
One might object that an estimate carries with it an implied uncertainty. For instance, p = 0.5 is clearly a decent estimate of p = 0.51; why isp = 0 not an equally good estimate of p = 0.01? The reason is that zero probabilities are not compatible with any error bars 4 . The estimatep = 0.5 could meanp = 0.5 ± 0.01, meaning ' p is probably between 0.49 and 0.51'. To reportp = 0 ± 0.01, however, is nonsensical. This would mean ' p is probably between −0.01 and 0.01', but because p must be non-negative, an unconditionally better description is ' p is probably between 0 and 0.01', orp = 0.05 ± 0.05. This is not the only way of representing ' p is probably between 0 and 0.01'. If the estimator's confidence is skewed toward one side of the interval, then the bestp might not be at its center. However, it should necessarily be within the interval, not on its boundary. An estimate on the boundary cannot be optimal, because moving the estimate inside the boundary by some tiny improves it (even if the optimal is unknown). Since p = 0 is on the boundary of any interval,p = 0 is only optimal when the confidence interval has zero width. Taken seriously, a zero probability thus implies both: (i) absolute certainty about the outcomes of future measurements, and (ii) absolute certainty about the probability itself.
This has practical consequences. If we accept that zero probabilities are implausible, then each zero eigenvalue inρ should be replaced by a small, but finite, . This poses two substantial problems. Firstly, what is ? It clearly declines with N , but whether it should scale as 1/N or 1/ √ N is unclear. Moreover, when statistics from many distinct observables are collated, it is not clear what N is. Secondly, how does 'fixing'ρ's small eigenvalues affect its large eigenvalues? Since Trρ = 1 is fixed, increasing many small eigenvalues will require decreasing the largest ones. These large eigenvalues are critical to most of the quantities of interest-entanglement, gate fidelity, etc. The only way to resolve this messy situation is to avoid zero eigenvalues in the first place.

Why does MLE produce zero eigenvalues?
The zero eigenvalues inρ MLE are connected to the negativity of tomographic estimates. What I will show in this section is that, for a given dataset, ifρ tomo is not positive, thenρ MLE is rankdeficient. On the other hand, if the tomographic estimate is positive, thenρ MLE =ρ tomo . MLE is thus a sort of 'corrected tomography' 5 .
The valid state-set, comprising all positive density matrices, is a convex subset of Hilbert-Schmidt space, the (d 2 − 1)-dimensional vector space of Hermitian, trace-1 matrices. 4 Here is an alternative explanation of why an error of p = 0.01 is acceptable forp = 0.5, but not forp = 0. The canonical application of probabilities is in making bets; if event E has probability p, then a canny bettor is justified in accepting odds of ( p −1 − 1) : 1 or better against its occurrence. When p = 0.51, a bettor who accepts 1 : 1 odds will slowly lose his money-but when p = 0.01, the bettor who believesp = 0 and accepts ∞ : 1 odds is truly courting disaster! The operational penalty for believing an (even slightly) erroneous estimate of p = 0 is, indeed, severe. Of course, this argument is moot if the estimator's purpose in reporting a probability is not to predict the future in any sense (in which case, generalized gambling is not a useful paradigm)-but this seems to obviate the very meaning of 'probability'. 5 This discussion is rigorously correct only when the estimator measures a complete-rather than overcompleteset of observables. For an overcomplete set, it is not entirely obvious what 'tomography' means. The most common answer would be a least-squares fit, in which caseρ tomo 0 does not implyρ tomo =ρ MLE . The general conclusions still hold, however, and rigor can be retained by defining 'tomography' to mean unconstrained MLE, for overcomplete data.
Its boundary comprises the rank-deficient states. Wheneverρ tomo lies outside this boundary, MLE squashes it down onto the boundary, producing a rank-deficient estimate.

How tomography works.
Quantum state tomography is based on inverting Born's rule: If a positive operator valued measure (POVM) measurement P = {E 1 . . . E N } is performed on a system in state ρ, then the probability of observing E i is p i = Tr(E i ρ). The probabilities for d 2 linearly independent outcomes single out a uniqueρ tomo consistent with those probabilities. Several projective measurements (at least d + 1) can, in aggregate, form a quorum-i.e. provide sufficient information to identifyρ tomo .
Note, however, that no measurement can reveal the probability of an event. Repeated measurements yield frequencies, from which the tomographic estimator infers probabilities. The measurement is repeated N times, and if outcome E i appears n i times, we estimatep i = n i /N . If the measurements form a quorum, then the equations can be solved to yield a uniqueρ tomo . Tomography thus seeks a density matrix whose predictions agree exactly with the observed frequencies. Unfortunately, this matrix is not always a state. Suppose that an experimentalist, estimating the state of a single qubit, measures σ x , σ y and σ z -but only one time each! Having observed the +1 result in each case, he/she seeks aρ tomo satisfying σ x = σ y = σ z = 1. Such a matrix exists,ρ but it has a negative eigenvalue λ = 1 − √ 3/2 ≈ −0.366. Moreover, this 'state' implies that all three spin measurements would be perfectly predictable, which is impossible.
Estimating the state from a single measurement of each basis is a rather extreme example. However, it illustrates a point. Tomography, in a single-minded quest to match Born's rule to observed frequencies, pays no attention to positivity. As the number of measurements (N ) increases, the possible tomographic estimates form an N × N × N grid. They fill a 'Bloch cube', defined by σ x,y,z ∈ [−1 . . . 1], which circumscribes the Bloch sphere and contains a lot of negative states (see figure 1). If the true state is sufficiently pure, then the probability of obtaining a negative estimate can remain as high as 50% for arbitrarily large N , since the true state is very close to the boundary between physical and unphysical states.
In larger Hilbert spaces, the problem gets worse for two reasons. Firstly, the state-set's dimensionality (and therefore the number of independent parameters in ρ) grows as d 2 − 1. In order to keep the RMS error ( 2 = Tr (ρ tomo − ρ) 2 ) fixed, N must grow proportional to d. Secondly,ρ tomo has more eigenvalues, so the probability of at least one negative eigenvalue grows with d (for fixed 2 ). Together, these scalings ensure that tomographic estimates of large systems are rarely non-negative.
The problems with tomography are well known-negative eigenvalues were precisely the embarrassing feature that motivated MLE. As we shall see, however, MLE's implausible zero eigenvalues are closely related to tomography's negative ones. Figure 1. A cross section of the 'Bloch cube', which contains all possible tomographic estimates, and circumscribes the Bloch sphere containing all positive estimates. The points shown are possible tomographic estimates for N = 11 measurements each of σ x and σ z , with σ y set to zero for the sake of simplicity. Of the 144ρ tomo shown, 54 are non-positive (it should be kept in mind that the σ y dimension is ignored). Depending on the state, someρ tomo will of course be more likely than others; this figure merely illustrates the array of possible non-positive estimates.
M can be compactly represented as a list of frequencies. Define a set P = {E 1 . . . E m } containing all possible outcomes, and let n i be the number of times that E i appears in M. Then M ∼ {n 1 . . . n m }. As N increases, the frequency representation of M remains short.
Findingρ MLE is feasible because L(ρ) has two convenient properties. Firstly, it is nonnegative, so we can maximize log(L(ρ)). Secondly, log(L(ρ)) is convex. The proof is quite simple: we observe that log(L(ρ)) = i log Tr[M i ρ]; that Tr[M i ρ] is a non-negative, linear function of ρ; that the logarithm of a linear function is convex; and that the sum of convex functions is convex. Among other things, this means that L(ρ) has a unique local maximum.

The relationship between tomography and MLE.
The likelihood function has another elegant property: If there is a stateρ tomo , such that the probability predicted for every outcome is equal to its observed frequency, thenρ tomo is the maximum of L(ρ). To prove this, let us write 8 log L(ρ) in terms of (a) the observed frequencies ( f j = n j /N ) and (b) the predicted probabilities ( p j = Tr[E j ρ]) for all E j : The last line invokes two information-theoretic quantities, entropy H (·) and relative entropy , which is always non-negative, and uniquely zero when p = f . Thus, log(L(ρ)) is uniquely maximized when p = f . So, ifρ tomo is a valid state, thenρ MLE =ρ tomo . What ifρ tomo exists, but is not a valid state? It must still be Hermitian and have unit trace. Furthermore, it predicts non-negative probability for each M i observed, so Tr[E iρtomo ] 0 for all i. The hyperplanes Tr[E i ρ] = 0 define a polytope in Hilbert-Schmidt space-a simple example is the 'Bloch cube' in figure 1-which containsρ tomo .
If we extend the domain of L(ρ) to this polytope and its interior, then its maximum must coincide withρ tomo , sinceρ tomo predicts the correct frequencies. Tomography, in other words, is essentially unconstrained MLE.
Because L(ρ) has a unique local maximum atρ tomo , its maximum over a closed region that does not containρ tomo must lie on the boundary of that region (see figure 2). The set of non-negative density matrices is precisely such a closed region, so wheneverρ tomo is not a valid state,ρ MLE must lie on the boundary of the state-set. That is, it will be rank-deficient.
MLE and tomography are thus variants of the same procedure, distinguished only by the positivity constraint 6 . MLE is a sort of minimal fix for tomography, returning the non-negative state that is in some sense 'closest' toρ tomo . Actually computing the number of zero eigenvalues inρ MLE seems difficult, but numerical exploration for 1, 2, 3 and 4 qubit problems suggests that ρ MLE usually has at least as many zero eigenvalues asρ tomo has negative ones. In conjunction with the observation that large-system tomography tends to yield many negative eigenvalues, this explains the many zero eigenvalues in experimental applications of MLE. An example of a likelihood function (for a single qubit) whose unconstrained maximum lies outside the state-set and whose constrained maximum therefore lies on its boundary. The domain shown here is a cross section of the Bloch sphere, with σ y = 0. This particular likelihood function is obtained from 16 measurements each of σ x and σ z , comprising 14 |1 and |+ results and 2 |0 and |− results. The unconstrained maximum of L(ρ) is

What is the underlying flaw?
Tomography and MLE maximize L(ρ) over different domains. They display the same pathology, implying unjustifiable (zero or negative) probabilities. The underlying problem is simple: maximum likelihood methods are frequentistic; they interpret observed frequencies as probabilities. By maximizing L(ρ), they seek to fit the observed frequencies as precisely as possible. If there exists aρ that fits the data exactly, then that is always the best estimate. The point of state estimation, however, is not solely to explain the data. Rather, it is to find a state that will predict the future. It should concisely describe what the estimator knows about the system being estimated. Mindless data fitting accomplishes only retrodiction, of the past. The best description of the past (i.e. data) probably does not describe the estimator's knowledge, especially his/her uncertainty.
For example, consider estimating the bias of a coin after flipping it just once. The best fit to the data is to assume that the coin always comes up the same way. This clearly does not describe the estimator's knowledge-an honest description would perhaps include the word 'scant'. Ironically, it is the high entropy of the estimator's knowledge that causes a spuriously low-entropy estimate.
The problem with MLE is that it matches probabilities to observed frequencies, consistent with frequentist statistics. This is actually unfair to frequentism, which begins by defining probability as the infinite-sample-size limit of frequency. A true frequentist avoids making statements about probabilities in the absence of an infinite ensemble, so applying a frequentist method to relatively small amounts of data is inherently disaster-prone. Nonetheless, this is precisely what is happening in MLE. For further discussion, see section 2.4.

BME
Bayesian methods provide a different perspective on statistics (see, e.g., [10] as a general reference). Bayesian mean estimation (BME) avoids the pitfalls of MLE. Here are three basic tenets, each of which independently motivates BME: 1. Consider all the possibilities. MLE identifies the best fit to observed data, but many nearby states are almost equally likely. An honest estimate should incorporate these alternatives. 2. Demand error bars. The estimate should be compatible with error bars, e.g.ρ ± ρ. This implies a region containing most of the plausible states, of size ρ, withρ somewhere around the center. Ifρ is rank-deficient, no such region exists. Thus,ρ should lie far enough from the state-set's boundary to be compatible with well-motivated error bars. 3. Optimize accuracy. Obviously, the estimateρ should be close to the true ρ. How do we evaluate this? Quantum strictly proper scoring rules [5] yield a class of metrics designed to measure this closeness, called 'operational divergences'. BME uniquely minimizes the expected value of every operational divergence.
Each of these motivations illustrates one of BME's major advantages. The estimate predicts reliable probabilities for all measurement outcomes, it comes with a free set of error bars, and it is (on average) the most accurate estimate that can be made from the data. Bayesian approaches have been previously discussed in various contexts. Helstrom [11] applied Bayesian methodology extensively to estimation. He considered a variety of utility functions, especially the rather pathological δ-function utility that motivates MLE, without paying particular attention to the posterior mean. Jones [12] applied Bayesian inference with Haar measure, focusing on information-theoretic bounds. Derka et al [13] examined Bayesian estimation in some detail, primarily in its connections to tomography and maximum entropy. Schack et al [14] formalized a deep connection to exchangeable (deFinetti) states. More recently, Neri [15] considered Bayesian estimation of phase difference in coherent light. Tanaka and Komaki [16] proved the optimality of Bayesian estimation with respect to relative entropy.
My goal in this section is to propose BME as a practical procedure for state estimation, and to describe its operational advantages. I begin by concisely presenting the BME algorithm, then discuss in section 2.2 how it can be implemented. Section 2.3 analyzes the properties ofρ BME , focusing on the three advantages asserted above. Finally, section 2.4 contrasts the Bayesian and frequentist approaches.

The BME algorithm
BME is conceptually simple.
1. Use the data to generate a likelihood function, L(ρ) = p(M|ρ). L is not a probability distribution; it quantifies the relative plausibility of different state assignments. 2. Choose a prior distribution over states, π 0 (ρ)dρ. It represents the estimator's ignorance, and should generally be chosen to be as 'uniform', or uninformative, as possible. 3. Multiply the prior by the likelihood and normalize to obtain a posterior distribution which represents the estimator's knowledge. The proportionality constant is set by normalization. 4. Report the mean of this posterior, This is the best concise description of the estimator's knowledge.

Implementation
In practice, BME comes down to computing an integral. The best way of doing this remains uncertain, as does the existence of an exact solution. The numerical algorithm presented below has been demonstrated to work well in a small variety of cases. However, it could be improved in many ways, and has some glaring deficiencies. This algorithm should thus be taken as a proof of principle (i.e. it is possible to do Bayesian estimation) rather than an optimal approach. An important observation for any integration procedure is that the likelihood is easy to compute. L(ρ) is the probability of observing a sequence of outcomes M = {M 1 . . . M N }, given ρ. This is the product of the probabilities for the individual M i , each of which is given by Born's rule: When M is represented using frequencies (E i was observed n i times, for i ∈ [1 . . . m]), this can be evaluated in O(m) time:

The Metropolis-Hastings algorithm.
In the absence of an analytic solution to the integral, we fall back to numerical Monte Carlo methods. Because L(ρ) is usually a sharply peaked function over a high-dimensional space, brute-force random sampling will converge extremely slowly. Metropolis algorithms [17] were conceived for precisely such situations. A variant known as Metropolis-Hastings [18,19] is commonly used for Bayesian estimation, and can be adapted straightforwardly to quantum states. The Metropolis-Hastings algorithm computes the average value of a function (in this case, ρ) over an integration measure (in this case, L(ρ)π 0 (ρ)dρ ). It leverages the fact that L(ρ)π 0 (ρ)dρ is typically dominated by a small region of high likelihood. Whereas basic Monte Carlo methods jump randomly around the integration measure, Metropolis-Hastings makes local, biased jumps. This samples the most relevant parts of the sample space preferentially. After each jump, the current value of ρ is added to a running tally. This tally, divided by the total number of jumps, becomes the final average.
To implement Metropolis-Hastings, we begin with a rule J for jumping from any ρ to a nearby ρ = J (ρ). The precise form of the rule is unimportant; it is usually stochastic, although a deterministic rule (traversing a quasi-random set) is conceivable. What is important is that J should generate the underlying measure dρ: for any ρ 0 , the set {J n (ρ 0 ) : n ∈ [0 . . . N ]} should sample uniformly from dρ as N → ∞. For example, we can sample from Lebesgue measure over the interval [0 . . . 1] using the rule J (x) = (x + y) mod 1, where y is selected from a Gaussian distribution with zero mean and fixed variance.

12
Such a rule, unmodified, would compute ρ f (ρ)dρ. To average instead over L(ρ)π 0 (ρ)dρ, we modify the rule as follows. After choosing ρ , but before jumping to it, we compute the likelihood ratio If r > 1 (ρ is more likely than ρ), then we jump as before. If not, we jump to ρ with probability r , and stay at ρ (adding it, once again, to the running total) with probability 1 − r . This biasing ensures that the algorithm spends more time at more likely spots, and tends to lurch uphill into regions of high probability. Unlike a gradient algorithm (as might be used for MLE), it does not actively seek the point of highest probability; jumping to a region of lower probability is both possible and necessary. Detailed discussion and explanation of why this works can be found in [19].

Applying Metropolis-Hastings to quantum states.
The heart of the algorithm is the rule J . It determines dρ, and its form is critical to the algorithm's performance. Different underlying measures will require different rules. Measures with some claim to 'uniformity' are usually invariant under a symmetry group. The natural group for quantum states on a d-dimensional Hilbert space is SU(d), and the measure that this group induces over pure states is called Haar measure. A sensible prior should extend over all possible states, so we need measures extending to mixed states. However, there is no uniquely suitable measure over mixed states, because their spectral degrees of freedom (eigenvalues) have no obvious symmetry. One appealing class of measures, proposed by Zyzkowski and Sommers [20], is the set of induced measures, denoted here by d k ρ. They are obtained by beginning with Haar measure on a d × k dimensional system, then tracing out the ancillary factor. Thus, d 1 ρ is simply the Haar measure on pure states; while d d ρ is the Hilbert-Schmidt measure (Lebesgue measure on the vector space of Hermitian d × d matrices).
These induced measures are easy to implement. Instead of keeping track of ρ itself, we generate and track a pure state |ψ d×k in d × k dimensions. At each step, ρ is obtained by tracing out part of |ψ d×k ψ d×k |. The ancillary degree of freedom acts as a sort of hidden variable, internal to the algorithm. We need only a rule J to implement Haar measure over the larger Hilbert space.
This could be done in many ways-for instance, at each step, we could generate a random unitary from Haar measure. This has two huge drawbacks. Firstly, the jumps are non-local, which negates the key advantage of the Metropolis-Hastings algorithm. Secondly, generating and applying a random unitary is computationally expensive. Instead, we need a relatively small set of efficiently constructable unitaries that generate the entire group.
Here is an efficient local random walk rule that generates Haar measure on a d-dimensional Choose a distance, δ, from a distribution (e.g. Gaussian) with δ = 0 and δ 2 = 2 . We will discuss the choice of below.
3. Let J (|ψ ) = e iδ H i j |ψ . Since U acts nontrivially only on the |i , | j subspace, this can be done very easily and quickly.
Each step's distance is chosen randomly to ensure uniform sampling-with a fixed step size, it is just barely conceivable that this algorithm might trace out a discrete lattice of states. The average step size is important: if is too large, the algorithm will not find small regions of high probability efficiently; if is too small, it will explore the space very slowly. The optimal will depend on L(ρ), and there is no way to identify it a priori.
The algorithm must therefore vary dynamically, with feedback. If is very small, then almost every jump will be accepted, whereas if is large, very few will be accepted. A good heuristic is that the acceptance ratio should be around 60% (other values are also suggested [19]). The algorithm should track the acceptance ratio over the last ∼1000 jumps, gradually changing as appropriate to maintain it around 60%.
Dynamically adjusting the step size like this can, in theory, break the convergence properties of the algorithm. This occurs if the distribution over states is multimodal; the step size is reduced in order to explore one narrow peak in detail, and a far-off peak becomes inaccessible. Fortunately, the likelihood function itself is guaranteed to be log-convex and therefore unimodal. For well-behaved priors with convex support (e.g. the Hilbert-Schmidt prior, d d ρ), this means that L(ρ)π 0 (ρ) can safely be sampled this way.
Other priors-in particular, the Haar prior, which is interesting as a limiting case-do not have convex support. These priors will yield multimodal posterior distributions. How to effectively and reliably sample from such distributions is an open problem. Repeating the sampling many times, with randomly distributed starting points, is not reliable. It fails badly if two similar peaks in the distribution have unequally sized regions of convergence; the peak with the larger convergence region will be relatively oversampled.

[Good] properties of the BME estimate
Why should an experimentalist use BME? After all, BME (via Monte Carlo) is more computationally intensive than MLE. The answer, of course, is that BME provides a better estimate than MLE. Specifically: (i)ρ BME 's eigenvalues are never unjustifiably small (or zero); (ii) the procedure can easily be made to yield well-motivated error bars that are compatible withρ BME ; (iii) BME is, in a particular sense, the most accurate possible estimate-not just asymptotically, but for finite N .

The estimate is plausible.
The first objection to MLE is thatρ MLE is implausible; it can (and often does) have zero eigenvalues, which imply an unjustified certainty. Any alternative procedure should yield a strictly positive estimate. BME yields just such an estimate, subject to a very weak restriction on the prior.
Consider a simple and illustrative example in classical estimation. We estimate the bias b of a coin, which comes up 'heads' with probability b, and 'tails' with probability 1 − b.
Flipping the coin N times yields a measurement record consisting of n heads and N − n tails. The likelihood function is and so the MLE estimate iŝ b MLE = n N .
If n = 0 or n = N , thenb MLE will assign zero probability to observing either 'heads' or 'tails', respectively. If we adopt a Bayesian approach, then we must choose a prior-e.g. the uniform prior with respect to Lebesgue measure, π 0 (b)db = db. The mean of the posterior is an integral of the likelihood, and we obtain Since 0 n N , the Bayesian estimator never assigns zero probability to anything. The lowest possible probability assignment for either heads or tails is p min = 1/(N + 2). With no data at all, the Bayesian assigns p = 1 2 to both outcomes; after a single flip, he/she assigns p = 2 3 to the outcome that was observed and p = 1 3 to the other. This is the property that we want in a quantum estimation procedure. The probabilities assigned to unobserved events are not only nonzero, but also sensible-after N trials, it is reasonable to assume that the probability of an as-yet-unobserved outcome is at most 1/N , and to assign p min ≈ 1/N .
However, this property depends on the prior. Consider the prior π 0 (b) = 1 2 (δ(b) + δ (1 − b)). After one observation of 'tails', our Bayesian estimate would beb BME = 0, which is implausible. The problem is that a finite number of observations (one) ruled out every b in support of π 0 that ascribed nonzero probability to 'heads'. The situation gets even worse if the next observation is 'heads'. The data now rule out every hypothesis, the posterior π f (ρ)dρ vanishes entirely, and the Bayesian procedure simply fails. This stems from a contradiction. A prior over states implies a probability distribution over observations as well. π 0 assigned exactly zero probability to M = {'heads', 'tails'}-which was then observed, causing a contradiction.
The following statements about a prior π 0 are logically equivalent: (a) π 0 assigns zero probability to some (finite-length) measurement record. (b) Bayesian estimation using π 0 will, for some measurement record, yield an estimate with zero probability. (c) There exists a measurement record that will annihilate π 0 , so that Bayesian estimation fails completely.
Let us define a fragile prior as one for which these statements hold (and which can therefore yield a rank-deficient estimate). An estimator should choose a robust (i.e. not fragile) prior, which in turn guarantees a full-rank estimate.
In classical probability estimation, avoiding fragility is simple: a prior is robust if and only if it has support in the interior of the probability simplex. States in the interior do not predict zero probability for any observation. They can never be ruled out, so a prior supported on one can never be annihilated by the data. Conversely, every prior supported only on the boundary will be annihilated by a measurement record that includes every possible outcome.
Intriguingly, this condition does not extend to the quantum problem. Support in the interior (i.e. on the full-rank states) is sufficient, but not necessary, for robustness. Consider estimation of a single qubit using the Haar prior, which is restricted to (and uniform over) the pure states. Each observation rules out, at most, a single pure state-if |0 0| is observed, then the true state cannot be |1 1|. There are uncountably many distinct candidate pure states, which means that no (finite-length) measurement record can annihilate the prior. The Haar prior is robust.
As a general rule, just about every prior that a halfway-sane estimator would pick is, in fact, robust. Not only the Haar prior (which implies absolute certainty that ρ is pure), but much more extreme priors, such as an equatorial distribution on the Bloch sphere-or, for that matter, any continuous curve on the Bloch sphere's surface-are robust. The appendix demonstrates a necessary and sufficient condition.

The estimate comes with natural error bars.
Another objection to the MLE procedure is thatρ MLE is not, in general, compatible with any error bars. This is an obvious consequence of zero eigenvalues; error bars imply a region of plausibility surrounding the point estimate. When the estimate lies on the state-set's boundary, no such region can exist-in order thatρ MLE be in its interior, the region would have to contain negative matrices.
The BME estimate is always full-rank, which is encouraging. This in itself does not guarantee compatibility with sensible error bars. The estimateρ = 99 100ρ MLE + 1 100d 1 is fullrank, but the estimator's honest uncertainty about ρ might well be greater than ±1%. Happily, the BME estimation procedure can easily be adapted to yield natural error bars, which are compatible with the point estimate.
First, let us consider what form these error bars should take. Intuitively, the qualified estimate should look like but what, precisely, is ' ρ'? Asρ is a d × d matrix, we might suppose that ρ is also a d × d matrix, so ρ i j =ρ i j ± ρ i j . This fails to account for covariance between distinct elements of ρ. For example, the diagonal elements ofρ must vary together to maintain Tr(ρ) = 1.
The correct way to think about the estimator's uncertainty begins by representing the estimate,ρ BME , as a d 2 − 1 dimensional vector in Hilbert-Schmidt space. For a single qubit: Tr(σ y ρ) The estimator's uncertainty is represented as a symmetric covariance matrix on the same space: The elements of ρ involve two different expectation values: one with respect to the state, denoted X ρ ≡ Tr(Xρ); and one with respect to the posterior probability, denoted f ≡ f (ρ)π(ρ)dρ. Using this notation, with the other elements given by the obvious generalization.
Represented as a covariance matrix, ρ quantifies the second cumulant of the estimator's probability distribution π f (ρ)dρ. It defines an ellipsoid in Hilbert-Schmidt space, which is a credible interval (the Bayesian version of a confidence interval). The eigenvectors of ρ are operators that define the principal axes of this ellipsoid, and the corresponding eigenvalues are their lengths.
As a matrix that acts on density matrices, ρ is a superoperator. It is symmetric and non-negative, but not completely positive or trace-preserving, so it cannot be interpreted as a quantum process. However, the superoperator interpretation gives a formula for the estimator's uncertainty about the expectation value of a particular operator X . Defining ρ[X ] to be the superoperator's action on X , quantifies the estimator's expected error in X . Alternatively, ρ can be represented as an unnormalized symmetric bipartite state, and in this representation, the estimator's expected error in X is This ρ is a consistent representation of the estimator's uncertainty; for any X , it yields the same X 2 that an independent estimate of X would. To see this, let X be an arbitrary observable with eigenvalues between x min and x max . The variance computed via BME is Because X parameterizes exactly one of the dimensions of Hilbert-Schmidt space, we can compute a marginal distribution over X by integrating π f (ρ)dρ over its other d 2 − 2 dimensions, which we denote by σ . Then dρ = d X dσ , and in terms of which, which is the familiar formula for the variance of the univariate distribution π f ( X ).

BME optimizes accuracy.
Above all else, an estimation procedure should yield an accurate estimate-one as close to the 'true' state as possible. While the concept of a 'true' state is problematic in actual experiments, it makes perfect sense in the context of a simple game. An impartial judge selects a state ρ, and provides N copies of it to the estimator, who measures them and reports an estimateρ. The best procedure is the one that consistently makeŝ ρ as close as possible to the unknown [to the estimator] ρ. Which procedure is 'best' depends on the situation. For instance, if the judge always picks a particular ρ 0 , and the estimator knows this, then the obvious best procedure is 'Reportρ = ρ 0 no matter what!' This is a trivial case. This section considers a situation where the judge selects ρ at random from a distribution π 0 (ρ)dρ. For a frequentist, this scenario is well defined, but not completely general (but see comments at the end of the section). For a Bayesian, this description is so general as to be tautological-any uncertainty about the preparation can (and should) be represented this way.
I will show that BME with the prior π 0 (ρ)dρ is unconditionally the most accurate scheme possible. It minimizes the expected error betweenρ and ρ. This optimality holds for every finite N , not just asymptotically. It depends of course on the measure of 'error' between ρ and ρ adopted. The error measures optimized by BME, operational divergences, are arguably the best-motivated such measures. The argument presented here is brief; for more details, see [5].
Operational divergences, denoted (ρ :ρ), measure how well the density matrixρ describes (or estimates) the quantum state ρ. A certain subtlety should be noted here: whereas ρ represents the state of a quantum system,ρ is a classical description of a state-e.g. a density matrix written down on paper. Two natural requirements constrain operational divergences. Firstly, must represent the outcome of some physically implementable process. Secondly, the best description of ρ had better be ρ itself.
To satisfy operationality, we imagine trying to motivate the estimator to do a good job. A third-party verifier, equipped with the estimateρ, will perform a measurement on ρ. This measurement, P(ρ) = {E 1 . . . E m }, is an arbitrary POVM that may depend onρ. Depending on the outcome (i), the verifier pays the estimator an amount r i (ρ).
The estimator's reward is represented by an operator and his/her expected reward (which he/she hopes to maximize) is r (ρ :ρ) = Tr(ρR(ρ)).
The amount that he/she loses by inaccurately describing the state, is an operational divergence. Note that (i) it is operationally significant; and (ii) the best description of ρ is ρ itself.
Of course, not every reward scheme is strictly proper, satisfying the condition that ρ be its own best estimate, Equation (32) is a constraint on R(ρ). If we define G(ρ) ≡ r (ρ : ρ) as the expected reward for a perfect estimate, then a bit of algebra yields Equation (33) holds if and only if (i) r (ρ :ρ) (as a function of ρ) is tangent to G(ρ); and (ii) G(·) is strictly concave. Thus, for every strictly concave function G(·) on density operators, there is a unique operational divergence 7 : where ∇G(·) is the gradient of G(·).
Operational divergences include widely used measures such as the squared Hilbert-Schmidt or L 2 distance, associated with G(ρ) = Tr(ρ 2 ); and the relative entropy or Kullback-Leibler divergence, associated with G(ρ) = −H (ρ) = Tr(ρ log ρ). Now that we have determined how to measure accuracy, let us try to optimize it. This is an easy task for an omniscient estimator, because the best estimate of ρ is ρ itself. If the estimator actually knows ρ, then her best plan is to reportρ = ρ. The interesting case is an uncertain estimator. She must consider all the possible ρ, in order to choose the bestρ. A risk-neutral estimator seeks to maximize his/her expected reward, averaged over all possible ρ.
Consider any estimation procedure, such as a map from measurement records M to estimatesρ(M). Which procedure should the estimator choose? Suppose that the unknown state ρ to be estimated will be drawn from an ensemble described by π 0 (ρ)dρ. The expected reward yielded by the procedureρ(M) is an average over (i) possible ρ and (ii) the ensuing M.
Inserting r (ρ :ρ) = Tr ρR(ρ) (equation (30)), The trace, sum and integral are all linear, so we can rearrange them as We now observe that p(M|ρ)π 0 (ρ)dρ = p(M), the unconditional probability of observing M. Furthermore, ρp(M|ρ)π 0 (ρ)dρ =ρ BME (M), the BME estimate given π 0 . Using these identities, the estimator's expected reward is and each term in the sum can be independently maximized. For each M, the optimalρ(M) iŝ ρ BME -which means that BME is unconditionally the optimal estimation procedure. This result is remarkable because it makes no appeal to asymptotics; the optimality holds for 100, 10, or even just 1 observation. Of course, when the estimator has insufficient data, the resulting estimate will not be very accurate-but neither will any other estimate. Crucially, his/her uncertainty will be reflected in a highly mixed estimate, with large error bars. Unlike MLE, BME fails gracefully, making the best use of the available data without over-reaching.
There are a few caveats that should be kept in mind. Firstly, BME need not optimize measures that are not operational divergences-e.g. trace distance or fidelity. Measuring estimation performance using these measures is generally unwise, but they (especially fidelity) are commonly misused. Secondly, optimality requires the estimator's prior to coincide with the ensemble from which the unknown states were selected. A sufficiently 'wrong' prior will lead to horrendous results. The BME estimate is still the most honest and accurate representation of the estimator's knowledge, but that knowledge may have been predicated on dangerous prior assumptions.
This appears, at first glance, like an unbridgeable gap between Bayesians and frequentists. Happily, it is not, and there is a clear road forward. The problem of estimating classical probability distributions (in identical circumstances to our problem), has been largely solved using the minimax framework [21][22][23]. The basic idea is to consider the set of all possible estimation protocols M →ρ(M), and for each protocol to identify the state ρ for which it works the worst. Then the protocols are ranked by their worst-case performance, and the one with the best worst-case performance is chosen. This 'minimax solution' is guaranteed to work pretty well for every state, so its optimality can be demonstrated even to a die-hard frequentist. However, the minimax solution is (provably) always a BME protocol, with a particular 'noninformative' prior. Hence, in probability estimation, open-minded Bayesians and frequentists can agree on a single unconditionally reliable protocol. This brief summary does not do justice to the subtle and surprising details of minimax analysis; suffice it to say that similar results for quantum estimation would be very useful, but deriving them is not a trivial problem.

Bayesian and frequentist approaches
Having examined both frequentist and Bayesian approaches, I have focused on the concrete details-(How does MLE fail? Why does BME do better? How is BME done?)-because estimation is an operational task. Certain readers may, however, ask 'what is wrong with frequentism, anyway?' Others may be wondering what really distinguishes Bayesian and frequentist methods, since L(ρ) is crucial to both. I attempt to address these questions below.

Why frequentism fails.
The frequentist approach has dominated statistics for most of the 20th century, so its failure in quantum state estimation requires some explanation. To see why frequentism fails, we might first ask why it should succeed.
MLE attempts to fit the observed data, and so the MLE estimate is the best 'predictor' of the past. Since the goal of a state estimate is to predict the future, frequentist estimation techniques can be justified by the following axiom: the future will look [statistically] identical to the past. If this axiom is true, thenρ MLE is the best possible estimate. The law of large numbers implies its validity as N → ∞, and the central limit theorem quantifies this convergence.
For classical systems, it is always possible that the frequentist axiom will hold. If the coin comes up 'heads' the first time, it is entirely possible that it will always come up heads. Moreover, the rules are not going to change-the possible outcomes in the past were 'heads' and 'tails', and they will remain the only possible outcomes in the future.
This does not hold for quantum systems. The past, represented by the estimator's data, comprises a finite set of observations extracted from a finite variety of measurements. For instance, the estimator might have measured σ x , σ y and σ z on a qubit. Future experimenters, however, might choose to measure any observable-and there are infinitely many. A quantum state, by definition, predicts the probabilities for every possible measurement. The frequentist axiom cannot possibly hold; any future observer could violate it at will, simply by making a novel measurement.
Frequentist methods for classical probabilities yield zero probabilities only when (a) event 'i' has never been observed, (b) in every trial, something in the complement of event 'i' was observed.
That is, event 'i' could have happened, but it did not. When MLE is used on quantum systems, the |φ i φ i | that ends up getting assigned zero probability is almost never something that could have been observed. The Achilles' heel of frequentist quantum estimation is that it happily assigns zero probability to events that were never observed not to happen. To avoid this problem, we need a method that does not begin by assuming 'the future will look like the past', because for a quantum system, that cannot be true.

How the Bayesian and frequentist approaches differ. L(ρ) is the key ingredient in
Bayesian methods, just as in frequentist ones. It represents everything relevant about the data. In frequentist methods, L(ρ) is the sole ingredient, and so the only natural thing to do is to find thê ρ that maximizes it. Bayesian methods, in contrast, transform the likelihood into a probability distribution, by multiplying it by a prior distribution π 0 (ρ)dρ.
A common misconception is that this transformation is trivial when π 0 (ρ)dρ is 'flat' (e.g. coincides with a Lebesgue or Haar measure). On the contrary, it transforms a function into a distribution (or measure), which is an entirely different mathematical object. Functions, like L(ρ), have values. Distributions have integrals-they assign values not to points, but to regions.
For example, if L(x) is defined for real-valued x, then L(0) and L(1) are well defined, but 1 0 L(x) is purely meaningless. To integrate, we must multiply by dx (a measure), obtaining a distribution L(x)dx. This can be integrated over the interval [0, 1]-but evaluating L(x)dx at x = 0 is ill defined (and infinitesimal in any case). This difference between functions and distributions enforces a difference in approach between frequentist and Bayesian methods. Frequentists, abjuring priors, can only work with the function L(ρ). The corresponding estimate,ρ 0 , will be distinguished by the value of L(ρ 0 ). The Bayesian approach begins and ends with a distribution that has no values. Everything must be phrased in terms of measurable subsets (e.g. intervals), and integration over them.
Estimation algorithms transform data (observed in the past) into an estimate (which predicts the future). In order to select the best estimate, we must logically connect the past and the future. Frequentist methods implicitly use the frequentist axiom, while Bayesian approaches take a weighted average over all possible theories. This averaging is particularly apropos for quantum state estimation, because density matrices have a natural convex structure. Suppose a physicist knows that a qubit's state is |0 with probability 1 3 and |1 with probability 2 3 . He will describe it by the average state, ρ = 1 3 (|0 0| + 2 |1 1|)-not the most likely state, ρ = |1 1|. Viewed this way, the prior replaces the frequentist axiom as a connection between the past and the future. This can be an advantage, for a Bayesian is capable of gracefully acknowledging that the data are not descriptive of the true state-that they are unlikely, or simply insufficient. However, the price paid for this flexibility is the need to choose a prior, often without any good justification.

Where do we go from here?
The Bayesian approach to state estimation has undeniable advantages. It is accurate, it honestly represents the estimator's knowledge, and it conforms to quantum states' role as predictors. Purely frequentist approaches-e.g. maximum likelihood as it is currently used-cannot match these qualities.
Nonetheless, BME comes with an array of concomitant challenges. These range from the purely practical (integration is hard) to the fundamental (How do we choose a prior?). While some are specific to Bayesian methods, others cast doubt on the scalability of any state estimation procedure.

The Prior's Tale
Of all the problems and caveats raised by BME, none is more pressing or obvious than 'How do we choose a prior?'. BME's optimality depends on the estimator's prior matching the 'true' distribution of unknown states. This is fine in the rather artificial context of a stateestimating game that might be played many times, but physics experiments are not drawn from an ensemble. Each experiment is, as a rule, unique.
The prior is therefore a necessary fiction. As a convenient way of representing the estimator's ignorance (either genuine, or assumed for the sake of scientific impartiality), it ought to be as uninformative as possible. Unitary invariance is a good first guideline (see section 3.3 below, however). Over the spectrum of ρ, however, no uniquely suitable measure exists. Identifying particularly useful and non-informative priors remains an open and urgent question.
A related open question is 'What is the penalty for choosing the wrong prior?'. If accuracy is measured by an operational divergence, then BME must outperform MLE and all other methods-if the estimator's prior matches the distribution of unknown states. Its robustness to a poor prior is unknown. The optimality proof given previously is elegant in its simplicity, but precisely because of that elegance, it provides few clues to this problem.

Practical matters
Every calculus student learns that integration is harder than differentiation. Numerical integration is an active and challenging field of numerical analysis, whereas differentiation involves little more than function evaluation. BME consists almost entirely of integration, whereas MLE is a maximization problem. Unsurprisingly, the implementation of BME described above is roughly an order of magnitude slower than MLE. Experimentalists, already frustrated by MLE analyses that run for a week or more [28], may be nonplussed.
This state of affairs may owe a great deal to the fact that MLE algorithms, unlike BME, have been developed and used for 5-10 years. Substantial speedups are likely in the futureprecisely because numerical integration remains something of an art. The Metropolis-Hastings algorithm already provides a tremendous advantage over naïve Monte Carlo, so a few more orders of magnitude may be feasible.
One reason for optimism is that the BME integral appears, in principle, to have a rather simple analytic form. The likelihood function is a polynomial, the product of many linear functions, of the form Tr(ρ M i ). For certain priors (e.g. Hilbert-Schmidt), the resulting posterior looks a lot like a beta distribution of the form β(x) = x n (1 − x) N −n . This appears in classical estimation and is easy to integrate. What makes the quantum case hard is the boundary conditions. Unlike the classical probability simplex, the quantum state-set has curved edges that are awkward to integrate over. However, analytic solutions can be obtained for small N , and a general solution might be possible.

Scalability
Quantum devices exist that provide coherent control over 8-12 qubits [9,24]. Twenty or 30 qubits will probably be controlled within the next five years (if only for a short time, and with limited fidelity). The Hilbert space of a 30-qubit quantum register is enormous-to merely store one density matrix for such a device would require just under 1 million terabytes of memory. State estimation, as we know it, is impossible in this context. Nonetheless, characterizing quantum hardware will remain important. A quantum computer will not need state estimation; its results will appear as a computational basis state, determined by a projective measurement. Development and testing of components, however, will depend crucially on state estimation. It is not sufficient to know whether or not the desired state is produced; the designer will want to know the nature of the errors, so as to correct them. Eventually, these errors need to be reduced below a fault-tolerance threshold that is probably less than 10 −3 .
As the states that are estimated grow larger and the uses to which they are put become more demanding, utterly new techniques will be needed. Unbiased estimation-i.e. guessing the system's state without any pre-existing assumptions-becomes exceedingly data intensive for large Hilbert spaces. Making use of the estimator's prior knowledge will be essential. The Bayesian approach presented here provides a natural framework for doing so. However, a framework for reliably representing that prior knowledge (without falling prey to self-fulfilling prophecies) will be necessary. This reason alone would justify further study of Bayesian state estimation.
Another approach to the same problem is to focus on certain properties of the state. This underlies Gross et al's compressed tomography [25] and Aaronson's PAC learning approach [26]. Compressed tomography is targeted at low-rank states: if the data were in fact generated by a nearly pure state, then techniques from compressed sensing (see references cited in [25]) can be used to reconstruct that state from data that would otherwise be incomplete. There is a price to be paid, in that higher precision is required, and so compressed tomography appears to require the same total number of measurement 'clicks'.
Aaronson's unique approach, in [26], is based on Valiant's notion of PAC (probably approximately correct) learning. He shows that any fixed set of measurements on an N -qubit system can be PAC learned using only poly(N ) copies. This seems impossible on the face of it, since conventional state estimation requires O(2 N ) different measurement settings. What enables Aaronson's intriguing result is a crucial assumption about the estimator's goals: future measurements will be drawn from an ensemble that is known in advance. PAC learning requires accurately estimating the probabilities for most future measurements-but it is okay to be wildly wrong about a tiny (exponentially small in N ) fraction of them. PAC learning is possible because one of the following must hold: 1. There are at most poly(N ) measurements in the set. 2. Most of them are highly correlated with each other, so there are at most poly(N ) linearly independent measurements in the set. 3. For every possible ρ, most of the measurements in the set have nearly uniform probability distributions.
If the measurements to be learned are tomographically complete (e.g. a full set of mutually unbiased bases), then the third case holds. The estimator can just write downρ = 1l d -without making any measurements at all.
Compressed tomography and PAC learning are not alternatives to MLE, BME or linear inversion. They are not procedures for state estimation. They are new definitions of the problem! Both are potentially amenable to frequentist, Bayesian or ad hoc statistical methods. Adapting the Bayesian principle and methods discussed here to them is a challenge (for instance, compressed tomography explicitly seeks a low-rankρ, while BME explicitly avoids rankdeficient estimates), but a worthy one. For one thing, these approaches lie squarely in the regime of incomplete and undercomplete data-precisely where Bayesian methods excel. These estimates will have large and unpredictable error bars. Characterizing the estimator's uncertainty will be important. These problems demand a reasoned and careful approach to statistical inference-which is exactly what Bayesian inference can provide.