Optimal error regions for quantum state estimation

Rather than point estimators, states of a quantum system that represent one's best guess for the given data, we consider optimal regions of estimators. As the natural counterpart of the popular maximum-likelihood point estimator, we introduce the maximum-likelihood region---the region of largest likelihood among all regions of the same size. Here, the size of a region is its prior probability. Another concept is the smallest credible region---the smallest region with pre-chosen posterior probability. For both optimization problems, the optimal region has constant likelihood on its boundary. We discuss criteria for assigning prior probabilities to regions, and illustrate the concepts and methods with several examples.


Introduction
Quantum state estimation (see e.g. [1]) is central to many, if not all, tasks that process quantum information. The characterization of a source of quantum carriers, the verification of the properties of a quantum channel and the monitoring of a transmission line used for quantum key distribution-all three require reliable quantum state estimation, to name just the most familiar examples.
In the typical situation that we are considering, several independently and identically prepared quantum-information carriers are measured one-by-one by an apparatus that realizes a probability-operator measurement (POM), suitably designed to extract the wanted information. The POM has a number of outcomes, with detectors that register individual information carriers (photons in the majority of current experiments), and the data consist of the observed sequence of detection events ('clicks') 7 .
The quantum state to be estimated is described by a statistical operator, the state, and the data can be used to determine an estimator for the state-another state that, so one hopes, approximates the actual state well. There are various strategies for finding such an estimator. Thanks to the efficient methods that Hradil,Řeháček, and their collaborators developed for calculating maximum-likelihood estimators (MLEs, reviewed in [2]; see also [3]), MLEs have become the estimators of choice. For the given data, the MLE is the state for which the data are more likely than for any other state.
Whether one prefers the MLE or a point estimator found by another method, the data have statistical noise and, therefore, one needs to supplement the point estimator with error bars of some sort-error regions, more generally, for higher-dimensional problems. Many recipes, which are often ad hoc in nature, have been proposed for attaching a vicinity of states to an estimator. These usually rely on having a lot of data [4,5], involve data resampling [6] or consider all data that one might have observed [7,8]. By contrast, we systematically construct error regions from the data actually observed.
For this purpose, we propose maximum-likelihood regions (MLRs) and smallest credible regions (SCRs). These are regions in the space of quantum states (more precisely: in the reconstruction space; see section 2). The MLR is that region of pre-chosen size for which the given data are more likely than for any other region of the same size. The SCR is the smallest region with pre-chosen credibility-the credibility of a region being the probability of finding the actual state in the region, conditioned on the data (see e.g. [9]). Whether one chooses the MLR or the SCR as the optimal error region depends on the situation at hand.
Central to both concepts is the notion of the size of a region. In fact, some notion of size must underlie any useful definition of error regions, since one usually aims at reporting an error region that is not unnecessarily large-a judgement that can only be made with a suitable concept of size. We agree with Evans et al [10] that, in the context of state estimation, it is most natural to measure the size of a region by its prior-before any data are at hand-probability of finding the actual state in the region: regions with the same prior probability are considered as having the same size. Hence, the size of a region expresses the relative importance of that region of states.
The identification 'size ≡ prior probability' is technically possible because both quantities simply add when disjointed regions are combined into a single region. While for some tasks one prefers not to assign a prior 8 , since state estimation expresses our best attempt at guessing the state, any prior information we possess should be taken into account in the estimation process alongside the data. Much guidance on choosing priors can be found in the standard statistics literature; in appendix A, we provide a summary that focuses on points relevant in quantum contexts. Ultimately, the choice of prior is up to the user, but it should be consistent: the estimation results should be dominated by the data, not the prior, if many copies of the state are measured.
As we show below, the problems of finding the MLR and the SCR are duals of each other. In both cases, the optimal regions contain all states for which the likelihood of the data exceeds a threshold value. This provides a concise way of communicating one's uncertainty of the estimate. That the optimal error regions possess such a simple description is surprising, since our construction imposes no restriction on the shape of the regions to be considered. The shape of the optimal regions are uniquely determined by the likelihood function, in sharp contrast to the arbitrariness in the shape of a confidence region (see appendix B), a concept that is the subject of recent discussion [7,8]. Yet the two are not unrelated: our SCRs provide natural starting points for the construction of the confidence regions considered in [7]. While the chosen MLR or SCR depends on the prior, the set of candidate regions is priorindependent: it depends only on the likelihood function for the given data. Also reassuring is the fact that every MLR or SCR is a small vicinity of the MLE, in the respective limits of small size or small credibility. This is reminiscent of standard ellipsoidal error regions constructed around the MLE, which are, however, applicable only in the limit of a large amount of data when the central limit theorem can be invoked and the uncertainty can be characterized by the Fisher information (see, for instance, [4]).
Here is a brief outline of the paper. We set the stage in section 2 where we introduce the reconstruction space, discuss the size of a region, and define the various joint and conditional probabilities. Equipped with these tools, we then formulate in section 3 the optimization problems that identify the MLRs and SCRs and find their solutions in section 4. We illustrate the matter by simulated single-qubit and two-qubit experiments in sections 5 and 6, and close with an outlook. Two appendices supply additional material: a guide for choosing priors, and a comparison with confidence regions.

Reconstruction space
The outcomes k , k = 1, 2, . . . , K , of the POM, with which the data are acquired, are positive ( k 0) Hilbert-space operators with K k=1 k = 1. If the state ρ describes the system, the probability p k that the kth detector clicks is p k = tr{ k ρ} = k (Born rule). (1) The positivity of ρ and its normalization to unit trace ensure p k 0 for all k and K k=1 p k = 1. Probabilities p = ( p 1 , p 2 , . . . , p K ) for which there is a state ρ such that (1) holds are permissible probabilities. They make up the probability space. The probability space for a K -outcome POM is usually smaller than that of a tossed K -sided die because not all positive p k with unit sum are permissible. The quantum nature of the state estimation problem enters only in these additional restrictions on p: quantum state estimation is standard statistical state estimation with constraints of quantum-mechanical origin. The rich methods of statistical inference immediately apply, modified where necessary to account for the restricted probability space.
Whereas p is uniquely determined by ρ through (1), the converse is true only if the POM is informationally complete. In any case, there is always a reconstruction space R 0 , a set of ρ that contains exactly one ρ for each permissible p, consistent with the Born rule. If there is more than one reconstruction space, it does not matter which one we choose. As an example, consider a harmonic oscillator with its infinite-dimensional state space. If the POM has two outcomes, with p 1 equal to the probability of finding the oscillator in its ground state and p 2 = 1 − p 1 , the reconstruction space is the set of convex combinations of the projector to the ground state and another state with no ground-state component. In this situation, there are very many reconstruction spaces to choose from because any other state serves the purpose, and all one can infer from the data is an estimate of the ground-state probability.
Since the probability space is unique, while there can be many different reconstruction spaces, it is often more convenient to work in the probability space. In particular, the probability space has the desirable property that it is always convex; it is, however, not always possible to find a convex reconstruction space. The primary objective of state estimation is then to find an estimator, or a region of estimators, for the probabilities p. The conversion of p into a state ρ can be performed later, if at all. At this stage, if the POM is not informationally complete, one must invoke additional criteria-beyond what the data tells us-for a unique mapping p → ρ. For example, one could follow Jaynes's guidance [11,12] and maximize the entropy [13] (see also [14]). However, following the tradition in this topic, we will formally work in a reconstruction space R 0 , although all actual calculations are performed in the probability space. Estimators are states in R 0 , and regions are sets of states there.

Size and prior content of a region
The reconstruction space is an abstract construct that is often not endowed with a self-suggesting unique metric. Instead, a region's prior probability-the quantity that matters most in the present context of statistical inference-offers a natural notion of size. This relieves us of the need to invoke additional, possibly artificial, criteria for the assignment of size, for instance, one that has more to do with a simple parameterization of the state space than the relative importance of different regions in terms of our prior expectations.
We denote by (dρ) the size (≡ prior probability) of the infinitesimal vicinity of state ρ in R 0 . The size S R of a region R is then A pertinent remark: we exclude pathological cases of improper priors where the prior density (dρ) cannot be normalized; should an improper prior be useful in a particular context, it should come about as the limit of a well-defined sequence of proper priors. By construction, S R does not depend on the parameterization used for the numerical representation of (dρ). The primary parameterization is in terms of the probabilities, where the prior density w( p) is positive for all permissible probabilities and vanishes for nonpermissible probabilities. To enforce positivity and normalization of p, w( p) always contains as a factor, where η( ) denotes Heaviside's unit step function and δ( ) is Dirac's delta function.
If there are no other constraints, we have the probability space of a K -sided die. For genuine quantum measurements, however, there are additional constraints, some accounted for by more delta-function factors, others by step functions. The delta-function constraints reduce the dimension of the reconstruction space from K − 1 to the number of independent probabilities. Accordingly, there is a factor of constraint w cstr ( p) (containing w 0 ( p)) that specifies the probability space and appears in all possible priors. In particular, there are two specific priors we will employ as examples below: the primitive prior and the Jeffreys prior [15] which is a popular choice of an unprejudiced prior [16]. For the harmonic-oscillator example in section 2.1, which has the same probability space as a tossed coin, the factor w 0 ( p) selects the line segment with 0 p 1 = 1 − p 2 1 in the p 1 p 2 plane. If we choose the primitive prior (dρ) = (d p) w 0 ( p), the subsegment with a p 1 b has size b − a. For the Jeffreys prior the same subsegment has size 2 π [sin −1 ( . In this example, and also in those we use for illustration in section 5 below, it is easy to state quite explicitly the restrictions on the set of permissible probabilities that follow from the Born rule. In other situations, including the examples of section 6, this is more difficult. In yet more complicated situations it could be impossible. It is, however, possible to check numerically if a certainp = (p 1 , . . . ,p K ) is permissible. For example, one calculates a MLE (which can be done efficiently) for relative frequencies n k /N =p k (see below), and if the resulting probabilities p are such that p =p, thenp is permissible; otherwise it is not. For practical reasons, it may be necessary to truncate the full state space-which can be, and often is, infinite-dimensional-to a test space of manageable size. With such a truncation one accepts that not all permissible probabilities are investigated. Therefore, a criterion for judging if the test space is large enough is to verify that the estimated probabilities do not change significantly when the space is enlarged. Examples for the artifacts that result from test spaces that are too small can be found in [17].

Point likelihood, region likelihood and credibility
The data D consist of a sequence of detector clicks, with n k clicks in total of the kth detector after measuring N = n 1 + n 2 + · · · + n K copies of the state 9 . The probability of obtaining D, if ρ is the state, is the point likelihood L(D|ρ) = p n 1 1 p n 2 2 · · · p n K K , which attains its maximum value when ρ is the MLE ρ ml ∈ R 0 ; the MLE is fully determined by the relative frequencies n k /N . The joint probability of finding the state ρ in the region R and obtaining the data D is For R = R 0 , prob(D ∧ R 0 ) = L(D) is the prior likelihood. Since one of the click sequences is surely observed, we have where the sum is taken over all possible data for N clicks. We factor the joint probability in two ways and so identify the region likelihood L(D|R) and the credibility C R (D). These are conditional probabilities: L(D|R) is the probability of obtaining the data D if the actual state is in the region R; C R (D) is the probability that the actual state is in R if D were obtained-the posterior probability of R.

Optimal error regions
For the given data D, we desire a region with the largest likelihood-the MLR. For this purpose, we maximize the region likelihood L(D|R) under the constraint that only regions with a prechosen size s participate in the competition, with 0 < s < 1, an unconstrained maximization of L(D|R) is not meaningful as it gives the limiting region comprising nothing but the point ρ ml . The resulting MLR R ml depends on D and s. Since the size is fixed, we can equivalently maximize the joint probability prob(D ∧ R) under the size constraint.
The answer to this maximization problem is given in [10, corollary 4], which we translate into our present context as follows: The MLRs of various sizes s consist of all states ρ for which the point likelihood exceeds a threshold value, with higher (13) thresholds for smaller sizes. This is justified by a proof of considerable mathematical sophistication in [10]. We offer an alternative argument that is more accessible to the working physicist. Consider infinitesimal variations of a region R by deforming its boundary. The maximum property of the MLR and its Both hold only if L(D|ρ) is constant on the boundary ∂ R ml of R ml , for an R ml entirely contained inside R 0 (so that − → δ (ρ) can have any direction): ∂ R ml is an iso-likelihood surface (ILS). Concavity of the log-likelihood further requires R ml to be the interior of this ILS. ∂ R ml can also contain part of the boundary of R 0 (see figure 1(b)), in which case only the part of ∂ R ml inside R 0 is an ILS. In either case, R ml is a bounded-likelihood region (BLR), comprising all states in R 0 with point likelihood exceeding a certain threshold. BLRs have appeared previously in standard statistical analysis (see [18] and references therein).
We specify the threshold value as a fraction λ of the maximum value L(D| ρ ml ) of the point likelihood; see figure 2. The BLR R λ has the characteristic function and is the size of R λ . We have R λ = R 0 and s λ = s 0 = 1 for λ λ 0 with λ 0 0 given by min ρ L(D|ρ) = λ 0 L(D| ρ ml ). As λ increases from λ 0 to 1, s λ decreases monotonically from 1 to 0. The size s specified in (12) is obtained for an intermediate λ value and the corresponding BLR is the looked-for MLR. Note that the MLE is contained in all MLRs: as s → 0, the MLR becomes an infinitesimal vicinity of the MLE, and L(D| R ml ) → L(D| ρ ml ). The MLR is the region for which the observed data are particularly likely. With a reversal of emphasis, we now look for a region that contains the actual state with high probability. Ultimately, this is the SCR R sc -the smallest region with the pre-chosen credibility value c. For the given D, the optimization problem is dual to that of (12). Here, we minimize the size for given joint probability; there we maximize the joint probability for a given size. It follows that the BLRs are not only the MLRs but they are also the SCRs: each MLR is a SCR, each SCR is an MLR. The BLR R λ has credibility which, like s λ , decreases monotonically from 1 to 0 as λ increases from λ 0 to 1. The credibility c specified in (17) is obtained for an intermediate value, and the corresponding BLR is the looked-for SCR.
That the general definitions of the MLR and the SCR, which allow for regions of arbitrary shapes, permit such a simple characterization in terms of BLRs is remarkable. BLRs are reminiscent of standard ellipsoidal error regions constructed by analyzing the neighborhood of the peak of the likelihood function-a procedure justified only for large enough N for the central limit theorem to apply (see, for instance, [4]); yet, our result employs no such assumption. Also surprising is that, while λ depends on the choice of prior, the set of regions that enter the competition is independent of that choice; the prior enters only in the size, region likelihood, and credibility of the MLR/SCR.
Once the data are obtained, there is the MLR and the SCR for these data, and other MLRs or SCRs associated with unobserved data play no role. This is in sharp contrast to confidence regions, whose construction requires consideration of all data that could have been obtained, since the confidence level is a property of the entire set of confidence regions, one for each possible data (see appendix B). Nevertheless, they are not unrelated. Christandl and Renner [7] showed that high-credibility regions offer starting points for constructing confidence regions-a set of SCRs with high credibility immediately suggests itself-and Blume-Kohout [8] argued that BLRs can be good confidence regions.

Reporting error regions
For a BLR, s λ and c λ are linked by Therefore, once s λ is known as a function of λ, we obtain c λ from This is, of course, consistent with the limiting values for λ λ 0 and λ = 1, and also establishes that c λ > s λ for λ 0 < λ < 1. Furthermore, (19) and (20) tell us that in the λ → 1 limit, when both s λ and c λ vanish, their ratio is finite and exceeds unity, We note that this provides the value of L(D), since the maximal value L(D| ρ ml ) of the point likelihood is computed earlier as it is needed when identifying the BLRs. Inasmuch as the value of s λ quantifies our prior belief that the actual state is in R λ , we are surprised when the data tell us that the probability for finding the state in R λ is larger.  Accordingly, the SCR is the region for which we are most surprised for the given prior belief. This matter and other aspects of Bayesian inference based on the concept of relative surprise are discussed in [10]. Relation (20) has a simple geometrical meaning in terms of areas under the graph of s λ , as explained in figure 3. This relation is of considerable practical importance because we only need to evaluate the integrals of (16), but not those of (18). Since the latter integrals require well-tailored Monte Carlo methods to handle the typically sharply peaked point likelihood, the numerical effort is substantially reduced if we only need to evaluate (16). The error regions for the observed data are then concisely communicated by reporting s λ and c λ as functions of λ. With these, the end user interested in the MLR with the size s of his liking or the SCR of his wanted credibility c can determine the required value of λ. It is then easy to check if a state is inside the specified error region. The example of section 6 illustrates the matter for an eightdimensional reconstruction space, for which the error regions will be impossible to visualize, but can still be easily specified through reporting the s λ and c λ values.

Example: incomplete single-qubit tomography
For a first illustration, we consider the simplest situation that exhibits the typical features: the quantum-information carriers have a qubit degree of freedom, which is measured by one of two standard POMs that are not informationally complete.

Probability-operator measurements and priors
For both POMs, the unit disk in the x y plane suggests itself for the reconstruction space R 0 . The first POM is the crosshair measurement that combines projective measurements of σ x and σ y into a four-outcome POM (K = 4) with probabilities The permissible probabilities are identified by where The dotted equals sign in (23) stands for 'equal up to a multiplicative constant', namely the factor that ensures the unit size of the reconstruction space. The second POM is the three-outcome trine measurement (K = 3), whose outcomes are subnormalized projectors on the eigenstates of σ x and (−σ x ± √ 3 σ y )/2 with eigenvalue +1. It has the probabilities for which summarizes the constraints that the permissible values of p 1 , p 2 , p 3 obey. Both POMs have the same primitive prior where 0 r 1 and ϕ covers any convenient range of 2π . This prior is uniform in x and y, and in r 2 and ϕ. The polar-coordinate version, that is x + iy = r e iϕ , is the more natural parameterization of the unit disk. The Jeffreys prior for the four-outcome POM is For the three-outcome POM, we have the Jeffreys prior   and for the Jeffreys prior of (28). The actual state is inside two of the four SCRs with credibility c = 0.5 and is contained in all four SCRs with credibility c = 0.9. Not unexpectedly, we get quite different regions for the two rather different sets of detector click counts. Yet, we observe that the choice of prior has little effect on the SCRs, although the total number of measured copies is too small for relying on the consistency of the priors. The same remarks apply to the SCRs for the three-outcome POM in figure 4(b); here we counted (n 1 , n 2 , n 3 ) = (15, 8, 1) and (13,7,4) detector clicks in the simulated experiments.

Computer-generated data
In section 4 we remarked that the estimator regions are properly communicated by reporting s λ and c λ as functions of λ. This is accomplished by the insets in figure 4 for two of the four simulated experiments. The dots give the values obtained by numerical integration that uses a Monte Carlo algorithm. The scatter of these numerical values confirms the expected: the computation of s λ only requires sampling the probability space in accordance with the prior and determining the fraction of the sample that is in R λ ; for the computation of c λ we need to add the values of L(D|ρ) for the sample points inside R λ ; and since L(D|ρ) is a sharply peaked function of the probabilities, the s λ values are more trustworthy than the c λ values for the same computational effort. The line fitted to the s λ values is a Padé approximant (see e.g. [19, section 5.12]) that takes the analytic forms near λ = λ 0 = 0 and λ = 1 into account. The line approximating the c λ values is then computed in accordance with (20).

Example: incomplete two-qubit tomography
For a second illustration, we consider the situations that arise in the quantum-key-distribution schemes by Bennett and Brassard (BB84 [20]) and the trine-antitrine (TAT) scheme of reference [21]. Both schemes can be implemented by having a source of entangled qubit pairs distribute one qubit each to the two communicating parties. Prior to any key generation, the two-qubit state emitted by the source needs to be characterized. It is desirable to achieve quantum state estimation with reliable error regions without sacrificing many data that are then not available for the key generation.

Probability-operator measurements and computer-generated data
In the BB84 scheme, each qubit is measured by the crosshair POM of (22); the resulting twoqubit POM has sixteen outcomes that obey eight constraints that give delta-function factors in w cstr ( p). In the TAT scheme, one qubit is measured by the trine POM of (25) and the other qubit by the antitrine POM that has the signs of x and y reversed in (25); the resulting two-qubit POM has nine outcomes subject to the single delta-function constraint of unit sum. Accordingly, the probability space is eight-dimensional for both schemes 10 , and we cannot report the SCRs by showing the optimal error regions in the reconstruction space, as was possible for the twodimensional probability space in figure 4. Therefore, we employ the strategy of section 4 and report the size s λ and the credibility c λ of the respective BLRs as functions of λ.
For the generation of the simulated data, we first add noise to the singlet state by putting it through a random Pauli channel 11 . The resulting true state has the probabilities for the twoqubit POMs given in the top row of table 1. For example, the '12' entry in the 4 × 4 table for the double-crosshair POM is the probability for outcome 1 The '11' entry of the 3 × 3 table for the TAT POM is 16/9 times that number. Note that all marginal probabilities (sums of rows and sums of columns) are equal; this is so because the 10 In actual experiments, the probability space is nine-dimensional because one must account for the no-click probability of the qubit pairs that do not give rise to coincidence clicks. Furthermore, the state estimation could also exploit the data collected for single-qubit detection without the coincidental detection of the partner qubit. Consistent with the footnote in section 2.3, we are here content with the idealized situation of perfect detection devices because our objective is to give an example for a higher-dimensional space rather than evaluating real experimental data. 11 A random Pauli channel is used as a simple model for noise in a communication protocol. The channel acts on an input state as ρ → jk r jk σ j ⊗ σ k ρ σ j ⊗ σ k , where j, k = 0, x, y, z (σ 0 denotes the single-qubit identity operator), and the r jk are sixteen randomly chosen probabilities. The 60 copies of the true state come from passing 60 copies of the singlet state through one instance of the random Pauli channel, i.e., the r jk are randomly picked once, with r 00 given a higher weight of 0.7 to simulate weak noise. Table 1. Computer-generated data for the estimation of a two-qubit state from measuring 60 identically prepared copies. The first row gives the joint probabilities of the true state. The broken second row shows the number of detector-click pairs obtained in the simulated experiment (and their expected values) together with the single-qubit marginals. The third row reports the joint probabilities of the MLEs for the data in the second row. In each row, we have a 4 × 4 table on the left for the double-crosshair POM of the BB84 scenario and a 3 × 3 table on the right for the nine-outcome POM of the TAT scheme. The rows of a 4 × 4 table for the double-crosshair POM refer to the four j of the first qubit in the pair and the columns refer to the k of the second qubit; entry ' jk' is the probability for outcome j ⊗ k . Analogously, entry ' jk' in a 3 × 3 table for the TAT scheme is the expectation value of j ⊗ k with trine outcome j and antitrine outcome k . The third row of table 1 shows the corresponding MLE probabilities. These probabilities are equal to the relative frequencies of the counts for the 9-outcome POM but are different from the relative frequencies for the 16-outcome POM. This tells us that the computer-generated data are not typical for the double-crosshair POM, whereas we have typical data for the TAT POM.

Size and credibility of the bounded-likelihood regions
As noted in section 4, the primary task of the data evaluation is the computation of the multidimensional integrals that give the size s λ of the BLRs for the whole range of 0 < λ < 1. For the data in table 1, these are integrals over eight-dimensional regions. We used a random-sampling technique for this purpose.
As a preparation, we generated a random sample of 648 785 permissible sets of probabilities, uniformly distributed in accordance with the primitive prior (see section 6.3 below). In view of the one-to-one correspondence between the permissible probabilities of the 16-outcome POM and the 9-outcome POM, the same random sample can be, and was, used for both POMs.
The actual data processing consists of two steps. In the first step, we determine the size s λ for the 160 values of λ with −log 10 λ = 0.1(0.1)16.0. This requires a simple counting of how many samples are inside the BLR R λ if the primitive prior is used. In the case of the Jeffreys prior, one adds the weights ( p 1 p 2 · ·· ) −1/2 of the samples inside the BLR. The correct normalization follows from s λ=0 = 1.
In the second step, the integrals needed in (20) are evaluated, for which a simple linear interpolation between adjacent (λ, s λ ) pairs is sufficiently accurate. Then, c λ is known as a function of λ and the λ values for which we have 99% or 95% credibility are determined.
We show s λ and c λ as functions of λ in figure 5. Table 2 reports the λ values of the 99% and 95% credibility thresholds. We observe that for the 16-outcome POM, the true state is inside the SCRs with 99% credibility for both the primitive prior and the Jeffreys prior, whereas it is inside the 95% SCR only for the primitive prior but not for the Jeffreys prior. This is more evidence that these data are untypical. By contrast, for the 9-outcome POM, the true state is inside all SCRs for both priors and both values of the credibility.
Typicality, or lack thereof, can also be noticed in figure 5. Since the Jeffreys prior gives more weight to the regions near the boundary of the probability space than the primitive prior, and less weight to regions deep inside, one expects that the values of s λ for the primitive prior are larger than those for the Jeffreys prior if the data are typical and, accordingly, the MLE is not close to the boundary. This is indeed the case for the TAT data, but not for the double-crosshair data.   figure 5, and the sizes of the respective BLRs. The true state is inside the R λ s with λ < 3.368 × 10 −3 for the 16-outcome POM (with its untypical data), and inside the BLRs with λ < 0.2486 for the 9-outcome POM. 16

Numerical effort
The two steps of data evaluation, the computation of the size s λ and then the credibility c λ , take a few seconds of central processing unit (CPU) time. The preparation of the random sample of permissible probabilities, which could be done ahead of the data taking, lasts much longer. Our sample of 648 785 probabilities took almost 100 h of CPU time on a standard desktop (Intel i7-870 CPU, using one of the four cores and 8 GB RAM). The procedure we employed was simple and reliable but not optimized for speed. There is clearly much room for improvement (see section 7). For each potential sample of probabilities we first generate nine random numbers x 1 , x 2 , . . . , x 9 uniformly and independently between 0 and 1. Then, the nine probabilities p k = log(x k )/log(x 1 x 2 · · · x 9 ) constitute a sample in the eight-dimensional simplex of the classical nine-sided die, and the samples are distributed in accordance with the primitive prior. The sample p = ( p 1 , p 2 , . . . , p 9 ) is accepted if it is a permissible set of probabilities for the TAT POM with its nine outcomes. Only 9.27% of the 7 × 10 6 candidate probabilities generated were accepted.
Whereas the generation of another sample p is fast, the test of permissibility is the part that consumes most of the CPU time. After identifying the candidate p with the relative frequencies of a measurement with the 9-outcome POM, we calculate a MLE for these frequencies. If the probabilities of the MLE are equal to p, this sample probability is accepted, otherwise it is rejected.

Outlook
For the given data and chosen size or credibility, the MLR or the SCR is a neighborhood of the MLE. In this sense, one can regard them as systematically constructed error regions for the MLE. Although there are efficient methods for computing the MLE [2,3], equally efficient algorithms for finding the MLR and the SCR need to be developed. In particular, there remains the challenge of evaluating the multi-dimensional integrals that give s λ .
Random-sampling techniques are our methods of choice. In section 6, a simple strategy required 100 h of CPU time for creating a sample of permissible probabilities. However, this can be substantially reduced by optimizing the sampling process. Parallelizing the sampling over many different computers and, later, combining into a single dataset was suggested. The chance that a candidate probability is permissible (only 9.27% in the sampling of the example) can be much increased by employing cleverer Monte Carlo methods, where one makes use of information at the current physical point to stay within the physical state space. We have just begun to explore this and will report on our progress in due course. It is also worth noting that this computational time is an overhead that is incurred only once and the sampling can be done ahead of any actual data-taking in the laboratory.
Often, only a few parameters computed from the state are of interest. It is, therefore, possible to reduce the dimensionality of the problem by discarding nuisance parameters. A variant of the methodology described here can be used to determine small regions of high credibility in the few-parameter space of interest, without first determining SCRs in the reconstruction space. We will return to this matter on another occasion.

Acknowledgments
We have benefited greatly from our discussions with David Nott and thank him in particular for bringing [10,16] to our attention. The Centre for Quantum Technologies is a Research Centre of Excellence funded by the Ministry of Education and the National Research Foundation of Singapore. Note added in proof. Since the completion of this paper, we have improved the sampling algorithm and thereby reduced the CPU time consumed to less than 10% of what it was before. This speed-up results from a faster test of permissibility that is still easy to implement. Further improvement is likely.

Appendix A. Choosing the prior
The assignment of prior probabilities to regions in the reconstruction space should be done in an unprejudiced manner while taking into account all prior information that might be available. We cannot do justice to the rich literature on this subject and are content with noting that [16] reviews various approaches to constructing unprejudiced priors. Here, we discuss some criteria that are useful when choosing a prior, illustrating with examples familiar in quantum contexts.
A general remark: the chosen prior should give some weight to (almost) all states. It should not give extremely high weight to states in some part of the state space and extremely low weight to other states. This is to say, the prior should be consistent in the sense that the credibility of a region-its posterior content-is dominated by the data rather than by the prior if a reasonably large number N of copies is measured. In the examples of figure 4, N = 24 is close to being 'reasonably large', while N = 2 in figure B.1 is clearly not. Also, N = 60 in section 6 is not large enough to ensure data dominance because the λ values in table 2 for the primitive prior and for the Jeffreys prior are quite different and, hence, correspond to quite different BLRs.
Below, we describe a few criteria for choosing priors. We begin in section A.1 with the common choice of a uniform prior; section A.2 discusses priors motivated by the utility of the estimated state; section A.3 invokes symmetry arguments to restrict considerations to priors that possess some symmetry properties; section A.4 presents form-invariant prior constructions; section A.5 deals with the situation where one has a target state in mind; and section A.6 is about priors induced by marginalization of full-state-space priors according to what the data can tell us.

A.1. Uniformity
The time-honored strategy of choosing a uniform prior on R 0 in which all states are treated equally gets us into a circular argument. Our identification of the size of a region with its prior content amounts to assigning equal probabilities to regions of equal sizes, prior to acquiring any data. But that just means that we now have to declare how we measure the size of a region without prejudice, and we are again faced with the original question about a uniform prior.
In fact, there is no unique meaning of the uniformity of a prior. In the sense that each prior tells us how to quantify the size of a region, each prior is uniform with respect to its induced size measure. To illustrate, reconsider the harmonic-oscillator example of section 2.1. For the primitive prior of (5), the parameterization where we integrate over v in the last step and so observe that the primitive prior is uniform in u, that is, the size of the region u 1 < u < u 2 is proportional to u 2 − u 1 . Likewise, the parameterization for the Jeffreys prior of (6), which is uniform in α. Other priors can be treated analogously, each of them yielding a uniform prior in an appropriate single parameter. Visualization of the uniformity for qubit priors can be found in figure A.1. Plot (b) shows uniform tiling of the unit disk by tiles of equal size. Here size is measured by the primitive prior of (27), which is uniform in x and y, and in r 2 and ϕ (the latter is used for the plot). Plots (c1) and (c2) show uniform tilings of the unit disk for the Jeffreys prior for the four-outcome POM of (28), while plots (d1) and (d2) show those for the three-outcome POM of (29). The crosshair symmetry of the four-outcome POM and the trine symmetry of the three-outcome POM are manifest in their respective uniform tilings.
The parameterizations in (A.1) and (A.3), and the tilings of figure A.1 exhibit in which explicit sense the primitive prior and the Jeffreys prior are uniform. However, the priors are what they are, irrespective of how they are parameterized. They are explicitly uniform in a particular parameterization and implicitly uniform in all others. Uniformity, it follows, cannot serve as a principle that distinguishes one prior from another.
This ubiquity of uniform priors for a continuous set of infinitesimal probabilities is in marked contrast to situations in which prior probabilities are assigned to a finite number of discrete possibilities, such as the 38 pockets of a double-zero roulette wheel. Uniform Uniform tilings of the unit disk for four different priors. The disk is in the x y plane, with the x-axis horizontal, the y-axis vertical, and the disk center at x = y = 0. Tiling (a) depicts the marginal prior of (A.14); tiling (b) is for the primitive prior of (27); tilings (c1) and (c2) illustrate the Jeffreys prior of (28) with the blue dots (•) just outside the unit circle indicating the four directions onto which the POM outcomes project; and, tilings (d1) and (d2) are for the Jeffreys prior of (29), the blue dots marking the three directions of the trine projectors. In each tiling, we identify 96 regions of equal size by dividing the disk into eight 'tree rings' of equal size and twelve 'pie slices' of equal size. In the tilings (a), (b), (c1) and (d1), the boundaries of the pie slices are (red) rays and an arc of the unit circle. In the tilings (a), (b), (c2) and (d2), the tree rings have concentric circles as their boundaries.
probabilities of 1/38 suggest themselves, are meaningful, and clearly distinguished from other priors, all of which have a bias.
Uniformity in a particularly natural parameterization of the probability space might also be meaningful. This, however, invokes a notion of 'natural' that others may not share.

A.2. Utility
In many applications, estimating the state is not a purpose in itself, but only an intermediate step on the way to determining some particular property of the physical system. The objective is to find the value of a parameter that quantifies the utility of the state.
For example, one could be interested in the fidelity of the actual state with a target state, or in an entanglement measure of a two-partite state, or in another quantity that tells us how useful are the quantum-information carriers for their intended task. In a situation of this kind, one should, if possible, use a prior that is uniform in the utility parameter of interest. In contrast to the situation of the previous section, where requiring uniformity in R 0 may be ill-advised because uniformity is a parameterization-dependent notion, here we specify uniformity for the parameter we are interested in.
To illustrate, consider a single qubit. Suppose the utility parameter is the purity ξ(ρ) = tr{ρ 2 } of the state ρ. With the Bloch-ball representation of a qubit state, ρ = 1 2 (1 + · σ ), where = tr{σ ρ} = σ is the Bloch vector and σ is the vector of Pauli matrices, the purity is A prior uniform in purity induces a prior on the state space according to where we parameterize the Bloch ball by spherical coordinates ( , θ, φ). Here, d is the prior for the angular coordinates; the prior for the radial coordinate is fixed by our choice of uniformity in ξ . Irrespective of what we choose for d , the marginal prior for is uniform in ξ . If one can quantify the utility of an estimator by a cost function, an optimal prior can be selected by a minimax strategy. For each prior in the competition one determines the maximum of the cost function over the states in the reconstruction space, and then chooses the prior for which the maximum cost is minimal. In classical statistics, such minimax strategies are common (see, for instance, [22, chapter 5]); for an example in the context of quantum state estimation, see [23].

A.3. Symmetry
Symmetry considerations are often helpful in narrowing the search for the appropriate prior. For a particularly instructive example, see section 12.4.4 in Jaynes's posthumous book [24].
Returning to the uniform-in-purity prior of (A.6), one can invoke rotational symmetry in favor of the usual solid-angle element, d = sin θdθ dφ, as the choice of angular prior. The reasoning is as follows: the purity of a qubit state does not change under unitary transformations; unitarily equivalent states have the same purity. Now, regions that are turned into each other by a unitary transformation have identical radial content whereas the angular dependences are related by a rotation. Invariance under rotations, in turn, requires that the prior is proportional to the solid angle, hence the identification of d with the differential of the solid angle. Note that the resulting prior element (dρ) is different from the usual Euclidean volume element, 2 d sin θdθ dφ, which would be natural if the Bloch ball were an object in the physical threedimensional space, but it is not.
Symmetry arguments should be used carefully and not blindly. For a fairly tossed coin, the prior should not be affected if the probabilities for heads and tails are interchanged, w( p 1 , p 2 ) = w( p 2 , p 1 ). However, for the harmonic-oscillator example of section 2.1, which has the same reconstruction space as the coin, there is poor justification for requiring this symmetry because the two probabilities-of finding the oscillator in its ground state, or not-are not on equal footing.

A.4. Invariance
When one speaks of an invariant prior, one does not mean the invariance under a change of parameterization-all priors are invariant in this respect-but rather a form-invariant construction in terms of a quantity that, preferably, has an invariant significance. We consider two particular constructions that make use of the metric induced by the response of the selected function to infinitesimal changes of its variables.
The first construction begins with a quantity F( p) that is a function of all probabilities p = ( p 1 , . . . , p K ). We include the square root of the determinant of the dyadic second derivative in the prior density as a factor, where w cstr ( p) contains all the delta-function and step-function factors of constraint as well as the normalization factor that ensures the unit size of the reconstruction space. The prior defined by (A.7) is invariant in the sense that a change of parameterization, from p to α, say, does not affect its structure because the various Jacobian determinants take care of each other. Since w cstr ( p) enforces all constraints, the p k are independent variables when F( p) and G( p, ν) are differentiated in (A.7) and (A.9), respectively. For the second construction, we use a data-dependent function G( p, ν) of the probabilities p and the frequencies ν = (ν 1 , ν 2 , . . . , ν K ) with ν j = n j /N . Here, the square root of the determinant of the expected value of the dyadic square of the p-gradient of G is a factor in the prior density where f (ν) denotes the expected value of f (ν), We have, in particular, the generating function for the expected values of products of the ν k . The prior defined by (A.9) is form-invariant in the same sense, and for the same reason, as the prior of (A.7). Form-invariant priors constructed by one of the two methods described in the text. The ' √ det' column gives the p-dependent factors only and omits all p-independent constants. The first method of (A.7) proceeds from functions of the probabilities that have extremal values when all probabilities are equal or all vanish save one. The second method of (A.9) uses functions that quantify how similar are the probabilities and the frequencies. The 'hedged prior' is named in analogy to the 'hedged likelihood' [25].

Method
Primary function √ det 1st − k p k log p k 1 √ p 1 p 2 · · · p K (Shannon entropy) (Jeffreys prior) 1st √ det' factors constructed by one of these two methods. It is worth noting that the Jeffreys prior can be obtained from the entropy of the probabilities by the first method as well as from the relative entropy between the probabilities and the frequencies by the second method. The latter is a variant of Jeffreys's original derivation [15] in terms of the Fisher information.

A.5. Conjugation
Sometimes there are reasons to expect that the actual state is close to a certain target state with probabilities t = (t 1 , t 2 , . . . , t K ). This is the situation, for example, when a source is designed to emit the quantum-information carriers in a particular state. A conjugate prior could then be a natural choice. Such priors are called 'conjugate' in standard statistics literature because the (· · · ) β factor has the same structure as the point likelihood: a product of powers of the detection probabilities. The (· · · ) β factor is maximal for p = t and the peak is narrower when β is larger. The conjugate prior can be understood as the 'mock posterior' for the primitive prior that results from pretending that β copies have been measured in the past and data obtained that are most typical for the target state. Therefore, a conjugate prior is quite a natural way of expressing the expectation that the apparatus is functioning well. The posterior content of a region will be data-dominated only if N is much larger than β.
In this context, it may be worth noting that the Bayesian mean state, computed with the conjugate prior above, is usually not the target state unless β is large. One could construct priors for which ρ bm is the target state, but the presence of the w cstr ( p) factor requires a case-by-case construction.

A.6. Marginalization
All priors used as examples-the ones in (A.2), (A.4) and (A.12), and in table A.1-have in common that they are defined in terms of the probabilities and, therefore, they refer to the particular POM with which the data are collected. While this takes the significance of the data duly into account, it does not seem to square with the point of view that prior probabilities are solely a property of the physical processes that put the quantum-information carriers into the state that is then diagnosed by the POM. When adopting this viewpoint, one begins with a prior density defined on the entire state space. In addition to the parameters that specify the reconstruction space (essentially the probabilities p), this full-space prior will depend on parameters whose values are not determined by the data. There could be very many nuisance parameters of this kind. In the harmonicoscillator example of section 2.1, the data tell us only about the ground-state population but nothing about the population in any specific excited state. For a prior assigned on the formally infinite-dimensional state space, all but the ground-state population are nuisance parameters. Upon integrating the full-space prior over the nuisance parameters, one obtains a marginal prior on the reconstruction space. As a function on the reconstruction space, the marginal prior is naturally parameterized in terms of the probabilities and so fits into the formalism we are using throughout.
We note that the invoking of 'additional criteria' for a unique mapping from p to ρ, as mentioned at the end of section 2.1, is exactly what would be required if one wishes to report estimated values of the nuisance parameters. That, however, goes beyond making statements that are solidly supported by the data and is, therefore, outside the scope of our present discussion.
The symmetric uniform-in-purity prior of sections A.2 and A.3 provides an example for marginalization if the POM only gives information about x = σ x and y = σ y but not about z = σ z . We express the full-space prior in Cartesian coordinates, integrate over z, and arrive at This marginal prior is a function on the unit disk in the x y plane, which is the natural choice of reconstruction space here. When one expresses (dρ) in polar coordinates, one sees that (dρ) is uniform in ϕ and in r 2 cosh −1 (1/r ) − √ 1 − r 2 , which increases monotonically from −1 to 0 on the way from the center of the disk at r = 0 to the unit circle where r = 1. Plot (a) in figure A.1 illustrates the matter.

Appendix B. Confidence regions
The confidence regions that were recently studied by Christandl and Renner [7], and independently by Blume-Kohout [8], are markedly different from the MLRs and the SCRs. The MLR and the SCR represent inferences drawn about the unknown state ρ from the data D that have actually been observed. By contrast, confidence regions are a set of regions, one region for each data, whether observed or not, from the measurement of N copies. The confidence regions would contain any state in, at least, a certain fraction of many N -copy measurements, if many measurements were performed. This fraction is the confidence level.
When denoting by C D the confidence region for data D, the confidence level γ of the set C of C D s for all conceivable data (for fixed N ) is where the minimum is reached in the 'worst case'. For example, in the security analysis of a protocol for quantum key distribution, one wishes a large value of γ to protect against an adversary who controls the source and prepares the quantum-information carriers in the state that is best for her. Any set C, for which γ has the desired value, serves the purpose. A smaller set C , in the sense that C D is contained in C D for all D, is preferable, but usually there is no smallest set of confidence regions. Here, 'smaller' is solely in this inclusion sense, with no reference to a quantification of the size of a region and, therefore, there is no necessity for specifying the prior probability of any region. Since the transition from set C to the smaller set C requires the shrinking of some of the C D s without enlarging even a single one, it is easily possible to have two sets of confidence regions with the same confidence level and neither set smaller than the other.
For illustration, we consider the harmonic-oscillator example of section 2.1 yet another time. Figure B.1 shows two sets of confidence regions (γ = 0.8) and the corresponding three SCRs (c = 0.8) for the primitive prior and the Jeffreys prior. Both sets of confidence regions are optimal in the sense that one cannot shrink even one of the regions without decreasing the confidence level, but neither set is smaller than the other. In the absence of additional criteria that specify a preference, both work equally well as sets of confidence regions. This generic non-uniqueness of confidence regions, and the arbitrariness associated with it, are in marked contrast to the SCRs, which are always unique.
We also observe in this example that confidence regions tend to overlap a lot, which is indeed unavoidable if a large confidence level is desired; whereas, the SCRs for different data usually do not overlap unless the data are quite similar. In figure B.1, there is no overlap of the SCRs for (n 1 , n 2 ) = (0, 2) and (2, 0).
Another important difference of considerable concern in all practical applications is the following. Once the data are obtained, there is the MLR and the SCR for these data, and it plays no role what other MLRs or SCRs are associated with different data that have not been observed. To find a confidence region for the actual data, however, one must first specify the whole set C of confidence regions because the confidence level of (B.1) is a property of the whole set.