Quantum Tomographic Reconstruction with Error Bars: a Kalman Filter Approach

We present a novel quantum tomographic reconstruction method based on Bayesian inference via the Kalman filter update equations. The method not only yields the maximum likelihood/optimal Bayesian reconstruction, but also a covariance matrix expressing the measurement uncertainties in a complete way. From this covariance matrix the error bars on any derived quantity can be easily calculated. This is a first step towards the broader goal of devising an omnibus reconstruction method that could be adapted to any tomographic setup with little effort and that treats measurement uncertainties in a statistically well-founded way. In this first part we restrict ourselves to the important subclass of tomography based on measurements with discrete outcomes (as opposed to continuous ones), and we also ignore any measurement imperfections (dark counts, less than unit detector efficiency, etc.), which will be treated in a follow-up paper. We illustrate our general theory on real tomography experiments of quantum optical information processing elements.


I. INTRODUCTION
Since the first proposal of optical quantum tomography by Vogel and Risken [1], and the first practical tomographic experiments by Smithey et al [2], quantum tomography has gone a long way, and is now being used in a variety of physical setups, not restricted to optical systems, and many improvements have been made to the original reconstruction methods [3,4]. While this is certainly a desirable evolution, it must also be said that on the negative side this resulted in a proliferation of tomography methods. While it is unavoidable that every physical system has its own peculiarities, and each particular setup calls for its own tomographic measurement method, it is not conceivable that for every type of system and for every tomography method there should also be a different tomographic reconstruction method. Furthermore, when each time the reconstruction software is written from scratch, that does not benefit reliability. After 20 years of tomographical experience it is not unreasonable to ask for a unification of these methods, taking the best out of each and devising a small set of "omnibus" reconstruction methods, that only need some small adaptations to the particular tomographic setup.
An even more important remark concerning reconstruction methods is the fact that error bars are seldom given. Measurements are worthless without error bars. When tomography is just a measurement, even though a complicated one, why then treat tomography differently? Error bars are dearly needed here as well, since the whole purpose of tomography is to come up with a description of the quantum state that is sufficient to derive further properties, and for these properties error bars would certainly be needed. AsŘehaček, Mogilevtsev and Hradil [5] stated: "The quantification of relevant statistical errors is an indispensable but often neglected part of any tomographic scheme used for quantum diagnostic purposes." Some theoretical papers mention error bars, but they are calculated from simulated data, using Monte-Carlo methods and are only meant to give an indication about how good the method performs. As measurement errors depend on the actual system and its state, this is clearly not enough. What we are after is error bars produced straight from the experimental data and the underlying statistical model.
A widely used error criterion is the state fidelity (for quantum states) or the process fidelity (for quantum processes), which compares the reconstructed state to a predefined "desired state" (e.g. [6,7,8]). While it is easy to use, it clearly gives no indication about the reconstruction alone but is the sum of reconstruction and construction errors. As such, it cannot answer the following two questions separately: "Are we seeing the correct state?", and "Are we seeing the state correctly?" Furthermore, as pointed out in [5], fidelity is just a single number and one cannot expect it to describe the complete error structure of the reconstruction.
One could argue, however, that error bars are not explicitly needed if one just subsumes all measurement noise into the estimated quantum state via statistical mixing. We disagree with this point of view, and we claim that this throws away useful information. There is a difference between, on one hand, preparing a pure state and assuming it is mixed because the measurements do not allow to conclude otherwise, and, on the other hand, not being able to prepare a pure state, obtaining a mixed state instead, and knowing that state perfectly. In both cases the outcome is the same, a mixed state, but in the former case the real state is actually pure.
To make sense of the concept of error bars in the context of state (process) estimation, we have to clearly distinguish between the roles of the state preparer and the observer measuring the state, both of which involve noise. Tomography is based on the assumption that every time the observer measures the state, it is actually the same state. Because of measurement noise, the observer cannot obtain perfect knowledge of which state has been prepared in finite time. However, nothing prevents him from doing so in principle. Measuring for an infinite time, using a sufficient set of measurements, will yield perfect knowledge. If the same state is being prepared, it is ultimately knowable. On the other hand, the preparation of the state will also involve noise. Every time the preparer attempts to prepare the nominal state, noise will affect this and some slightly other state will result. This kind of noise is impossible to overcome, not even in principle. The only way one can deal with it is by absorbing the preparation noise into the quantum state that is being prepared, through statistical mixing. If the preparation noise is ergodic, the observer will recover the average state of the quantum state ensemble.
In short: the individual states are not knowable, but their average is; and if the number of measurements is finite, the observer will not obtain this average preparation state perfectly. Then error bars, or more precisely a density function over state space, are needed to express this lack of complete knowledge.
One of the first papers calculating error bars from the measurement data is Ref. [9], for a specific reconstruction method of optical homodyning tomography (OHT) using so-called pattern functions [10,11]. Using this method, the density matrix ρ can be derived directly from the detection probabilities pr(x, θ) sampled over phase space, where x and θ are parameters representing the settings of the OHT apparatus. To reconstruct ρ from the measurement data, these probabilities are replaced by the relative frequencies f (x, θ)/N . To obtain error bars, the fluctuations on the detection frequencies are modeled by a Poisson process, by which the variance σ 2 equals the mean pr divided by the number of runs N . The first drawback of this method is that only the marginal variances of the density matrix elements are treated here, disregarding correlations between errors, and therefore overestimating them. A second drawback of this approach is that, due to pr not being known, it is approximated by the relative frequency, giving σ 2 = pr/N ≈ f /N 2 , where f is the absolute detection frequency. This approximation actually underestimates the variance, especially for low-probability events. Indeed, for events with f = 0 this approach assigns zero probability to the event, with zero variance, which corresponds to an absolute certainty. That certainty is not warranted given that only a finite number N of experiments were done. This problem is known more generally as the "zero-eigenvalue problem" and occurs in different guises in many other reconstruction methods.
The present work is a first step towards overcoming the two deficiencies described above: we propose to use Bayesian inference as the unifying principle to calculate a probability density function over state space, from the measurement data, and from that density derive the first and second order moments: the mean state and the state covariance matrix. The goal of the present paper is to outline a practical method for calculating this mean state and state covariance directly from a set of tomographic data, in a completely general and statistically well-founded way.
During the course of this work, the paper [5] appeared, in which the same goals were aimed for and a method quite similar in spirit to ours was proposed. We refer the reader to Section VI for a discussion of the main differences between our method and the one in [5].
The input required by our method is a statistical model of the quantum tomographic setup. Given a state and the measurement settings, how do the statistical properties of the measurement outcomes depend on that? For a measurement with a finite number K of outcomes, a measurement setting corresponds to a choice of operator elements A (k) , one element for each outcome. When the outcome is a continuous variable (e.g. position x), the measurement operator is parameterised by the continuous outcome value x. In either case, the laws of quantum mechanics dictate that the outcome k (or x) occurs in an experiment with a probability (or probability density) given by Born's rule p k = Tr[ρA (k) ] (or p(x)dx = Tr[ρA(x)]dx). The measurements taken in a tomographic experiment relate to this probability density, in one way or another. For any given setting a number of runs N would typically be performed, and in case of a finite outcome experiment, the frequencies f k of the various outcomes would be recorded, or the values of the measurement in case of a continuous outcome. The relation between these frequencies and the underlying probability distribution is governed by the laws of statistics.
Alternatively, in some experiments the outcome could be the combined effect of many small measurement events. For example, in tomography of atomic/molecular clouds, the fluorescence response of the cloud to an impinging probe beam could be observed, in which case the experimental outcome is a fluorescence spectrum [12,13,14]. The recorded spectrum is a random variate with expected value proportional to the relevant probability density (a marginal of the quasi-probability density describing the state) and variance depending on the signal-to-noise ratio. Another example is an optical homodyning tomography (OHT) experiment where the probe beam intensity is so high that individual photons impinging on the detectors can no longer be resolved and the measurement results are light intensities, measured as voltages.
Once the statistical model of the tomographic setup has been supplied in suitable form, a general-purpose Bayesian inference engine could in principle take it from there, converting the measurement data into a posterior probability density over quantum state space. However, the actual calculations are typically too demanding to be at all practical. The second ingredient of our proposed solution is to make the calculations involved in Bayesian inference feasible by approximating the statistical tomographic model by a so-called linear-Gaussian model (explained below). For such models, the Bayesian inference simplifies to a set of simple equations known as the Kalman filter update equations.
Kalman filtering is a technique for dynamical state estimation that allows to estimate a dynamical state from a sequence of noisy data [15]. Kalman filtering has already been applied in the context of continuous quantum measurement and quantum control [16,17,18]. For tomographic reconstruction, ap-plying Kalman filtering seems to be a novel idea.
The goals we have set out for this work are quite challenging. Rather than trying to solve all the problems involved in one go, we focus here on a particular, but important, class of tomography experiments, namely those where the measurements have (few) discrete outcomes, as opposed to measurements of continuous variables. This class still covers a wide variety of quantum systems, including single-photon optical systems (e.g. optical quantum computing) [6,19,20,21], spin systems based on ions [7], atoms [22], or electrons (spin echo tomography) [23], superconducting [8] and solid state qubits [24], and tomography of atomic and molecular states based on fluorescence spectra [12,13,14]. For the purposes of exposition, we will restrict our attention here and illustrate our reconstruction method for optical systems. Important examples of tomography experiments not falling in this class are optical homodyning and heterodyning [1,2,11], where the outcome is the continuous variable x. We leave this for future work.
The second restriction we have imposed here is that we assume that the apparatus performing the tomography is ideal. In reality any measurement exhibits imperfections; e.g. photon detectors have less than unit efficiency and may exhibit dark counts and input states for process tomography may be imperfectly prepared. Although these imperfections pose no deep fundamental problems, they would obscure the presentation which is why we have chosen to treat them in a follow-up paper [25]. In the present paper we will to cover the main principles of our proposal and illustrate them using a simple real application (based on actual data) as proof of principle.
A third (minor) restriction is that the number of experimental runs per measurement setting is not too small, so that the distributions governing the measurement outcomes can be approximated by normal distributions without too great an error.
It goes without saying that our reconstruction method is suitable both for state tomography and process tomography, because state tomography lies at the basis of process tomography. Either one sends various input states through the quantum process and measures the output states (as in Ref. [6]), or one half of a maximally entangled state is sent through the process and the global output state (including the other half) is measured (as in Ref. [20]). Both methods are formally equivalent with state tomography of the state representative of the process, under the Choi-Jamiolkowski isomorphism. The presentation of our method will therefore be mainly state based, for definiteness and simplicity.
As we bring together a number of concepts from statistics, engineering mathematics and quantum mechanics, we begin our presentation with a rather lengthy section (Sec. II) on background material, with an extensive list of notations, and brief overviews of Bayesian inference and Kalman filtering. In Sec. III we present the basic theory of our proposal, and show how the problem of tomographic reconstruction can be made to fit the mould of Kalman filtering, thereby proposing solutions to several problems that we encounter along the way. The theory is then illustrated on two real tomography applications in Secs. IV and V, based on actual experimental tomography data. Finally, in Sec. VI we highlight the main benefits of our method, its performance and the costs associated with it, and compare it to existing methods, in particular to its sister-method proposed in [5].

A. Notations
Let us start with introducing the main notations and typographical conventions that we will use in the paper, along with some fundamental notions from applied probability theory. We denote vectors and matrices by symbols in boldface, F , f , to distinguish them from scalar quantities which we denote in roman, including vector and matrix components, f i . Exceptions to this convention are quantum-mechanical quantities like states ρ, POVM elements Π and maps Φ. The vector whose entries are all 1 is denoted by 1 := (1, . . . , 1), and the identity matrix by 1 1.
We adopt the usual convention from the statistics literature to denote random variables with capital letters, F , and the values they can take with lowercase, f . For example, the probability density function (PDF) of a random variable F is denoted f F (f ). The first f is the general symbol for a PDF, the second F is the random variable, and the third f is the argument of the PDF and represents the values the random variable F can take. The mean and variance of a scalar random variable X are denoted by µ(X), σ 2 (X), and the mean and the covariance matrix of a d-dimensional random variable X by µ(X) and Σ(X).
In this paper, a number of distributions are prominent. Here we recall definitions and notations about the multinomial and normal distributions. Other distributions (chi-squared, Dirichlet and beta) will be described in subsequent sections.
When a random d-dimensional variable F is distributed according to a multinomial distribution with parameters N (where d k=1 f k = N ) and p this is denoted by F ∼ Mtn(N ; p). The PDF of this distribution is given by ( Here we denote the multinomial coefficient by The mean of the multinomial distribution is given by and the entries of its covariance matrix are When a random variable X is distributed according to a univariate normal distribution with mean µ and variance σ 2 we write this as X ∼ N (µ, σ 2 ). Similarly, for a multivariate normal distribution with mean µ and covariance matrix Σ we write X ∼ N (µ, Σ).
We will reserve the symbol φ for the PDF of a normal distribution, while using f for general PDFs. The PDF of the univariate normal distribution will thus be denoted by Similarly, we will denote the PDF of an n-dimensional multivariate normal distribution by The quadratic form appearing in the exponent is a proper distance measure between the vectors x and µ and is called the squared Mahalanobis distance, which we denote by M 2 : One of the main statistical tools used in this paper is the approximation of distributions by normal distributions, using the technique of moment matching, whereby a distribution is approximated by a normal distribution with the same first and second order moments as the original. While conceptually simple and easy to use in practice, this method is also statistically well-founded because it gives the approximation that minimises the Kullback-Leibler distance D KL (p||q) := dx p(x) log(p(x)/q(x)) between a given PDF p and the approximating normal PDF q.
On the matrix analysis side, we will follow mathematical convention (and not the physical one) of denoting Hermitian conjugates with an asterisk A * instead of a dagger, and reserve the dagger for the Moore-Penrose (MP) inverse: A † . Complex conjugation is denoted by an overline: A.
We will have the occasion to apply the following formula for the matrix inverse of a rank-k correction to a non-singular matrix: Here, A is n×n and non-singular, C is k×k and non-singular, and U and V are general n × k matrices. This formula is alternately known as the Matrix inversion lemma, or the Woodbury matrix identity [26]. While the main topic of this paper is the tomographic reconstruction of quantum states, maps and POVMs, objects that are typically represented by matrices (density matrices, Choi matrices, POVM elements), the reconstruction technique we use is based on vector representations of the state of the system. Therefore, more often than not, we will need to convert the usual matrix representation of the quantum objects to vector representation. For quantum states that means we will be employing a Hilbert space representation. The space M d (C) of d × d matrices will be considered as a Hilbert space H d equipped with the Hilbert-Schmidt inner product A, B = Tr AB * . To distinguish between the two representations, we write ρ for a density matrix, and ρ for its Hilbert space representative.
While many different bases could be used for H, by far the easiest one for the purposes of this paper is the basis of matrix units {e ij } d i,j=1 ; in quantum physical notation e ij = |i j|. Converting a density matrix ρ to its Hilbert space representative ρ amounts to the so-called Vec operation, which works by just stacking the columns of the density matrix into a single d 2 -dimensional column vector: ρ = Vec(ρ) := i,j ρ ij |i |j . The reverse operation is the Mat operation, which reshapes a d 2 -dimensional vector into a d × d matrix.
In the same vein, the Hilbert representation of a linear map L (be it completely positive (CP) or not) acting on d × d density matrices, expressed in the basis of matrix units, is a d 2 × d 2 matrix whose columns are the Hilbert space representations of the matrices L(e ij ). More specifically, if the map is a CP map and has the Kraus representation ρ → L(ρ) = K k=1 A (k) ρA (k), * , then the Hilbert space representation of L is the matrix L given by

B. The Dirichlet Distribution
The Dirichlet distribution is the higher-dimensional generalisation of the beta distribution. The importance of this distribution stems from the fact that it is the conjugate distribution of the multinomial distribution. That is, if F ∼ Mtn(N, p) is the distribution of F conditional on P = p, then Bayesian inversion yields that P conditional on F = f is Dirichlet with parameter f . Formally, the two distributions only differ by their normalisation. The multinomial is normalised by summing over all integer non-negative f summing up to N , while the Dirichlet is normalised by integrating over the simplex of non-negative p summing to 1.
The general form of the PDF of a d-dimensional Dirichlet distribution with parameters α i is ( [27], Chapter 49) where α 0 is defined as The range of P is the simplex p i ≥ 0, p i = 1. In our case α i − 1 = f i , and as i f i = N , the PDF is The mean values of the Dirichlet distribution are and the elements of its covariance matrix are For further reference, we note a few properties of the covariance matrix Σ of the Dirichlet distribution. First of all, Σ is singular; it has a zero eigenvalue pertaining to the eigenvector 1 := (1, . . . , 1). As the inverse of Σ does not exist we will need its Moore-Penrose inverse Σ † . Because Σ can be written as a diagonal matrix minus a symmetric rank-1 matrix, its MP-inverse can be calculated analytically. For non-singular differences of a matrix D and a rank 1 matrix xx * , the matrix inversion lemma (4) provides the formula Even for invertible D the difference can still be singular when x * D −1 x = 1. In that case the MP-inverse is given by Here, G is an orthogonal projector on the support of D−xx * . This gives in our case Thus G is the projector on the subspace of vectors x for which In other words, on the subspace of differences of probability vectors, Σ † reduces to the diagonal matrix

C. Bayesian Inference
Our reconstruction procedure essentially amounts to performing Bayesian inference, in conjunction with an approximation scheme for the statistical properties of the measurement process. More precisely, the measurement process is approximated by a so-called linear-Gaussian model. Within the confines of this model, the actual calculations for performing the Bayesian inference turn out to be very simple and are given by the update equations of a Kalman filter; this will be explained below. In this section we briefly describe the elements of Bayesian inference. For an in-depth treatment we refer to the excellent introductory work [28].
We describe the state of the system under investigation by a vector and denote it by x. For the time being, we ignore the fact that the system is a quantum system. As our knowledge of x is obtained from (quantum) measurements and is statistical in nature, we describe it by a random variable, X. Since in our setup measurements are independent (each measurement operates on a private copy of the quantum state under investigation), the reconstruction procedure can be decomposed as an iterative scheme in which each measurement is incorporated sequentially. We assume that in each iteration any prior knowledge about X, as well as any knowledge obtained through the outcomes of the first m − 1 measurement settings has been incorporated into the probability density function f X . In Bayesian terminology the PDF of X is called the prior PDF. The goal of the inference procedure is to come up with a posterior PDF that incorporates the knowledge obtained by the measurement outcomes in the m-th measurement setting. We use the random variable X ′ to describe the updated knowledge; its PDF f X ′ is called the posterior PDF.
We denote the "knowledge obtained through a measurement" by a vector z describing the measurement outcome. This vector z is a sample of the corresponding random variable Z. It can give us additional information about the system via the statistical relation between Z and X, which we express as the conditional PDF f Z|X . This conditional PDF is the statistical description of the measurement model linking X to Z; it will be specified further in section III B. The posterior f X ′ is then nothing but the conditional PDF f X|Z .
At the heart of any Bayesian inference procedure lies Bayes' rule, which expresses the relation between f X|Z and f Z|X : While f Z is defined as the marginal of f X,Z = f Z|X f X , it is much more convenient to interpret f Z as a normalisation constant, ensuring that f X|Z integrates to 1 over the probability space of X. Second, as the main random variable in Bayes' rule is X, while z plays the role as parameter and is given by the observation, we have to consider f Z|X as a function of x too. As a function of its second argument, f Z|X is no longer a conditional probability density but a likelihood function. We will denote this function by L X|Z . Because of the explicit normalisation in Bayes' rule, L is defined up to proportionality. Note the reversal of the arguments, which is customary in the Bayesian literature and reflects the change of focus from Z, being the measurement outcome causally related to the underlying state X, to X, our guess of what the state should be, given the measurement outcome Z = z as evidence: We will henceforth rewrite Bayes' rule as The Bayesian inference step, incorporating z as new knowledge, is then expressed as For a sequence of measurement settings (Z i ) n i=1 , this step has to be iterated n times, leading through a sequence of n + 1 PDFs that describe the state X in a way that is consistent with the additional knowledge obtained through the measurements. If we describe the prior information by the PDF of X 0 , the ith measurement by the likelihood function with parameter z i , and the updated information after i iterations by the PDF of X i , we get As one can see, this is merely a product of the n likelihood functions and the prior PDF, and can therefore be calculated in any order. This will turn out to be important later on. Despite the apparent simplicity of the Bayesian update formula, actual calculations based on it are in general very complicated because the variable X appears both as main variable of a continuous PDF (the prior) and as continuous parameter of the (discrete) likelihood functions. The resulting product is in general an extremely complicated continuous function of x.
In the context of reconstruction of tomographic data, for example, the likelihood functions are polynomials of very high degree.
The calculations do become feasible in the specific case of so-called Linear-Gaussian Models. In such models the dynamics of the system can be described by a time-discrete Markov chain with a linear evolution operator perturbed by zero-mean Gaussian noise. Similarly, the measurement also depends linearly on the system state and any perturbation must also be zero-mean Gaussian noise. In other words, all variables (X 0 , and all Z i ) are normally distributed, and the parameter x enters only in the value of the mean of f Zi|Xi−1 ; moreover, it does so in a linear way only. A typical example is a classical measurement whose output depends linearly on the state and is perturbed by additive white Gaussian noise.
For these models, and for more complicated dynamical models where the state X varies over time, the Bayesian update formula reduces to a set of simple equations called the Kalman filter equations. A Kalman filter is the optimal state estimator when the system and the measurement can be modelled by linear-Gaussian models. The Kalman filter equations consist of two sets of equation; one set (the predictor equations) predicts the evolution of the Markov chain, while the other set (the update equations) updates the state of the system based on the measurements taken. For the purposes of the present paper only the update equations will be used because the basic assumption of tomography is that the system is static. A very good account of Kalman filtering is given in Ref. [15].
The reason for the feasibility of the linear-Gaussian model is that all distributions occurring in the calculations are normal, including the intermediate products of the factors of (11). This will be explained in the next section, where we describe the Kalman update equations in detail.

D. Kalman Filtering for Static Systems
Let us return again to the Bayesian update formula (10) A linear-Gaussian model can be represented pictorially as follows: For the purposes of this paper it turns out to be beneficial to split the model into two parts: a linear mapping, represented by a matrix H, followed by the noise process, which consists simply of adding zero-mean Gaussian white noise with given covariance matrix Θ. A further model assumption is that the prior PDF f X is Gaussian, so that the posterior PDF will be Gaussian, too.
Suppose now that an observation of Z is made, giving the value z. Bayesian inference of the noise process then yields that the distribution of Y conditional on this observation will be N (z, Θ). In other words, the moments of Y are given by We are now left with performing Bayesian inference on the linear part, with the variables distributed as Here, µ and Σ are the already known first and second order moments of X (mean and covariance matrix) and µ ′ and Σ ′ are the unknown first and second order moments of X ′ . Taking into account the explicit formula (2) for the PDF of a Gaussian distribution, the logarithm of the likelihood function is given by plus some constant. We get similar expressions for the logarithm of f X ′ and the logarithm of f X . Combining everything, dropping the factors −1/2, and using the relation y = H x, the Bayesian update formula (10) for the linear mapping becomes: Here, all additive constants have been absorbed in the constant c. This constant is actually irrelevant because it is ultimately absorbed in the Bayesian normalisation factor C of (10). Both sides of the equation are therefore degree-2 polynomials in x, which confirms our earlier statement that all distributions occurring in the calculations for linear-Gaussian models are Gaussian. Eliminating x from this equation gives us the two update equations we need, one for the mean, and one for the covariance.
Remark. Although the probability space of X is a real vector space, the vector entries of x need not be real. This will give us more freedom in choosing a basis when X is a Hilbert space representation of density matrices; to get real vector entries one is forced to choose Pauli matrices (or generalisations thereof) as basis vectors. This is the reason why we have applied the Hermitian conjugate * rather than the transpose.
Combining the terms that are quadratic in x yields the equation Inverting gives the solution for Σ ′ [using the matrix inversion lemma, Eq. (4)] The factor ΣH * (HΣH * + Σ(Y )) −1 is customarily called the Kalman gain factor and is denoted by K.
Combining the terms linear in x yields the equation with the solution for µ ′ : In the last line we have used equation (18) and the easily ver- The three relevant equations are commonly known as the Kalman update equations, and they form the backbone for the method of this paper. Combined with the equations (12) for the moments of Y they read It is instructive to consider the special case where the measurement mean y is exactly the state x, i.e. H = 1 1. This gives the simplified equations After some algebra one finds that Σ ′ is given by the parallel sum of Σ and Θ: In other words, in this special case, the inverses of the covariance matrices add up.

Remarks.
1. In the expression for the Kalman gain we use a matrix inversion, and that requires invertibility of the argument. In our setting (quantum mechanics), it turns out that the argument is actually never invertible. The reason for that is a certain set of exact constraints that the state has to obey. For example, when the state is a quantum state, its trace must be equal to 1. In addition, the measurement vector also has to satisfy certain exact constraints. For example, in an N -run experiment, the number of clicks must add up to N . This eventually has an impact on Σ, H and Θ, causing HΣH * + Θ to be non-invertible. How to deal with this will be described later on, in Section III G.
2. From an analytic viewpoint it is not readily clear that the expression (20c) for Σ ′ always yields a positive semi-definite matrix. More importantly, the expression is not the best one from a numerical viewpoint; since a subtraction is made, numerical roundoff may produce a non-positive semi-definite matrix. This is more likely to happen for precise measurements, yielding small variances. For that reason, the alternative formula (17) involving addition rather than subtraction is preferable.
In that form it is also obvious that the obtained Σ ′ is always positive semi-definite.

III. KALMAN FILTER RECONSTRUCTION OF QUANTUM TOMOGRAPHIC DATA
In this Section we present the basic theory of our reconstruction method, based on the concepts of Bayesian inference and Kalman filtering, which were described in the previous Section. This Section contains the bulk of the material, and includes the mathematical underpinnings of our method. To assist the readers who are more interested in the method itself and how to apply it, we have included Sec. III A, containing a self-contained explanation of the method. Those readers are advised to read that Section only and then fast-forward to the applications and discussions Sections IV, V and VI.
We start in Sec. III B with a characterisation of the likelihood function for quantum measurements (characterised themselves by a POVM {Π (k) }) and obtain a normal approximation that allows to approximately fit a quantum measurement in the mould of a linear-Gaussian model. We find that incorporating the information obtained from the quantum measurements into the likelihood function amounts to applying the Kalman filter update equations (20) where the entries of the measurement mean z are given by formula (6), those of the measurement covariance Θ by (7), and the measurement matrix H by the matrix representing the linear mapping ρ → p = (Tr[ρΠ (k) ]) K k=1 . Apart from calculating the likelihood, Bayesian inference requires a choice of a prior, and this is partially treated in Sec. III C. In a full treatment one would have to incorporate the structure of the physical set into the prior, i.e. the prior should be zero outside of the set of physical states. This, however, causes the prior to be non-Gaussian and it therefore does not fit the requirements of Kalman filtering. For that reason, the restriction to the physical set will be done in a postprocessing step, as described in Sections III E and III F, and instead a Gaussian dummy prior is chosen. Directly after the Kalman updates, a simple correction step removes the effects of this dummy prior again [Sec. III C, Eqs. (27) and (28)]. At this point, the eigenvalues of the corrected Σ, before restricting to the physical set, already give useful information as they are variances of certain linear combinations of state components and give an indication about how many more measurements are needed to reduce the measurement error. For example, to bound the maximal error on any state component from above, the largest eigenvalue of the covariance matrix should be bounded.
Once the posterior PDF, restricted to the physical set is calculated, via its first and second order moments, one can calculate the confidence region for the reconstruction, i.e. the set within which the actual state should lie with high probability (say, 95%). The basic quantity expressing this confidence region is the Mahalanobis distance. This is explained in Sec. III D.
Next, the problem of restricting the posterior PDF to the physical set is treated. This is actually a very difficult numerical problem, especially in high dimensions, and turns out to be a challenge even for current state-of-the-art Bayesian integration methods. In Sections III E and III F we give two simple algorithms that perform the task reasonably well, if one is willing to give up exactness of the solution.
For most quantum tomography problems, the physical set is partially defined by exact constraints. For example, quantum states have trace equal to 1. In Sec. III G we show how such exact constraints are best dealt with, in order to avoid numerical problems. The Kalman update equations have to be slightly modified.
Finally, Sec. III H deals with the issue of graphically representing the calculated results in a meaningful way, based on mean values and error bars. The first moment of the posterior PDF roughly corresponds to the maximum likelihood solution, and as frequently happens with this kind of solutions, exhibits reconstruction artifacts, which are not features of the actual state, but are not ruled out by the measurements either. A number of methods have been developed to remove these artifacts, all based on maximisation of entropy or related functionals, and we discuss these in our context.

A. Overview
In this Section we present a self-contained overview of our method for the purpose of implementation. Mathematical details, as well as the underlying rationale are explained in subsequent sections, which could be skipped on first reading. For the sake of clarity we restrict ourselves to the setting of state reconstruction.
The first step in implementing the method is to gather all relevant information about the tomography process and cast it in an appropriate mathematical form. The following are needed: • dimension D of the state (or of its representation); • the various sets of POVM elements {Π (j) } j , one set per measurement setting; • the number of runs N per measurement setting, if applicable; in continuous wave (CW) experiments, there is no such N ; • the measurement data: the frequencies f j (the "number of clicks") of each of the outcomes; • any exact linear constraints on the state; by default, Tr ρ = 1 is the only such constraint; • any exact linear constraints on the measurement outcomes; for example, j f j = N ; • a statistical model of the measurement process in the form of a PDF of the frequencies f j in terms of the POVM elements, N , or any other parameter; typically, this PDF is multinomial or Poissonian; • a reference state ρ 0 satisfying the linear constraints on the state; typically, this would be the maximally mixed state, The reconstruction algorithm consists of a number of phases, and we describe each in the following. It is important to note that we represent states by vectors. The most convenient basis is the standard basis, in which ρ is represented by the D 2 -dimensional vector Vec(ρ).

Setup phase
The exact linear constraints are enforced through the use of two projectors T X and T Z . These have to be calculated first. The projector T X is a projector on a subspace S X in state space (that is, the vector space representation thereof), while T Z is a projector on a subspace S Z in measurement space (the space of measurement vectors z). In principle, the latter could differ per measurement setting, but we will assume here that it does not. The interpretation of the subspace S X is that a state ρ satisfies the linear state constraints if and only if Vec(ρ − ρ 0 ) is in the subspace S X . For example, if the only linear constraint is Tr ρ = 1, then S X is the space of vector representations of Hermitian matrices of trace zero; that is, the space of D 2 -dimensional vectors orthogonal to x 0 = Vec(ρ 0 ) = Vec(1 1 D /D) = |1 1 D /D. We will henceforth represent a state ρ by the vector Vec(ρ − ρ 0 ) in S X . The reference state ρ 0 is thus represented by the null vector.
The interpretation of S Z is similar. If, for example, the only linear constraint on the measurements is j f j = N , resulting in the constraint j z j = 1, then S Z is the space of real vectors (with dimension equal to the number of measurement outcomes) whose components sum up to zero.
The corresponding projectors can be calculated by constructing orthonormal bases (onb) supporting each of these subspaces. If {x i } is an onb for S X , we construct the matrix X 1 whose columns are the vectors x i . This matrix is a so-called partial isometry. The projector T X is then given as where the columns of Z 1 form an onb for the subspace S Z . In the actual algorithm we don't need the projectors but only these partial isometries X 1 and Z 1 .
Conversely, if the projectors are given, we can calculate X 1 and Z 1 from them via the singular value decomposition (SVD). Let T X = U SU * be the singular value decomposition (SVD) of T X , where U is unitary and S is diagonal, the diagonal elements of which are the singular values of T X . For a projector T X of rank k, the first k singular values are 1, while the others are zero. Then X 1 consists of the first k columns of U . The partial isometry Z 1 is calculated in the same way from the SVD of T Z .
For the example where the linear state constraint is Tr ρ = 1, the projector T X is easily constructed as Using an SVD, this gives X 1 . Similarly, when the measurement constraint is d j=1 z j = 1, with d the number of outcomes, T Z = 1 1 d − |1 1|/d, and Z 1 is also calculated using an SVD.
The exact constraints can be dealt with by expressing the relevant quantities in terms of the bases X 1 and Z 1 . A tilde will be used to indicate this. For example, a state ρ satisfying the constraint can be represented by the tilde vector

Initial mean value and covariance:
Let k be the rank of T X ; thus k is D 2 minus the number of independent constraints on ρ. The initial mean value is set to the k-dimensional null vectorμ 0 = 0, representing the reference state ρ 0 . The initial covariance matrix isΣ 0 = b1 1 k , where b is a scalar with a "large" chosen value.

Kalman Filter phase
The following is applied iteratively, for each measurement setting: The measurement matrix H represents the POVM elements of the current measurement setting. The j-th column of H is given by Vec(Π (j) ). From H we calculateH = Z * 1 H X 1 , to incorporate the exact state and measurement constraints.
The reference measurement vector is then z 0 = H x 0 . The measurement vectors will be represented by the tilde vectors z = Z * 1 (z − z 0 ). The original measurement vector z is derived from the actual measurements f j , along with the measurement covariance matrix Θ, as follows.
For a measurement with multinomial statistics (as for N runs of single-photon tomographic measurements) z is given by z j = (f j +1)/(N +d) (formula (6)), and Θ by the formula (7). The tilde quantities then follow usingz = Z * 1 (z − z 0 ) andΘ = Z * 1 ΘZ 1 . For measurements with Poissonian statistics, in the case that the POVM elements add up to either the identity matrix or a scalar multiple thereof, the same formulas hold, but with N given by N = j f j . When the POVM elements do not add up to a scalar matrix (a situation that better be avoided), the modified formulas (23) and (24) have to be used for z and Θ.
We now apply the Kalman filter update equations (45), written entirely in terms of tilde quantities: For the first iteration, we setμ =μ 0 andΣ =Σ 0 . The primed quantities are the updated values and have to be used asμ andΣ in the next iteration.
After the final iteration, the "untilded" quantities can be calculated, if one so wishes, using the formulas

First Interpretation
The Kalman filter reconstruction procedure yields a Gaussian distribution over the linear space containing the physical state space as a subset. In most cases, this distribution will have a non-negligible probability outside this physical set. In fact, more often than not the mean of the distribution will be non-physical. At this point, the reconstructed mean and covariance only summarise what the measurements tell us, regardless of the physical significance of the outcome. In the next phase, the physicality constraint has to be combined with the reconstruction, as a kind of prior knowledge.
However, the unphysical covariance matrix already gives us interesting diagnostic information about the tomography itself, namely about the inherent accuracy of the measurements. To that purpose one can investigate the eigenvalues of the covariance matrix Σ. If some (or all) of these are large, that means the tomography is not able to give accurate information about the state in the direction of the corresponding eigenvectors. Application 2 (Section V) gives a particularly nice example of this.

Restriction phase
Restricting the reconstruction to physical state space can be done in a number of ways, but is always more complicated than in the MaxLik case, because the covariance matrix also has to be treated. The easiest way is to first calculate the maximum likelihood physical state. This is the state ρ ML satisfying Tr ρ ML = 1 and ρ ML ≥ 0 and minimising the squared to the reconstructed unphysical mean state. In the presence of exact constraints, these calculations are best done using tilde quantities. This is a constrained minimisation problem that can be reformulated as a semi-definite problem (SDP) (see Section III E) and can be solved efficiently using SDP solvers.
The obtained minimal Mahalanobis distance M 2 ML has diagnostic value. If nothing has gone wrong, the MaxLik solution should be well within the confidence region of the reconstructed likelihood distribution. Taking a 95% confidence value, the confidence region is given by the inequality M 2 ≤ γ ν := ( ν − 1/2 + 1.5) 2 , where ν is the dimension of the subspace S X supporting the reconstructed state ρ (the number of degrees-of-freedom). If the value of M 2 ML is much larger than that, this indicates that something has gone wrong either with the tomographic measurements or with the reconstruction, in the sense that the underlying assumptions are violated, for example if certain additional noise sources haven't been accounted for. Application 1 illustrates this aspect very clearly.
If M 2 ML falls within the confidence interval related to the unphysical reconstruction, we can calculate the confidence region for the physical restriction. A good approximation for that region is given by the intersection of the ellipsoid with the physical set. A drawback of this approach is that to get error bars the intersection has to be calculated explicitly, which is a non-trivial task.
In Sec. III F we present an alternative algorithm for performing the restriction to the physical set. This algorithm does not rely on an SDP solver and, furthermore, yields an explicit confidence region, allowing to calculate error bars in a straightforward way.
Finally, it is possible to calculate a regularised solution. This is a physical state within the physical confidence region that optimises a certain regularisation functional. Possible choices are the entropy, and then we get the so-called maximum entropy (MaxEnt) solution, or a functional expressing the smoothness of a solution. An example of the latter is given in Application 2. The calculations for this again require solving a semi-definite program (see Sec. III H 4).

B. Approximation of the Measurement Process in Quantum
Tomography by a Linear-Gaussian Model

The Likelihood Function in Quantum Tomography
Any quantum measurement, be it in state tomography or process tomography, can be characterised by the application of a POVM {Π (k) } K k=1 to a certain state ρ; this state could be the state under investigation, or the output of a quantum process given a certain applied input state. From the point of view of the experimenter, the state ρ is initially unknown, even though the experimenter may have certain preconceptions about it. Because the tomographic experiments reveal information about the state of a statistical nature, the state has to be treated as a random variable. Henceforth, ρ will denote an observed quantum state corresponding to a random variable denoted by R.
Quantum mechanics predicts the probabilities of each of the K POVM outcomes on a state ρ to be p k = Tr[ρΠ (k) ].
We define the vector of probabilities as p = (p k ) K k=1 . When the state is described by a random variable R, the vector p is an observation of an underlying random variable P , with In reality, P is never observed directly. We will consider two types of optical tomography experiments in this paper.
In pulsed mode tomography experiments, N individual light pulses are sent into the system, each pulse prepared in a state ρ. The POVM measurement is repeated N times, presumably on a sequence of N independent identically prepared states ρ.
For every pulse, a detector either clicks or does not click. The results of these N runs can then be combined into a vector of frequencies f = (f k ) K k=1 of the respective outcomes. This vector is an observation of a random variable, F . As is wellknown, for fixed N and p, F has a multinomial distribution with parameters N and p: F ∼ Mtn(N ; p).
In continuous wave (CW) mode optical experiments, the incoming laser beam is turned on for a relatively long but fixed time, and the number of times the detectors click in that time span are taken as the frequencies f k . The elements F k are Poisson distributed with mean value µ k = Ap k , where A is a proportionality factor called the brightness factor. This incorporates the intensity of the laser beam, the duration of the experiment, detector losses, etc. Obviously, the sum of frequencies N = k f k is not fixed but is a Poissonian random variable as well.
Combining all this with the relation P k = Tr[RΠ (k) ] we obtain the PDF f F |R (f |ρ) of F conditional on R, or the likelihood function when considered as a function of ρ. Pictorially, we have the following (for pulsed mode experiments): The first step is a linear mapping, and the second step is the quantum noise model. In comparison, recall that Kalman filtering is based on the linear-Gaussian model: The first step is again a linear mapping, but the second step is an additive Gaussian white noise (AGWN) model. As mentioned, the basic idea explored in this paper is to approximate the quantum model by a linear-Gaussian model in order to open the door for Kalman filtering. To do that, the following incompatibilities have to be overcome: first, in linear-Gaussian models there are typically no restrictions on the state vector X, while in quantum mechanics R is confined to quantum state space (positive semi-definite and trace equal to 1). We postpone the solution to this problem until Section III E and just pretend for the time being that the variable R is unrestricted.
The second difference is of course the different noise model. While in both measurement models the first step is a linear mapping, the quantum noise model is non-additive and non-Gaussian. In spite of this apparently rather large difference, the simplicity of the Kalman filter equations is so appealing that one is enticed to try and approximate f F |R by a linear-Gaussian model anyhow. Indeed, many distributions, including the multinomial and Poisson distributions, can be approximated by a normal distribution, and according to the law of large numbers the approximation gets better when the number of observations increases. Incidentally, this is why we have imposed the requirement that the number of experimental runs per measurement setting should not be too small.
So the main problem we are faced with is to reconcile the two models in a statistically sound way, but without losing sight of the practical issues. In the next subsection we first present a deceptively simple "solution", one that comes to mind almost automatically, but which suffers from a number of serious drawbacks. An observation that is more than 200 years old will then provide a way out of this conundrum, paving the way to a more satisfactory solution.

A naïve approach
If we make the straightforward identifications Z = F , Y = P and X = R, then L R|F (ρ|f ) provides the likelihood function L X|Z required for the Bayesian update formula (10).
There are two problems with this, however, preventing a direct mapping to a linear-Gaussian model: 1. P enters in the moments of F of all order, and not just the mean.
2. P enters in these moments in a highly non-linear way.
A first naïve approach could be to simply approximate the multinomial distribution by a multivariate normal with mean N p k = N Tr[ρΠ (k) ] (which is linear in ρ, as required) and with covariance matrix the one obtained by taking the covariance matrix of the multinomial distribution and replacing every occurrence of p k by f k /N (which is independent of ρ, as required). Although this superficially seems to solve the above problems, a serious drawback of this approach is that the assignment of the covariance matrix is very ad-hoc; for example, p k is replaced by its estimator f k /N in the covariance but not in the mean. Even more importantly, this approach is statistically ill-founded and, in fact, underestimates the actual variance of F . This is most apparent when some of the components of f are zero. Indeed, consider an N -trial 2-outcome measurement, where f = (f 0 , f 1 ) = (f 0 , N − f 0 ), and suppose f 0 = 0. In the naïve approach, the variance assigned to f F |R would be N p(1 − p), with p replaced by f 0 /N , hence giving zero. This is clearly a mistake because a variance of zero amounts to perfect knowledge, and a confidence interval of zero width. However, never having seen outcome '0' is no guarantee that '0' will not occur in the future, no matter how high the value of N may be. This has also been noted in Ref. [5].
The first documented encounter of this phenomenon appeared in a 1774 paper by Laplace, as the so-called "sunrise problem": calculate the probability of a sunrise, solely based on the information that it has risen N days before [29]. The answer is not 1. Instead, the correct value of this conditional probability is given by a formula that is now known as Laplace's rule of succession (see, e.g. Ref. [30]). In a more wider context we can consider the "visible sunrise problem" and calculate the probability p that we can see the sun rise (unhindered by clouds), given that we have done so in f of the N days before. In the modern interpretation of Laplace's rule, p is a random variable with a uniform prior, and conditional on the N observations has a posterior PDF that according to Bayes' rule is a beta-distribution with parameters f and N −f , whose mean is (f + 1)/(N + 2). While useful for predicting sunrises, beta distributions will also offer the solution to our reconstruction problem.

Bayesian Solution
Essentially, Kalman filtering can be seen as Bayesian inference, simplified to the case of linear-Gaussian models. When the noise is no longer Gaussian, as in our case, but we still want to reap the benefits from the simplicity of the Kalman filter equations, we really should be looking at the Bayesian inference equations and suitably approximate these, rather than approximate the model and apply Kalman filtering to that. In this way we can avoid the problems of the naïve approach.
More precisely, what we will do is match the two models after Bayesian inversion of their noise processes. Recall, for the linear-Gaussian model this gave with the moments of Y determined by the observation z, Eqs. (12). For the quantum measurement model we have Bayesian inversion yields the PDF of P conditional on the observation f of F . As explained in Section II B this is a Dirichlet distribution with parameter f , P ∼ Dirichlet(f ), and moments given by (6) and (7). The solution to the matching problem has now become very simple. We match the partially inverted quantum measurement model to the partially inverted linear-Gaussian model, and to do so we approximate the Dirichlet distribution by a Gaussian distribution with same first and second order moments (moment matching). The upshot of all this is is the following rule: In the Kalman filter update equations (20) replace z by formula (6), and Θ by (7).

Remarks.
1. In our context, the formula for the mean of a Dirichlet random variate (Laplace's rule of succession) could be paraphrased as "each outcome gets one click for free". In statistics this extra count is called a pseudocount [31]. In comparison, the mode (the position of the maximum of the PDF) is given by We would like to point out that in maximum likelihood reconstruction one takes the mode as the basic quantity (the relative frequencies of the outcomes), as that is the point of maximum likelihood, while in our approach we use the confidence region, which is approximately centered around the mean (the modified relative frequencies, with pseudocounts included). To counter potential objections against this approach we already mention here that, for any non-unreasonably small confidence value, the mode is well within this confidence region. This will be shown in the appendix, Section B 1.
3. In fact, even if one is not going to calculate the confidence region, Laplace's rule tells us that one should really use the modified relative frequencies, because "the mean of a posterior can be thought of as being more representative [than its mode] as it takes into account the skewness of the PDF." ( [28], p. 25).
4. In this whole discussion we have quietly disregarded the fact that in the quantum setting the probabilities P do not necessarily range over the whole probability space. The range is essentially determined by the relations p j = Tr[ρΠ (j) ]. Thus, to be completely correct, all Bayesian integrations should be carried out over this range. However, in many cases, integrating over the exact range complicates the calculations too much. On the other hand, in the quantum tomography setting, performing exact integrations does not guarantee that the final solution, in terms of the state, belongs to the physical set anyway. Therefore, we integrate over the full probability space (the d-dimensional probability simplex) and restrict the solution to the exact physical set afterwards (see Sections III E and III F). More about this is discussed in Section III B 5.
We have illustrated our approach here for the case of pulsed experiments, where the distribution of clicks is governed by the Multinomial distribution. More generally, the approach can be described as follows. Let f be the PDF of the distribution of the outcome frequencies, conditional on the probabilities p. From this, derive the conjugate PDF, i.e. the PDF of P conditional on the observed frequencies. Then approximate this conjugate PDF by a Gaussian using moment matching. This amounts to replacing z and Θ in the Kalman update equations by the first and second order moments (which are functions of the observed outcomes) of the conjugate PDF, respectively.

Poissonian counts
Consider, as a second example, CW experiments, where the outcome frequencies are governed by a Poisson distribution. More precisely, the frequencies f j of the outcomes are independent Poisson variates with parameters µ j = Ap j , where A is the brightness factor of the experimental setup.
The conjugate distribution of the Poissonian f (k|p) = e −µ µ k /k! with µ = Ap is the PDF where Q(k+1, A) is the regularised incomplete Gamma function [32]. We can apply this to find the PDF of P conditional on the outcome frequencies f j . While the latter are independently Poissonian distributed, the p j have to add up to 1 and are therefore correlated. Thus, the PDF of P equals the product j f (p j |f j ) but renormalised to 1 over the probability simplex of p. A short calculation yields the surprising result that, again, the conjugate PDF is given by the Dirichlet distribution, with N = d j=1 f j . Maybe even more surprising is the fact that the brightness factor A cancels out completely. This is rather convenient, since, in general, A is not known, or at least not with great precision. We can therefore carry over the formulas for the pulsed mode case to the CW case wholesale, with the one addition that N has to be explicitly defined as N = d j=1 f j .

Non-POVM Measurements
As is well-known, the most general measurement one can perform is a POVM measurement, described by POVMelements, positive-semidefinite operators that add up to the identity operator. In practical experiments, however, one is not bound to implement the full POVM. For example, one could just implement one element of the POVM at a time, make N measurements with it, and leave the other elements for subsequent runs. Under the assumption that the measurements are always made on identically prepared state, this makes no difference in the end result (the vector of outcome frequencies). Because of this, one can simulate measurements that cannot be performed in a single shot, namely POVM measurements where the elements do not add up to identity, as long as the elements Π (j) themselves obey the condition 0 ≤ Π (j) ≤ 1 1 (so that they form part of some POVM-proper).
However, for the purposes of tomographic reconstruction, in particular for the kind we consider here, this potentially poses a problem in the CW case. When the POVM elements add up to a multiple of identity, the unknown brightness factor A drops out of the calculations, just as in the case of proper POVMs. When the elements do not add up to a multiple of identity, this is no longer the case, and the calculations become more complicated. The brightness factor A is now a so-called nuisance parameter and has to be removed from the likelihood function by integrating it out, as shown below.
This situation has already been considered in [33,34] for Maximum Likelihood reconstruction. It was noted there that the sum of the POVM elements, the matrix Π (0) := j Π (j) , determines the field-of-view of the tomography experiment. That is, if Π (0) is supported only on a restricted subspace, the reconstructed state will also be supported on that subspace only. The tomography will be "blind" to state components outside that subspace. Moreover, the eigenvalues of Π (0) determine the sensitivity of the tomography along the corresponding eigenvectors. The larger the eigenvalue, the more accurate the tomography in that direction will be.
When the POVM elements Π (j) no longer add up to identity, the corresponding "probabilities" p j = Tr[ρΠ (j) ] do not add up to 1 either. Let us then define p 0 = d j=1 p j . Similarly, define f 0 = d j=1 f j . The maximum likelihood reconstruction method can be extended to cover this situation by renormalising p j : one just replaces p j by p j /p 0 in the original MaxLik formulas. This so-called extended maximum likelihood principle was first suggested by Fermi (see [35], p. 90). For our reconstruction method, however, we need the mean and variance of the likelihood function, and not just the mode, and these depend on Π (0) in a more complicated way.
If the POVM elements add up to a multiple of identity, Π (0) = M 1 1, then p 0 = M , a constant. Otherwise, p 0 is not a constant, but depends on the state ρ. The PDF of the corresponding measurement outcomes (CW case) is the product of Poissonians The corresponding likelihood function is proportional to this, and can be converted to a PDF by normalising over the set of allowed values of P . Now this is exactly the problem: how should the probability space P of P look like when the p j no longer add up to 1? When p 0 is a constant, p 0 = M , it is reasonable to take the set of non-negative vectors adding up to M as probability space. Then the normalisation constant becomes and all references to A indeed cancel. The end result is then a Dirichlet distribution in terms of the probability vector p/M . The mean and covariance matrix of P are thus given by the formulas for the Dirichlet moments, multiplied by M and M 2 , respectively.
The remainder of this subsection can be skipped on first reading, and can be skipped altogether if one always makes sure that the POVM elements used add up to a multiple of identity; this design rule is recommended.
When p 0 is not a constant, this magical cancellation of A no longer takes place. The standard way to deal with this in Bayesian inference is to consider A as a random variable, too, and calculate the joint distribution of A and P : Since we are not really interested in A (it is a nuisance parameter) we then take the marginal distribution of P by integrating out a. Using the integral A second problem that occurs when p 0 is not a constant is the geometrical shape of the probability space P. In principle, this shape can be derived from the relations p j = Tr[ρΠ (j) ], but this nearly always yields a complicated set and integrating over it is extremely difficult (as has been remarked before).
For example, let ρ be a qubit state ρ = x z z 1 − x , and take the POVM elements Then P is defined by p 1 + p 2 = 1 and (p 1 − 1/2) 2 + (p 3 − 1/2) 2 + (p 4 − 1/2) 2 ≤ 1/4 (essentially a Bloch sphere). The reader is invited to try and integrate the function p f1 1 p f2 2 p f3 3 p f4 4 /p f0+1 0 over this set. As before, we propose to integrate over the smallest set containing P and giving easy integrals, and restrict to the physical set in a later phase. The easiest way to do this is to first fix p 0 and include all points in the simplex P(p 0 ) := {p : p j ≥ 0, d j=1 p j = p 0 }, perform all the calculations conditional on this assumption, and then average over the range of p 0 . This range can be easily determined from the extremal eigenvalues of Π (0) := j Π (j) . Let m and M be the smallest and largest eigenvalue of Π (0) , respectively, then m ≤ p 0 ≤ M .
To average over p 0 we need a measure; to get P = m≤p0≤M P(p 0 ), with all points in the set equally weighted, this measure has to be proportional to the volume of the simplex P(p 0 ). As this volume is proportional to p d−1 0 , the measure is p d−1 For fixed p 0 , the calculations show that P /p 0 follows a Dirichlet distribution, with N = j f j . Thus the moments of P are: .
where we have defined The range of φ k is between d/(d + k) and 1, obtained when m = 0 and m = M , respectively. Hence we get .
This yields the covariance matrix elements via the relations For the extreme case m = 0, this gives As a small check, for m = M (Π (0) = M 1 1), we get the mean and covariance matrix of the Dirichlet distribution, multiplied by M and M 2 , respectively.

C. Choosing a Prior
In this Section, we tackle the problem of choosing an appropriate prior distribution, as required for starting the Kalman filter process. We also have to solve problem of restricting the solution of the Kalman filter to physical space. Both problems could have been solved in one go by choosing the prior to be a uniform distribution over state space, and setting it equal to 0 outside of it. Unfortunately, Kalman filtering requires a Gaussian prior, and we leave the solution of the restriction problem to the next section. In this section, therefore, we ignore the restriction to physical state space.
When we have no prior information about the quantum state apart from the tomography data, we have to construct a prior that reflects this total lack of knowledge. Moreover, to allow for the application of Kalman filtering, this prior has to be Gaussian. One such prior could be a Gaussian with an infinite covariance (the mean would then be irrelevant): Σ = ∞1 1. In numerical computations, this infinity of course has to be replaced by a finite, but still big number b, giving Σ = b1 1. On the other hand, to avoid numerical instability, b should be not too big.
But what does big enough and not too big mean? Fortunately, as we are dealing with quantum state estimation, we know that the state belongs to a bounded set: its eigenvalues are positive numbers summing up to 1. To illustrate this, let us restrict to diagonal d-dimensional states, i.e. distributions.
With the choice Σ = b1 1, the squared 2-norm distance between two such distributions p and q is given by ||p − q|| 2 2 /b. (Why we take the 2-norm distance will become clear in the next Section). The maximum value is therefore 2/b. To minimise the influence of the prior, this distance should be small enough, and certainly much smaller than d. Hence, we need b ≫ 2/d. In our numerical experiments, we have chosen the value b = 1.
For the mean of the prior, the best choice is to take a state "in the middle" of state space. For distributions this would be the uniform distribution (1, . . . , 1)/d, for quantum states the maximally mixed state ρ 0 = 1 1/d. More generally, one could take the state that has the largest entropy within the physical set.
An alternative solution to the problem of choosing a prior is based on the observation that the Bayesian update equation (10) is basically a multiplication and therefore all Bayesian updates commute. We can therefore start with any suitable prior and divide it out again after all Kalman filter updates have been performed. This amounts to the same thing as starting off with the infinite width prior. This division is easy when the covariance matrix of the chosen prior is a scalar matrix, Σ 0 = b1 1, with some finite choice of b.
Let µ 0 represent some fixed state and let us consider measurement parameters of a very specific form z = Hµ 0 and Θ = bHH * . In that case the Kalman update equations simplify to Using this it is easy to calculate that starting off the Kalman filtering sequence with the "infinite width" Gaussian prior (µ = 0 and Σ = ∞1 1) and applying the Kalman update step (z = Hµ 0 and Θ = bHH * ) yields as updated state a Gaussian with µ ′ = µ 0 and Σ ′ = b1 1. Starting off with this Gaussian as prior (µ = µ 0 and Σ = b1 1) is therefore equivalent to starting off with an infinitely wide prior and applying this particular Kalman update step once, anywhere during the sequence (anywhere, because of commutativity). In particular, this update can be done at the end of the sequence. Undoing the narrow prior can therefore be done after the final Kalman update by applying the inverses of equations (25) and (26). Denoting by µ and Σ the quantities obtained at the end of the Kalman filter sequence, and by µ corr and Σ corr the corrected ones, with infinite prior, we have the correction equations In practice, one could for example choose µ 0 to be a representation of the maximal entropy (maximally mixed) quantum state (1 1/d).
The problem with these correction equations is the extreme sensitivity of µ corr to even the tiniest variations in µ when Σ corr has very large components. While this is not necessarily a numerical artifact-large uncertainties on certain components of the covariance should go hand in hand with equally large uncertainties on the mean-it may cause numerical problems further down the line. For that reason we try to avoid this situation by choosing a slightly larger value for b in the correction equations than was used in the construction of the prior. To obtain a corrected covariance matrix Σ corr with an upper bound σ 2 max on the variances we can choose a value b ′ satisfying 1/b − 1/b ′ = 1/σ 2 max .

D. Calculation of the Confidence Region
The mean value and covariance matrix that we have been able to calculate using the Kalman filter method are not ends in themselves. One possible use of these is to calculate mean and variance of certain operators when applied to the state. In Ref. [5] it is shown that the mean value and variance of an operator Z depends on the mean state µ and state covariance Σ (the inverse of the Fischer information matrix F ) via the relations z = Tr µZ where z is a vector representation of the operator Z. The error bars on Z can then be derived by setting appropriate condifence levels.
In this subsection, we derive more generally an expression for the confidence regions for the complete state, corresponding to the reconstructed mean value and covariance. The confidence region is the region around the mean value obtained from the Kalman filter procedure within which the actual state can be found "with high probability". The value of this probability is called the confidence value and is typically chosen to be 95%. Stated otherwise, the probability that the actual state is outside the confidence region should be "low", e.g. 5%.
For the multivariate normal distribution, with mean µ and covariance matrix Σ, the confidence region is an ellipsoid centered around the mean. This is quite clear as the surfaces where the PDF assumes a constant value are governed by the quadratic equation (x − µ) * Σ † (x − µ) = c (note the Moore-Penrose inverse, as required when there are exact linear constraints; see Section III G). The quantity of the left-hand side is the squared Mahalanobis distance M 2 between points x and µ, as defined by (3). The surface of the confidence region is thus the set of points at a certain Mahalanobis distance from the mean. To find which value M should take for which confidence value, we have to consider the distribution of M.
It is well-known that the squared Mahalanobis distance has a chi-square distribution with ν degrees of freedom (DoF): where ν is the rank of Σ, equalling the dimension of x minus the number of independent exact constraints (zero variance components). A proof of this basic fact goes as follows: Proof. Suppose Σ has rank ν and is bounded (all eigenvalues are finite), then its MP inverse has rank ν as well and can therefore be written as Σ † = Q * Q, where Q is a ν ×d matrix. Introduce the ν-dimensional random vector U = Q(X − µ).
Then the entries of this vector are independent and distributed as U i ∼ ind N (0, 1). The sum-of-squares of these entries is then, by definition, χ 2 ν distributed [36]. Since the squared Mahalanobis distance M 2 is equal to We summarise the main properties of the chi-squared distribution [36]. The PDF as a function of x, with x ≥ 0, is given by and the cumulative distribution function (CDF) by where Γ(a, y) is the incomplete gamma function 1 . The mean of a variable X 2 ∼ χ 2 ν is ν and its variance is 2ν. The variable X itself is distributed as X ∼ χ ν . For not too small values of ν, X is approximately normal with mean ν − 1/2 and variance 1/2. The 95% confidence interval of X is therefore approximately given by 0 ≤ x ≤ ν − 1/2 + 1.16309, where 1.16309 = InvErf(0.9), the root of [1 + Erf(x)]/2 = 0.95. Even for the smallest ν that we will encounter, this approximation turns out to be very good; for ν = 3 (a single qubit state, for example) the value x ≤ 3 − 1/2 + 1.16309 yields the only slightly smaller confidence value of 94.3%.
One immediately obtains that 95% of the probability mass of a multivariate normal is contained in the ellipsoid consisting of points whose M 2 lies in the 95% χ 2 ν confidence interval. In other words, the 95% confidence region is the ellipsoid This formula lies at the heart of many statistical procedures, for example outlier detection. Now, since we are approximating the actual posterior PDF by a Gaussian, the true confidence region will be different from the one just obtained. However, we show in Appendix B 2 that the difference will not be dramatic. Even in the worst case, the actual distribution of M 2 is very close to chi-squared, but has a variance that is larger by a factor of about 30% (unless N , the number of measurements per run, is pathologically small). This means that the confidence region will be slightly larger than the value given by (29). A conservative estimate is to take

E. Restricting to Physical State Space
In this Section and the next, we treat the problem of restricting the solution of the Kalman filter to the physical region S; when the state is a quantum state, this means restricting solutions to state space, the set of positive semidefinite matrices of trace 1. We will perform this restriction as a post-processing step after the Kalman filtering calculations.
At the level of the PDF, the restriction involves setting the values of the obtained PDF equal to zero outside S, and then renormalising the PDF (as it should now integrate to 1 over S instead of the whole space). The whole art is to determine the value of the new normalisation factor. This requires the integration of the posterior PDF over S, which is a very complicated problem, especially for high dimensions; this problem is known as the Bayesian integration problem. Likewise, similar integrals are necessary to obtain the moments of the restricted PDF.
Various numerical methods have been proposed to approximate such integrals; for an overview see [37,38]. The particular problem faced here is that the intersection of the unconstrained confidence region with the physical set has extremely low volume both within the confidence region and within the physical set, in part due to the high dimensionality of the problem. This turns out to be a very challenging situation for all existing integration methods.
In the following, we present two approximative methods. The first method is very simple but rather crude and actually circumvents the Bayesian integration problem. It is not a very powerful method, because it only allows to check whether a state is in the restricted confidence region. For some situations this might be already enough. If one desires to know the shape of the restricted confidence region, e.g. via its moments, then this method does not suffice. Nevertheless, the method is extremely simple to apply.
A second method, discussed in the next Section, is more powerful and yields an approximation of the restricted confidence region in explicit form. Perhaps not surprisingly, it is based again on Kalman filtering and yields an approximative confidence region expressed by a mean value and a covariance matrix. This method, while still in its experimental stages, seems to work amazingly well in practice.
The simple method consists of keeping the first and second moments of the unphysical posterior PDF but modifying the M 2 CR value to take the renormalisation over the physical set into account. Furthermore, rather than calculate the exact value of the new M 2 CR , a conservative upper bound is taken that is easy to calculate.
The method consists of two parts, of which the first one is optional. First, the maximum likelihood (MaxLik) solution is calculated. That is, the physical state for which the unphysical posterior PDF is maximal is calculated. This corresponds to the following semi-definite program: find the minimal value of t for which a state ρ exists satisfying the mixed semi-definite/quadratic constraints For this minimal t, the state ρ in question is the MaxLik solution.
Even though very efficient semi-definite program solvers exist [39], this part can still be very time consuming when the dimension of the state is high. Nevertheless, finding the Max-Lik solution is interesting enough in its own right to warrant inclusion of this part. After all, this solution is what most reconstruction algorithms try to find. In our context, the MaxLik solution also allows to check the validity of the tomography data. Indeed, given that the MaxLik solution corresponds to the best physical "guess" of the actual state, the former should lie within the "raw" (i.e. unconstrained) confidence region allowed by the measurements. Thus, the Mahalanobis distance between the MaxLik solution ρ ML and the mean of the unconstrained posterior PDF should be below M CR . If not, this could be an indication that something is wrong, either with the data, or with the underlying assumptions (e.g. the noise model).
For the second step, we consider the Mahalanobis distance just calculated, The confidence region for the constrained PDF is the intersection of the physical set with the ellipsoid (ρ− µ) * Σ † (ρ− µ) ≤ M 2 CR,phys , where M 2 CR,phys is the confidence value for the constrained posterior PDF. Note that µ and Σ are still the moments of the unconstrained PDF. Because the constrained posterior has to be normalised over the physical set only, M CR,phys will be larger than M CR . Calculating the exact value of this M CR,phys is an extremely difficult problem, but we can prove the validity of a very simple upper bound: with M ML given by Eq. (31). The proof of this bound is given in Appendix A. In case one does not even want to calculate the MaxLik solution, and one is willing to believe this solution is in the unconstrained confidence region, M ML can be replaced by its (presumed) upper bound M CR,unphys , giving the simple result that the constrained confidence limit is at most twice the unconstrained one: In conclusion, the simple method consists of doing the following: 1. Depending on resources and taste, choose between steps 2 or 3, then proceed to step 4. Let us now move on to our second method for restricting the posterior PDF to physical state space. It is a more involved method but gives more information. To simplify the discussion, consider an example where X is a d-dimensional real variable, and the physical region consists of the positive orthant x i ≥ 0, i = 1, . . . , d. We assume that, after incorporating the measurement data, the unrestricted PDF of X is (approximately) given by its mean µ and its covariance matrix Σ. We assume that the corresponding confidence region is not completely contained in the physical region; otherwise, there would be nothing to do here.
Consider now the marginal distributions of each of the components X i of X. The marginal distribution of X i is of course normal, with mean µ i and variance Σ ii . We can then easily calculate the confidence interval of each X i for given confidence levels. There will at least be one such X i whose confidence interval will not be completely contained in the physical interval x i ≥ 0. Broadly speaking, this is the one-dimensional marginal version of our restriction problem.
The first key idea of our proposal is to consider the marginal distribution of this X i . If more than a fixed amount α of probability mass of this marginal is outside the physical interval X i ≥ 0, we truncate the marginal to that interval. That means, we set the density equal to 0 outside the physical interval, and renormalise to 1. We then approximate this truncated normal distribution by an ordinary normal distribution, with appropriate meanμ and varianceσ 2 . How we will do this is described in the next subsection.

The marginal restriction problem
The obvious idea of using moment matching to approximate the truncated normal is of no use here because we need a procedure that gives a stable result when applied twice or more. Approximating a truncated normal with moment matching does not yield a distribution with controlled tails (the tail being that part of the distribution outside the physical interval). Truncating it again and approximating that truncated normal with moment matching for a second time will typically yield yet another set of values for mean and variance. There is no guarantee at all that this process would converge. For that reason we seek an approximation procedure that controls the tail probability explicitly. We will require that the approximating normal distribution has exactly an amount α of probability mass outside the physical interval x ≥ 0, where α is small, say 5%. For an illustration of this approximation process, see Fig. 1.
As a quality measure of the approximation, we will use the Kullback-Leibler distance again D KL (p||q) with the truncated distribution as first argument (thus, with integration interval restricted to x ≥ 0) and the approximating distribution as second argument. Without the restriction on the approximating distribution this would result in the moment matched approximation.
Minimising this distance over all choices of approximating distributions amounts to maximising the following function over the parametersμ andσ: The requirement we imposed on the probability mass in the left tail of the approximating normal translates to the equalitỹ where α 0 = √ 2 InvErf((α + 1)/2). For α = 5% we find α 0 ≈ 1.64485. After some algebraic manipulations one finds the following complicated set of formulas for the optimalσ: In Fig. 2, we plot the ratioσ/σ as a function of t. This one-dimensional solution has now to be translated back to the original setting of restricting the state vector X to the physical set. As this translation is basically an inverse problem involving normal distributions only, we will again use a Kalman filter to solve it, as explained in the next subsection.

Backprojecting the marginal restriction using Kalman filtering
Our second key idea is to enforce the original unrestricted distribution to have a marginal distribution (for component X i ) given by the one-dimensional approximating normal, with parametersμ andσ. Formally, this is equivalent to performing a linear one-dimensional measurement, and we can incorporate its effects on the state X by a Kalman filter update step. The parameters of the measurement (measurement matrix H, mean z and covariance Θ) will have to be such that the effect of the measurement is the required enforcement of the marginal mentioned above.
As we only want to enforce the marginal of the X i component, we will set H equal to the row vector i|. Thus, Hx = x i . This is a one-dimensional measurement, so that its mean z and covariance Θ will be one-dimensional too. Correspondingly, the Kalman gain is a column vector. Inserting this into the Kalman filter update equations (20) gives The X i marginal of the updated distribution will then have moments µ ′ i and Σ ′ ii given by We find the required values for the measurement parameters z and Θ by solving the equations µ ′ i =μ and Σ ′ ii =σ 2 . The solution is Here,μ andσ are given by the equations (34) and (35), with µ = µ i and σ = √ Σ ii , the marginal moments of X i in the original distribution.
In the previous subsections, we have shown how a single variable X i can be restricted to its physical interval X i ≥ 0. In general, if the physical set is convex, the set is defined by a number of such inequalities, possibly an infinite number. For simplicity, we first treat the case that the physical interval is defined by a finite set of inequalities X i ≥ 0, ∀i, and treat the more general case below. This case corresponds for example to diagonal quantum states, and also to the optical POVM of Section V.
To restrict the complete state vector X to the physical set, we repeat the above-mentioned procedure for every component of X, or at least for those components for which µ i /σ i < α 0 . In general, however, because of correlations, multiple components of X will be affected by a single step of the procedure, and it could very well be that the work of previous steps is partially undone by the current step. For example, forcing X 2 to be positive could bring X 1 back into the non-physical region.
Therefore, a number of runs of the algorithm will be necessary, stopping when all marginals have their confidence intervals approximately within the physical interval. As the quantity min i µ i /σ i will converge to α 0 from below, a good stopping criterion is min i µ i /σ i > (1 − ǫ)α 0 , where ǫ is a small positive number. In practice, ǫ should not be chosen too small, so that the algorithm terminates in reasonable time; in our applications we chose ǫ = 0.003.
The order of the iterations, namely which marginal X i to treat first, does not seem to influence the end result very much. In one set of experiments we treated the marginals in fixed order, and in another we always chose the marginal with smallest µ i /σ i first. It is not clear that the latter order should converge faster because of the correlations between the X i ; in our experiments it only did marginally so. While this and other convergence issues are still under investigation, they appear not to be of major importance.
An illustration on a small example is shown in Fig. 3. The left graph shows the result of the restriction of X 1 to the physical interval X 1 , X 2 ≥ 0. One sees that the new distribution again crosses the border of the physical set, but now component X 2 is involved, although it did not before the update. A second Kalman filter update step will therefore be necessary, on X 2 . The result of that second step is shown in the right graph of Fig. 3.
In the case that the number of inequalities defining the physical set is infinite, for example for the set of quantum states, where the inequalities are Tr ρX ≥ 0, ∀X ≥ 0, the fixed order rule is obviously infeasible. The smallest-first rule, on the other hand, requires the complicated minimisation of X * ρ/ √ X * P X over all X ≥ 0. A third, and much simpler possibility is to choose X at random. However, that method exhibits slow convergence, especially in the final stages when After applying one step of the Kalman filter, with parameters given by Eqs (36)(37)(38), we obtain a normal distribution whose PDF is plotted in the left diagram in dashed lines.
In the graph on the right the result of a second Kalman filter update step is shown, now for component X2.
the number of unsatisfied constraints becomes small. A better option is to consider a combination of two rules in the following double iteration: the inner iteration consists of, given a unitary U , performing the restriction on the diagonal of U ρU * , that is with H = Vec(U * e ii U ) * , and varying i. This inner iteration can be performed either using the fixed order rule i = 1, . . . , d or the smallest-first rule. The outer iteration consists of choosing a new random unitary U each time, until a suitable stopping criterion is satisfied, e.g. until the inner iterations achieve no further reduction of µ/σ. Although the smallest µ/σ component does not necessarily occur for U diagonalising ρ, it is a good idea to choose the unitary U diagonalising the current ρ every now and then.

G. Incorporating Exact Linear Constraints
In many cases, the description of the physical state is subject to one or more exact constraints. For quantum states the trace of the density matrix is 1, for trace preserving quantum processes the partial trace of the state representative over the output Hilbert space is the maximally mixed state 1 1/d, and for POVMs the sum of all the elements must be the identity matrix. Depending on the physical system, there may be further constraints like this. These exact linear constraints can be incorporated into the reconstruction process in a number of ways.
A first approach is to incorporate exact constraints via dummy measurements with zero measurement covariance, and replace inverses by Moore-Penrose (MP) inverses. The benefit of this method is that virtually no changes to the Kalman filter implementation are needed. A serious drawback is that the state covariance matrix becomes ill-conditioned, since exact constraints correspond to zero variance components. In reality, numerical round-off causes these components to have non-zero variance, which makes it hard to discriminate between variances that are nominally zero and those that are not. This is a notorious problem for actual implementations of Kalman filters and may cause serious numerical instabilities. Later on in the calculations, the Mahalanobis distance (x−µ) * Σ −1 (x−µ) has to be calculated (see Sec. III D) and even the smallest deviation in x − µ from the exact constraints is blown up by the inverse of Σ.
A second approach is to store exact constraints in two additional matrices, along with state mean and state covariance. In general, exact constraints may have an impact on the state but also on the measurement. For example, when the state is a quantum state, we have the exact constraint on the state Tr ρ = 1, and a corresponding exact constraint on the measurement probabilities i p i = 1. This implies that the difference between any two states, e.g. the mean X and its update X ′ , should lie in a subspace, namely the one for which the trace is zero. Similarly, the difference between two measurements, e.g. the actual measurement outcome z and the expected outcome Hµ, should also lie in a subspace, namely the one for which the sum of all components is zero.
Both subspaces can be represented in calculations by two projectors, T X and T Z . The projector T X projects on the subspace in state space, and T Z on the subspace in measurement space. The Kalman filter update equations can be made more resistant to numerical inaccuracies using these projectors, ensuring that the exact constraints are obeyed in any iteration of the update process, as follows: Here, µ 0 is a reference state, e.g. the maximally mixed state 1 1/d. Note that the ordinary inverse in the formula for the Kalman gain K has been replaced by the MP inverse. Likewise, the inverse of Σ appearing in the formula for the posterior PDF corresponding to the Kalman filter solution has to be replaced by an MP inverse too.
A third approach is to parameterise the state such that the exact constraints are inherently satisfied. The obvious benefit is that the exact constraints do not have to be explicitly imposed. A second benefit is higher numerical stability, and straightforward invertibility of all matrices that have to be inverted.
We start again from the projectors T X and T Z . From these projectors we can derive two partial isometries, X 1 and Z 1 , such that the following holds: the number of columns of X 1 and Z 1 must be equal to the ranks of T X and T Z , respectively; Numerically, these partial isometries can be calculated from a singular value decomposition (SVD) of the projectors. For example, let T X = U SU * ; the partial isometry X 1 is then obtained from the unitary U by dropping those columns that correspond to the zero-valued singular values.
Roughly speaking, using these partial isometries, the matrices Θ, Σ and H can be "cut down" to their invertible parts, which we will denote by a tilde. Definẽ Since the support of Θ is exactly the support of T Z , we also have the reversed equality Θ = Z 1Θ Z * 1 . Furthermore,Θ is full rank and therefore invertible. In a similar way we definẽ

H. Graphical Representations of the Reconstruction
In the previous sections we have presented a methodology for state reconstruction from tomographic data by which a Kalman filter is used to obtain a normal approximation to the likelihood function f X|F , where X is the state and F is the measurement data. The normal approximation is defined by its two moments: the mean state vector µ, and the covariance matrix Σ. These two moments should in principle suffice as a complete statistical description of the reconstructed state (within the limits of the normal approximation).
When it comes to presenting the reconstruction, however, there are a number of problems with the use of mean and covariance matrix alone. Consider, for example, the reconstruction of an optical POVM using our method, as discussed in Section V below. The reconstruction of the diagonal elements of the first element is shown in Fig. 6. The reconstructed mean is plotted as the centerline in the figure. On top of that, we would like a depiction of the covariance matrix, because this matrix essentially describes the reconstruction uncertainties.

Depicting the covariance matrix
The first problem one is faced with is that the covariance matrix Σ, being a matrix, cannot really be depicted in a very meaningful way. Nevertheless, as the whole purpose of calculating it is to provide some kind of error bars on the reconstruction, it is desirable to have some means of representation. One can do this by plotting its diagonal elements √ Σ ii as error bars on the mean value. This is meaningful because the diagonal element Σ ii is exactly the variation of the marginal distribution of X i . Of course, such a plot has to be accompanied by the proviso that the plot can only be indicative, because the variations on the elements are in general correlated.

Avoiding reconstruction artifacts
The second, and more important problem we want to address in this Section is the appearance of reconstruction artifacts in the reconstructed mean. Closer inspection of Fig. 6 reveals the presence of a wave-like pattern in the centerline, while from theoretical considerations of the underlying POVM model there really is no reason why that pattern should be there. Such artifacts are typical for maximum-likelihood reconstruction methods and are well-known in image restoration [28]. Even though the wave pattern in the POVM reconstruction stays well within the error bars, which is already a clear counter-indication to its statistical significance, it would be better to have a reconstruction not showing such artifacts at all. Two methods for obtaining artifact-free solutions (or at least for suppressing the artifacts) are described below.

MaxEnt reconstruction
A widely used method for suppressing reconstruction artifacts is the MaxEnt method, first proposed by Skilling in the context of image reconstruction [28]. Originally, the method was formulated as choosing a special prior PDF based on the entropy S(x) of the states (provided such an entropy exists). In many cases a state can be formally identified with a probability distribution, after suitable normalisation. This is possible whenever the state consists of a set of positive numbers. For digital images, the PDF is the list of intensities of each pixel. For quantum states, it could be the list of eigenvalues of the density matrix. In those cases one can assign a meaningful entropy functional to the state space. For quantum states, the von Neumann entropy is the obvious choice.
The MaxEnt method then consists of choosing the function exp(αS(x)) (properly normalised), where α is a fixed parameter, as prior PDF. Inference then proceeds in the normal way, by calculating the posterior PDF and finding the maximum likelihood solution. The upshot of this choice of prior is that in the absence of other information, preference is given to states with higher entropy. The parameter α characterises the amount of preference. Jaynes' principle of maximum entropy [40] could be seen as a legitimisation of this approach.
In the context of quantum tomography, Hradil andŘehaček [41] advocated a combination of the maximum entropy method with the maximum likelihood (MaxLik) reconstruction method, which they called MaxEnt assisted MaxLik (MEML) tomography. This method can be seen as a special case of Skilling's MaxEnt method. In their paper, they considered the situation of incomplete measurements. This corresponds to a likelihood function whose covariance matrix has a certain number of eigenvalues that are (almost) zero, while the others are infinitely large. The MaxLik reconstruction is thus known with certainty to lie in a certain subspace, but its position within that subspace is completely unknown. In other words, there exists not a single state maximising the likelihood function, but a whole plateau of states. The proposal of [41] consists of finding the point on that plateau (i.e. in the MaxLik subspace) for which the entropy S(ρ) = − Tr ρ log ρ is maximised and take that point as the reconstruction.
From an experimental viewpoint, the situation considered in Ref. [41], of variances that are either zero or infinite, is an idealised one. In practical experiments, the number of measurements is finite, so that even the most precisely known state components have a non-negligible variance. Second, there may be practical and/or technical limitations on the kind of measurements that can be performed, so that some variances may be very large, but still finite. In Sec. V we will see a clear example of this. In that section the reconstruction of an optical POVM is described. While the elements of this POVM are diagonal in the Fock basis, its tomography is based on coherent states rather than Fock states, because the latter are extremely hard to produce. This causes large variances on the reconstructed elements without a clear-cut distinction between perfectly known and completely unknown components. When dealing with such realistic experiments, the full-blown MaxEnt method is much more preferable.
In its original formulation as a choice of prior, the MaxEnt method has a number of shortcomings. One is that there appears to be no satisfactory and rigorous way of choosing the parameter α. Secondly, the principle of maximum entropy does not necessarily apply to the entropy of the states. In quantum tomography we are dealing with a controlled system; the system is being prepared in a predefined quantum state, to the best of the preparer's abilities, and the tomography acts on a sequence of independent identically prepared systems. In thermodynamical terms, this corresponds to a system that could be as far from equilibrium as the preparer wants it to be. This has to be contrasted with Jaynes' MaxEnt principle, which has been inspired by the statistical mechanics of systems in near-equilibrium, and which is based on the argument that the probability of a macro-state should be proportional to the number of microstates consistent with it, i.e. is proportional to its thermodynamic entropy. For systems close to equilibrium, we agree that it makes sense to choose a prior distribution that assigns more weight to states with higher entropy. For controlled systems, and for those systems lacking a fundamental notion of entropy, we are more tempted to opt for a uniform distribution, as we have done in this work, and incorporate the maximum entropy idea as a regularisation, as explained below.

Regularisation
Rather than apply the MaxEnt principle, which we deem not always appropriate, one can adopt a more pragmatic approach in which the entropy functional is no longer fundamental and can be replaced by other functionals. And rather than replace the prior PDF with the chosen functional, which implicitly changes the final posterior PDF, and choose the maximum likelihood solution for that changed posterior, the regularisation method does the following (Ref. [28], Sec. 6.2): the prior PDF is unchanged, and within the confidence region of the resulting posterior PDF (unchanged as well) it finds the solution that maximises the chosen functional. When expressing the functional as a cost, or penalty function, this would be a minimisation.
Since the entropy is a concave functional, maximising it over a convex set (such as the confidence region) is a convex problem and can be efficiently solved numerically. Likewise, minimising a cost function again gives a convex problem provided the cost function is convex. Proper distance measures, for example, would therefore be good cost functions.
Which cost function to use really depends on the problem setting. In the example of the optical POVM mentioned above, theoretical considerations suggested [19] that the smoothness of the POVM elements, defined as could be appropriate. In fact, this smoothness is a commonly used regularisation functional in image reconstruction methods [28]. It is immediately clear that this Q is a convex functional, as required. The appropriateness of this cost function comes from the fact that it penalises the 'wavyness' of the centerline, as exemplified in Fig. 6. When the cost function is quadratic, like this smoothness term, the minimisation problem is a quadratically constrained quadratic programming (QCQP) problem. Such problems can be efficiently solved using semi-definite programming (SDP) solvers [42]. For the sake of definiteness, let us consider the case where the states are quantum states (ρ ≥ 0 and Tr ρ = 1). The general form of a quadratic cost function can then be written in terms of a matrix A and a vector b as (Aρ − b) * (Aρ − b). The SDP form of the QCQP problem is then: minimise the (slack) variable t over all t and ρ under the combined quadratic and semi-definite constraints This problem can be solved in a straightforward way by SDP solvers like Sedumi [39].

IV. APPLICATION 1: STATE RECONSTRUCTION OF AN ENTANGLED 2-QUBIT STATE
The methods introduced in this paper have all been tested on real sets of tomographic data. In this Section and the next we report on two such applications, one in state tomography and one in POVM tomography.
In the present Section, we consider the reconstruction of tomography data of a source of polarisation-entangled photon pairs, obtained by Langford et al [43] and compare our results to their reconstruction. The source is a BBO-crystal down-conversion source operating in CW mode, pumped by an Argon laser. Two sets of tomography data were taken, one directly on the crystal, and one on the single mode fibres (SMF) attached to the crystal. In both cases, the sequence of measurements is as given in Tab. I. This measurement basis is over-complete because not all measurements are needed to obtain a full state reconstruction. Nevertheless, it was argued that by taking an over-complete basis a more accurate reconstruction could be obtained.
A nice consequence of this choice for our reconstruction method is that the projectors of the 36 basis states add up to a multiple of the identity,   Bayesian inversion (f −→ p) we find that p/M is Dirichlet distributed with parameters f and N = 36 k=1 f k . The upshot of all this is that the Kalman update equations have to be executed exactly once, with z given by z = M µ(Dirichlet(f )) and Θ by M 2 times the covariance matrix of Dirichlet(f ). This is particularly convenient, because the issue of setting an initial prior and removing it again after the Kalman updates (see Section III C) can be resolved analytically, which allows us to choose an infinitely wide initial prior b = ∞ without getting into numerical trouble. With such a prior, the Kalman update yields the following posterior, as can be checked with a modest amount of work: Note that these formulas are stated in terms of the "tilde quantities" [see Sec. III G, Eq. (46)]. Both the state and the frequencies satisfy exact constraints, Tr ρ = 1, and 36 k=1 f k = N , and we have chosen to deal with these constraints in the numerically most stable way, by "cutting off" the kernels (zero eigenvalues) of the respective operators. In the derivation of the above formulas care has to be taken because the productHH * is not full rank.
We show the results of the tomographic reconstructions of the measurements at the crystal and at the SMF in Figs. 4 and 5, respectively. Obviously, we cannot show the confidence regions in full 16-dimensional space, and we have chosen a 2D subspace spanned by two pure state projectors. We take the two eigenvectors ψ and φ of the reconstructed mean state µ that correspond to the 2 smallest eigenvalues (one of them being negative). The parameters x 1 and x 2 are then given by the mappings ρ → x 1 = ψ|ρ|ψ and ρ → x 2 = φ|ρ|φ .
For both cases we calculate the least-squares solution and the MaxLik solution. The least-squares solution ρ N is the state ρ N = j c j |ψ (j) ψ (j) |/ j c j , where the coefficients c j are the least-squares solutions of the system The MaxLik solution is the physical state ρ ML for which the Mahalanobis distance from the reconstructed mean state is min- Section IV, showing a slice through state space illustrating the position of the two-photon state, reconstructed from the data obtained by measuring directly at the BBO crystal. What is shown is the projection of the state on the 2D subspace spanned by two well-chosen pure state projectors. The two concentric ellipses centered about the reconstruction mean µ are the projections of the 50% (inner ellipse) and 95% confidence regions (outer ellipse), respectively; these ellipses are quite close together due to the rather high dimension of the system (d = 15). The intersection of the physical set with the subspace is the triangle x1 ≥ 0, x2 ≥ 0, x1 + x2 ≤ 1, of which the lower left corner is shown. The projection of the MaxLik solution ρML is also shown. This solution is well within the confidence region, as should be. For comparison purposes we have also plotted the projection of the "naïve" least-squares reconstruction ρN .
imal. We have implemented this in Sedumi, as indicated in Sec. III E.
In [43], the MaxLik solution was calculated in a different way, through the minimisation of a penalty function Here A is the unknown brightness factor of the experiment. This MaxLik solution closely matches the MaxLik solution obtained through our KF method. To obtain a quantification of the accuracy of the MaxLik solution, Langford used a Monte Carlo calculation to estimate the mean value of Π(ρ ML ) when f k is considered as a Poissonian random variable with mean Ap k (ρ ML ). From this mean value, a fit quality parameter Q is obtained by dividing the mean value by the total number of measurements and taking the square root. Ideally, the mean value of Q should be 1. Because the duration of the tomography run was twice as long as in the case of Fig. 4, the confidence region is much smaller. The main difference, however, is that the confidence region lies deep into nonphysical space, meaning that the MaxLik solution is far outside the confidence region. This is not a deficiency of the KF reconstruction method, nor of its implementation, but is actually a feature. It is a diagnostic feature that indicates that something is wrong with the assumptions about the underlying noise model. A likely possibility is that the measurements are subject to additional fluctuations. According to [43] the most likely source is temperature-dependency of the spatial alignment of the SMFs, which caused the measurements to drift. To get a proper reconstruction this drift should be taken into account in the noise model as an additional term. much more error information, with a clear statistical interpretation.

V. APPLICATION 2: RECONSTRUCTION OF AN OPTICAL POVM
Following a proposal of Ref. [44], in Ref. [45] an experimental realisation was reported of an optical detector with photon-number resolving capabilities. The basic idea is to carve up an optical pulse into 8 portions and detect the presence of photons in each of these portions. More precisely, this setup simulates a cascade of beam splitters and eight avalanche photo-detectors (APDs), with the probability of a photon arriving at a certain APD being roughly 1 in 8. The number of detectors clicking therefore gives an indication of the photon numbers in the pulse. The detector is implemented using two Franson interferometers, an additional balanced beam splitter, two avalanche photo-detectors, and two identical circuits for performing time binning.
The behaviour of this composite detector can be described by a 9-element POVM, where each of the outcomes corresponds to the number of APD's clicking (from 0 to 8). We denote the POVM elements by {Π (k) } 8 k=0 , where the elements Π (k) are positive semi-definite and add up to the identity matrix. In principle, the elements are infinite dimensional (corresponding to photon numbers being unbounded), but we will truncate them at a certain dimension d (in our calculations we have chosen values of d of up to 170). Since this detector has no phase reference, it is insensitive to phase, which means that the POVM elements have to be diagonal in the Fock basis.
To obtain a precise characterisation of the POVM elements, a tomography experiment has been performed [19] by which a large number of pulses consisting of coherent states |α of ever increasing power (∝ |α| 2 ) were sent to the composite detector and the resulting numbers of detectors clicking were recorded. The parameter α was sweeped from 0.4 to 11, in steps of about 0.01, and for each value of α, N = 38084 measurements were taken. Per value of α the measurement record consisted of the number of pulses f k that caused k detectors to click, for k = 0, . . . , 8; obviously, 8 k=0 f k = N . Using these data, a reconstruction of the POVM elements (without error bars) was obtained and presented in [19]. Here we take the same data and perform a reconstruction based on the KF method, yielding a maximally likely solution with in addition a definite confidence region. To avoid any confusion, we stress that the object under scrutiny is a POVM and the measurement is made using prepared quantum states. In other words: the POVMs are states and the state is a POVM.
We have calculated the (unphysical) mean value µ and covariance matrix Σ using Kalman filtering, including T projectors for including the exact constraints that the POVM elements must add up to the identity matrix. Then we applied the KF method for restricting to the physical set, giving physical mean value and covariance matrix. Finally, we calculated the maximally smooth solution within the physical confidence region.
In Fig. 6 we depict the final results for each of the POVM elements, showing the physical mean value solution, the error bars, and the smoothed solution. The smoothed solution of all POVM elements together is depicted separately in Fig. 7. The results are in very good correspondence with both the reconstruction of [19] and the theoretical model of the POVM (based on independent measurements of the reflectivities of the beam splitters and the overall photon loss).
To illustrate how the mean values and error bars change after each KF iteration, we have created a movie, where each frame consists of a plot similar to the one of Fig. 6, generated after each iteration. We refer the reader to [48] for this animation, the MatLab routines used, and other related material.
In order to infer how many measurements are needed to reduce the errors, one has to look at the unconstrained confidence region. We have plotted the spectrum of the unphysical covariance matrix in Fig. 8. This graph allows to estimate the number of experimental runs N necessary to achieve a certain final precision. It is evident from the graph that only 110 of the 800 free components have standard deviation less than 0.001 (λ i = σ 2 i ≤ 10 −6 ). Since variances scale as 1/N , to double that number to 220, say, N should be increased by a factor of no less than about 100,000 (to get the σ 2 580 of 10 −1 below 10 −6 ), i.e. from 38,000 to the rather impractical 3,800,000,000. Hence, to really achieve higher precision with this kind of experiment, another setup should be considered.  ii of the Optical POVM of Sec. V (up to photon number 30 only). This is the regularised solution, i.e. the solution with maximal smoothness that is still within the confidence region obtained from the Kalman Filter method. The solution obtained is in complete agreement with the solution from [19], which was obtained in a completely different way.

A. Comparison to other Methods
The reconstruction method that matches ours most closely is the one reported in [5], which is also based on the likelihood function and also yields a covariance matrix. Hence, this method allows to calculate confidence regions in the same way as ours. The main differences are that in [5] the point of maximum likelihood is calculated first, the covariance matrix is calculated as the inverse of the Hessian (the second derivative matrix ∂ 2 /∂x i ∂x j ) of the logarithm of the likelihood function, taken in the mode of that function, and the restriction to the physical set is imposed beforehand. In contrast, our method amounts to calculating the mean of the log-likelihood function, its Hessian in that mean, and the restriction to the physical set is made afterwards.
First of all, we believe that our approach yields results that better match the exact confidence region. The likelihood function is highly skewed, whenever there are a lot of lowprobability measurement outcomes; this appears to be the rule rather than the exception. In those cases the mean is statistically more meaningful than the mode, especially when one restricts to the mode over the physical set from the outset. Second, restricting to the physical set only in a post-processing phase yields valuable diagnostic information about correctness of the assumed noise model and also about the ultimate accuracy allowed by the particular tomographic data. As illustrated in Application 1, the mode over the physical set can be really far from the mean, due to unforeseen noise/error contributions, and the mean has to be calculated in order to see that. Also, to infer how many more measurements would be needed to improve the reconstruction accuracy, one needs to look at the covariance matrix before restricting to the physical set, as illustrated in Application 2.
Other reconstruction methods also calculate the MaxLik solution and derive error measures from Monte Carlo simulations. As such, they suffer from the same drawbacks as the method of [5] in that the restriction to physical space is made from the outset. Moreover, the time required for the Monte Carlo calculations rapidly becomes prohibitive with increasing system dimensions. Even for two-qubit systems, our method is orders of magnitude faster than Monte Carlo methods. A further problem with the Monte Carlo method is the difficulty of obtaining a reliable stopping criterion.

B. Computational Resources
The memory requirements of our method are easily calculated. They are essentially governed by the dimension of the subspace S X on which the state ρ is supported. If this dimension is D then storage forμ consists of D 2 complex numbers, while for the (tilde) covariance matrix it is the square of that, D 4 . This means that for the full reconstruction of n-qubit states, 2 2n elements are needed for the state, and 2 4n for the covariance matrix.
The computation time for the Kalman filter update (executed once per measurement setting) is dominated by a fixed number of matrix multiplications (of D 2 × D 2 matrices) and one matrix inversion (of a K × K matrix, where K is the number of outcomes per measurement setting and therefore is typically much smaller than D). As the computation complexity of a matrix multiplication for two k × k matrices is O(k 3 ) (or somewhat less), we get a computational complexity of D 6 .
The optional post-processing steps of calculating the Max-Lik and/or MaxEnt solution require solving a semi-definite program. In all reported applications this turned out to be the most time-consuming step.

VII. CONCLUSION
In this work we have introduced a novel Bayesian tomographic reconstruction method based on Kalman filtering that does not just give a maximum likelihood solution but also produces error bars, in the form of a confidence region around a mean value solution. It must be stressed that the error bars are directly derived from the measurement data, unlike in Monte-Carlo methods, where they are produced from simulations.
We have shown that to properly deal with low-probability events (e.g. measurement outcomes with very few clicks) one has to consider the conjugate distribution of the noise model, in the spirit of Laplace's rule of succession. That is, if click frequencies are distributed multinomially or Poissonian, this yields a distribution of the underlying click probabilities that is Dirichlet distributed. This avoids the incorrect assignment of zero probability to an outcome that has not been observed. Furthermore, we have introduced a novel method of ensuring that the reconstruction is physical. This method is again based on Kalman filtering, and has the benefit that it is very fast and again produces appropriate error bars.
Finally, we have applied the method to two real world applications. In the first example, the state reconstruction of an entangled two-qubit state, the reconstruction process reduces to a single application of the Kalman update equations which, apart from its numerical stability, reduces the computational effort. Compared to Monte Carlo methods for calculating error bars the computational effort is reduced by several orders of magnitude. The Kalman filter method also revealed the necessity to adjust the underlying noise model by taking into account additional error sources. The second example concerned the reconstruction of an optical POVM. There the advantages of Kalman filtering also became evident in one's capability to estimate the number of experimental runs necessary to achieve a certain final precision. Both examples indicate that our KF method can be an invaluable diagnostic tool.
In future work we will consider how to deal with measurement imperfections, including drift in the tomographic and system components. We will investigate how the present method can be applied to tomography with continuous variable outcomes. A further topic of study will be the integration of the Kalman filter method within adaptive tomographic setups, as the method is very much an online method, updating the covariance matrix as it goes. Among the more technical issues, we will study the convergence properties of our proposed Kalman filter method for restricting the reconstruction to physical space.
We are confident that our reconstruction method, due to its statistically well-founded nature, can be the basis of a dependable, easily adaptable, and universal reconstruction algorithm. Acknowledgments SS thanks the UK Engineering and Physical Sciences Research Council (EPSRC) for support. We thank the following people who have kindly provided us with ample of tomographic data for developing and testing the Kalman Filter method: I. Walmsley, A.G. White, J. O'Brien and N.K. Langford. It is fair to say that without their assistance this paper would have been very theoretic and equally useless. We also thank M. Plenio, A. Feito, R. Schack, T. Osborne and T. Sharia for illuminating discussions.Last but not least, we thank Z. Hradil for sharing his thoughts on the issues considered in our work.

APPENDIX A: PROOF OF THE BOUND (32) ON THE PHYSICAL CONFIDENCE VALUE
For definitions we refer back to Sec. III D. We start with a Lemma. gives the inequality of the lemma.
We start from the unphysical reconstruction, that is the mean µ and the covariance matrix Σ. Let P be the physical set, and let ρ 0 be the maximum likelihood solution, i.e. the state that is closest to the mean µ, in the Mahalanobis distance. In what follows we will use the Hilbert space representation of states, i.e. a representation as vectors. As before, we will denote this by math boldface. The discussion becomes easier by going over to a new, "standardised" coordinate system, in which the mean µ is the origin and the covariance matrix is the identity matrix. The Mahalanobis distance is then just the Euclidean distance, and the confidence regions are spheres centered around the origin.
In quantum mechanics, the physical set P is convex. By definition, ρ 0 is on the boundary of P. Therefore, P can be decomposed into infinitesimal cones with center ρ 0 , each pointing to a different direction Ω, having cross-section dΩ, and cut to certain length R(Ω), where the latter function determines the overall shape of P.
In standardised coordinates, the unphysical posterior f is given by f (x) = C exp(−x 2 /2), with C the normalisation constant, and x = ||x||. We now want to calculate the cumulative distribution function (CDF) of the physical posterior, which is the normalised integral of f over the intersection of P with the ball of radius x, g(x)/g(∞), with  g(x, Ω) g(∞, Ω) .
The first factor of the integrand, which we will denote by w(Ω), is a PDF, in that it is a non-negative function integrating to 1 over Ω. We have thus shown the following statement: Statement C: The function g(x)/g(∞) is a weighted average of the functions g(x, Ω)/g(∞, Ω) over Ω.
Let us now fix Ω. The value of g(x, Ω) no longer changes for x beyond R(Ω). We define R x as that value of r for which ||rΩ + ρ 0 || = x. Thus, for R x ≥ R(Ω), we have g(x, Ω) = g(∞, Ω).
Consider now the case that x is small enough so that R x ≤ R(Ω). Let ρ = ||ρ 0 || (the 2-norm of the vector representation of ρ 0 ). In fact, ρ = M ML as used in the bound (32). Let θ be the angle between a normal to ρ 0 − µ and Ω. Because ρ 0 is the nearest point in P to µ, this angle is between 0 and π/2.
In this case we have Now R x satisfies the triangle inequality: x ≤ ρ + R x .
Thus if we replace R x as upper integration limit by its lower bound x − ρ, or 0 if the difference is negative, then we get a lower bound on the integral too, giving g(x, Ω) g(∞, Ω) ≥ The upshot of this step is that the right-hand side is now completely independent of Ω, which allows us to invoke Statement C and get that g(x)/g(∞) satisfies the same inequality: g(x) g(∞) ≥

Mode v Confidence Region
Here we give the promised proof that the mode of the Dirichlet distribution lies within the confidence region as defined in (29), with µ and Σ given by (6) and (7).
Proof. Let x be the mode of the Dirichlet distribution, x = f /N , and µ be its mean, µ = (f +1)/(N +d). Then x−µ = (df − N )/ (N (N + d)); as the sum of the entries of x − µ is 0, x − µ lies in the subspace on which G of (8) projects. Thus we have If r is the number of non-zero components of f (thus 1 ≤ r ≤ d) and if we put f i = N p i , fixing p i , then this expression can be expanded as d − r + O(1/N ). The term d i=1 (f i + 1) −1 is maximal for f = (N, 0, . . . , 0), giving the sum (N + 1) −1 + d − 1. In this way we get the upper bound For not too small values of N , this bound is approximately equal to d− 1, which is also the number of degrees of freedom ν in this case. As d − 1 is the mean value of the χ 2 d−1 distribution, the value d − 1 lies within any reasonable confidence interval. Therefore, the mode of the Dirichlet distribution lies within the confidence region of its normal approximation.

Wald statistic
Suppose the actual state under consideration is ρ, and a measurement is made using a d-outcome POVM, so that the probabilities of the outcomes are given by the probability vector p. In an experiment this gives rise to certain outcome frequencies f , drawn from a multinomial distribution with parameters N and p. From these frequencies f one can derive an estimationP of p, Dirichlet distributed with parameter f according to the prescription of Sec. III B. Let µ and Σ be the moments of this Dirichlet estimation.
We want to study how well the actual p fits within the confidence region obtained from this estimation. To do so, we construct the Wald statistic If the distribution involved was Gaussian, this statistic would be χ 2 d−1 distributed. In reality, the distribution only tends to a Gaussian and the Wald statistic is only asymptotically χ 2 d−1 [46].
For the 2-dimensional case, with N = 100, the resulting values for µ(Z) and σ(Z) are plotted as function of p in Fig. 9. When p is sufficiently far removed from the endpoints 0 or 1, one sees that µ and σ converge to their χ 2 values 1 (µ χ 2 d = d − 1) and √ 2 (σ χ 2 d = 2(d − 1)). More generally, good convergence occurs when the smallest p i is still larger than about 20/N , i.e. when every outcome has at least 20 clicks. Numerical studies reveal that the highest value of σ occurs roughly when the smallest p i is about 7.2/N . In turn, this highest value of σ is maximal when all p i bar one are equal to 7.2/N . This worst case value is approximately given by the empirical formula 1.285(1 + 2(d − 3/2)/N )σ χ 2 .
This gives us the following conservative approach: Take the chi-square value for σ(Z) whenever the smallest p i is larger than 20/N , and 1.285(1+2(d−3/2)/N ) times the chi-square value otherwise.