Rank-based model selection for multiple ions quantum tomography

The statistical analysis of measurement data has become a key component of many quantum engineering experiments. As standard full state tomography becomes unfeasible for large dimensional quantum systems, one needs to exploit prior information and the"sparsity"properties of the experimental state in order to reduce the dimensionality of the estimation problem. In this paper we propose model selection as a general principle for finding the simplest, or most parsimonious explanation of the data, by fitting different models and choosing the estimator with the best trade-off between likelihood fit and model complexity. We apply two well established model selection methods -- the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) -- to models consising of states of fixed rank and datasets such as are currently produced in multiple ions experiments. We test the performance of AIC and BIC on randomly chosen low rank states of 4 ions, and study the dependence of the selected rank with the number of measurement repetitions for one ion states. We then apply the methods to real data from a 4 ions experiment aimed at creating a Smolin state of rank 4. The two methods indicate that the optimal model for describing the data lies between ranks 6 and 9, and the Pearson $\chi^{2}$ test is applied to validate this conclusion. Additionally we find that the mean square error of the maximum likelihood estimator for pure states is close to that of the optimal over all possible measurements.

The importance and difficulty of quantum state tomography became evident in the landmark experiment [7] where entangled states of up to 8 ions were created and fully characterised. More recently the same group succeeded in creating entangled states of 14 ions [41] but their statistical reconstruction is beyond current computational capabilities! Therefore, there is great interest in alternative methods aimed at reducing the dimensionality of the state estimation problem without making unwarranted or unrealistic assumptions. Among these we mention the development of quantum compressed sensing methods [42,43] which extend the "classical" 1 -minimisation algorithms [44,45] to the quantum set-up, and the estimation of many-body states based on lower dimensional families of matrix product states [46]. Both methods rely on the Ansatz that the states produced in real experiments are not completely arbitrary, but have some sparsity structure that can be exploited for more efficient estimation, e.g. low rank in the first case and finite correlations in the second. In this paper we propose and investigate a state tomography method which can also take advantage of the sparsity structure of the state, by adjusting the rank of the estimator according to the measurement data. However, although it shares with compressed sensing the goal of exploiting sparsity structures, our method is closer to the standard tomography set-up in the sense that it takes as input the dataset consisting of measurement counts rather than estimates of observables expectations, and it uses maximum likelihood for determining the estimator of a given rank. The philosophy of rank-based model selection is to choose an estimator which offers a good fit to the data, but in the same time contains a minimal number of parameters (Occam razor principle). For this, we construct a sequence of models consisting of states of fixed rank, and choose the model whose maximum likelihood estimator achieves the best trade-off between fit (likelihood) and model complexity. To quantify the trade-off we use two model selection methods, the Akaike information criterion (AIC) [47] and the Bayesian information criterion (BIC) [48] which have been used extensively in model selection problems; see [49,50] for an introduction to model selection methods, and [51,52,53] for applications in quantum statistics. Although the method can be used for an arbitrary measurement set-up, we focus on the statistical model of multiple ions tomography (MIT) [7,41], which constitutes a physically relevant testing ground for tomography of large dimensional systems. We emphasise that model selection does not assume any particular model, but rather lets the data select the model which gives the most suitable description.This offers the experimentalist an "honest" but also minimal estimation framework. The states created in many experiments have a good degree of purity, and therefore one would only need to compute the maximum likelihood over spaces of low rank, rather than full rank matrices. Furthermore, the principle of model selection can be applied to other families of models such as matrix product states, which may be more suitable in specific experimental conditions. The paper is organised as follows. In section 2 we introduce the statistical model of MIT, and discuss some of the existing estimation methods. To gain more insight, in section 3 we investigate the problem of estimating pure states and in particular we find that the MIT measurement set-up is quasi-optimal in the sense that the mean norm-two distance squared is only slightly larger than that of the (asymptotically) optimal measurement. Section 4 introduces the two rank-based model selection procedures based on the AIC and BIC, and discusses the implementation of the fixed rank models by using the Cholesky decomposition. The methods are applied to three randomly chosen states of ranks 1, 2 and 3 in section 5. We find that both criteria perform very well when the strictly positive eigenvalues are significant relative to the number of measurement samples, and explain this by analysing the asymptotics of the log-likelihood ratio statistic for different models. In section 6 we investigate the dependence of the selected rank on purity and number of measurement repetitions in a one ion study. In section 7 we apply BIC and AIC to experimental data provided by Rainer Blatt's group from the University of Innsbruck. We find that the maximum log-likelihood flattens from rank 10 (see Figure 5), and the AIC and BIC predict ranks 9 and respectively 6, which capture the 4 significant eigenvalues and some of the eigenvalues of order 10 −2 (see Table  3). As additional check, we use the Pearson χ 2 statistic to test the hypothesis H 0 that state has rank at most rank 10, and find that there is no evidence to reject H 0 . Section 8 contains a summary of the paper and an outlook for future work.

Background on multiple ions tomography
In this section we review the statistical model describing the measurement data collected in multiple ions tomography (MIT) experiments [7,41,11], and comment briefly on the existing estimation methods, with an emphasis on maximum likelihood estimation.
The physical system consists of an array of trapped ions whose joint state can be manipulated by means of precisely tuned laser pulses. Since only two electronic energy levels are used for encoding the state, each ion can be describe mathematically as a two level system, so that the joint Hilbert space of k ions is C 2 k . The state of the system is described by a density matrix ρ on this space, i.e. a 2 k × 2 k complex selfadjoint matrix which is positive semidefinite and has trace one. Typically, the goal of the experiment is demonstrate the preparation of a certain target state to a sufficiently high degree of precision. To validate the result, a large number of preparation-measurement cycles are performed, and the collected measurement data are used to estimate the state produced in the preparation phase. In a nutshell, the measurement procedure consists of performing simultaneous Pauli measurements on all ions, each combination of Pauli observables being repeatedly measured n times. More precisely, each measurement is defined by a setting d which specifies which of the 3 Pauli observables σ x , σ y , σ z is measured for each ion. For instance d := (x, y, z, z) is a 4 ions measurement setting, and in general for a k-ions state there are 3 k possible settings d ∈ D k := {x, y, z} k . For each fixed setting, the measurement produces random outcomes s ∈ O k := {+1, −1} k with probability distribution where P d s are one dimensional projections onto the vectors of the orthonormal basis formed by taking tensor products of eigenvectors of the Pauli matrices σ d1 , . . . , σ d k : After repeating n times the measurement with setting d, the data can be summarised by counting the number of times that each possible outcome has occurred. The probability of a certain set of counts {N (s|d) : s ∈ O k } is given by the multinomial distribution with probabilites given by (1), so that Since any given setting d gives information only about the diagonal of the density matrix ρ with respect to the basis (2), the above procedure is repeated for all possible settings to obtain the complete 2 k · 3 k dataset consisting of counts {N (s|d) : (s, d) ∈ O k × D k } for all outcomes in each setting. As successive preparation-measurement cycles are independent of each other, the distribution over all possible datasets is the product of multinomials Let us ponder for a moment on the structure of this statistical model. If no assumption is made on the state, the parameter space is the (4 k − 1)-dimensional convex set of density matrices S k ⊂ M (C 2 k ). We will verify that the above measurement scheme is informationally complete, or equivalently that the parameter ρ is identifiable in the sense that there is a oneto-one correspondence between ρ and the probability distribution P ρ given in (4). Since {σ x , σ y , σ z , σ 0 := 1} form a basis in the space of 2 × 2 selfadjoint matrices, the tensor productsσ form an orthonormal basis of the space of 2 k × 2 k selfadjoint matrices with respect to the inner product A, B := Tr(AB). Therefore, any state can be expanded as and to estimate ρ it suffices to estimate the Fourier coeffcients ρ i . A naive unbiased estimator can be easily constructed based on the counts of any particular measurement setting d for which d j = i j whenever i j = 0. For example when k = 2, to estimate ρ (x,z) we consider the counts from the setting d = (x, z), and definê While this proves that the state can be fully estimated, the naive estimator is generally not a bona-fide density matrix, and more importantly, has large estimation errors. The latter is due to the fact thatρ i is constructed from the counts of a single setting and does not harness the information contained in the counts of the others. Indeed, since the projectors {P d s : s ∈ O k , d ∈ D k } form a (highly) overcomplete set of vectors in M (C 2 k ), any product of Pauli's σ i can be expressed in (continuously) many ways as a linear combination of projectors, each producing a linear estimator which could in principle be combined to obtain a significantly reduced MSE. However, finding the "optimal linear estimator" is problematic due to the fact that the empirical frequencies N d s /n are noisy estimates of the probabilities P(s|d), and their covariance depends on the unknown state. An interesting proposal in this direction is the Kalman filter estimator developed in [14], but to our knowledge its performance in the case of MIT has not been extensively investigated. Another proposal put forward in [54] is to combine the naive estimator with a second stage rank-penalised minimisation of the norm-two square (Hilbert-Schmidt) distance to the final estimator. Maximum likelihood (ML) is one of the most commonly used estimation methods across statistics. Its popularity is due to the intuitive interpretation, versatility, and strong theoretical underpinning. Under certain regularity conditions the ML estimator is asymptotically optimal (or efficient in statistical terminology) in the sense that its covariance achieves the Cramér-Rao bound in the limit of large samples, and has Normal (Gaussian) limiting distribution, with covariance equal to the inverse of the Fisher information matrix [55]. By discarding the constant factorial term in (3) and taking logarithm we can write the maximum likelihood estimator for MIT asρ := arg max where the maximum is computed over the set S k of k-ions states τ . Note that the ML estimator is invariant under reparametrisation, i.e. the ML estimator of a state functional f := f (ρ) is f (ρ). The ML estimator has been used extensively in quantum statistics [56], and an efficient iterative computational routine has been put forward in [57,58]. Nevertheless, ML has been criticised for several perceived drawbacks [12,13]. The first criticism is that the ML has the tendency to produce rank deficient estimators, when the true state has some small eigenvalues; this can be understood [12] by observing that the likelihood (seen as a function of the matrix elements) may attain its maximum at a point which lies outside the convex space of states S k , in which case the "constrained" MLEρ will fall on boundary of S k by the concavity of the log-likelihood function. The second, and in our opinion more serious criticism is that the asymptotic theory does not apply as such, to states which lie on the boundary. As a side remark, we note that the asymptotic theory does hold for the unconstrained ML estimator, for "generic" states which satisfy P ρ (s|d) > 0 for all s, d. This may be used to prove the existence of the asymptotic distribution of the MLE (6), but the latter is likely to be complicated and impractical for establishing confidence regions (error bars). In this paper we focus on the performance of the proposed model selection estimation method, and refer to [26,27] for two recent proposals for constructing confidence regions, and the forthcoming paper [59] for a comparative study of bootstrap and Fisher information methods.

Estimation of pure states in the MIT setting
Pure states are arguably the golden standard of most state preparation experiments [7,41]. Therefore, we will start by considering the ideal situation in which the quantum state is assumed to be pure, and we would like to estimate it using the MIT dataset described in section 2. The goal is to get more insight into the statistical power of the measurement setup and its asymptotic properties. This will prepare the ground for the next section where the purity assumption is lifted and the state is fitted to models of increasing rank, and model selection criteria are used for choosing a rank with a good trade-off between fit and model complexity. The findings are summarised at the end of the section, where we also clarify the relation to the compressed sensing set-up [42,43] which uses as input the estimated valuesρ i of the expectations of Pauli observables σ i .
The pure states ML estimatorρ can be computed as in (6), with the maximisation restricted to the space of pure (rank one) states on C 2 k . As figure of merit we consider the mean square error (MSE) M SE(ρ) := E( |ρ −ρ 2 2 ) with the norm-two distance squared defined as where ρ i are the Fourier coefficients with respect to the Pauli basis defined in (5). Note that for pure states the norm-two distance is related in a simple way to the (arguably more natural) norm-one distance ρ −ρ 1 := Tr(|ρ −ρ|) by the equality ρ −ρ 2 = ρ −ρ 1 / √ 2.
We would like to address the following questions: 1. What is the MSE of the MLE ?
2. Are we in an "asymptotic regime" ?
3. Is the multiple ions measurement "optimal" in any sense ?
The pure states form a compact manifold of dimension 2(2 k − 1) which can be identified with the complex projective space CP 2 k +1 . Therefore, when restricting the MIT statistical model to pure states, the standard asymptotic efficiency theory [55] is applicable. For simplicity, we assume that |ψ has the expansion with respect to the standard basis |ψ = c i |e i such that c 1 = 0, in which case we can parametrise the state by the real and the imaginary parts of the remaining coefficients Note that due to the geometry of the projective space, any global parametrisation must be singular unless some points are cut out as we did here. However, as we are interested in the asymptotic behaviour of the ML estimator, the global properties are unimportant and we can always choose an appropriate local parametrisation for all practical purposes. The norm two-square distance (7) can be rewritten locally as a quadratic form where G(θ) is a positive definite matrix whose explicit form can be easily computed. The maximum likelihood estimatorθ =θ n is efficient , i.e. as n → ∞ where N (0, I(θ) −1 )) is a the centered normal distribution with covariance matrix I(θ) −1 which is the inverse of the (classical) Fisher information matrix I(θ). In particular, from (8) and (9) we get lim To verify these results we simulated 100 datasets from a fixed but randomly chosen pure state of k = 4 ions, each dataset consisting of counts for n = 100 measurements per setting. Figure 1 shows the histogram of the square error ρ θ − ρθ 2 2 whose empirically estimated MSE (green line) is very close to the asymptotic prediction (blue line) computed from (10) which is equal to 3.9 · 10 −3 . More interestingly, we find that the MSE is also very close to the "quantum optimal" bound (red line) which describes the best MSE achievable with any measurement! The latter is given by the simple formula ( see [60] and references therein) This example shows that the MSE of the ML estimator agrees with the asymptotic theory when n = 100 and k = 4; we expect that the same holds for fixed n and larger k due to the favourable scaling of the number of samples n · 3 k with respect to the number of parameters 2(2 k − 1). Moreover, the multiple ions measurement set-up appears to be quasi-optimal. This implies that adaptive strategies for choosing the settings cannot offer a significant improvement, but does not exclude the possibility that a similar performance can be achieved with a fraction of the settings. To further emphasise the point that the MIT dataset is very informative, and that ML can optimally extract this information from the data, we compare the above results with the MSE of the naive estimator discussed in section 2, and with the asymptotic MSE of a dataset consisting of estimates of the Pauli products obtained by lumping together the counts of each measurement setting into a single statistic. MSE of the naive estimator. With the square error defined as in (7), we note that the MSE of each coefficientρ i for which i 1 , . . . , i k = 0, is of the order 1/(n · 2 k ) since we are essentially dealing with the problem of estimating the mean of a random variable with values {+2 −k/2 , −2 −k/2 }. Therefore these coefficients alone (not counting those for which some i j are zero) bring a contribution of the order 3 k /(n · 2 k ) which is larger than QMSE (11) by a factor (9/4) k /2. For the particular example of k = 4 and n = 100 this gives an MSE of 5 · 10 −2 which is an order of magnitude larger than that of the MLE. MSE of the coarse grained data. At this point, it is natural to ask the following question. Suppose that we are given the 3 k empirical averages of the Pauli products σ î which are obtained by computing one empirical average for each column of the original dataset. Is there a more efficient method to estimate the pure state |ψ , from the data (12) and what is its MSE ? Two important candidates are the compressed sensing and lasso algorithms [43] (with the slight difference that they would use a smaller number of settings, but proportionally more measurements per setting). Both methods aims at estimating the state by trying to match the empirical expectationsρ i with those of a selfajoint matrix, while in the same time penalising the trace norm of the matrix. Testing these methods is beyond the scope of this paper, but the asymptotic efficiency theory offers a shortcut to the answer of the above question. Applying the same methodology as before, but to the coarse grained data (12) we can predict that (asymptotically) the MSE of any estimator is bounded from below by that of the ML estimatorρ cg which in turn satisfies where the only difference with (11) is the Fisher information matrix which satisfies the inequality I cg (θ) ≤ I(θ). Figure 2 shows histograms of the asymptotic MSE (11) for the full MIT data (left panel) versus the MSE (13) of the coarse grained data (right panel). The histograms were produced with 250 randomly chosen pure states, k = 4 and n = 100. Note that the MSE of the coarse grained data is smaller than the (partial) estimated contribution of the naive estimator. However, the MSE is still an order of magnitude higher that that of the full dataset, due to the fact that a significant amount of information has been discarded in the process of retaining the Pauli products expectations. To summarise, we conclude that MIT works because the different settings "overlap" with each other in the sense that the one dimensional projections |e s d e s d | form an overcomplete set of size 2 k × 3 k which is significantly larger than the dimension of the space of matrices 4 k even for small k. Therefore, the measurement data is structured so that the counts for each setting provide a relatively small amount of information, but the dataset as a whole is very informative about the state. Reducing this dataset to a small number of expectations may be advantageous for the purpose of devising fast estimation algorithms, but underperforms from the viewpoint of statistical errors, for a given number of state re-preparations. This statement may seem to contradict the simulation results illustrated in Figure 1 of [43], where compressed sensing and lasso are found to perform better than ML on datasets of the type (12). This apparent contradiction is lifted by the following observations: 1) Our comparison is between the MSE of efficient estimators for two different types of data. Based on this we conclude that any estimator using the coarse grained data will asymptotically underperform ML based on the full counts experimental dataset.
2) The comparison in [43] is different; it regards the performance of ML versus compressed sensing and lasso for the coarse grained data. Since a completely unknown state is not identifiable for the coarse grained model, the MLE is not consistent, and arguably should not be used in this case.

Model selection for quantum tomography
In the previous sections we discussed the extreme scenarios of "full" quantum tomography and estimation of pure states. In reality, the states produced in experiments tend to have one or few significant eigenvalues and a large number of small eigenvalues of different orders of magnitude, which account for the imperfections in the preparation procedure. Therefore, neither of the two settings seems to be suitable: the former underfits the real state while the latter overfits by trying to estimate eigenvalues that may not be statistically significant. This is a well known phenomenon in statistics, that occurrs in high (or infinite) dimensional problems such as estimating the probability density of a real valued random variable, or in non-linear regression where an unknown function is "learned" from estimates of its values at certain points. In such cases the maximum likelihood estimator overfits the data and is very "noisy". A possible solution is to use a penalised maximum likelihood estimator, which maximises the difference between the log-likelihood and a penalty measuring the complexity of the estimator, e.g. the number of non-zero coefficients with respect to an appropriate basis. More generally, one can design a class of statistical models with various degrees of complexity, and decide which model and which estimator from that model is most suitable for describing the data. Our aim is to apply the model selection methodology to state tomography, the models being the families of states of a given rank. The same methods can be used in tasks such as quantum homodyne tomography [31] where the state to be estimated is that of a light pulse (one mode continuous variables system), and a model could be the set of states with a given maximum number of photons [61].
To select the rank of the state we will use two well established methods: the Akaike information criterion (AIC) [47] and the Bayesian information criterion (BIC) [48]. Both methods amount to penalising the log-likelohood function by a factor proportional to the dimension of the model, and choosing the ML estimator with the smallest value of the information criterion. In the next section we give a brief general description of AIC and BIC, after which we discuss the parametrisation of the fixed rank quantum models, and the implementation of the model selection procedure.

AIC versus BIC model selection
Occam's razor is an old scientific principle which states that when trying to explain a phenomenon, one should choose the simplest model that adequately fits the data. A very complex model will be able to fit the given data almost perfectly but it will not be able to generalise very well. On the other hand, very simple models will not be able to describe the essential features of the data. Therefore, we must make a compromise and choose a model which is as simple as possible, but no simpler. It is not surprising that many approaches have been proposed over the years for dealing with this key aspect in statistical modelling.
The general framework of model selection is the following. We are given n samples X = {X 1 , . . . , X n } from some unknown distribution P which we try to fit with a distribution from one of several possible statistical models where M r has a parameter θ r of dimension p(r). For simplicity we assume that X i ∈ {1, . . . , a} are discrete random variables, as in the case of MIT measurements. We also assume that at least one of the D models contains the true distribution, or at least gives a reasonabale approximation to it. We denote byθ r the ML estimator for the model M r , and by θr = θr (X) := n i=1 log P θr (X i ) log-likelihood function at θ r .

Akaike Information Criterion (AIC)
The AIC for model M r is [47] AIC(r) = −2 · θr + 2p(r), and the chosen model is the one with the minimum AIC. Since p(r) is larger for more complex models, the AIC formally biases against overly complicated models. Although the derivation of AIC is outside the scope of this paper, we briefly explain the idea behind the choice of penalty. Having computed the ML estimators for different models we would like to select the "best" one in the sense that the corresponding distribution Pθ r is the closest to the "truth " P with respect to the Kullback-Leibler distance (or relative entropy) However, this quantity cannot be computed since P is unknown. Since the first terms on the right side is the same for all models, it can be neglected, and one can focus on estimating the second term, which nevertheless still depends on P. If instead ofθ r we had a fixed parameter θ r , this term would be the expected value of the log-likelihood at θ r and could be estimated by θr (X)/n, by the law of large numbers. However θ r (X)/n is a biased estimator of the second term, due to the fact that the data has been already used in computingθ r . Akaike showed that under the regularity conditions required by the asymptotic normality theory, the bias is approximately p(r)/n, so that ML which is the closest to the truth is approximately give by the minimizer of the AIC.

Bayesian Information Criterion (BIC)
The BIC for model M r is defined as [48] BIC(r) = −2 · θ r + p(r) log (n) where n is the sample size. Note that the BIC differs from the AIC only in the second term which increases with n, so that BIC favors simpler models (that is models with a smaller number of parameters) compared to AIC. But despite the superficial similarity between the AIC and BIC the latter is derived in a very different way, within a Bayesian framework. For simplicity, suppose that there are two competing models, M 1 and M 2 with parameters θ 1 and θ 2 respectively. One begins by assigning prior probabilities q 1 and q 2 = 1 − q 1 to the event that the observed data have been generated from either model. One also assigns prior distributions π 1 (θ 1 ) and π 2 (θ 2 ) to the model parameters in each model. Then one can compute the marginal likelihoods which can be interpreted the probability of observing the data if model M i is correct, having integrated out our ignorance about the parameters θ 1 and θ 2 in each model. Hence, one can apply Bayes theorem to evaluate the probability of model M i being the true model given the observed data. A measure of the extent to which the data support model M 2 over M 1 is given by the posterior odds The first fraction on the right-hand side is called the Bayes factor and the second is known as the prior odds. The Bayes factor is a fundamental quantity in Bayesian theory and can be interpreted as a measure of the extent to which the data support model M 2 over M 1 when the prior odds are equal to one. The difference BIC(1) − BIC(2) can be shown to be a large sample approximation to the logarithm of the Bayes factor, so that the second model is chosen if the difference is positive.

Parametrising models with fixed rank
Here we describe the fixed rank models which will be used in model selection. Let D(d, r) be the set of rank r states of a d-dimensional quantum system, i.e. those states which have exactly r non-zero eigenvalues, and let be the set of states of rank at most r. Every state ρ has a unique spectral decomposition where λ i > 0 are its distinct eigenvalues, and P i is an eigenprojector whose dimension is equal to the multiplicity m i of λ i . The spectral information (λ 1 , P 1 , . . . , λ r , P r ) can be used to construct a parametrisation of D(d, r) and R(d, r), which has the advantage of a direct physical interpretation. However, the practical implementation of such a parametrisation for computing the maximum likelihood estimator is less straightforward due to the orthogonality constraints for the eigenvectors, and the singularities appearing on lower dimensional manifolds consisting of states with non-trivial sets of multiplicities. A variation on this would be to parametrise the state by the set of eigenvalues and an eigenbasis, in which case the singularity problem is replaced by the non-identifiability of the different basis vectors corresponding to the same eigenvalue. We will describe an alternative parametrisation which is related to the Cholesky factorisation of the state. Recall that any positive definite matrix A ∈ M (C d ) has a unique decomposition where T is an upper triangular matrix with strictly positive diagonal elements. Therefore there exists a one-to-one correspondence between full-rank states ρ and matrices T as described above, with the additional constraint such that R, I are the real and imaginary parts of the off-diagonal elements ordered from the first to the d − 1 row, and from left to right along each row. By (14) and (15), θ must satisfy the constraints D > 0 and R 2 + I 2 + D 2 < 1, and the left-top element of T is equal to The Cholesky parametrisation of the full rank matrices can be extended, albeit with some caveats, to the spaces of rank-deficient matrices D(d, r) and R(d, r). The idea is to consider a decomposition as in (14), but with T belonging to the set T + (d, r) of d × d upper triangular matrices with the bottom d − r rows equal to zero, and satisying T 11 , . . . , T rr > 0; equivalently, one can consider r × d trapezoidal matrices obtained by removing the zero lines of the triagular matrices. Since every T ∈ T + (d, r) is of rank r, this guarantees that the corresponding state ρ has the same property. However, not all states of rank r can be decomposed in this way! Indeed it is easy to verify that if ρ = T * T then the r × r top-left principal minor of ρ must be of rank k, and therefore such a parametrisation excludes states in D(d, r) which do not satisfy this property. Nevertheless, "generic" matrices of rank r do have principal minors of rank r, in the sense that those with smaller rank principal minor form a lower dimensional subset of D(d, r). If restrict our attention to the subset D(d, r) + ⊂ D(d, r) which excludes the "deficient" states, we find that the Cholesky decomposition exists and is unique, so that D(d, r) + := {ρ = T * T : T ∈ T + d,r } ⊂ D(d, r).
What can we say about the complement D(d, r) \ D(d, r) + ? In order to have a Cholesky decomposition we need to relax the condition T 11 , . . . , T rr > 0 and consider the set T (d, r) of r-lines upper triangular matrices, with non-negative elements on the diagonal. In this case, the root T not only exists but is in general not unique. Let Θ(d, r) + be the set of real parameters θ := (R, I, D) of a matrix T = T θ ∈ T (d, r) + which are defined similarly to equation (16), and let Θ(d, r) be the set of parameters associated to matrices in D(k, r). We define two sequences of quantum statistical models: the first one consisting of rank r matrices with rank r principal minor, the second one describing (albeit not always uniquely) all matrices of rank up to r. The reason why we mention the two models is that each has some appealing features and some disadvantages. For Q(d, r) the advantage is that we deal with a nested set of models The disadvantage is that the Cholesky parametrisation is not one-to-one in this case. On the other hand, Q(r, d) + offers a one-to-one parametrisation of rank r matrices in D(d, r) + , with the disadvantage that the models are not nested, but instead Q(d, r) + lies on the boundary of Q(d, r + 1) + . While these facts are relevant to a theoretical analysis, for practical purposes the distinction between the two models is less important, and in all our numerical experiments we used the models Q + (d, r).

The implementation of AIC and BIC model selection for rank-based models
We return now to the state estimation problem, and describe how AIC and BIC model selection is applied to the family of rank-based models described above for a system consisting of k ions, i.e. d = 2 k . Let be the log-likelihood of the measurement data, ignoring the constant factorial terms. The maximum likelihood estimatorsθ r andρ r for the model Q(2 k , r) + are : In order to choose between the different models we compute the AIC and the BIC for each rank and select the model with the smallest value. In our case the two criteria are given by with the dimension of the space of rank r matrices, and n · 3 k is the total number of measurements.
In practice each criterion decreases with the rank until it reaches the minimum value after which it increases, so one only needs to compute the ML estimator up to the rank where the criterion begins to increase. For low rank states, this offers a the advantage of having to compute the maximum likelihood estimator on models of dimension approximately r·d rather than d 2 − 1 as standard ML. The disadvantage is that the likelihood function is not concave as in the full rank model, and may have several local maxima.
To implement the ML estimation numerically, we used a standard maximisation routine of the statistics package R. Additionally, we developed an array of statistical analysis tools such as Fisher information, square errors, bootstrap, Pearson χ 2 statistic which will be made available online. Although the computation of the log-likelihood was optimised for faster speed, the maximisation can probably be improved significantly by using more sophisticated routines.
In the next sections we will discuss the results of several investigations on the performance of BIC and AIC model selection, using simulated and real data.

Study 1: randomly chosen low rank states
In a first simulation study we chose 3 "random" states of ranks 1, 2, and 3 of k = 4 ions, and generated 100 datasets from each state, each dataset with n = 100 measurement repetitions. We then computed the maximum likelihood estimators for the ranks between 1 and 4 and used AIC and BIC to select the optimal rank. The exact procedure used to generate "random" states is not very important, but it will be relevant that all non-zero eigenvalues of the states are significant. As illustrated in Table 1 that it agrees very well with the predictions of asymptotic theory. For illustration, we consider the state of rank r = 2 denoted ρ, and show that the distributions of BIC(3) − BIC (2) and BIC(1) − BIC(2) concentrate on the positive axis, so that BIC chooses the correct rank. Since their behaviours are determined by different mechanisms, we will study each BIC difference separately. A similar analysis can be performed for AIC. In the first case, so the problem is to show that the log-likelihood ratio statistic is typically smaller than the penalty 242.98. For "regular" nested models, the asymptotic distribution of Λ is described by Wilks' theorem as discussed in section Appendix A. However, this is not directly applicable here since the rank 2 model lies on the boundary of the rank 3 one, due to the positivity constraints. Nevertheless, Wilks' theorem can be extended to more general situations where the two hypotheses can be "linearised" locally (see [62] chapter 16), in which case the limiting distribution depends on the local geometry of the two models and the Fisher information at each point. We will not pursue this analysis here but limit ourselves to giving a stochastic upper bound to the limiting distribution which will be sufficient for our purposes. The idea is to note that where θ r is the "unconstrained" maximum likelihood estimator obtained by maximising over the spaceD(d, r) ⊃ D(d, r) consisting of matrices ρ of rank r which are not necessarily positive but must respect the property that P ρ (·|d) is a probability distribution for each d.
The unconstrained MLE is easier to analyse theoretically and can be used to explain why MLE often produces rank deficient states when the true state has high purity [12]. Now, assuming that we are in the generic situation where all probabilities for the true rank-two state ρ are non-zero, this means that locally around ρ the rank two model is a regular submodel of the extended rank 3 model, and we can apply Wilks' theorem to conclude that 2)).
From (22) we get that Λ is stochastically bounded from above by χ 2 (27) and similarly BIC(3) − BIC(2) is bounded from below by 242.98 − χ 2 (27) which agrees with the simulations results illustrated in the left panel of Figure 3. Note that as n increases, the probability of BIC choosing the rank 3 model converges to zero due to the presence of the log n factor in the penalty, while AIC is not rank consistent in the sense that it choses the higher rank with a probability which does not vanish with n, in agreement with the results illustrated in Table 1 note that its values are much larger, a illustrated in the right panel of Figure 3. It turns out that in this case the behaviour is not dominated by the complexity penalty but by the bias of the lower rank model with respect to the "correct" one, and in particular the distribution of the difference is state dependent. The key is to observe that while the rank 2 ML estimatorρ 2 converges to the true state ρ, the rank one ML estimatorρ 1 converges to the state ρ * 1 whose corresponding distribution P ρ * 1 is the closest to the true distribution P ρ with respect to the relative entropy (or Kullback-Leibler divergence) ρ * 1 := arg min τ ∈D(2 4 ,1) In conjunction with the law of large numbers we then obtain the almost sure convergence as n → ∞.
For our particular example we used one of the rank one MLE's to compute an approximate value K (P τ | P ρ ) ≈ 11.33 which gives an estimate ≈ 2 · 11.33 · 100 + log(100 · 3 4 )(p(4, 1) − p(4, 2)) in agreement with the histogram illustrated in the right panel of Figure 3.
In conclusion, for low rank states with eigenvalues which are not very close to zero, the BIC and to lesser extent AIC, identify the correct rank with high probability, the latter having a tendency to overfit the true model. On the other hand, as we will see in the next section, the BIC may underfit the true model when one or more eigenvalues are small.

Study 2: one ion simulations
We have seen that the performance of the model selection criteria depends on the spectrum of eigenvalues of the true state, and on the number of measurement repetitions. To investigate this dependence we performed a statistical experiment with three one-ion states (k=1) of different degrees of purity: a pure state, one with eigenvalues (0.95, 0.05), and the other with eigenvalues (0.72, 0.28). For each state we simulated datasets with varying number of repetitions n = 10, 50, 100, 250, 500. Table 2 shows the number of times (out of 1000 samples) BIC and AIC choose the correct rank of the state, for all possible choices of states and measurement repetitions. As expected, in the case of the the pure and the mixed states both criteria require a small number of repetitions (of the order of 50) to give the correct answer. In the case of the almost pure state, we see a clear dependence with n: for small n the difference between the log-likelihoods does not off-set the complexity penalty and both criteria choose rank one; at n = 500 the balance tips in favour of the rank 2 model, with AIC switching faster than BIC, on average.   as a function of n for each of the three states, with the pure state (rank one) estimator in black and the mixed state (rank two) estimator in red. For the pure state (left panel), the rank two estimator has a larger MSE due to the variance contribution from the third parameter, but the relative difference between the two MSE's is small for all n. In this case the rank one estimator proposed by both criteria is optimal both from the point of view of parsimony, as well as estimation error. For the mixed state (right panel), the rank one estimator has a large bias which dominates the MSE, while the rank two MSE decreases at rate 1/n, as expected. At n = 50 the relative difference in risk is significant and both criteria choose the optimal ranktwo estimator. The most interesting case is that of the almost pure state (middle panel). Here we see that the relative difference in MSE is not significant for small and medium number of repetitions (n = 10, 50, 100), but for larger n the error of the pure state estimator is dominated by its bias while the variance of the full state estimator becomes very small. This behaviour is picked up by the model selection criteria, which on average switch to the more complex model when n is in the interval between 200 and 500.
In conclusion, the study shows that both methods become more sensitive to the true rank of the state as the number of repetitions increases, and the switching point increases (on average) with the purity of the state. As for the estimation error, the switch to the higher rank model appears to happen in the region where the MSE's of the two estimators starts to diverge from each other, which shows that even if the result is suboptimal for small n, the relative difference in errors remains small. Finally, the BIC is more aggressive in selecting the lower complexity model, due to the additional log-factor in the penalty.

Study 3: model selection for 4 ions real data
In the third study, we applied the model selection methods to experimental data provided by Rainer Blatt's group from the University of Innsbruck. The aim of the experiment [11] was to create a particular 4 ions bound entangled state of rank 4 called Smolin state [63], and the measurement dataset consisted of counts for the 3 4 measurement settings, with a number n = 4800 of repetitions for each setting. We computed the maximum likelihood estimatorŝ ρ r for all ranks r between 1 and 16, and found that the corresponding log-likelihoods reach a plateau at rank 10 (see Figure 5) which indicates that the rank 10 model is already sufficiently rich to describe the measurement data. Reinforcing this conclusion, we found that the value of the maximum likelihood for rank 10 (and the subsequent ones) was slightly larger than that of that of the maximum likelihood over all states computed with Hradil's iterative method [56], probably due to the fact that the latter had not reached the true maximum after 1000 iterations. The values of the AIC and BIC for all ranks are shown on the left side of Table 3. The two criteria reach minima at ranks r = 6 and respectively r = 9, as a result of the trade-off between the increasing penalty and the log-likelihood. As expected, the BIC chooses a smaller rank due to its larger complexity penalty, but both methods capture the top 4 eigenvalues of order 10 −1 and a few of the following ones of order 10 −2 which account for experimental imprecision in creating the state. On the right side of Table 3 we listed for comparison the eigenvalues of the rank 10 and full rank estimators, showing perfect agreement in the first two decimal places. We emphasise that results should be taken as an indication that the experimental data is consistent with models whose rank could be chosen somewhere between 6 and 10, rather than answering the ill posed question "what is the rank of the state". To make a more informed decision on the final choice of model, one can additionally use different model testing procedures such as the Pearson chi-square test discussed below. As we will see, the various arguments converge towards the conclusion that the rank 6 estimator may be too conservative, while the rank 10 model already fits the data very well.   Table 3. Left: values of AIC and BIC for the ML estimators of ranks 1 to 16. The minimum values of the two criteria are attained at ranks r = 9 and respectively r = 6. Right: eigenvalues of the MLE's of rank 10 and 16 in decreasing order

Pearson χ 2 -test
As an additional tool for probing the conclusions of the model selection procedures, we recast the problem as that of testing between the hypotheses H 0 = "the dataset is generated by a state of rank at most r" H 1 = "the dataset is generated by a state of rank higher than r" A standard approach to such a problem is based on using the Pearson χ 2 -statistic. Following the general procedure described in appendix Appendix A, we consider the rank r MLEρ r with expected number of counts E(s|d) := nPρ r (s|d), and define the Pearson χ 2 -statistic where N (s|d) are the counts from the real data. Under the hypothesis H 0 , the Pearson statistic has an asymptotic χ 2 distribution with number of degrees of freedom equal to the number of free parameters of the dataset minus the number of parameters of the model df(r) := 3 4 · (2 4 − 1) − p(r, 4).
Therefore one can define the (asymptotically) level α test where the threshhold t α is chosen such that P(Y > t α ) = α for a χ 2 (df(r))-distributed random variable Y . In practice the χ 2 approximation works well for pure states (r = 1), and small rank states which have only a few small eigenvalues. However, if the state has a significant number of small eigenvalues, the distribution of T (r) may differ significantly from the asymptotic χ 2 distribution. We will not pursue a theoretical analysis here, but instead use bootstrap techniques [55] to estimate the distribution of T (r) and then perform the test with respect to the bootstrap distribution. The idea of bootstrap is to use the measurement data itself to construct a distribution which (under the hypothesis H 0 ) approximates that of T (r), and therefore can be used to define the threshhold t α instead of the χ 2 distribution.

Parametric Bootstrap
Pearson's Chi Square Statistic Density 900 1000 1100 1200 1300 0.000 0.002 0.004 0.006 0.008 Figure 6. Pearson χ 2 statistic T (blue line), the limit χ 2 distribution (red curve) and the parametric bootstrap distribution for 100 bootstrap samples. The boostrap distribution is shifted with respect to the χ 2 due to the fact that the state is close to the boundary of the states space and the asymptotic theory does not hold. Based on the value of T we conclude that the hypothesis H 0 is not rejected for any reasonable significance level.
The bootstrap distributions are constructed as follows: 1) Compute the maximum likelihood estimatorρ r and its probability distribution Pρ r (s|d); 2) Generate a large number N of independent datasets from the distribution Pρ r (s|d) of the maximum likelihood estimation ; 3) Compute the maximum likelihood estimatorsρ boot 1 , . . . ,ρ boot N for the bootstrap datasets; 4) Compute the Pearson χ 2 statistic for each bootstrap sample and its MLE as in (23); 5) Apply the χ 2 test using the empirical distribution of the boostrap χ 2 statistics.
The results of applying the χ 2 test based on the bootstrap distribution to the rank 10 model are illustrated in Figure 6. The value of the test is T = 1039 is which means that the hypothesis H 0 is accepted.

Conclusions and outlook
Statistical inference has become a key tool in interpreting the measurement data in quantum engineering experiments, which require precise, efficient and informative estimation methods. However, standard full state tomography becomes unfeasible for large dimensional quantum systems [41]. In this paper we proposed model selection as a general principle for approaching state estimation problems. As in [42,43,46] the aim is to reduce the dimensionality of the problem by taking advantage of the "sparsity" properties of quantum states in realistic experimental settings. The route to this goal is however different. The philosophy in model selection is to try to find the simplest, or most parsimonious explanation of the data, by fitting different models (often of increasing complexity) and choosing the estimator with the best trade-off between likelihood and complexity. Concretely, we looked at the problem of selecting the rank of the estimator, by using two well known methods: Akaike's information criterion (AIC) and the Bayesian information criterion (BIC). In both cases the fit-complexity trade-off is realised by penalising the log-likelihood of the data with a measure of complexity proportional to the number of parameters of the fixed rank model. We have tested AIC and BIC in several real data and simulation studies which we summarise here. Pure states. We studied the performance of (rank one) ML for pure states and found a very good agreement with the asymptotic predictions based on Fisher information and the efficiency of the ML estimator. More interestingly, we found that the MSE is only slightly larger than the MSE of the best possible measurement predicted by quantum version of the asymptotic theory. In particular this rules out the possibility of significantly improving the MSE by means of adaptive measurement design techniques. The (asymptotic) MSE of the full counts dataset was compared to that of the "coarse grained" data obtained by estimating the means of the Pauli products corresponding to each measurement setting, as used in compressed sensing algrithms [42,43]. For 4 ions, the latter is an order of magnitude larger than the former due to the loss of information when discarding the full counts statistics. Study 1. For 4 ions states of ranks between 1 and 3 we found that both AIC and BIC identify the correct rank in 80%-90% of the cases, when the smallest non-zero eigenvalue is not too close to zero. The results are explained by using the ML asymptotic theory. Study 2. We analysed the performance of AIC and BIC as a function of the number of measurement repetitions and the purity of the state, for a toy example consisting of one ion states. With only a small number of repetitions, both methods identify the correct rank for pure and "pretty mixed" states. For an "almost pure" state, the model choice switches to rank 2 as the number of repetitions increases. The switching happens roughly at the point where the MSE of the "wrong" rank 1 estimator becomes significantly larger that that of the correct model, indicating that model selection is only slightly suboptimal in terms of the MSE.
Study 3. We applied model selection to the 4 ions experimental data provided by Rainer Blatt's group from the University of Innsbruck. The target state of the experiment was an equal mixture of 4 orthogonal pure states, and BIC and AIC selected rank 6 and respectively 9, with both estimators capturing the principle eigenvalues and (some of) the noisy components due to imperfections in the preparation procedure, of the order 10 −2 . While the BIC prediction seems too conservative, we find that a rank 10 estimator gives a very good explanation of the data from several perspectives: log-likelihood values, eigenvalues of the estimators, hypothesis testing. Overall, the numerical results indicate that model selection gives sensible answers, and can be used as an alternative to full tomography and compressed sensing. In principle the method works for any state, but is designed to take advantage of the lower complexity of small rank states. The drawback is the computational complexity of finding the MLE over states of fixed rank. Therefore it would be interesting to see whether ideas from the different methods can be combined in a fast, scalable and statistically efficient estimator. A possible future direction is apply model selection to state estimation for other types of models such as classes of matrix product states, and to system identification problems. Another topic of interest is the computation of confidence intervals (error-bars). Last but not least, there is a need for a deeper theoretical understanding of the quantum tomography statistical model. We mention two important questions: how does the state's proximity to the boundary affect the standard asymptotic theory, and how does the model behave for a large number of ions? This would hopefully lead to improved estimation algorithms and information criteria for model selection.
If θ ∈ Θ 0 , then Λ converges in law as n → ∞ to the χ 2 distribution with k degrees of freedom.
In both cases, it is essential that the parameter does not lie on the boundary, in order to be able to apply the asymptotic normality theory of the MLE. This condition is violated for states whose rank is strictly smaller than that of the fixed rank model in which they are considered. Therefore care must be taken before applying these results directly, and indeed our results show that the χ 2 asymptotics fail in some cases. A more refined asymptotic analysis taking into account the boundary effects will be pursued elsewhere.