Spectral thresholding quantum tomography for low rank states

The estimation of high dimensional quantum states is an important statistical problem arising in current quantum technology applications. A key example is the tomography of multiple ions states, employed in the validation of state preparation in ion trap experiments \cite{Haffner2005}. Since full tomography becomes unfeasible even for a small number of ions, there is a need to investigate lower dimensional statistical models which capture prior information about the state, and to devise estimation methods tailored to such models. In this paper we propose several new methods aimed at the efficient estimation of low rank states in multiple ions tomography. All methods consist in first computing the least squares estimator, followed by its truncation to an appropriately chosen smaller rank. The latter is done by setting eigenvalues below a certain"noise level"to zero, while keeping the rest unchanged, or normalising them appropriately. We show that (up to logarithmic factors in the space dimension) the mean square error of the resulting estimators scales as $r\cdot d/N$ where $r$ is the rank, $d=2^k$ is the dimension of the Hilbert space, and $N$ is the number of quantum samples. Furthermore we establish a lower bound for the asymptotic minimax risk which shows that the above scaling is optimal. The performance of the estimators is analysed in an extensive simulations study, with emphasis on the dependence on the state rank, and the number of measurement repetitions. We find that all estimators perform significantly better that the least squares, with the"physical estimator"(which is a bona fide density matrix) slightly outperforming the other estimators.


Introduction
Recent years have witnessed significant developments at the overlap between quantum theory and statistics: from new state estimation (or tomography) methods [2,3,4,5,6,7,8], design of experiments [9,10,11], quantum process and detector tomography [12,13] construction of confidence regions (error bars) [14,15,16], quantum tests [17,18] entanglement estimation [19], asymptotic theory [20,21,22,23]. The importance of quantum state tomography, and the challenges raised by the estimation of high dimensional systems were highlighted by the landmark experiment [1] where entangled states of up to 8 ions were created and fully characterised. However, as full quantum state tomography of large systems becomes unfeasible [24], there is significant interest in identifying physically relevant, lower dimensional models, and in devising efficient model selection and estimation methods in such setups [25,26,27,7,8,28]. In this paper we reconsider the multiple ions tomography (MIT) problem by proposing and analysing several new methods for estimating low rank states in a statistically efficient way. Below, we briefly review the MIT setup, after which we proceed with presenting the key ideas and results of the paper.
In MIT [1], the goal is to statistically reconstruct the joint state of k ions (modelled as two-level systems), from counts data generated by performing a large number of measurements on identically prepared systems. The unknown state ρ is a d × d density matrix (complex, positive trace-one matrix) where d = 2 k is the dimension of the Hilbert space of k ions. The experimenter can measure an arbitrary Pauli observable σ x , σ y or σ z of each ion, simultaneously on all k ions. Thus, each measurement setting is labelled by a sequence s = (s 1 , . . . , s k ) ∈ {x, y, z} k out of 3 k possible choices. The measurement produces an outcome o = (o 1 , . . . , o k ) ∈ {+1, −1} k , whose probability is equal to the corresponding diagonal element of ρ with respect to the orthonormal basis determined by the measurement setting s. The measurement procedure and statistical model can be summarised as follows. For each setting s the experimenter performs n repeated measurements and collects the counts of different outcomes N (o|s), so that the total number of quantum samples used is N := n×3 k . The resulting dataset is a 2 k × 3 k table whose columns are independent and contain all the counts in a given setting. A commonly used [1] estimation method is maximum likelihood which selects the state for which the probability of the observed data is the highest among all states. However, while this method seems to perform well in practice, and has efficient numerical implementations [29], it does not provide confidence intervals (error bars) for the estimators, and it has been criticised for its tendency to produce rank-deficient states [2].
The goal of this paper is to find alternative estimators which can be efficiently computed, and work well for low rank states. The reason for focusing on low rank states is that they form a realistic model for physical states created in the lab, where experimentalists often aim at preparing a pure (rank-one) state. While this is generally difficult, the realised states tend to have rapidly decaying eigenvalues, so that they can be well approximated by low rank states. Our strategy is to combine an easy but "noisy" estimation method -the least square estimator (LSE) -with an appropriate spectral truncation, tuned using available data only, which sets spurious eigenvalues to zero and allows to reduce the mean square error of the estimator.
The LSE ρ (ls) n is obtained by inverting the linear map A : ρ → P ρ between the state and the probability distribution of the data, where the unknown probabilities are replaced by the empirical (observed) frequencies of the measurement data. The resulting estimator is unbiased, and is "optimal" in the sense that it minimises the prediction error, i.e. the euclidian distance between the empirical frequencies and the predicted probabilities. However, one of the disadvantages of the LSE is that it does not take into account the physical properties of the state, i.e. its positivity and trace-one property. More importantly, as we explain below, the LSE has a relatively large estimation error for the class of low rank states, and performs well only on very mixed states. This is illustrated in Figure 1 where the eigenvalues of ρ (ls) n are plotted (in decreasing order) against those of the true state ρ, the latter being chosen to have rank r = 2. We see that while the non-zero eigenvalues of ρ are estimated reasonably well, the LSE is poor in estimating the zero-eigenvalues, and as consequence, it has a large estimation variance.
Our goal is to design more precise estimators, which have the LSE as a starting point, but take into account the "sparsity" properties of the unknown state. Figure 1 suggests that the non-zero eigenvalues of the LSE which are below a certain "statistical threshold", can be considered as statistical noise and may be set to zero in order to improve the estimation error. To find this noise level, we establish a concentration inequality (see Proposition 1) which shows that the operator-norm error ρ (ls) n −ρ 2 is upper bounded by a rate ν 2 which (up to logarithmic factors in d) is proportional to d/N .
The first estimator we propose, is a rank penalised one obtained by diagonalising the LSE, arranging its eigenvalues in decreasing order of their absolute values, and setting to zero all those eigenvalues whose absolute values are below the threshold ν ρ (ls) The same outcome can be obtained as solution of the following penalised estimation problem: among all selfadjoint matrices, choose the one that is close to the LSE but in the same time it has low rank, so that it minimises over τ the norm-two square discrepancy penalised by the rank In particular the estimator's rank is determined by the data. In Theorem 1 we show that if ρ is of unknown rank r ≤ d, then the mean square error (MSE) E ρ (pen) n − ρ 2 2 is upper-bounded (up to logarithmic factors) by the rate (r · d)/N . This captures the expected optimal dependence on the number of parameters for a state of rank r. Indeed, in section 5 we show that no estimator can improve the above rate for all states of rank r, cf. Theorem 3 for the asymptotic minimax lower bound.
The penalised estimator has however the drawback that it may not represent a physical state. To remedy this, and further improve its statistical accuracy, we propose a physical estimator which is the solution of the following optimisation problem. We seek the density matrix which is closest to the LSE ρ (ls) n , and whose non-zero eigenvalues are larger that the threshold 4ν. It turns out that the solution can be found via a simple iterative algorithm whereby at each step the eigevalues of ρ (ls) n below the threshold are set to zero, and the remaining eigenvalues are normalised by shifting with a common constant, while the eigenvectors are not changed throughout the process. In Theorem 2 we show that the physical estimator satisfies a similar upper bound to the penalised one.
In section 6 we present results of extensive numerical investigations of the two proposed estimators. In addition we consider the oracle "estimator", which is simply the spectral truncation of the LSE that is closest to the true state ρ, and the cross-validated estimator ρ (cv) n which aims at finding the optimal truncation rank by estimating the Frobenius error by cross-validation. In fact, we found that cross-validation can help in better tuning the constant factor of the threshold rate of the penalised and physical estimators. As expected from the theoretical results, we find that all estimators perform significantly better than the LSE on low rank states; moreover the physical estimator has slightly smaller estimation error than the others, including the oracle estimator. We also find that all methods converge to the correct rank in the limit of large number of repetitions but through different routes: the penalised estimator tends to underestimate, while the physical one tends to overestimate the rank, for small number of samples.
Having discussed the upper bounds on the estimators' MSE, we would like to know how they compare with the best possible estimation procedure. One way to characterise the latter is through the asymptotic minimax risk for the class of states of a given rank r. From asymptotic statistics theory [30] we know that for every sequence of estimators ρ n , the following lower bound for its asymptotic maximum risk over the set S d,r of state of rank r holds Tr(I(ρ) −1 G(ρ)).
On the right side, I(ρ) is the Fisher information corresponding to all measurement settings taken together, and G(ρ) is a positive matrix describing the quadratic approximation of the Frobenius distance around ρ. In Theorem 3 we show that the right side is lower bounded by 2r(d − r) which shows that (up to logarithmic factors) the upper bounds of the penalised and physical estimators have the same scaling as the asymptotic minimax risk.
Recently, a number of papers discussed related aspects of quantum tomography problems. The idea of the penalised estimator has been proposed in [8], which provided a weaker upper bound for its MSE. Reference [7] analyses model selection methods for finite rank models and maximum likelihood estimation. Reference [31] proposes a different estimator and establishes a comparable upper bound for its MSE. The class of low rank states is also employed in compressed sensing quantum tomography [25,32,28], but their statistical model is based on expectations of Pauli observables rather than measurement counts.
The paper is organised as follows. In section 2 we describe the measurement procedure and introduce the statistical model of MIT. In section 3 we define the linear (least squares) estimator and derive an upper bound on its operator norm error which improves on a previous bound of [8]. In section 4 we define the penalised and threshold estimators and derive upper bounds for their mean square errors with respect to the norm-two square (Frobenius) distance. The performance of the different methods is analysed in Section 6. An asymptotic lower bound for the minimax risk is derived in section 5, based on the Fisher information of the measurement data. The upper and lower bounds match in the scaling with the number of parameters and number of total measurements, up to a logarithmic factor. We give a detailed description of the numerical implementation of the algorithms, including the cross-validation routines used for tuning the pre-factor of the penalty and threshold constants. We illustrate the simulation results with box plots of the Frobenius errors for the least squares, oracle, cross-validation, penalisation and threshold estimator, for states of ranks 1,2,6 and 10, and for different choices of measurement repetitions n = 20, 100. Additionally, we plot the empirical distribution of the chosen rank for different estimators, showing the concentration on the true rank as the number of repetitions increases.

Multiple ions tomography
This paper deals with the problem of estimating the joint quantum state of k two-dimensional systems (qubits), as encountered in ion trap quantum tomography [1]. The two-dimensional system is determined by two energy levels of an ion, while the remaining levels can be ignored as they remain unpopulated during the experiment. The joint Hilbert space of the ions is therefore the tensor product C 2 ⊗k ∼ = C d where d = 2 k , and the state is a density matrix ρ on this space, i.e. a positive d × d matrix of trace one.
Our statistical model is derived from standard ion trap measurement procedures, and takes into account the specific statistical uncertainty due to finite number of measurement repetitions. We consider that for each individual qubit, the experimenter can measure one of the three Pauli observables σ x , σ y , σ z . A measurement set-up is then defined by a setting s = (s 1 , . . . , s k ) ∈ S k := {x, y, z} k which specifies which of the 3 Pauli observables is measured for each ion. where P s o are the one-dimensional projections and |e s o are the eigenvectors of the Pauli matrices, i.e. σ s |e s ± = ± |e s ± . The measurement procedure consists of choosing a setting s, and performing n repeated measurements in that setting, on identically prepared systems in state ρ. This provides information about the diagonal elements of ρ with respect to the chosen measurement basis, i.e. the probabilities p(o|s). In order to identify the other elements, the procedure is then repeated for all 3 k possible settings.
Before describing the statistical model of the measurement counts data, we start by discussing in more detail the relation between the unknown parameter ρ and the probabilities p(o|s). Consider the "extended" set of Pauli operators {σ x , σ y , σ z , σ I := 1} which form a basis in M (C 2 ). We construct the tensor product basis in M (C d ) with elements σ b = σ b1 ⊗ · · · ⊗ σ b k where b ∈ {x, y, z, I} k and note that the following orthogonality relations hold Tr(σ b σ c ) = dδ b,c . The state ρ can be expanded in this basis as ρ = b∈{I,x,y,z} k Equation (2.1) can then be written as The coefficients A b (o|s) can be computed explicitly as Letρ ∈ C 4 k be the representation of ρ as a the vector of coefficients ρ b , and let p be the corresponding vector of probabilities for all settings (p(o|s) : (o, s) ∈ O k × S k ), with settings, and outcomes within settings ordered in lexicographical order. The measurement is then described by the linear map A : The linear map A is injective and its inverse can be computed asρ = (A * · A) −1 · A * · p, which together with the decomposition (2.3) and Lemma 1 implies The above formula allows to reconstruct the matrix elements from the measurement probabilities. However, since the experiment only provides random counts from these probabilities, we need to construct a statistical model for the measurement data. After n repetitions of the measurement with setting s, we collect independent, identically distributed observations X i|s ∈ O k , for i from 1 to n. The data can be summarised by the set of counts is the number of times that outcome o has occurred. After repeating this for each setting s ∈ S k , we collect all the data in the counts dataset D := {N (o|s) : (o, s) ∈ O k × S k }. Since successive preparation-measurement cycles are independent of each other, the probability of a certain dataset D is given by the product of multinomials The statistical problem is to estimate the state ρ from the measurement data summarised by the counts dataset D. The most commonly used estimation method is maximum likelihood (ML). The ML estimator is defined byρ where the maximum is taken over all density matrices τ on C d , and can be computed by using standard maximisation routines, or the iterative algorithms proposed in [29,33]. However, ML becomes impractical for about k = 10 ions, and the iterative algorithm has the drawback that it cannot be adapted to models where prior information about the state is encoded in a lower dimensional parametrisation of the relevant density matrices, e.g. when the states are low rank.
In the next section we discuss an alternative method, the least square estimator, and derive an upper bound on its mean square error. After this, we will show that by "post-processing" the least squares estimator using penalisation and thresholding methods, its performance can be considerably improved when the unknown state has low rank.

The linear (least squares) estimator
Since the vectorise versionρ of the state ρ satisfies (2.5), it is the solution of the optimisatioñ giving ρ = bρ b ·σ b in (2.6). If the number of repetitions n is large compared with the dimension d, then the outcomes' empirical frequencies are good approximations of the corresponding probabilities, i.e. f (o|s) := N (o|s)/n → p(o|s) by the law of large numbers. Therefore, by replacing p by the vector of frequencies f in in the previous display, we can define the least square estimator of ρ ρ (ls) which has the explicit expression Note that in this case it comes down to replacing the unknown probability p in equation (2.6) with the empirical frequencies f (also known as the plug-in method).
In spite of this "optimality" property and its computationally efficiency, the least square estimator has the disadvantage that in general it is not a state, i.e. it is not trace-one and may have negative eigenvalues. A more serious disadvantage is that its risk -measured for instance by the mean square error E( ρ (ls) n − ρ 2 2 ) -is large compared with other estimators such as the maximum likelihood estimator. This is due to the fact that the linear estimator does not use the physical properties of the unknown parameter ρ, that is positivity and trace-one. As we will see below, the modified estimators proposed in Section 4 outperform the least square while adding only a small amount of computational complexity. Moreover, the second estimator will be a density matrix.
In the remainder of this section we provide concentration bounds on the square error of the linear estimator, which will later be used in obtaining the upper bounds of the improved estimators. The following Proposition improves the rate k(4/3) k /n obtained in [8] to k(2/3) k /n. Proposition 1. Let ρ (ls) n be the linear estimator of ρ. Then, for any ε > 0 small enough the following operator norm inequality holds with probability larger than 1 − ε under P ρ with N := n · 3 k the total number of measurements. The same bound holds when k = k(n) as long as ν(ε) → 0.
As a side remark we note that projecting the least-squares estimator onto the space of Hermitian matrices with trace 1, does not change the rate of convergence from Proposition 1. The following Proposition allows us to assume that, without loss of generality, the least-squares estimator has also trace 1.
Proposition 2. Under the notation and assumptions in Proposition 1, let Then with probability larger than 1 − ε we have ρ Proof. See Appendix 8.3.

Rank-penalised and threshold projection estimator
In this section we investigate two ways to improve the least-squares estimators. The first method is to project the least-squares estimator onto the space of finite rank Hermitian matrices of an appropriate rank. We prove upper bounds for its risk with respect to the Frobenius (norm-two) distance. Building on the knowledge about the rank-penalised estimator, we define the second estimator which is the projection of the least-squares estimator on the space of physical states whose eigenvalues are larger than a certain positive noise threshold. We give an simple and fast algorithm producing a proper density matrix from the data, which also inherits the good theoretical properties of the rank-penalized estimator.

Rank-penalised estimator
We introduce here the rank-penalised nonlinear estimator, which can be computed from the leastsquares estimator by truncation to an appropriately chosen rank.
As noted earlier, while the least-square estimator is unbiased, it has a large variance due to the fact that it does not take into account the physical constraints encoded in the unknown parameter ρ. A possible remedy is to "project" the least squares estimator onto the space of physical states, i.e. positive, trace-one matrices. This method will be discussed in the following subsection. Another improvement can be obtained by taking into account the "sparsity" properties of the unknown state. For instance, in many experimental situations the goal is to create a particular low rank, or even pure state. The fact that such states can be characterised with a smaller number of parameters than a general density matrix, has two important consequences. Firstly, they can be estimated by measuring an "informationally incomplete" set of observables, as demonstrated in [25,26]. Secondly, the prior information can be used to design estimators with reduced estimation error compared with generic methods which do not take into account the structure of the state. Roughly speaking, this is because each unknown parameter brings its own contribution to the overall error of the estimator.
However, the downside of working with a lower dimensional model is that it contains built-in assumptions which may not be satisfied by the true (unknown) physical state. Preparing a pure state is strictly speaking rarely achievable due to various experimental imperfections, so using a pure state statistical model is in fact an oversimplification and can lead to erroneous conclusions about the true state. On the other hand, one can argue that when the (small) experimental noises are taken into account, the actual state is "effectively" low rank, i.e. it has a small number of significant eigenvalues and a large number of eigenvalues which are so close to zero that they cannot be distinguished from it. Then, the interesting question is how to decide on where to make the cut-off between statistically relevant eigenvalues and pure statistical noise. This is a common problem in statistics which is closely related to that of model selection [34]. Below we describe the rank-penalised estimator addressing this problem, and show that its theoretical and practical performance is superior to the least squares estimator, and is close to what one would expect from an optimal estimator. In addition, its computation requires only the diagonalisation of the least-squares estimator.
Before presenting a simple algorithm for computing the estimator, we briefly discuss the idea behind its definition. Let be the spectral decomposition of the least squares estimator, with eigenvalues ordered such that For each given rank κ ∈ {1, . . . , d} we can project ρ (ls) n onto the space of matrices of rank κ by computing the matrix which is the closest to ρ Although the projection is not a linear operator, ρ n (κ) is easy to compute, and is obtained by truncating the spectral decomposition (4.1) to the most significant κ eigenvalues The question is now how to choose the rank κ in order to obtain a good estimator. In Figure 2 we illustrate the dependence of the norm-two square error e(κ) := ρ n (κ) − ρ 2 2 on the rank, for a particular dataset generated with a rank 6, 4-ions state. As the rank is increased starting with κ = 1 (pure states), the error decreases steeply as ρ n (κ) becomes less biased, it reaches a minimum close to the true rank, and increases slowly as added parameters increase the variance of the estimator. However, since the state ρ is unknown, the norm-two error and optimal rank for which the minimum is achieved, are unknown. To go around this, we can estimate the error e(κ) from the data by means of e.g. cross-validation, as it will be described in section 6. However, in this section we follow a different path, and we define the rank-penalised estimator [8], [35] as the minimiser over κ of the following expression: where ν is a constant which will be tuned appropriately. The first term quantifies the fit of the truncated estimator with respect to the least squares estimator, while the second term is a penalty which increases with the complexity of the model, i.e. the rank. The rank penalised estimator ρ (pen) n is thus the solution of the simple optimisation problem This means that the eigenvalues below a certain noise threshold are set to zero while those above the threshold remain unchanged. The following theorem is our first main result, and shows that the appropriate threshold is given by the upper bound on the operator norm error of the least squares estimator, as established in Proposition 1.
Theorem 1. Let θ > 0 be an arbitrary constant, let c(θ) := 1 + 2/θ, and let ε > 0 be a small parameter. Then with probability larger than 1 − ε, we have where ρ (pen) n is the penalised estimator defined in (4.2) with threshold ν(ε) 2 given by which is assumed to be o(1) with increasing n and k.
Proof. The upper bound follows directly from Proposition 1 combined with the following oracle inequality established in [8], which holds true provided that (1 + θ) ρ (ls) This event occurs with probability larger than 1 − ε.
Let us make some explanatory remarks on the above result. Firstly, the bound (4.3) applies to all states ρ, not only "small" rank ones. Recall that S d,r denotes the set of states of rank-r states on C d . In the special case when rank(ρ) = r ≤ d, the theorem implies that, with probability larger than 1 − ε, ρ (pen) If the rank r is much smaller than d, this bound is a significant improvement to the corresponding upper bound d · ν(ε) 2 for the least square estimator, which can be derived from the operator norm bound of Proposition 1. Moreover, up to a constant factor the rate ν(ε) 2 is equal to d · r log(d)/N which is essentially the ratio of the number of parameters and total number of measurements. In section 5 we will show that apart from the log factor this rate is also optimal, and cannot be improved even if the rank of the state is known, which indicates that the estimator adapts to the complexity of the true parameter. Furthermore, we stress the fact that the bound (4.3) holds true for growing dimension d = 2 k as well as the number of measurements n; the bound remains meaningful as far as d log d/N → 0.
The second observation is that our procedure selects the true rank consistently. Denote byκ the rank of the resulting estimator ρ (pen) n . Following [8] we can prove that, if there exists some κ such that λ κ (ρ) > (1 + δ) ν(ε) and λ κ+1 (ρ) < (1 − δ) ν(ε) for some δ ∈ (0, 1), then This stresses the fact that the procedure detects the eigenvalues above a threshold related to the error of the least squares estimator. If the true rank of ρ is r and if k and n are such that ν tends to 0 (which always occurs for fixed number of ions k), then λ r > ν asymptotically and the probability thatκ = r tends to 1.
We can also project ρ (pen) n on the matrices with trace 1, to get where S is the set of all density matrices on C d . The following Corollary shows that the key properties of the estimator are preserved if we additionally normalise it to trace-one after thresholding.
Corollary 1. Under the notation and assumptions of Theorem 1 if ρ is an arbitrary state in S d,r and if k and n are such that λ r > ν(ε) for some ε ∈ (0, 1), then, with probability larger than 1 − ε, Moreover, there exists an absolute constant C > 0 such that Proof. See Appendix 8.4.

Physical threshold estimator
Although the rank-penalised estimator performs well in terms of its risk, it is not necessarily positive and trace-one and therefore it may not represent a physical state. In this section we propose and analyse the following "physical estimator" where ρ (ls) n is the "'normalised least squares estimator" defined in (3.2), and S(ν) denotes the set of states at noise level ν S(ν) = {σ density matrix with eigenvalues λ j ∈ {0} ∪ (4ν, 1], j = 1, ..., d} . In particular, the space of all density matrices correspond to ν = 0 and is denoted S. The estimator ρ (phys) n is therefore the physical state which is closest to the (normalised) least square estimator, and whose non-zero eigenvalues are above the threshold 4ν.
Before analysing the performance of the estimator, we describe its numerical implementation through the following simple iterative algorithm.
Let λ 1 ≥ · · · ≥ λ d denote the eigenvalues ofρ (ls) n . Let = 0, and define λ The algorithm checks whether the smallest eigenvalue is larger than the noise level 4ν and if it is not, then sets its value to 0 and distributes the mass of the erased eigenvalue in such a way that they sum to 1. This algorithm is similar to that proposed by Smolin et al. [4], with the important difference that we do not keep all positive eigenvalues but only significantly positive eigenvalues.
Here, significant means larger than the noise threshold of the order of the operator-norm error of the least-squares estimator. Indeed, this noise can give a confidence interval for each eigenvalue.
If the total number of iterations is = d − r then the estimator ρ (phys) n has rank r. Its eigenvalues are equal to 0 for j > r, while, for j ≤ r they are given by This implies that ρ (phys) n has decreasing eigenvalues and λ (phys) r ≥ 4ν. The following Theorem shows that ρ (phys) n is rank-consistent and its MSE has the same scaling as that of penalised estimator ρ (pen) n . Theorem 2. Assume that the state ρ has rank r, i.e. belongs to S d,r . For small ε > 0, let ν = ν(ε) be defined as in Theorem 1, and assume that λ r > 8ν(ε). Then, with P ρ probability larger than 1 − ε we have r = r and ρ (phys) n − ρ 2 2 ≤ 48 · r · ν(ε) 2 . Moreover, there exists an absolute constant C > 0 such that Proof. See Appendix 8.5.

Lower bounds for rank-constrained estimation
The goal of this section is to investigate how the convergence rates of our estimators compare with that of an "optimal estimator" for the statistical model consisting of all states of rank up to r. For this we will derive a lower bound for the maximum risk of any estimator.
In this section ρ n will be an arbitrary estimator and the true state ρ is assumed to belong to the set S d,r of rank-r states. To quantify the overall performance of ρ n , we define the maximum risk In view of the previous upper bounds, we expect its asymptotic behaviour (in terms of the total number of measurements, for a large number of repetitions n) to be Taking this into account we define the (appropriately rescaled) minimax risk as which describes the behaviour of the best estimator at the hardest to estimate state. The next theorem provides an asymptotic lower bound for the minimax risk. It shows that the maximum MSE of any estimator is al least of the order of r(d − r)/N , which for low rank states scales as #parameters/#samples, which up to logarithmic factors is the same as the upper bounds derived in Theorems 1, Corollary 1, and Theorem 2. Proof. The minimax risk captures the worst asymptotic behaviour of the rescaled risk, over all states of rank r. In order to bound the risk from below, we construct a (lower dimensional) subfamily of states R d,r ⊂ S d,r such that the maximum risk for this subfamily provides the lower bound. Let The Frobenius distance is given by ). However, this parametrisation does not take into account the prior information about the rank of the true state, and moreover, our key argument involves the even smaller family R d,r of states. We will now focus on providing a local parametrisation of R d,r around ρ = U ρ 0 U * . With respect to the basis B U , a state ρ ∈ R d,r in the neighbourhood of ρ has the form where ∆ is a matrix of free (complex) parameters, and the two O( ∆ 2 ) blocks are r × r and respectively (d − r) × (d − r) matrices whose elements scale quadratically in ∆ near ∆ = 0. The intuition behind this decomposition is that a small rotation of ρ produces off-diagonal blocks of the size of the "rotation angles" while the change in the diagonal blocks are only quadratic in those angles. Since we are interested in the asymptotic behaviour of estimators, the local approach is justified, and the leading contribution to the Frobenius distance comes from the off-diagonal blocks. More precisely, if ρ 1 , ρ 2 ∈ R d,r are in the neighbourhood of ρ then whereθ r ,θ i are the real and imaginary parts of the off-diagonal elements contained in the block ∆, i.e. for i ≤ r < j. Locally, the manifold R d,r can be parametrised byθ := (θ r ,θ i ).
Since R d,r ⊂ S d,r the maximum risk for the model consisting of rank-r states is bounded from below by that of the (smaller) rotation model Let π be the "uniform" distribution over R d,r . To draw a sample from this distribution, one can choose a random unitary U from the Haar measure over unitaries, and defines ρ := U ρ 0 U * . Then the maximum risk is bounded from below by the Bayes risk By applying the van Trees inequality in [36] (see also [37]) we get that where α > 0 is a constant which does not depend on n. Here,Ĩ(ρ) is the (classical) Fisher information matrix of the the data obtained by performing one measurement for each setting, andG(ρ) is the weight matrix corresponding to the quadratic approximation of the Frobenius distance around ρ.
Both matrices are of dimensions 2r(d − r) = dim(R d,r ), and depend on the chosen parametrisation, but the trace is independent of it. Inserting (5.6) and (5.7) into (5.5), we get Since t → t −1 is an operator convex function we have and by taking the limit n → ∞ we obtain the asymptotic minimax lower bound where R minmax (r, k; n) is the minimax risk defined in equation (5.1) At this point we choose a convenient local parametrisation around an arbitrary state ρ ∈ R d,r . As discussed in the beginning of the proof we showed that for this we can use the real and imaginary partsθ = (θ r ,θ i ) of the off-diagonal block ∆, and that the corresponding weight matrix isG(ρ) = 21 2r(d−r) . The lower bound (5.8) becomes Another consequence of (5.4) is that the Fisher information matrixĨ(ρ) is equal to the corresponding block of the Fisher information matrix I of the full (d 2 -dimensional) unconstrained model with parametrisation θ defined in (5.3). We will now compute the average over states of the Fisher information with respect to the Pauli bases measurements, by showing that it is equal to the average Fisher information at ρ 0 , for the random basis measurement. As the different settings are measured independently, the Fisher informationĨ(ρ) is (and similarly for I) whereĨ(ρ|s) is the Fisher information corresponding to the von Neumann measurement with respect to the ONB defined by setting s. More generally, with B U as defined above, we denote byĨ(ρ|B U ) the Fisher information corresponding to this basis. Due to the rotation symmetry, we havẽ where µ d (dU ) is the unique Haar measure on the unitary group on C d .
The average Fisher information matrixĪ (of the full, unconstrained model) is computed in section 8.6 where we show that the block corresponding toθ parameters has averageĪ = 21 2r(d−r) such that the lower bound is R minmax (r, k) ≥ 2r(d − r).

Numerical results
In this section we present the results of a simulation study which analyses the performance of the proposed estimation methods. The penalised and physical estimators discussed in the previous sections use a theoretical penalisation and respective threshold rates proportional to ν 2 . However in practice we found that the performance of the estimators can be further improved when the rates are adjusted by multiplying with an appropriate constant c -whose choice is informed by the data -from a grid over a small interval which was chosen to be [0, 3]. The last two estimators are such versions of the theoretical ones with constant c chosen by using cross-validation methods which are explained in detail in section 6.1. We will compare the following 5 estimators described below.
2. the oracle "estimator" ρ (oracle) n defined below. This is strictly speaking not an estimator since it requires the knowledge of the state ρ itself, and can be computed only in simulation studies. However, the oracle is a useful benchmark for evaluating the performance of the other estimators.

the cross-validated projection estimator ρ (cv)
n . Here we try to find the optimal truncation rank of the least squares estimator, by using the cross-validation method. We explore the estimators' behaviour by simulating datasets from states with different ranks, and with different number of measurement repetitions per setting. The methodology is described in detail below.

Generation of random states and simulation of datasets
In order to generate a density matrix of rank r, we first create a rank r upper triangular matrix T in which i) the off-diagonal elements of the first r rows are random complex numbers, ii) the diagonal elements T 22 , . . . , T rr are real, positive random numbers, iii) all elements of the rows r + 1, . . . , d are zero.
The matrix T is completed by setting T 11 such that T 11 ≥ 0, and T 2 2 = 1. If these conditions cannot be satisfied we repeat the procedure by generating a new set of matrix elements for T . When successful, we set ρ := T * T which by construction is a density matrix of rank r. We note that it is not our purpose to generate matrices from a particular "uniform" ensemble, but merely to have a state with reasonably random eigenvectors, and whose r eigenvalues are not significantly smaller than 1/r. Following this procedure we have generated 4 states of 4 ions (d = 2 4 ) with ranks 1, 2, 6, 10. The rank 6 state for instance, has non-zero eigenvalues (0.47, 0.19, 0.12, 0.11, 0.07, 0.04).
For each state, we have then simulated a number of 100 independent datasets with a given number of repetitions chosen from the range 20, 100, 500, and 2500. In this way we can study the dependence of the MSE of each estimator on state (or rank) and number of repetitions.

Computation of estimators
We conducted the following simulation study for all the possible combinations between the states and the total number of cycles (i.e. 4 × 4 = 16 different scenarios). Below, we denote by r the rank of the "true" state ρ, from which the data has been generated. The procedure has the following steps: 1. For a given number of repetitions n, we simulate 5 independent datasets D 1 , . . . , D 5 , each with n/5 repetitions. By simply adding the number of counts for each setting and outcome, we obtain a dataset D of n repetitions. However, as we will see below, having 5 separate "smaller" datasets is important for the purpose of applying cross-validation. Note that such a procedure can be easily implemented in an experimental setting.
2. We compute the least square estimator ρ (ls) n based on the full dataset D with total number of cycles n.
3. We compute the oracle "estimator" as follows: (a) we compute the spectral decomposition (4.1) of ρ Note that the oracle estimator relies on the knowledge of the true state ρ which is not available in a real data set-up. It is nevertheless useful as a benchmark for judging the performance of other estimators in simulation studies. At the next point we define the cross-validation estimator which tries to find the "optimal" rank κ 0 by replacing the unknown state ρ with the least squares estimator computed on a separate batch of data.
4. We compute the cross validation estimator as follows.
(a) For each j ∈ {1, . . . , 5} we compute the following estimators. While holding the batch D j out, we compute the least squares estimator ρ (ls) n;−j for the dataset consisting of joining the remaining 4 batches together. Similarly to the point above, we define the rank κ truncation of this estimator by ρ n;−j (κ). We also compute the least squares estimator for the remaining batch j, denoted by ρ (ls) n;j .
(b) For each rank κ we evaluate the "empirical discrepancy" Since ρ n;−j (κ) and ρ (ls) n;j are independent, and the least squares estimator is unbiased E( ρ (ls) n;j ) = ρ, the expected value of CV(k) is where we denoted E −1 and E 1 the expectation over all batches except the first, and respectively over the first batch. Therefore, the average of CV(κ) is equal to the mean square error of the truncated estimator ρ n;−1 (κ), up to a constant C which is independent of κ.
(c) Based on the above observation we use the the cross-validation method as a proxy for the oracle estimator. Concretely, we minimize CV(κ) with respect to κ κ cv = arg min k CV(κ), and define the cross-validation estimator as the truncation to rankκ cv of the full data least squares estimator ρ (cv) n := ρ n (κ cv ).
5. We compute the cross-validated rank-penalised estimator as follows.
(a) Let c be a penalisation constant chosen from a suitable set of discrete values in the interval [0,3]. Similarly to the cross-validation procedure, we hold out batch j, and we compute the rank-penalised estimator (4.2), with penalty constant cν 2 for j = 1, . . . , 5. We denote these estimators by ρ (pen) n;−j (c). We will also need the least square estimator ρ (ls) n;j for batch j computed above. Finally we compute the cross-validated rank penalised estimator ρ (rk−cv) n which is defined as in (4.2), with constant cν 2 , on the whole dataset D.
6. We compute the physical estimator as follows.
(a) As above we choose a constant c from a grid over the interval [0, 3]. We hold out batch j, and we compute the physical threshold estimator (4.5), using the algorithm below this equation, with threshold c · 4ν. We denote the resulting estimators by ρ Finally we compute the cross-validated physical estimator ρ (phys−cv) n which is defined as in (4.5), with constantĉ · 4ν.

Simulation results
We collect here the results of the simulation study described in the previous section. As a figure of merit we focus on the mean square error E ( ρ n − ρ 2 2 of each estimator, which is estimated by averaging the square errors over the 100 independent repetitions of the procedure. We are also interested in how the different methods perform relative to each other, and whether the selected rank is consistent, i.e. it concentrates on the rank of the true state for large number of repetitions. The four panels in Figure 3 represent the boxplots of the square errors ρ n − ρ 2 2 for the different estimators, and different states, when the number of repetitions is n = 20. Similarly, Figure 4 shows the same boxplots at n = 100. As expected, in both cases the least squares performs significantly worse than the other estimators, and the discrepancy is larger for small rank states. The remaining 4 estimators have similar MSE's with the physical one performing slightly better than the rest, followed by the oracle. Note also that the estimators' variances (indicated by the size of the boxes) are larger for the least squares than the other estimators. A similar behaviour has been observed for n = 500, 2500 repetitions.  Figure 5 illustrates the dependence of the MSE of a given estimator, as a function of n, for the four different states which have been analysed. Since the MSE decreases as n −1 we have chosen to plot the "renormalised" MSE given by n · E ρ n − ρ 2 2 , which converges to a constant value for large n. As expected the limiting value increases with the rank of the state, as a proxy for the number of parameters to be estimated. The histograms in Figure 6 show the probability distributions for the chosen rank of each given estimator, as a function of the number of measurement repetitions n, for the state of rank 6. We note that in all cases the proportion of times that the chosen rank is equal to the true rank of the state increases as with the number of repetitions. However, this convergence towards a "rank-consistent" estimator is rather slow, as the proportion surpasses 80% only when n = 2500. Another observation is that the penalty and threshold estimators appear to have different behaviours: the former tends to underestimate the true rank, while the latter tends to overestimate it. As expected, the oracle estimator is more likely to choose the correct rank for large number of repetitions. Perhaps slightly more surprising, for small number of repetitions (n = 20), the oracle choose a pure state in most cases.

Conclusions and outlook
Quantum state tomography, and in particular multiple ions tomography is an important enabling component of quantum engineering experiments. Since full quantum tomography becomes unfeasible for large dimensional systems, it is useful to identify lower dimensional models with good approximation properties for physically relevant states, and to develop estimation methods tailored for such models. In particular, quantum states created in the lab are often very well approximated by low rank density matrices, which are characterised by a number of parameter which is linear rather than quadratic in the space dimension.
In this work we analysed several estimation algorithms targeted at estimating low rank states in multiple ions tomography. The procedure consists in computing the least squares estimator, which is then diagonalised, truncated to an appropriate smaller rank by setting eigenvalues below a "noise threshold" to zero, and normalised. Among the several truncation methods proposed, the best performing one is the "physical estimator"; this chooses the density matrix whose non-zero eigenvalues are above a certain threshold and is the closest to the least squares estimator. We proved concentration bounds and upper bounds for the mean square error of the penalised and physical estimators, as well as a lower bound for the asymptotic minimax rate for multiple ions tomography. The results show that the proposed methods have an optimal dependence on rank and dimension, up to a logarithmic factor in dimension. In addition, the algorithms are easy to implement numerically and their computational complexity is determined by that of the least squares estimator.
An interesting future direction is to extend the spectral thresholding methodology to a measurement setup where a smaller number of settings is measured, which is however sufficient to identify the unknown low rank state. Another direction involves the construction of confidence intervals / regions for such estimators, beyond the concentration bounds established here.
Proof. Next, we give here for the reader's convenience the proof of Lemma 1 similar to that in [8].
On the one hand, in case the sets E b and E b are equal, we have We assumed here that j 0 belongs to E b \E b and the same holds for j 0 in E b \E b .

Proof of Proposition 1
Proof of Proposition 1. Note that f (o|s) = 1 n i I(X s,i = o), where the random variables X s,i are independent for all settings s and all i from 1 to n. To estimate the risk of the linear estimator we write ρ (ls) where W s,i are independent and centered Hermitian random matrices. We will apply the following extension of the Bernstein matrix inequality [38] due to [39], see also [31], [32].
Proposition 3 (Bernstein inequality, [39]). Let Y 1 , ..., Y n be independent, centered, m×m Hermitian random matrices. Suppose that, for some constants V, W > 0 we have Y j ≤ V , for all j from 1 to n, and that j E(Y 2 j ) ≤ W . Then, for all t ≥ 0, In our setup we bound W s,i ≤ V for all s and i and We replace B(o|s) in (8.1) and use Lemma 1 to get Apply the matrix Bernstein inequality in Proposition 3 to get, for any t > 0: We choose t > 0 such that which leads to t such that Then the convenient choice of t is ν(ε) = v(ε)/2 + v 2 (ε)/2 + 3/2 · v(ε), which is equivalent to 3/2 · v(ε) when this last term tends to 0.

Proof of Proposition 2
Proof of Proposition 2. Let us denote by ( λ 1 ≥ ... ≥ λ d ) the eigenvalues of | ρ (ls) n | and by λ 1 , ..., λ d the eigenvalues of the resulting estimator ρ (ls,n) n . It is easy to see that the latter has the same eigenvectors as ρ (ls) n . Therefore This optimisation has the explicit solution Note that Therefore, ρ

Proof of Corollary 1
Proof of Corollary 1. We order the eigenvalues of ρ (pen) n in decreasing order of absolute values (| λ 1 | ≥ ... ≥ | λ d |), and denote by λ 1 , ..., λ d the eigenvalues of ρ (pen,n) n . As in the previous proof, we can see that both matrices will share the same eigenvectors and that the relation between the eigenvalues is Thus, ρ (pen) n − ρ (pen,n) and the previous inequality remains true for all 0 < e ≤ ε < 1. Let us denote by which is a decreasing function of e. This implies that e = 2d exp(− N x(e) 16rd ). Thus,
After a total ofˆ = d −r iterations the algorithm stops and we have [41]  From now on we assume that ρ (ls) n − ρ ≤ 2ν(ε) which holds with probability larger than 1 − ε, cf. Corollary 3.2. We will show that under this assumptionr = r. For this we consider the two caseŝ r > r andr < r separately.
This implies that λ (phys) r ≤ λr + 4ν = 4ν sincer > r. However, as discussed above, by the definition of the threshold estimator, λ (phys) r ≥ 4ν. Thereforer cannot be larger than r.
We consider now the MSE of the estimator. As shown above, we have | λ (phys) n U * n be the decompositions of the least squares and the physical threshold estimator, where we take into account that the latter two have the same eigenbasis. Then The first norm on the right-hand side of the previous inequality is bounded as For the second norm we use a similar triangle inequality for the operator norm The first term is smaller than ν since |λ i − λ (ls) i | ≤ ν as it follows from Proposition 1, and the Weyl inequality [40]. The second term is also bounded by ν, by Proposition 1. Therefore, since U n D U * n and U DU * are rank r matrices, the difference is at most rank 2r and U n D U * n − U DU * 2 ≤ 2ν √ 2r. By plugging this together with (8.3) into the right side of (8.2) we get with probability larger than 1 − .
Moreover, the previous inequality remains true for all 0 < e ≤ ε < 1. Let us denote by x(e) = 48rν 2 (e) = 96 which is a decreasing function of e. This implies that e = 2d exp(− N x(e) 96rd ). Thus,

The average Fisher information matrix for the full, unconstrained model, with random measurement design
In this section we present a detailed calculation of the average quantum Fisher information at a the rank r state ρ 0 defined in (5.2), for random measurements with uniform distribution over the measurement basis. We consider the full parametrisation by θ ∈ R d 2 given by equation (5.3). As explained in the proof, the Fisher information matrix for the parametrisationθ of the rotation model R d,r is a particular block of the larger Fisher matrix computed here. We will come back to this at the end of the computation.
and the partial derivatives are given by The Fisher information matrix I(ρ|B U ) has the following block structure with superscripts indicating the type of parameter considered: diagonal, real or imaginary part of off-diagonal element. The average Fisher information for a randomly chosen basis is where µ(dU ) is the Haar measure over unitaries used for choosing the random basis. Note that by symmetryĪ(ρ) only depends on the spectrum of ρ. We will not computeĪ(ρ) for an arbitrary state ρ but only at ρ = ρ 0 . The corresponding Fisher information will be denotedĪ =Ī(ρ 0 ) and is a function of d and r. Below we compute the different blocks ofĪ. For the matrix elements we will use a suggestive notation, e.g.Ī d,r ii;jk denotes the element corresponding to the diagonal parameter θ d ii = ρ ii and the real part of the off-diagonal element ρ jk , etc.
Note that for r = 1 these matrix elements are infinite. This is due to large contributions from measurements which have one basis vector close to being orthogonal to the one dimensional vector state.. This is somewhat akin to what happens in the case of a Bernoulli variable (coin toss), in the case when the probability is close to zero or one. While this phenomenon is interesting, it does not play any role in our analysis for which the diagonal matrix elements are not relevant parameters.
As before, the corresponding entry of the average Fisher information matrix is Sub-case 1: i, j, k ≤ r. In this case the integral is where we applied formula (8.6) for the integrals over the unitaries.
Sub-case 2: i, j ≤ r and k > r. In this case the integral is Sub-case 3: i ≤ r and j, k > r. In this case the integral is Sub-case 4: i > r and j, k ≤ r. In this case the integral is Sub-case 5: i > r and j, k ≤ r. In this case the integral is Sub-case 6: i, k > r and j ≤ r. In this case the integral is × Re | ψ 2 |i | 2 ψ 2 |k ν d−r (dψ 2 ) · j|ψ 1 ν r (dψ 1 ) = 0.
Sub-case 7: i, j, k > r. In this case the integral is The last integral is zero for the same reason as in sub-case 1.

C) Diagonal-Imaginary block.
Similarly to the case of diagonal-real elements, we obtain that all diagonal-imaginary matrix elements are zeroĪ di ii;jk = 0, for all i = 1, . . . , 2 k , and 1 ≤ j < k ≤ 2 k .
The same reasoning as before can be applied to transform the integral into a single integral, or a product of integrals over unitaries. If a single index is larger than r while the other 3 are smaller, or conversely a single index is smaller than r while the other are larger, then we have two integrals over unitaries which are zero since the monomial is of odd order. Therefore the following cases remain to be analysed.
Using formula (8.6) we find that the integral over unitaries is zero unless we deal with a diagonal matrix element of I, i.e. i = k and j = l. For the latter we havē Sub-case 2: i, j, k, l > r. In this case, the integral is I rr ij;kl = 4 · d · r (1 − q 2 ) 2 q 2 m d,r (dq) Re( i|ψ 2 ψ 2 |j ) · Re( k|ψ 2 ψ 2 |l )ν d−r (dψ 2 ) where we have used formula (8.11) for the integral over q. A similar calculation as above shows that all off-diagonal elements are zero and Sub-case 3: (i, j ≤ r and k, l > r) or (i, j > r and k, l ≤ r). In this case we deal with a product of integrals over unitaries of the form so all matrix elements are zero.
Again, if one index is smaller that r while the other three are larger, or otherwise, then the matrix element in zero. We analyse the remaining cases.
Sub-case 1: i, j, k, l ≤ r. In this case the integral is I ri ij;kl = −4 · d · r q 2 m d,r (dq) Re( i|ψ 1 ψ 1 |j )Im( k|ψ 1 ψ 1 |l )ν r (dψ 1 ) Using formula (8.6) we find that the integral over unitaries is zero unless i = k and j = l. However even in this case, two of the four terms are zero, and the other two cancel each other.
Sub-case 2: i, j, k, l > r. Here the integrals over the unitaries are similar as in case 1 above, with the difference that they taken with respect to µ d−r (dU ). Therefore the matrix elements are zero.
Sub-case 3: (i, j ≤ r and k, l > r) or (i, j > r, k, l ≤ r). In this case we deal with a product of integrals over unitaries of the form so all matrix elements are zero.
Sub-case 4: i, k ≤ r and j, l > r Again, the off-diagonal elements are zero and I ri ij;ij = = −4 · d · r (1 − q 2 )m d,r (dq) Re( i|ψ 1 ψ 2 |j ) · Im( i|ψ 1 ψ 2 |j )ν r (dψ 1 )ν r (dψ 2 ) In the last integral, two terms are zero and two have different signs and cancel each other. In conclusion all elements of the real-imaginary block are equal to zero.
Moreover the real and imaginary diagonal blocks are diagonal, equal to each other and have three distinct values depending on the position of the indices i < j with respect to r:  The model R d,r can be seen (locally) as the restriction of the full unconstrained model S r,d parametrised by θ, to the subset of parametersθ which are real and imaginary parts of matrix elements ρ i,j with i ≤ r < j. Therefore the corresponding average Fisher information matrix is equal to the corresponding block ofĪ, i.e.Ī = 21 2r(d−r) .