Estimation of low rank density matrices: bounds in Schatten norms and other distances

Let ${\mathcal S}_m$ be the set of all $m\times m$ density matrices (Hermitian positively semi-definite matrices of unit trace). Consider a problem of estimation of an unknown density matrix $\rho\in {\mathcal S}_m$ based on outcomes of $n$ measurements of observables $X_1,\dots, X_n\in {\mathbb H}_m$ (${\mathbb H}_m$ being the space of $m\times m$ Hermitian matrices) for a quantum system identically prepared $n$ times in state $\rho.$ Outcomes $Y_1,\dots, Y_n$ of such measurements could be described by a trace regression model in which ${\mathbb E}_{\rho}(Y_j|X_j)={\rm tr}(\rho X_j), j=1,\dots, n.$ The design variables $X_1,\dots, X_n$ are often sampled at random from the uniform distribution in an orthonormal basis $\{E_1,\dots, E_{m^2}\}$ of ${\mathbb H}_m$ (such as Pauli basis). The goal is to estimate the unknown density matrix $\rho$ based on the data $(X_1,Y_1), \dots, (X_n,Y_n).$ Let $$ \hat Z:=\frac{m^2}{n}\sum_{j=1}^n Y_j X_j $$ and let $\check \rho$ be the projection of $\hat Z$ onto the convex set ${\mathcal S}_m$ of density matrices. It is shown that for estimator $\check \rho$ the minimax lower bounds in classes of low rank density matrices (established earlier) are attained up logarithmic factors for all Schatten $p$-norm distances, $p\in [1,\infty]$ and for Bures version of quantum Hellinger distance. Moreover, for a slightly modified version of estimator $\check \rho$ the same property holds also for quantum relative entropy (Kullback-Leibler) distance between density matrices.


Introduction.
Let M m be the set of all m × m matrices with entries in C. For A ∈ M m , let A * denote its conjugate transpose and let tr(A) denote the trace of A. The complex linear space M m of dimension m 2 will be equipped with the Hilbert-Schmidt inner product A, B = tr(AB * ), A, B ∈ M m . In what follows, the sign ⊗ denotes the tensor product of vectors or matrices (linear transformations). For instance, for u, v ∈ C m , u ⊗ v is a linear transformation from C m into itself defined as follows: Let be the set of all Hermitian matrices. Clearly, H m is a linear space of dimension m 2 over the field of real numbers. For A ∈ H m , the notation A 0 means that A is positively semi-definite. A density matrix is a positively semi-definite Hermitian matrix of unit trace. The set of all m × m density matrices will be denoted by Density matrices are used in quantum mechanics to characterize the states of quantum systems. More generally, the states are represented by self-adjoint positively semidefinite operators of unit trace acting in an infinite-dimensional Hilbert space. In this case, density matrices of a large dimension m could be used to approximate the states of the system. The goal of quantum state tomography is to estimate the density matrix for a system prepared in an unknown state based on specially designed measurements. Let X ∈ H m be a Hermitian matrix (an observable) with spectral representation X = m ′ j=1 λ j P j , where m ′ ≤ m, λ j ∈ R, j = 1, . . . , m ′ being the distinct eigenvalues of X and P j , j = 1, . . . , m ′ being the corresponding eigenprojections. For a system prepared in state ρ ∈ S m , possible outcomes of a measurement of observable X are the eigenvalues λ j , j = 1, . . . , m ′ and they occur with probabilities p j := tr(ρP j ), j = 1, . . . , m ′ . If Y is a random variable representing such an outcome, then In a simple model of quantum state tomography considered in this paper, an observable X is sampled at random from some probability distribution Π in H m , E ρ (Y |X) = ρ, X and Y = ρ, X + ξ with noise ξ such that E ρ (ξ|X) = 0. Given a sample X 1 , . . . , X n of n i.i.d. copies of X, n measurements of observables X 1 , . . . , X n are performed for a system identically prepared n times in the same unknown state ρ ∈ S m resulting in outcomes Y 1 , . . . , Y n . This leads to the following trace regression model (1.1) Y j = ρ, X j + ξ j , j = 1, . . . , n with design variables X j , j = 1, . . . , n, response variables Y j , j = 1, . . . , n and noise ξ j , j = 1, . . . , n satisfying the assumption E ρ (ξ j |X j ) = 0, j = 1, . . . , n and E ρ (Y j |X j ) = ρ, X j . The goal is to estimate the target density matrix ρ based on the data (X 1 , Y 1 ), . . . , (X n , Y n ), with the estimation error being measured by one of the statistically meaningful distances between density matrices such as the Schatten p-norm distances for p ∈ [1, ∞] or quantum versions of Hellinger and Kullback-Leibler distances. This version of the problem of quantum state tomography has been intensively studied in the recent years. The noiseless case (quantum compressed sensing) was considered in [12] and [11]. In these papers, sharp bounds on the number n of measurements needed to recover a density matrix of rank r were obtained based on a subtle argument (so called "golfing scheme") utilizing matrix Bernstein type inequalities. These developments were related to an earlier work on low rank matrix completion [7]. In the noisy case, trace regression problems have been studied by many authors (see, e.g., [15] and references therein). The main focus was on nuclear norm penalized least squares estimator (matrix LASSO) and related methods such as matrix Dantzig selector (see [6], [19], [25], [14]). In [21], sharp bounds for matrix LASSO and matrix Dantzig selector, in particular, for Pauli measurements in quantum state tomography were obtained. Most of the results in these papers included upper bounds on the estimation error in Hilbert-Schmidt (Frobenius) norm as well as low rank oracle inequalities ( [19], [15], [18]). In [19], an upper bound on the operator norm error of a nuclear norm penalized modified least squares estimator was also proved. This result was further developed in [22]. In [16], upper bounds and low rank oracle inequalities for von Neumann entropy penalized least squares estimators were studied (including the bounds on the error in Bures distance and quantum relative entropy distance). A rank penalized estimator of density matrix was studied in [1]. The minimax lower bounds on the Frobenius norm error for matrix completion problems in classes of matrices of rank r were obtained in [19] (the operator norm version could be found in [22]). In [23], a method of deriving lower bounds for unitary invariant matrix norms (including Schatten p-norms) was developed and, among other matrix estimation problems, such bounds were obtained for matrix completion. Minimax lower bounds on the nuclear norm error in density matrix estimation were obtained in [9], where it was also shown that these bounds are attained (up to logarithmic factors) for the matrix versions of LASSO and Dantzig selector. In our recent paper [20], we derived minimax lower bounds in classes of low rank density matrices for the whole range of Schatten p-norm distances as well as for Bures (quantum Hellinger) and quantum relative entropy distance. We also showed that these minimax bounds are attained (up to logarithmic factors) for von Neumann entropy penalized least squares estimators introduced in [16] simultaneously for Bures, relative entropy and Schatten p-norm distances for p ∈ [1,2].
The current paper could be viewed as a continuation of [20]. Our main goal is to study a minimal distance estimatorρ of ρ (initially proposed in [17]) defined as the projection of a simple unbiased estimator onto the convex set of density matrices S m . We show that the minimax error rates established in [20] for the classes of low rank density matrices are attained for this estimator up to logarithmic factors in the whole range of Schatten p-norm distances for p ∈ [1, ∞] as well as for Bures and relative entropy distance. The proof of these results relies on simple properties of projections of Hermitian matrices onto the convex set S m of density matrices (see theorems 7 and 8) that might be of independent interest. Throughout the paper, ·, · denotes either Hilbert-Schmidt inner product (defined above), or (with a little abuse of notation) the canonical inner product of C m . The corresponding norm in C m is denoted by | · |. For A, B ≥ 0, the notation A B means that A ≤ CB for a numerical constant C > 0, A B means that B A and A ≍ B means that B A B. If needed, these signs might be provided with subscripts indicating that the constant is allowed to depend on parameters. Say, A γ B would mean that A ≤ CB with C depending on γ.

Preliminaries.
2.1. Distances between density matrices. The Schatten p-norm of a matrix A ∈ H m is defined as where λ 1 (A) ≥ · · · ≥ λ m (A) are the eigenvalues of A arranged in a non-increasing order. For p = 1, the norm A 1 is called the nuclear or the trace norm; for p = 2, A 2 is the Hilbert-Schmidt (generated by the Hilbert-Schmidt inner product) or Frobenius norm; for p = +∞, A ∞ = max 1≤j≤m |λ j (A)| is called the operator or the spectral norm. Note that, for all A ∈ H m , [1, ∞] ∋ p → A p is a non-increasing function. The following interpolation inequality is well known and can be easily deduced from a similar result for ℓ p -norms. Let 1 ≤ p < q < r ≤ ∞ and let µ ∈ [0, 1] be such that µ p + 1−µ r = 1 q , then In addition to the distances generated by the Schatten p-norms, the following two distances (extending well known distances between probability distributions used in the classical statistics) are of importance in quantum statistics: Bures distance and Kullback-Leibler divergence. The Bures distance is a quantum version of Hellinger distance and it is defined as follows: The quantity tr S 1/2 is called the fidelity of states S 1 , S 2 (a quantum version of Hellinger affinity). Note that 0 ≤ H 2 (S 1 , S 2 ) ≤ 2 and that H(S 1 , S 2 ) defines a metric in the space S m . The non-commutative Kullback-Leibler divergence, or relative entropy distance is defined as If S 2 is a density matrix of rank strictly smaller than m, log S 2 is not well defined and K(S 1 S 2 ) := +∞. Clearly, K(S 1 S 2 ) is not a metric (in particular, it is not symmetric). It is well known that K(S 1 S 2 ) is the supremum of classical Kullback-Leibler divergences between the distributions of outcomes of all possible measurements (represented by positive operator valued measures (POVM)) for the system prepared in states S 1 and S 2 . Similar property holds also for the Bures (Hellinger) distance and for the nuclear norm distance S 1 − S 2 1 which is the supremum of classical total variation distances between the distributions of outcomes of all measurements (see [26], [13]). These observations easily imply the following inequalities: (see also [16]).

2.2.
Sampling from an orthonormal basis. Uniform sampling from an orthonormal basis is a model of design distribution in trace regression (1.1) that has been frequently used in the literature on quantum compressed sensing (see, [12], [11]). Let E := {E 1 , . . . , E m 2 } be an orthonormal basis of the space H m of Hermitian matrices. Let Clearly, U ≤ 1 and 1 = max In what follows, it will be assumed that Π is a uniform distribution on the basis E. As a result, the response variables Y j , j = 1, . . . , n of trace regression model (1.1) could be viewed as noisy measurements of n randomly picked Fourier coefficients of the target density matrix ρ in basis E. This model includes, in particular, the so called Pauli measurements, an important approach to quantum state tomography (see, e.g., [12], [11]). Example: Pauli bases and Pauli measurements. The space of observables for a single qubit system is the space H 2 of 2 × 2 Hermitian matrices. Let The matrices σ 1 , σ 2 , σ 3 (often denoted σ x , σ y , σ z ) are called Pauli matrices. The matrices W i = 1 √ 2 σ i , i = 0, 1, 2, 3 form an orthonormal basis of the space H 2 (the Pauli basis). For a system consisting of k qubits, the space of observables is H m , where m = 2 k . The Pauli basis of this space is defined by tensorizing the Pauli basis of H 2 : it consists of m 2 = 4 k tensor products . . , E m 2 be the rest of the matrices of the Pauli basis of H m . It is straightforward to The fact that the matrices of this basis have the smallest possible operator norms has been used in quantum compressed sensing (see [12], [11], [21]). Matrices E j have the following spectral representations: . A measurement of E j for a k qubit system prepared in state ρ results in a random outcome τ j with two possible values ± 1 √ m taken with probabilities ρ, P ± j . For random variable τ j , E ρ τ j = ρ, E j . The density matrix ρ admits the following representation in the Pauli basis: Let ν be picked at random from the set {1, . . . , m 2 } (with the uniform distribution) and let X = E ν , Y = τ ν (which corresponds to random sampling from the Pauli basis with a subsequent measurement of observable X resulting in the outcome Y ). Then Since, for ρ ∈ S m , ρ 2 ≤ 1, this means that, for m > 2 with probability at least 1 − 2 m , Var ρ (Y |X) > 1 2m . In other words, the number of j = 1, . . . , m 2 such that Var ρ (τ j ) > 1 2m is at least m 2 − 2m implying that, for the most of the values of j, Var ρ (τ j ) ≍ 1 m . The variance could be further reduced by repeating the measurement of the observable X K times (for a system identically prepared in state ρ) and averaging the outcomes of the resulting K measurements. In this case, the response variable becomes

Minimax lower bounds.
In [20], the problem of density matrix estimation was studied in the case of trace regression model (1.1) with i.i.d. random design variables X 1 , . . . , X n sampled from the uniform distribution in an orthonormal basis E = {E 1 , . . . , E m 2 } in two different settings: trace regression with Gaussian noise and trace regression with a bounded response. In both cases, minimax lower bounds on the estimation error of the unknown target density matrix ρ of rank at most r were obtained for the Schatten p-norm distances (p ∈ [1, +∞]) as well as for the Bures version of quantum Hellinger distance and for the quantum Kullback-Leibler (relative entropy) distance. These results of [20] are stated below.
Denote by S r,m the set of all density matrices of rank at most r (1 ≤ r ≤ m).
Assumption 1 (Trace regression with Gaussian noise). Let (X, Y ) be a random couple with X being a random matrix sampled from the uniform distribution Π in an In this model, the level of the noise ξ is characterized by its variance which should be involved in the error bound (this could be viewed as a normal approximation of the noise in the case when repeated measurements are performed for each observable X j with averaging of the outcomes).
Theorem 1. Suppose Assumption 1 holds. For all p ∈ [1, +∞], there exist constants c, c ′ > 0 such that, the following bounds hold: 1 where infρ denotes the infimum over all estimatorsρ in S m based on the data (X 1 , Y 1 ), . . . , (X n , Y n ) satisfying the Gaussian trace regression model with noise variance σ 2 ξ .
The trace regression model with a bounded response is characterized by the size U of the range of response variable Y, which usually coincides with the bound on the operator norms of the basis matrices E j . It includes, in particular, Pauli measurements discussed above (for which U = m −1/2 ).
Assumption 2 (Trace regression with a bounded response). Let (X, Y ) be a random couple with X being a random matrix sampled from the uniform distribution Π in an orthonormal basis Let P r,m (U ) denote the class of all distributions P of (X, Y ) such that Assumption 2 holds for some U > 0 and E(Y |X) = ρ P , X for some ρ P ∈ S r,m . For a given P ∈ P r,m (U ), P P denotes the corresponding probability measure such that (X 1 , Y 1 ), . . . , (X n , Y n ) are i.i.d. copies of (X, Y ).
Theorem 2. Suppose Assumption 2 is satisfied and, for some constant γ ∈ (0, 1), Then, for all p ∈ [1, +∞], there exist constants c γ , c ′ γ > 0 such that the following bounds hold: As it was pointed out in [20] (see Remark 12) In the case of Pauli basis, such a matrix indeed exists and it is E 1 = W 0 ⊗ · · · ⊗ W 0 . Thus, Theorem 2 does not apply directly to the Pauli measurement model. However, the following result does hold (see [20], Theorem 10). . . , Y n be outcomes of measurements of observables X 1 , . . . , X n for the system being identically prepared n times in state ρ. The corresponding probability measure will be denoted by P ρ . Then, for all p ∈ [1, +∞], there exist constants c, c ′ > 0 such that the following bounds hold: where infρ denotes the infimum over all estimatorsρ in S m based on the data (X 1 , Y 1 ), . . . , (X n , Y n ).
It was also shown in [20] that, in the case of Schatten p-norm distances for p ∈ [1, 2], Bures distance and Kullback-Leibler distance, the minimax lower bounds of theorems 1, 2 and 3 are attained up to logarithmic factors in m and n for a penalized least squares estimator with von Neumann entropy penalty introduced in [16]. In the current paper, our main goal is to show that the minimax optimal rates are attained up to logarithmic factors for a very simple minimal distance estimator (that does not require any penalization) in the whole range of Schatten p-norms, p ∈ [1, ∞], as well as for Bures and Kullback-Leibler distances.

Main
Results. For the model of uniform sampling from an orthonormal basis E = {E 1 , . . . , E m 2 }, the following simple estimator of unknown state ρ ∈ S m is unbiased: Clearly,Ẑ is not necessarily a density matrix.
We will now define the minimal distance estimatorρ as the projection ofẐ onto the convex set S m of all density matrices. More precisely, for an arbitrary Z ∈ H m , define (3.1) π Sm (Z) := argmin S∈Sm Z − S 2 2 . Clearly, π Sm (Z) is the closest density matrix to Z with respect to the Hilbert-Schmidt norm distance (that is, the projection of Z onto S m ; such a closest density matrix exists in view of compactness of S m and it is unique in view of strict convexity of S → Z − S 2 2 ). Letρ := π Sm (Ẑ).

Remark 1. This definition is equivalent to the followinǧ
that was considered in [17] (in [19], similar estimators involving nuclear norm penalty were studied). Note that replacing the term m −2 S 2 2 in the right hand side of (3.2) by its unbiased estimator n −1 n j=1 S, X j 2 yields the usual least squares estimator Note that we also have since, for S ∈ S m , S 1 = tr(S) = 1. Thus,ρ coincides with the nuclear norm penalized least squares estimator (also called the matrix LASSO estimator) for any value of the regularization parameter ε.
We will show that the upper bounds on the error rates in Schatten p-norm distances for p ∈ [1, ∞] and in Bures distance that match the minimax lower bounds of theorems 1, 2 and 3 up to logarithmic factors hold for the estimatorρ. We will then introduce a simple modification of this estimator for which a matching upper bound holds also for Kullback-Leibler distance.
First, we consider the case of Gaussian trace regression model (Assumption 1). We need an additional assumption that σ ξ ≥ U m 1/2 (the variance of the noise is not too small).
Similarly, in the case of trace regression with a bounded response, the following result holds.
For completeness, we state also the upper bounds in the case of Pauli measurements (that immediately follow from Theorem 5).
The proof of these results relies on the following fact that might be of independent interest and that essentially shows that π Sm (Z) is the closest density matrix to Z not only in the Hilbert-Schmidt norm distance, but also in the operator norm distance.
The proof of this theorem will be given in Section 4. Here we use it to establish the next result that is the main ingredient of the proofs of theorems 4, 5 and 6.
Proof. Let S = r j=1 λ j (φ j ⊗φ j ) be the spectral decomposition of S with eigenvalues λ j and eigenvectors φ j . Let L := supp(S) be the linear span of vectors φ 1 , . . . , φ r ∈ C m . Denote by P L , P L ⊥ the orthogonal projection operators onto subspace L and its orthogonal complement L ⊥ , respectively. We will need the following projection operators P L , P ⊥ L : H m → H m : The following bounds are obvious: 1 . Since S = P L SP L , we can use the pinching inequality for unitary invariant norm · 1 (see [4], p. 97) to get:

It follows from the last bound that
Since dim(L) = r, the matrix P L (S ′ − S) is of rank at most 2r. This implies that Therefore, S ′ − S 1 ≤ 8r S ′ − S ∞ , and since also S ′ − S 1 ≤ 2, S, S ′ ∈ S m , we conclude that Together with interpolation inequality this yields that for all p ∈ [1, ∞] Proof. We now prove Theorem 8. It immediately follows from Theorem 7 that, for all S ∈ S m , If S ∈ S m is a density matrix of rank r, the last bound could be combined with the bound of Lemma 1 to get that for all p ∈ [1, +∞] Proof. We now turn to the proof of theorems 4, 5 and 6. To this end, we use the bound of Theorem 8 with Z =Ẑ and S = ρ ∈ S r,m that yields: is based on a standard application of matrix Bernstein type inequalities. We give a detailed argument for completeness. Note that ρ − ρ p in the left-hand side of bound (3.11) is upper bounded by 2, so, if Bernstein bound on Ẑ − ρ ∞ is larger than 1 (or even 1), it could be replaced by the trivial bound equal to 1. In the case of Theorem 5, we use the following version of Bernstein inequality for i.i.d. bounded random matrices (see, e.g., [28]).
Then, for all t > 0 with probability at least 1 − e −t , For V = Y X − E(Y X), we get, under Assumption 2, that It is also well known that, under the same assumption, . . , m} is an orthonormal basis of C m , then We use the bound of Lemma 2 with t = A log(2m), A ≥ 1 to get that with probability at least 1 − (2m) −A , Thus, when the bound on Ẑ − ρ ∞ is substituted in bound (3.11), it is enough to keep only the first term U m 3/2 A log(2m) n , the second term could be dropped. This implies that with some constant C ′ > 0 (that does not depend on ρ ∈ S r,m ) the inequality holds with probability at least 1 − (2m) −A , implying the first bound of Theorem 5. The second bound immediately follows from the inequality H 2 (ρ, ρ) ≤ ρ − ρ 1 (see (2.2)). Theorem 6 is an immediate consequence of Theorem 5. The proof of Theorem 4 is very similar. In this case, Assumption 1 holds and it is natural to splitẐ − ρ into two parts and to bound Ẑ − ρ ∞ by triangle inequality. For the first part, an application of matrix Bernstein inequality of Lemma 2 yields the bound that holds for some absolute constant C ≥ 1 with probability at least 1−(2m) −A . Indeed, in this case V = ρ, X X − E ρ, X X and and Lemma 2 implies (3.13). As before, if Thus, the second term U 2 m 2 A log(2m) n could be dropped when the bound on Ẑ − ρ ∞ (for which the right hand side of (3.13) is a part) is substituted in (3.11).
As to the second part of representation (3.12) that involves normal random variables ξ j , it is bounded using another version of matrix Bernstein inequality for not necessarily bounded random matrices (see [16], [15], [18]).
We apply the bound of Lemma 3 in the case when V := ξX, α = 2 and t = A log(2m) for A ≥ 1. By an easy computation, This yields the following bound n that holds with probability at least 1 − (2m) −A and with some absolute constant C ≥ 1.
If the second term in the maximum in the right hand side of (3.14) is dominant, then U m 1/2 A log(2m) n log 1/2 (4U √ m) ≥ 1. Under the condition that σ ξ ≥ U m −1/2 , this implies that also σ ξ m 3/2 A log(2m) n 1. Thus, when the bound in the right hand side of (3.14) (used to control Ẑ − ρ ∞ ) is substituted in (3.11), it is enough to keep only the first term in the maximum. Finally, under the assumption σ ξ ≥ U m −1/2 , the first term of bound (3.14) dominates the first term of (3.13), so, only this term is needed to control Ẑ − ρ ∞ in bound (3.11). These considerations imply the bound that holds with some constant C ′ > 0 (that does not depend on ρ ∈ S r,m ) and with probability at least 1 − (2m) −A . The first bound of Theorem 4 now follows for all p ∈ [1, ∞] (which also implies the second bound in view of (2.2)). 2 Here · ψα denotes the ψα Orlicz norm in the space of random variables defined as follows: It turns out that for a slightly modified version of estimatorρ, minimax lower bounds are also attained (up to logarithmic factors) in the case of Kullback-Leibler distance. For S ∈ S m and δ ∈ [0, 1], define S δ = (1 − δ)S + δ Im m . Clearly, S δ ∈ S m . Let S m,δ := {S δ : S ∈ S m }. Define π S m,δ (Z) the projection of Z ∈ H m onto the convex set S m,δ : π S m,δ (Z) := argmin S∈S m,δ Z − S 2 2 . Letρ δ := π S m,δ (Ẑ) withρ 0 =ρ. We will prove the following versions of theorems 4, 5 and 6 for the estimatoř ρ δ .
Then (3.9) and (3.10) hold for estimatorρ δ . Moreover, for A ≥ 1, define Then, for some constant c > 0, Proof. We start with the following modification of Theorem 8.
Proof. The following formula is straightforward: for δ ∈ [0, 1), implying the claim. Let S ∈ S r,m . Then, for p ∈ [1, ∞], To control the first term in the right hand side, we use the bound of Theorem 8, which requires bounding Using bounds (3.18), (3.19) along with the bound of Theorem 8, we get the bound of the lemma.
We will use the bound of Lemma 4 to control ρ δ − ρ p for ρ ∈ S r,m . To this end, we need to bound Ẑ −ρ ∞ using matrix Bernstein inequalities exactly as it was done in the proof of theorems 4, 5 and 6 (under assumptions of these theorems). Denote by∆ such an upper bound on Ẑ − ρ ∞ that holds with probability a least 1 − (2m) −A . Recall that that holds with the same probability at least 1 − (2m) −A . Recall that we replace∆ by ∆ since the left hand side ρ δ − ρ p ≤ 2; for the same reason, we can and do drop the "exponential parts" of matrix Bernstein bounds leaving in the definition of ∆ only the "Gaussian parts". For δ ∆, we get ρ δ − ρ p min(r 1/p ∆, ∆ 1−1/p ).
The bound on the Kullback-Leibler divergence K(ρ ρ δ ) is an immediate consequence of the bound on ρ δ − ρ 1 and the next lemma that follows from Corollary 1 in [3].
We conclude this section with a simple result concerning the least squares estimator ρ defined by (3.3). It shows that the estimatorsρ andρ are close in the Hilbert-Schmidt norm. As a result, the bounds of the previous theorems could be applied to estimatorρ as well (at least, under some additional assumptions).
Similar analysis of convex optimization problem (3.3) shows that which could be rewritten as follows: We will now write 3 It follows from (3.22) that Since ρ 2 ≤ 1, we get It remains to control the operator norm in the right hand side for which we can again use matrix Bernstein inequality of Lemma 2 applying it to V = X ⊗ X − E(X ⊗ X). In this case, (3.23) along with the bound of Lemma 2 with t = A log(2m 2 ), A ≥ 1 yield the following inequality ρ −ρ 2 m A log(2m) n m 2 A log(2m) n that holds with probability at least 1 − (2m 2 ) −A . Since ρ −ρ 2 ≤ 2, the second term m 2 A log(2m) n in the right hand side could be dropped (if this term is dominant, the bound is 1). This completes the proof of the theorem.
Since ρ −ρ ∞ ≤ ρ −ρ 2 , the bound of Theorem 12 also holds for ρ −ρ ∞ . Combining this with the bound of Theorem 5 for p = ∞, it is easy to conclude that under conditions of this theorem and that the last bound holds (with a proper choice of constant in relationship ) with probability at least 1 − (2m) −A . In view of Lemma 1, this immediately implies that all the bounds of Theorem 5 also hold for the least squares estimatorρ. In a special case of Pauli measurements, this means that Theorem 6 holds for the estimatorρ. Concerning Theorem 9, the same conclusion is true under the additional assumption that σ ξ ≥ m −1/2 . Moreover, ifρ δ is the following modification of estimatorρ  2. If D ∈ H m is a diagonal matrix, then π Sm (D) ∈ S d m .
Proof. To prove the first claim, note that, by the unitary invariance of the Hilbert-Schmidt norm, In addition, the mapping S → U SU −1 is a bijection from the set S m onto itself. This immediately implies that For an m × m matrix A = (a ij ) m i,j=1 ∈ H m , let A d be the diagonal matrix with diagonal entries a ii , i = 1, . . . , m. It is easy to see that if A is a density matrix, then A d is also a density matrix. Moreover, it is also obvious that, for a diagonal matrix D, with a strict inequality if A is not diagonal. These observations immediately imply the second claim.
We will now state and prove a vector version of Theorem 7 in which the role of the set of density matrices S m is played by the simplex ∆ m := u = (u 1 , . . . , u m ) ∈ R m : u j ≥ 0, m j=1 u j = 1 in R m (this is equivalent to considering the set of diagonal density matrices). We will then show that the matrix version of the problem reduces to the vector case.
Define π ∆m (z) := argmin u∈∆m z − u 2 is strictly convex and ∆ m is a compact convex set, such a minimizer exists and is unique. In other words, π ∆m (z) is the projection of the point z ∈ R m onto simplex ∆ m (the closest point to z in the set ∆ m with respect to the Euclidean ℓ m 2 -distance). The next lemma shows that the same point also minimizes the ℓ m ∞ -distance from z to the simplex ∆ m . Proof. This is an immediate consequence of Proposition 1 and the following simple fact: To complete the proof of Theorem 7, observe that, In view of lemmas 6, 7, Without loss of generality, assume that d 1 ≥ · · · ≥ d m . Let S ∈ S m be a density matrix with eigenvalues v 1 ≥ · · · ≥ v m . Clearly, v = (v 1 , . . . , v m ) ∈ ∆ m . Therefore, where to get the last bound we used Weyl's perturbation inequality (see [4], Corollary III.2.6).

5.
Comments on computational aspects of the problem. An advantage of minimal distance estimatorρ = π Sm (Ẑ) is the simplicity of its computational implementation. The computation of the matrixẐ = m 2 n n i=1 Y i X i requires O(nm 2 ) operations. It is followed by an eigen-decomposition of Z that requires O(m 3 ) operations(see [10]); there exist efficient software packages designed for this kind of tasks, for instance, LIN-PACK and PROPACK, etc.). As it is shown in the previous section, the problem of computing π Sm (Ẑ) then reduces to projecting of the vector of eigenvalues of Z arranged in a non-increasing order onto the simplex ∆ m . The last problem has been studied in the literature (see [24], [27], [8]) and it has an explicit solution of computational complexity proportional to m (see the proof of Lemma 6). Thus, the computational implementation of the minimal distance estimatorρ requires O((n + m)m 2 ) operations.
The matrix version of LASSO estimator for density matrices is equivalent to solving the following optimization problem that results in the least squares estimator. Clearly, there is no explicit solution for this optimization problem and it is usually solved by iterative algorithms. For example, a well know iterative singular value thresholding (SVT) algorithm was proposed in [5], and also implemented in quantum compressed sensing in [9]. The main idea is that (5.1) is equivalent to the following optimization problem: for any τ > 0, ρ := arg min S∈Sm,Z∈Hm,S=Z The proposed algorithm updates Z and S alternatively, with the only constraint for S being that S ∈ S m . Therefore, the main ingredient of SVT is the following iterative updating rule (with initial Z 0 = 0): for k = 1, 2, . . ., with certain pre-determined step sizes δ k > 0. The algorithm terminates at some step k = N and outputs S N ∈ S m when S N − S N −1 2 ≤ ǫ for some numerical threshold ǫ > 0. It is clear that the minimal distance estimatorρ can be produced by the above algorithm with one iteration and the initialization Z 0 =Ẑ, δ 1 = 0. When the number of qubits k is not small (for instance, about 20) and the dimension m is very large, the iterative algorithm (5.2) is much more computationally expensive than the algorithm for the minimal distance estimator (since every iteration requires the eigen-decomposition of a high dimensional matrix).