Estimation of low rank density matrices by Pauli measurements

Density matrices are positively semi-definite Hermitian matrices with unit trace that describe the states of quantum systems. Many quantum systems of physical interest can be represented as high-dimensional low rank density matrices. A popular problem in {\it quantum state tomography} (QST) is to estimate the unknown low rank density matrix of a quantum system by conducting Pauli measurements. Our main contribution is twofold. First, we establish the minimax lower bounds in Schatten $p$-norms with $1\leq p\leq +\infty$ for low rank density matrices estimation by Pauli measurements. In our previous paper, these minimax lower bounds are proved under the trace regression model with Gaussian noise and the noise is assumed to have common variance. In this paper, we prove these bounds under the Binomial observation model which meets the actual model in QST. Second, we study the Dantzig estimator (DE) for estimating the unknown low rank density matrix under the Binomial observation model by using Pauli measurements. In our previous papers, we studied the least squares estimator and the projection estimator, where we proved the optimal convergence rates for the least squares estimator in Schatten $p$-norms with $1\leq p\leq 2$ and, under a stronger condition, the optimal convergence rates for the projection estimator in Schatten $p$-norms with $1\leq p\leq +\infty$. In this paper, we show that the results of these two distinct estimators can be simultaneously obtained by the Dantzig estimator. Moreover, better convergence rates in Schatten norm distances can be proved for Dantzig estimator under conditions weaker than those needed in previous papers. When the objective function of DE is replaced by the negative von Neumann entropy, we obtain sharp convergence rate in Kullback-Leibler divergence.


Introduction
Let H m be the set of all Hermitian matrices: H m := {A ∈ C m×m : A = A * } with A * denoting the adjoint matrix of A. For A ∈ H m , tr(A) denotes the trace of A and A 0 means that A is positively semi-definite (i.e., all its eigenvalues are nonnegative). Let S m := {S ∈ H m : S 0, tr(S) = 1} be the set of all positively semi-definite m × m Hermitian matrices of unit trace which are called density matrices.
In quantum mechanics, the state of a quantum system is often characterized (or at least approximated) by a density matrix ρ ∈ S m . The goal of quantum state tomography (QST) is to estimate the unknown state ρ based on a number of measurements conducted on the systems prepared in state ρ (see [7], [6], [11], [3] and references therein). The difficulty of QST is that the dimension m grows exponentially as the system size increases. For instance, for a quantum system consisting of b qubits, its density matrix ρ ∈ S m with m = 2 b . Fortunately, density matrices of many important quantum states (for instance, pure states) are of low rank which can significantly reduce the complexity of the estimation problem. In this paper, we focus on the following set of low rank density matrices, S m,r := {S ∈ S m : rank(S) ≤ r}. (1.1) Typically, the statistical model of QST is as follows. Given an observable A ∈ H m (in this paper, A represents a Pauli matrix) with spectral representation A = m ′ j=1 λ j P j , where m ′ ≤ m, λ j being the distinct eigenvalues of A and P j being the corresponding mutually orthogonal eigenprojectors, the outcome of a measurement of A on the system prepared in state ρ is a random variable O taking values λ j with probability tr(ρP j ). Put it differently, P(O = λ j ) = ρ, P j , j = 1, 2, . . . , m. Then, it is easy to check E ρ O = tr(ρA) and the variance of the outcome O depends on both ρ and A. Usually, given an observable A, multiple measurements of A are performed on independently and identically prepared quantum systems and the average outcome Y is taken as the output, whose variance can be significantly smaller than the outcome O from a single measurement. For instance, given the Observable A, it is used to conduct measurement on K independently and identically prepared quantum systems, producing the outcomes O 1 , . . . , O K . Then Y is taken as Y := K −1 K k=1 O k . Typically, there are many possible choices for the Observable A. A common approach is to choose an observable A at random, assuming that it is the value of a random variable X with some design distribution Π in a subset of H m . More precisely, given a sample of n i.i.d. copies X 1 , . . . , X n of X, (multiple) measurements are performed for each of them on quantum systems identically prepared in state ρ resulting into the (average) outcomes Y 1 , . . . , Y n . Based on the data (X 1 , Y 1 ), . . . , (X n , Y n ), the goal is to estimate the underlying density matrix ρ ∈ S m . Clearly, the observations satisfy the following model 1 Schatten p-norms for all 1 ≤ p ≤ +∞. In the trace regression model with Gaussian noise, the sharp upper bounds in Schatten p-norms have been proved in [14] for a least squares estimator (denoted byρ in the following context). In particular, these upper bounds (except the logarithmic terms) in Schatten norms are given as follows (assuming that V (ρ, X j ) ≤ σ 2 ξ for all X j ): for some constant C > 0. These bounds hold as long as σ ξ 1 √ nm for Pauli measurements. It is interesting to notice that the second term in (1.3) implies that m n (logarithmic factors)→ 0 is enough to guarantee ρ − ρ 2 → 0 as m, n → ∞. The convergence rates in other Schatten norm distances are proved in a recent work [25] where a projection estimator is considered and upper bounds in the form (1.3) are obtained for all 1 ≤ p ≤ +∞. The bounds established in [25] relies crucially on the assumption that σ ξ ≥ 1 m for Pauli measurements (if σ ξ ≤ 1 m , the bound holds by replacing σ ξ with 1 m ). Clearly, one question is that whether the performances of the least squares estimator and the projection estimator can be simultaneously obtained by a single estimator. And another question is what bounds can we get when σ ξ is small? We seek to answer this question by considering a Dantizg-type estimator for Pauli measurements. Its Schatten p-norm convergence rates have the same form as (1.3) for 1 ≤ p ≤ 2 as long as σ ξ 1 √ nm . In addition, when σ ξ 1 m , its Schatten p-norm convergence rates can be obtained for all 1 ≤ p ≤ +∞. In other words, the Dantzig-type estimator achieves the performances as both the least squares estimator in [14] and the projection estimator in [25], under the same conditions. Another advantage is that nontrivial convergence rates in Schatten p-norms for 1 ≤ p ≤ +∞ can also be obtained under weaker conditions on σ ξ . A summary of these results can be found in Section 5.

Notations
The notation ·, · is used for both the Euclidean inner product in C m and for the Hilbert-Schmidt inner product in H m . · p , p ≥ 1 will be used to being the eigenvalues of A. In particular, · 2 denotes the Hilbert-Schmidt (or Frobenius) norm, · 1 denotes the nuclear (or trace) norm and · ∞ denotes the operator (or spectral) norm: In what follows, Π will be typically the uniform distribution in an orthonormal basis E = {E 1 , . . . , E m 2 } ⊂ H m , implying that so, the L 2 (Π)-norm is just a rescaled Hilbert-Schmidt norm. In addition, let Π n denote the empirical distribution based on the sample (X 1 , . . . , X n ) such that The non-commutative Kullback-Leibler divergence (or relative entropy distance) K(· ·) is defined as (see also [18]): If log S 2 is not well-defined (for instance, some of the eigenvalues of S 2 are equal to 0), we set K(S 1 S 2 ) = +∞. The symmetrized version of Kullback-Leibler divergence is defined as C, C 1 , C ′ , c, c ′ , etc will denote constants (that do not depend on parameters of interest) whose values could change from line to line (or, even, within the same line) without further notice. For nonnegative A and B, A B (equivalently, B A) means that A ≤ CB for some absolute constant C > 0, and A ≍ B means that A B and B A simultaneously. Moreover, by writing A log(m,n,K) B, we mean that A ≤ C 1 B log c1 m log c2 n log c3 K for some absolute constants C 1 , c 1 , c 2 , c 3 .

Sampling from Pauli basis and Binomial observation model
The spin-1 2 particle is the simplest example of a two-state quantum system, which is conventionally called a qubit. The state of a single qubit is determined by its spin: up, down or a superposition of both. One most popular Observable for a single qubit system is usually represented by Pauli matrices. They are given as (2.1) The matrices σ x , σ y , σ z correspond to the spin along the coordinate axes in R 3 . The additional matrix σ 0 represents a trivial operation on the single qubit system. Define W α := 1 √ 2 σ α for α = 0, x, y, z. Then {W 0 , W x , W y , W z } consists of an orthonormal basis of H 2 which is conventionally called the Pauli basis. The Pauli matrices can be easily generalized for multi-qubit systems. Indeed, for a system with b qubits, the normalized Pauli matrices are constructed as which actually form an orthonormal basis of H m with m = 2 b . We rearrange the Pauli matrices E : . . , E m 2 denote the rest of Pauli matrices in (2.2). An obvious fact is that 1 √ m is the only eigenvalue of E 1 and ± 1 √ m are the eigenvalues of E 2 , . . . , E m 2 with the same multiplicity such that tr(E k ) = 0 for 2 ≤ k ≤ m 2 . In other words, one has the spectral decomposition By measuring E k on a b-qubits system prepared in the state ρ ∈ S m with m = 2 b , the outcome τ k is a random variable taking values ± 1 √ m with probability ρ, P ± k and E ρ τ k = ρ, E k for 1 ≤ k ≤ m 2 . If we represent ρ in the Pauli basis 2m . A standard approach in QST is to randomly select X uniformly from the Pauli basis E. Then multiple measurements of X are conducted on quantum systems independently prepared in the same (unknown) state ρ. Suppose that K measurements are performed, resulting into the outcomes O 1 , . . . , O K . Clearly, |O k | = 1 √ m for 1 ≤ k ≤ K. Let X 1 , X 2 , . . . , X n be i.i.d. random Pauli matrices sampled uniformly from E with replacement. For each X i , K measurements are conducted and the outcomes are collected. Let K + i denote the number of outcomes + 1 √ m and K − i denote the number of outcomes − 1 √ m . Then K + i + K − i = K. It is clear that K + i has a Binomial distribution and K + i ∼ Bin K, ρ, X + i where X + i and X − i represent the spectral projectors of X i corresponding to the eigenvalues + 1 Km . By assuming ξ i |X i ∼ N (0, σ 2 ξ ) for all 1 ≤ i ≤ n, the minimax lower bounds of estimating low rank ρ ∈ S m,r based on the data (X 1 , Y 1 ), . . . , (X n , Y n ) was established in [14]. Clearly, the assumption of Gaussian noise with common variance is not true in practice. In section 3, we prove the minimax lower bounds based on the data (X 1 , K + 1 ), . . . , (X n , K + n ) where K + i | X i has a Binomial distribution for 1 ≤ i ≤ n. Formally, our model is described as follows.
Assumption 1 (Binomial observation model). Let E be the Pauli basis as in (2.2). Let X be sampled uniformly from E, then measurements of X are conducted on K quantum systems independently and identically prepared in the state ρ . Let K + be the number of + 1 √ m collected from the K random outcomes. The data (X 1 , K + 1 ), . . . (X n , K + n ) consists of n i.i.d. copies of (X, K + ). Let P ρ denote the probability distribution of (X 1 , K + 1 ), . . . (X n , K + n ).

Some useful lemmas
The following well known interpolation inequality for Schatten p-norms will be used to extend the bounds proved for p = 1 and p = ∞ to the whole range of its values. It easily follows from similar bounds in ℓ p -spaces.
Lemma 1 (Interpolation inequality). For 1 ≤ p < q < r ≤ ∞, and let µ ∈ [0, 1] be such that Then, for all A ∈ H m , Given a subspace L ⊂ C m , L ⊥ denotes the orthogonal complement of L and P L denotes the orthogonal projection onto L. Let P L , P ⊥ L be orthogonal projection operators in the space H m (equipped with the Hilbert-Schmidt inner product), defined as follows: These two operators split any Hermitian matrix A into two orthogonal parts, P L (A) and P ⊥ L (A), the first one being of rank at most 2dim(L). Non-commutative (matrix) versions of Bernstein inequality will be used in what follows. Lemma 2 is an unbounded version of the standard matrix Bernstein inequality (see [21]) whose proof is provided in [11]. Recall that, for any α ≥ 1, the ψ α -norm of a random variable Z is defined as The following lemma will be helpful. It means that given S ∈ S m and its support L, then the projection of S 1 − S onto L dominates its complement in nuclear norm for all S 1 ∈ S m . This is due to the unit trace of all density matrices.
Lemma 3. Let S ∈ S m such that rank(S) ≤ l. Then any S 1 ∈ S m , the following inequality holds Proof. Let L denotes the linear space spanned by the first l eigenvectors of S and P L , P ⊥ L be corresponding orthogonal projection operators. Then if rank(S) ≤ l, where the last equality is due to the mutual orthogonality between S and P ⊥ L (S 1 ). As a consequence, has rank at most 2l, which proves the inequality (2.3).

Minimax lower bounds
In this section, we prove the minimax lower bounds in Schatten p-norms of estimating ρ ∈ S m,r under the Binomial observation model (Assumption 1). These bounds hold for all the Schatten p-norms with 1 ≤ p ≤ +∞. In addition, we also obtain the minimax lower bound for Kullback-Leibler divergence. Note that the bounds in Theorem 3.1 are equivalent to the bounds proved in [14] under the Gaussian noise model with common variance (by setting σ 2 ξ ≍ 1 Km in [14, Theorem 4]).
where infρ is taken as the infimum over all etimatorsρ based on the data (X 1 , K + 1 ), . . . , (X n , K + n ). Remark 1. Note that n and K are the main interesting parameters in quantum state tomography. The value of K represents the number of quantum systems prepared for each of the Pauli measurements X 1 , . . . , X n . The value of n is closely related to the number of different Pauli measurements needed to be setup in real experiments 2 . Clearly, if K increases, the bounds (3.1) and (3.2) become smaller. In the case K ≍ m, we get where m n 1− 1 p converges to 0 for any p > 1 as long as m n → 0. It is worthwhile to compare the bounds (3.1) and (3.2) with the bounds (30) and (31) in [14] under the trace regression model with bounded responses (where each X i is used as measurement on only one quantum system). By replacing n with nK in (30) and (31) in [14], we immediately end up with bounds (3.1) and (3.2). If we just focus on the necessary number of identically prepared quantum systems in both models, bounds (3.1) and (3.2) are essentially equivalent to the bounds (30) and (31) in [14]. However, bounds (3.1) and (3.2) indicate that the necessary number of different Pauli measurements can be significantly reduced in the Binomial observation model (Assumption 1). For instance, bound (3.2) is nontrivial only when nK m 2 r 2 . Therefore, nK = O(m 2 r 2 ) random Pauli measurements are needed for bounds (30) and (31) in [14] . Under the uniformly sampling scheme, it means that all the O(m 2 ) different Pauli measurements will be used. However, in our Binomial observation model (Assumption 1) with K ≍ m, bound (3.2) indicates that O(mr 2 ) different Pauli measurements might be enough to produce an estimation with small error in relative entropy distance. Moreover, if we consider Schatten p-norm distances for p > 1, the Binomial observation model requires only n = O(m) Pauli measurements which is significantly smaller than m 2 .

Dantzig estimator and optimal convergence rates
In view of the minimax lower bounds established in Section 3, it is natural to ask which estimators can achieve these convergence rates under the Binomial observation model (Assumption 1). In [5], [7] and [15], the matrix LASSO estimator and Dantzig estimator were considered in the setting n mr log 6 m, where the convergence rates in Schatten p-norms are obtained for 1 ≤ p ≤ 2. Those convergence rates match the first term in (3.1) up to logarithmic terms 3 . In our recent paper [14], a least squares estimator was studied and the convergence rates in Schatten p-norms for 1 ≤ p ≤ 2 were proved as long as K n (noise is large enough). For Schatten p-norms with 1 ≤ p ≤ +∞, a different estimator was proposed in [25] based on the eigenvalues thresholding which achieves convergence rates matching the minimax lower bounds (3.2) when K m. Both the bounds proved in [14] and [25] are nontrivial as long as n log(m,n) m. Clearly, both estimators in [14] and [25] have advantages in different settings. 3 In their work, they focused on a nontrivial upper bound for the Schatten  In this section, we show that the advantages of both estimators can be obtained simultaneously for the Dantzig-type estimator. Moreover, nontrivial Schatten p-norms for 1 ≤ p ≤ +∞ can be obtained under weaker conditions. Let's begin with the introduction to the Dantzig-type estimator.

Dantzig estimator
Recall that the central problem in QST is to estimate an unknown high-dimensional low rank density matrix ρ based on the data (X 1 , K + 1 ), . . . , (X n , K + n ) satisfying the Binomial observation model (Assumption 1). Recall Y i := In the following, we consider estimators based on the data (X 1 , Y 1 ), . . . , (X n , Y n ).
The standard matrix Dantzig estimator (or Selector) is defined as the solution to the following convex optimization problem: for some ǫ ≥ 0. When ǫ = 0, it corresponds to the noiseless settings (i.e., K = +∞ which never happens in reality) where the exact recovery of ρ is the main interest. It was introduced in Candès and Plan [4] for low rank matrix estimation and was applied in quantum state tomography for estimating low rank density matrices, see Liu [15], Gross [6] and Flammia et al. [5]. They also proved sharp (in this paper, "sharp" means optimality up to logarithmic factors) convergence rates in Schatten 1-norm and Schatten 2-norm distances by applying some techniques based on the restricted isometry property(RIP) which requires n log(m) mr Pauli measurements. RIP is a strong assumption, but there is yet no results related to its convergence in other Schatten p-norms. Moreover, even though the condition n log m mr looks natural for low rank matrix completion or estimation, this condition might not be necessary for density matrix estimation, especially when we focus on Schatten p-norms for p = 1. The reason is that a density matrix itself essentially has low rank due to its unit trace. Indeed, our results show that n log m m is sufficient to produce a consistent estimation in Schatten p-norms with p > 1. When S ∈ S m , the objective function in (4.1) is always 1 and provides no benefit to the optimization problem. Instead, we study the following estimator: where we replaced the nuclear norm in (4.1) with negative von Neumann entropy. von Neumann entropy of a density matrix S is defined as which is a concave function on S m and then (4.2) is actually a convex optimization problem. von Neumann entropy plays an important role in quantum information theory and it was used in [11] and [14] as a penalization to the least squares estimator. In this paper, we prove the sharp convergence rates of ρ ǫ in all the Schatten p-norms with p ∈ [1, +∞]. It is easy to show that these rates also hold for the standard matrix Dantzig estimator (4.1). As a benefit of von Neumann entropy in (4.2), we obtain sharp convergence rate ofρ ǫ in Kullback-Leibler divergence.

Oracle inequality and Schatten p-norm convergence rates
Theorem 4.1 displays the performance ofρ ǫ by a low rank oracle inequality.
The low rank oracle inequality has been well studied for (matrix) LASSO estimator(see [10] [12], and [14]). When studying Dantzig estimator in compressed sensing problems, the sparsity oracle inequality is considered over all oracles in the feasible set (that is Λ(ǫ) in this paper), for example [9]. It is generally impossible to compare the performance of the estimator with sparse oracles (or low rank oracles in matrix compressed sensing) when they are not in the feasible set. Surprisingly, we can obtain the following low rank oracle inequality forρ ǫ which actually hold for all the oracles in S m , even when the oracle is infeasible for the optimization problem (4.1) and (4.2). assume ρ ∈ S m,r , letρ ǫ be as defined in (4.2). For ǫ ≥ C1 m t+log(2m) nK with any t ≥ 1 and some large enough constant C 1 > 0, there exists a constant C > 0 such that with probability at least 1 − e −t , nK , then with probability at least 1 − 1 2m , and Remark 2. The objective function in optimization problem (4.2) is not involved in the proof of (4.4). Therefore, bound (4.4) also holds for the standard Dantzig estimator (4.1). Moreover, instead of (4.4), we actually prove a stronger bound in Section 6.2: for any S ∈ S m . Consider S = ρ, t = log(2m) and ε = C1 m log(2m) nK , it indicates that if n ≥ C ′ mr log 3 m log 3 n for a large enough constant C ′ > 0 such that (due to Lemma 3) , which reduces to the canonical result by applying the restricted isometry property (see [15], [4]). This bound depends linearly on 1 K (which can be arbitrarily small, even K = +∞), see also Remark 3 after Theorem 4.2 and the discussion in Section 5.
The following corollary is an immediate result by applying a similar approach as in [14,Theorem 21]. Note that similar convergence rates (with different logarithmic factors) for Schatten p-norms with 1 ≤ p ≤ 2 have been proved for the least squares estimator under the condition K Corollary 1. Letρ ε be defined as (4.2) with ε := C m log(2m) nK and assume that K n. Then, for all 1 ≤ p ≤ 2, with probability at least 1 − 1 2m , 2. (4.10) Moreover, we get upper bounds for ρ ε − ρ p with 1 ≤ p ≤ +∞ if K m. The bounds in Theorem 4.2 are similar to the bounds proved in [25] for a distinct estimator based on eigenvalues thresholding. Both the bounds (4.10) and (4.11) match, except the logarithmic factors, the minimax lower bounds (3.1) in corresponding settings of K.
(4.11) for all 1 ≤ p ≤ +∞. In the case K m, the upper bounds still hold by replacing K with m. [14] and [25] can be achieved simultaneously by the Dantzig-type estimatorρ ε . Moreover, it worths to point out that a bound stronger than (4.11) is actually proved in Section 6.2

Remark 3. Basically, Corollary 1 and Theorem 4.2 indicate that the performances of both estimators in
which holds with probability at least 1 − 1 2m . Recall that when K n from (4.9), Then, we get from (4.12) that for all 1 ≤ p ≤ +∞. This bound holds for any n, K as long as K n. It is unclear at this moment whether the least squares estimator in [14] and the projection estimator in [25] can obtain this bound in the same settings. Some interesting bounds can be obtained when we consider specific choices of n and K. For instance, consider K ≍ n such that n ≍ √ nK mr log 3 m log 3 n. Then, the following bound holds with probability at least 1 − 1 2m , More specifically, consider p = +∞, we obtain ρ ε − ρ ∞ log(m,n) m 3/2 n 3/2 r m √ r n mr n . It is interesting to compare this bound with the bounds established for the projection estimator [25] and the least squares estimator [14]. Since K m, it is proved that the spectral norm convergence rate of the projection estimator (similar bounds as in Theorem 4.2) is of the order m n (with logarithmic factors). Clearly, m 3/2 n 3/2 r ≤ m n . In addition, if we consider the least squares estimator and control its spectral norm convergence rate by its Frobenius norm convergence rate (similar bounds as in Corollary 1), we obtain a simple bound (recall that K ≍ n) as m √ r n (up to logarithmic factors) which clearly dominates m √ r n mr n , especially when n mr α for α > 1. Basically, we conclude that the Dantzig estimatorρ ε can achieve better convergence rates than the least squares estimator and the projection estimator in the case n log m mr. More discussions are provided in Section 5.

Discussion
The main purpose of this paper is to study the convergence rates of the Dantzig estimator in Schatten p-norms for all 1 ≤ p ≤ +∞ and compare it with the least squares estimator in [14] and the projection estimator in [25]. In this section, we provide a summary of convergence rates in Schatten norm distances obtained for estimators which have been proved in the literature. The summary is based on the different cases of n and K. In the following, consider n ≍ log(m,n) mr α (assume that m and r dominate the logarithmic factors).
1. If 0 ≤ α < 1. No estimators have been shown to achieve nontrivial convergence rate in Schatten 1-norm distance. For Schatten p-norms with 1 < p, the convergence rates are proved in Corollary 1 and Theorem 4.2. These rates can be obtained by the least squares estimator [14] and the projection estimator [25] separately, and simultaneously by the Dantzig-type estimatorρ ε . Basically, the following bounds hold with high probability, Assume that K is large enough such that √ nK mr, then with high probability where the right hand side can be equivalently written as m n 1 r α−1 . These bounds are nontrivial even when we compare them with the bounds obtained for the least squares estimator [14] and the projection estimator [25]. Note that there is no upper bound constraint on the value of K. 3. If α ≥ 2. In this case, the convergence rates in Schatten p-norms are simple for all 1 ≤ p ≤ +∞. Together with the minimax lower bounds in Theorem 3.1, for all 1 ≤ p ≤ +∞ and 1 ≤ K ≤ +∞. It is interesting to notice that the condition n log(m,n) mr 2 is also needed in proving the optimal convergence rates in Schatten p-norms of Dantzig estimator for estimating general low rank matrices with Gaussian or Rademacher measurements, see Xia [24]. It is still an open problem that whether this condition is necessary.

Acknowledgment
I would like to thank an anonymous reviewer for several helpful suggestions on improving the quality of the paper.

Proof of the minimax lower bounds
Proof of Theorem 3.1. The proof is based on several steps. Basically, some techniques in [14] are combined and we construct a subset of S ′ p ⊂ S m,r whose elements are well separated in Schatten p-norm such that for each ρ ∈ S p , ρ, E k ≤ 0.7 √ m with k = 2, 3, . . . , m 2 . Then, minimax lower bounds are established by calculating the Kullback-Leibler divergence between Binomial distributions.
Denote by G k,l the Grassmann manifold which is the set of all k-dimensional subspaces L of the l-dimensional space R l . Given such a subspace L ⊂ R l with dim(L) = k, let P L be the orthogonal projector onto L and let P k,l := {P L : L ∈ G k,l }. The set of all k-dimensional projectors P k,l will be equipped with Schatten p-norm distances for all p ∈ [1, +∞] (which can be also viewed as distances on the Grassmannian itself): d p (Q 1 , Q 2 ) := Q 1 − Q 2 p , Q 1 , Q 2 ∈ P k,l . Recall that the ε-packing number of a metric space (T, d) is defined as D(T, d, ε) := max n : there are t 1 , . . . , t n ∈ T, such that min The packing number of P k,l with respect to Schatten distances d p will be needed and it is given in the following lemma (see Pajor [19]).

Lemma 4.
For all integer 1 ≤ k ≤ l such that k ≤ l − k, and all 1 ≤ p ≤ ∞, the following bounds hold with d = k(l − k) and universal positive constants c, C.
We will prove the bound (3.1) for 2 ≤ r ≤ m/2 since the proof in the case r = 1 is simpler. Moreover, the case r > m/2 can be easily reduced to the case r ≤ m/2 by adjusting the constant c in (3.1). According to Lemma 4, there is a subset D p ⊂ P r−1,m−1 such that Card D p ≥ 2 (r−1)(m−r) and, for some positive constant c ′ , Q 1 − Q 2 p ≥ c ′ (r − 1) 1/p for any Q 1 , Q 2 ∈ D p and Q 1 = Q 2 . Note that for any Q ∈ P r−1,m−1 , it can be viewed as an (m − 1) × (m − 1) positive definite matrix with tr(Q) = r − 1. Then, construct the following m × m density matrix It is easy to check that whenever κ ≤ 1, S Q is indeed a density matrix with rank at most r. Now we take κ := c 1 m(r−1) √ nK with a small enough absolute constant c 1 > 0 and assume that κ ≤ 1 2 . Define a subset of density matrices S p := S Q : Q ∈ D p and an immediate result is Card(S p ) = Card(D p ) ≥ 2 (r−1)(m−r) and S p ⊂ S m,r . Moreover, for S Q1 , S Q2 with Q 1 , Q 2 ∈ D p and Q 1 = Q 2 , we have for some constant c > 0. To this end, we obtain a large enough subset S p such that each element is well separated in Schatten p-norm which holds for any 1 ≤ p ≤ +∞. Recall that E := {E 1 , E 2 , . . . , E m 2 } is the set of Pauli matrices with E 1 = Im m with I m being the m × m identity matrix. Now, we construct a subset S ′ p ⊂ S m,r such that Card(S ′ p ) = Card(S p ) and for each S ∈ S ′ p , S, E k ≤ 0.7 √ m for all 2 ≤ k ≤ m 2 . The following lemma will be needed and its proof is provided in [14,Lemma 9] by choosing γ = 0.2 there and observing tr(E k ) = 0 for 2 ≤ k ≤ m 2 .
Lemma 5. There exists a universal constant C 1 > 0 such that when m ≥ C 1 , there exists a vector v ∈ C m with v = 1 and We set e 1 := v in Lemma 5 and construct an orthonormal basis e 1 , e 2 , . . . , e m . Let e := [e 1 , e 2 , . . . , e m ] ∈ C m×m with e i being the i-th column of e for 1 ≤ i ≤ m. To this end, we define the subset of density matrices as follows In other words, S ′ Q is obtained by assuming S Q defined in (6.2) represent linear transformation in basis {e 1 , . . . , e m }. Since e is an orthonormal matrix, Moreover, for each E k , 2 ≤ k ≤ m 2 , where we used the fact κ ≤ 1 2 . Recall that P ρ denotes the probability distribution of (X 1 , K + 1 ), . . . , (X n , K + n ) with X i being uniformly sampled from E for each 1 ≤ i ≤ n. We are ready to prove the upper bound of the Kullback-Leibler divergence D KL (P S ′ Q 1 Let Π denote the distribution of X which is a uniform distribution over E. Then, imsart-ejs ver. 2014/07/30 file: Dantzig_vonNeumann.tex date: January 6,2017 Recall that √ m S, E 1 = 1 for any S ∈ S m . As a result, the term in (6.4) just equals 0. To deal with (6.5), we need a simple fact of Kullback-Leibler divergence between two Binomial distributions.
where the last inequality holds whenever p, q ∈ [3/20, 17/20]. As a result, we obtain where the last inequality holds as long as c > 0 is small enough. Then, by [22,Theorem 2.5], there exist universal constants c, c ′ > 0 such that where the bound holds for all 1 ≤ p ≤ +∞. Remember that S ′ p ⊂ S m,r , we get the first term on the right hand side of bound (3.1). To this end, we assumed κ ≤ 1 2 . Now, consider κ > 1 2 . In this case, c 1 m(r−1) √ nK > 1 2 . Choose the largest integer 2 ≤ r ′ < r − 1 such that c 1 m(r ′ −1) √ nK ≤ 1 2 . Following the method above, we get The definition of r ′ implies that Since S m,r ′ ⊂ S m,r , by combing (6.6) and (6.7), we get the bound (3.1). Following the similar approach used in [14,Theorem 4] (by comparing K(S 1 S 2 ) with squared Hellinger distance), we can get the bound (3.2).

Proof of Theorem 4.1
The main technical tool for our proof is the following lemma which gives a probabilistic upper bound of the product empirical processes. For any ∆ ∈ [0, 1], define the set and quantity Lemma 6. Let X 1 , . . . , X n be i.i.d. random matrices uniformly sampled from the Pauli basis E. Given 0 < δ − < δ + and t ≥ 1, let t := t + log(log 2 (δ + /δ − ) + 3).
Then, with some constant C and probability at least 1−e −t , the following bound holds for all ∆1+∆2 2 ∈ [δ − , δ + ]: Generally, tight upper bounds (generic chaining bounds) of product empirical processes are not easy to derive due to the nontrivial geometric structure of the indexing classes of the empirical process, see Mendelson [16] and references therein. Even though we suspect that the bound in Lemma 6 might not be sharp, it is sufficient for us to prove the results we need in this paper. Lemma 6 will be used to prove the oracle inequality (4.4) and the spectral norm (i.e., p = +∞) convergence rate ofρ ǫ in (4.11). The proof of Lemma 6 is given in Section 6.4.
Proof of Theorem 4.1. Denote Ξ 1 = 1 n n i=1 ξ i X i . By Lemma 2, we know that with probability at least 1 − e −t , for some constant C > 0. We used the simple facts Eξ 2 X 2 1/2 ∞ ≤ 1 m √ K (see [14]) and The second term in (6.8) is clearly dominated by the first term as long as n ≥ t + log(2m) log(2m), which is assumed to be true hereandafter. In the case n ≤ t + log(2m) log(2m), bounds (4.4), (4.5) and (4.6) are trivial.

Proof of Theorem 4.2
We begin with the proof of the spectral norm ρ ǫ − ρ ∞ . Note that The first term is upper bounded by 2ǫ = C1 m log(2m) nK with probability at least 1 − 1 2m , sinceρ ǫ ∈ Λ(ǫ) and, 1 n By definition of spectral norm, the second term is written as follows (recall the definition of A(∆) in Lemma 6): (6.14) To this end, we apply Lemma 6 with δ − = 1 2m and δ + = 1 m . Then, which holds with probability at least 1 − 1 2m . By simply replacing ρ ǫ − ρ 1 with 2 in (6.14), with probability at least 1 − 1 2m , where we used the condition K m (logarithmic terms get higher order due to this simplifying step). Since ρ ǫ − ρ ∞ has a trivial upper bound 2, we conclude that ρ ǫ − ρ ∞ ≤ Cm log 3 m log 3 n √ nK 2.

Proof of Lemma 6
Proof of Lemma 6. For any ∆ ∈ [0, 1], define the following quantity β n (∆) := sup For all A 1 ∈ A(∆ 1 ) and A 2 ∈ A(∆ 2 ), the following fact is clear, where the last inequality holds because A1±A2 for all A 1 ∈ A(∆ 1 ) and A 2 ∈ A(∆ 2 ). Therefore, It suffices to prove an upper bound for β n (∆) for ∆ ∈ [δ − , δ + ]. We need to point out that the upper bound for β n (∆) has been claimed in our previous paper [14] without proof. Since Lemma 6 is used for several times in this paper, we give a simple proof based on Dudley's entropy bound and the L ∞ (Π n ) complexity of unit ball in H m equipped with Schatten 1-norm.