Covariance Estimation under Missing Observations and L 4 − L 2 Moment Equivalence

We consider the problem of estimating the covariance matrix of a random vector by observing i.i.d samples and each entry of the sampled vector is missed with probability p . Under the standard L 4 − L 2 moment equivalence assumption, we construct the ﬁrst estimator that simultaneously achieves optimality with respect to the parameter p and it recovers the optimal convergence rate for the classical covariance estimation problem when p = 1 .


Introduction
High-dimensional covariance estimation is one of the most fundamental problems in the intersection of probability and statistics.On the applied side, it is a fundamental task for PCA or linear regression [31].On the theoretical side, the non-asymptotic properties of isotropic sample covariance matrices have been extensively studied [2,20,30,29,28,11] due to a famous question by Kannan, Lovász and Simonovits [12] and further generalized to the anisotropic case [16,15,1].Although the sample covariance matrix seems to be the most natural choice of estimator, its performance is suboptimal when the input data lacks a strong decay in the tail.Specifically, the convergence rate with respect to the confidence level δ is quite slow.
Motivated by this fact, a line of work in robust statistics, pioneered by Catoni [5], studied the so-called sub-Gaussian estimators.These estimators are defined to be estimators that perform as good as the empirical mean under the Gaussian distribution.Many estimators have been proposed for the covariance estimation problem (see [13] for a survey), in particular, there are now sub-Gaussian estimators under minimal assumptions on the data distribution [1,23].
On the other hand, data may be corrupted by noise.In [18], Lounici addressed the so-called covariance estimation problem with missing observations, motivated by applications in climate change, gene expression and cosmology.His work considers i.i.d observations, where each entry is "missing" with probability p.We highlight that the missing observations model is a standard notion in the literature, extending beyond the covariance estimation setting, see [10,17] and the references therein.
The goal of this work is to design an estimator that simultaneously achieves the following properties: • Missing Observations: We allow the data to have missing observations and heavy tails.
We construct an estimator with minimax optimal convergence rate without assuming any knowledge of p. Remarkably, we show that dependence on p is universal, meaning that it does not depend on the distribution of the data.
This is an important aspect in high dimensional settings when the dimension d is at least the sample size N .
• Heavy-Tails: We allow the distribution to have heavy tails, requiring only the existence of four moments satisfying minimal assumptions.Moreover, the result is as sharp as if the data were Gaussian (up to an absolute constant).
We begin with the rigorous definition of the model.We say that a centred random vector X satisfies the L 4 − L 2 moment equivalence (hypercontractivity) with constant κ 1, if for all v ∈ S d−1 , (E X, v 4 ) 1/4 κ(E X, v 2 ) 1/2 .
Here we always assume that the data satisfies the L 4 − L 2 moment equivalence with an absolute constant κ > 0, i.e, the constant κ is a fixed real number that does not depend on any other parameter.A vast class of distributions satisfies the moment equivalence assumption mentioned above, with κ being a small absolute constant.Examples include sub-gaussian random vectors, sub-exponential random vectors with bounded ψ α norm, as well t-student distributions with a sufficiently large degree of freedom [21].
We say that the sample Y 1 , . . ., Y N is p-sparsified if it is obtained from the sample X 1 , . . .X N of independent copies of X by multiplying each entry of the X i 's by an independent 0/1 Bernoulli random variable with mean p.In concise terminology, we say that the data is sampled from X⊙p, where p ∈ {0, 1} d is a random vector with i.i.d entries Bernoulli p, and the notation ⊙ simply denotes the standard entrywise product.The choice of zero to represent missing information is merely for convenience and could be replaced by any other value.Now, we present the main result of this manuscript.
Theorem 1 (Main result).Assume that X is a zero mean random vector in R d with covariance matrix Σ satisfying the L 4 − L 2 moment equivalence assumption with an absolute constant κ.Fix the confidence level δ ∈ (0, 1).Suppose that Y 1 , . . ., Y N are i.i.d samples distributed as X ⊙ p, where p = (p 1 , . . ., p d ) ∈ R d is a random vector with i.i.d Bernoulli entries with parameter p. Then there exists an estimator Σ(N, δ) depending only on the sample Y 1 , . . ., Y N and δ satisfying that, with probability at least 1 − δ, Here C(κ) > 0 is an absolute constant depending only on κ.
Literature Review: We remark that several results for covariance estimation under missing observations were obtained in the literature, for example [18,27,26,25,14,3].However, none of the previous results has been able to simultaneously scale correctly with the factor of p and recover a sub-Gaussian estimator when p = 1, as established in [1,23], even when the data is Gaussian.Moreover, the convergence rate is optimal up to an absolute constant: When p = 1, a classical result by Lounici and Koltchinskii [15,Theorem 4] states that if G 1 , . . ., G N are i.i.d mean zero Gaussian vectors with covariance matrix Σ and N r(Σ), then This essentially shows optimality with respect to the effective rank, as we expect that the empirical covariance for Gaussian distributions to be sharp in expectation.They also showed that that the expectation is tightly concentrated around the mean, and our quantitative convergence rate with respect to δ matches their result up to an absolute constant.Both are indeed optimal among all (measurable) estimators for the covariance; see [21,1] for a more technical discussion.Intuitively, it should be not surprising that we cannot beat the Gaussian decay.
In addition to this, the dependence with respect to p is also optimal thanks to a minimax lower bound from Lounici [18,Theorem 2].In a nutshell, his result shows that there exists absolute constants c 1 , c 2 > 0 for which Here, the infimum is taken with respect to all estimators that depend only on the data, and the supremum is taken over all possible distributions with covariance matrix Σ.This implies that our main result captures the optimal dependence with respect to p as well.
It is important to note that the main drawback of our result is that our estimator is not computationally tractable.The primary focus of this work is on the information-theoretic limits of covariance estimators, specifically our main contribution is demonstrating the possibility of constructing an optimal data-driven estimator for covariance under minimal assumptions on the data (albeit it is computationally infeasible).
To the best of our knowledge, there are no computable estimators for the covariance matrix under heavy tails, even in the case without missing observations.We leave it as an important open problem.
Proposed Estimator: The startpoint to construct our estimator is the following observation: The expectation of the covariance matrix of Y scales differently for the diagonal part and the off-diagonal part of its covariance matrix.More accurately, EY ⊗ Y = p Diag(Σ) + p 2 Off(Σ).
We can "invert" the equality above to get the dependence between the true covariance and the data, namely A natural approach would be to replace the unknown term EY ⊗ Y by its sample covariance, but this is not enough when we consider heavy tailed data X 1 , . . ., X N , as discussed above.In fact, we define the truncation function to robustify our estimator in each direction of the sphere.The idea here is to estimate the matrix through its quadratic form.Next, we describe the estimator's final form.We estimate the diagonal and off-diagonal part separately, where S d + is the set of d by d positive semi-definite matrices and S d−1 denotes the unit sphere in R d , and the final estimator becomes Here, p is an estimator for the parameter p, and the choice of the truncation levels λ 1 , λ 2 will be clarified in what follows.
As mentioned before, the main drawback of our estimation is that it is not computational tractable.Indeed, Σ 1 , Σ 2 does not seem to be computable in polynomial time as a (sub)-gradient descent/ascent type method might get stuck in local optima, and analyzing it is out of the scope of this text.We remark that a similar optimization problem appears in [9], with the quadratic forms replaced by linear forms.Unfortunately, even in that case it is an open problem to come up with an (analyzable) algorithm.
From a more practical perspective, it might be possible that, under some stronger concentration assumption on the data, we can avoid evaluating the truncation function ψ in each direction v on the Euclidean sphere, and replace the supremum over the Euclidean sphere in the definition of Σ 1 , Σ 2 by other (more tractable) quantity.This would make the optimization problem much easier to be solved in polynomial time.We leave this as an interesting question to pursue in a future work.
The construction of the estimator and its analysis share similarities with the "trimmead covariance" estimator proposed by Zhivotovskiy and the author [1].However, we need to break into diagonal and off-diagonal parts to take in account the different scales with p. Indeed, the main technical difficulty arises in controlling the random quadratic form to get the optimal dependence with respect to p, mainly in the off-diagonal case.A direct approach faces the difficulty that we no longer have a positive semidefinite matrix, making it challenging to capture cancellations.Conversely, an indirect approach, expressing the off-diagonal part as the total part minus the diagonal part, leads to sub-optimality with respect to p. Thus, we need to carefully balance these two approaches.
Organization.The rest of the paper is organized as follows: In Section 2, we assume the knowledge of certain parameters to simplify the analysis of the estimator and derive sharp convergence rates.We then systematically relax these assumptions in Section 3 by estimating each parameter separately in individual subsections.The last subsection of Section 3 is devoted to the formal construction of the estimator and the proof of the main result.
Notation.Throughout this text C, c > 0 denote an absolute constant that may change from line to line.For an integer N , we set [N ] = {1, . . ., N }.For any two functions (or random variables) f, g defined in some common domain, the notation f g means that there is an absolute constant c such that f cg and f ∼ g means that f g and g f .Let S d + denote the set of d by d positive-definite matrices.The symbols • , • F denote the operator norm and the Frobenius norm of a matrix, respectively.Let KL(ρ, µ) = log dρ dµ dρ denote the Kullback-Leibler divergence between a pair of measures ρ and µ.We write ρ ≪ µ to indicate that the measure ρ is absolutely continuous with respect to the measure µ.For a vector X ∈ R d , the tensor product ⊗ is defined as X ⊗ X := XX T ∈ R d×d .

Oracle Estimator
In this section, we prove our main result under the assumption that we know the effective rank of the covariance matrix r(Σ), the trace of the covariance matrix Tr(Σ), and the sparsifying factor p. These assumptions will be further relaxed in the next section.Our main goal is to prove the following result.Proposition 1. Assume that X is a mean zero random vector in R d with covariance matrix Σ satisfying the L 4 − L 2 moment equivalence assumption.Fix the confidence level δ ∈ (0, 1).Suppose that Y 1 , . . ., Y N are i.i.d samples from X ⊙ p. Then there exists λ 1 , λ 2 > 0 depending only on Tr(Σ), Σ and p for which, with probability at least 1 − δ, Here C(κ) > 0 is an absolute constant depending only on κ.
Our analysis is based on the variational principle pioneered by O. Catoni [5,4,8] and further developed in many applications related to high dimensional probability and statistics [4,6,7,8,32,22].In most of the applications of the variational principle, the following lemma serves as a key stepping stone.Lemma 1. Assume that X i are i.i.d.random variables defined on some measurable space.Let Θ be a subset of R p for some p 1, µ be a a fixed distribution on Θ, and ρ be any distribution on Θ satisfying that ρ ≪ µ.Then, simultaneously for any such ρ, with probability at least 1 − δ, Here θ is distributed according to ρ.
The proof can be found in [4,32] and will be omitted.The next lemma is a technical fact that allow to "convexify" the truncation function ψ.Indeed, it is easy to see that the function e ψ (x) is bounded by (1 + x + x 2 ) that still not convex, but if we add a suitable quadratic term, then it becomes convex.Lemma 2. Let ψ be the truncation function from (1), and let Z be a random variable with finite second moment.Then the following holds Moreover, for any a > 0, This result has been previously used in [4,32,1].For the sake of completeness, we include a proof at the end of this section.Now we start with the facts specifically derived for the missing observation case.The next result is crucial to establish the right dependence on p.The proof is deferred to the end of this section.
Lemma 3. Let Y as above.For every v ∈ S d−1 , we have The main idea behind the proof of Proposition 1 consists in using the variational principle twice, one for the diagonal part and the other for the more delicate off-diagonal part.
Proof.Diagonal Part: We start by defining the parameter space of interest, namely Choose µ to be a product of two zero mean multivariate Gaussians with mean zero and covariance β −1 I d , where β > 0 will be chosen later.For each v ∈ S d−1 , let ρ v be the product of two multivariate Gaussian distribution with mean v and covariance β −1 I d .By construction, (θ, ν) is distributed according to ρ v , therefore it satisfies that E ρv (θ, ν) = (v, v).The standard formula for the KL-divergence between two Gaussian measures [24] implies that Let λ 1 > 0 be a free parameter to be optimized later.By the first part of Lemma 2, we have where By symmetry, To see this, observe that it is equal to The second term is positive with probability one half.Conditioned on this event, the first term is positive with probability one half, and it is independence from the first term.We obtain that the probability of both are positive is at least one quarter.Therefore, By the second part of Lemma 2, we have For instance, let us focus on the first term.The goal is to apply Lemma 1 to the function f defined below Using the numeric inequality log(1 + y) y, valid for all y −1, followed by Fubini's theorem and Lemma 3, we have Next, setting β := r(Σ) (which is at least one) and applying Lemma 1, it follows that with probability at least 1 − δ, for all v ∈ S d−1 , Here R i is an independent copy of R. We proceed to estimate the third term in the righthand side.Clearly, since min{1, 2λ 6} is bounded by one, its variance is bounded by its expectation.Therefore, by Bernstein's inequality it follows that with probability 1 − δ, An analogous computation shows the same estimate holds (up to an absolute constant) for the term min{1, 2β −1 Diag(Y ⊗ Y )v 2 2 /6}.Finally we conclude that, there exists an absolute constant C > 0 such that, with probability at least 1 − δ, We optimize the right-hand side over λ 1 > 0.More accurately, setting we obtain that, with probability at least 1 − δ, We repeat the same arguments above for ρ 2,v being a product measure between two Gaussians The argument follows exactly the same steps because ψ is symmetric.Therefore, it also holds that with probability 1 − δ, By union bound, we obtain a two-sided bound: With probability at least 1 − δ, Off-diagonal part: We now proceed to the second part of the proof to deal with the off-diagonal part.We choose µ and ρ(v) as before, and write where We have to deal with the quadratic form of the off-diagonal that requires a more delicate analysis.In fact, Y, e i Y, e j Y, e k Y, e l θ i ν j θ k ν l By independence between θ and ν, it remains to analyze the term Eθ i θ k Eν j ν l .We split the analysis in three cases: The first one when k = i and j = l, the second when either k = i and j = l or k = i and j = l, and finally the third one when k = i and j = l.In the first case, the summation becomes In the second case, the summation becomes The third case is simpler, Observe that summing all terms that do not contain any β factor, we obtain As before, the goal is to apply Lemma 1 to the function f , where C 2 > 0 is a sufficiently large absolute constant.Using again the numeric inequality log(1 + y) y, Fubini's theorem, and Lemma 3, we have The first term is equal to p 2 λ 2 v T Off(Σ)v.We know that all terms in the expansion of θ T Off(Y ⊗ Y )ν without a β factor add up v T Off(Y ⊗ Y )v and its expectation is at most 4p 2 κ 4 Σ 2 by Lemma 3. Next, we estimate the terms containing β systematically.Using Cauchy-Schwarz inequality together with the moment equivalence for X, we obtain Similarly, we obtain

It remains to analyze
We estimate the first term on the right-hand side as the second term is identically distributed.
To this end, we apply Hölder's inequality with conjugate exponents 4/3 and 4, and the moment equivalence to obtain that where the last inequality follows from the arithmetic-geometric inequality.Putting all together, we conclude that Therefore, setting β = r(Σ), it follows that Finally we conclude that there exists an absolute constant C ′ 2 > 0 for which, with probability at least 1 − δ, By Bernstein inequality the remainder terms R 2 (Y i ) are absorbed by the last term in the sum exactly in the same way as in the diagonal case.We optimize over λ 2 > 0 by setting Therefore, with probability 1 − δ, We repeat the arguments by changing the mean of ν to −v.This gives the other side of the inequality in the same way it was done for the diagonal part.We conclude that, with probability 1 − δ, By triangular inequality, union bound and re-scaling the multiplicative constant in δ, the following holds.The estimator Σ satisfies, with probability 1 − δ, To end this section, we prove some technical facts, Lemma 2 and 3. We start with the proof of Lemma 3.
Proof.We start with the diagonal case.Observe that Clearly, (I) is at most Next, by the arithmetic-geometric inequality For the off-diagonal term, we need to proceed carefully as the natural idea to decompose the offdiagonal matrix into the matrix itself minus the diagonal part leads to suboptimal dependence on p.We first expand it directly, where the term p 2 comes from the fact that at least two indices are distinct in each summand.Now, we split the off-diagonal term E(v T Off(X ⊗ X)v) 2 .More accurately, The last term (c) is negative because both matrices are positive semidefinite, so we can safely ignore it.The first term (a) on is at most κ 4 (v T Σv) 2 κ 4 Σ 2 by the moment equivalence assumption.Finally, the second term (b) is at most κ 4 Σ 2 by the same argument used above.
Next, we proceed to prove Lemma 2.
Proof.Notice that ψ(x) log(1 + x + x 2 ) holds trivially, and we add x 2 /6 to make the latter function convex.It follows that ψ(EZ) min{log(1 + EZ + EZ 2 ) + EZ 2 /6, 1}.Now, we apply Jensen's inequality to conclude the proof of the first part.For the second part, notice that by Taylor series expansion, if t ∈ [0, a] then we have the following inequality, To get the inequality in the statement, we only need to split into the cases where |Z| 2 /6 is smaller than one and where it is greater than one.

Proof of Theorem 1
In the previous section, we showed in Proposition 1, that the proof of the main result boils down to estimate the trace of the covariance matrix, the operator norm, and the sparsifying parameter p.For the trace and operator norm, it is enough to estimate it with a multiplicative absolute constant.On the other hand, for the parameter p, we need a more accurate estimator.In fact, since we need to divide the estimator by p, an estimator p that do not convergence to p would insert a bias.Remark 1.The best possible convergence rate is at least The trivial estimator Σ = 0 satisfies Σ − Σ Σ , so in order to have a meaningful result we need 1 p r(Σ)+log(1/δ) N < 1.Therefore, without making any further comments, we may assume that for some well-chosen C > 0.

Estimation of p
The idea here is to explore the proportion of non-zeros entries in the observed data.In any standard data set, a missed value does not appear with zero; we set it to zero for convenience, as we have done throughout the entire manuscript until now.As it happens, when estimating the proportion of missing values, it could be the case that the distribution of the random vector X has non-trivial mass at zero.Clearly, we can distinguish between the zero that comes from the distribution and the zero from the missing value.Equivalently, we may assume that the marginals of X, namely X, v (for every v ∈ S d−1 ), do not have mass at zero.The starting point is the following.We collect Y 1 , . . ., Y N , and compute Z 1 , . . ., Z N , where Z i (j) = 1 if and only Y i (j) = 0 and zero, otherwise.The goal is to estimate the mean of the random variable as it is equal to ER(Z) = p.
Lemma 4. Let Y 1 , . . ., Y N be i.i.d copies of X ⊙ p.There exists an estimator p depending only on the sample and the confidence level δ satisfying that, with probability at least 1 − δ, As an immediate consequence, if N C log(1/δ), then (with the same probability guarantee) Before we proceed to the proof, we remark that if p > 1, then we round it, p = 1.
Proof.Following the notation above, we collect R(Z 1 ), . . ., R(Z N ) i.i.d copies of R(Z).We invoke a standard sub-Gaussian mean estimator for R(Z) (e.g trimmead mean estimator [19, Theorem 1]) together with the fact that Var(R(Z)) p 2 , to obtain that, with probability at least

Estimation of the Trace
To simplify the analysis, we can safely assume that p is known because we can accurately estimate it using Lemma 4. Clearly, To invoke a mean estimator, we need to compute the standard deviation of the random variable in the right hand side.To this end, we have Y, e i The latter step follows from moment equivalence and Hölder's inequality (as we have been doing several times in this manuscript).Since p is know, one may invoke Theorem 1 [19] to obtain an estimator Tr(Σ) satisfying that, with probability 1 − δ, (2)

Estimation of the Operator Norm
The most delicate part of this section is the estimation of the operator norm.The main lemma is the following Lemma 5. Let Y 1 , . . ., Y N be i.i.d copies of X ⊙ p.There exist an absolute constant C N and an estimator Σ depending only on the samples and κ satisfying that, with probability at least Here c 1 , c 2 > 0 are two absolute constants depending only on κ.
The key idea is to repeat the same analysis as before for each part with an additional parameter α, and show that if certain inequalities are satisfied then α needs to be of same order as the operator norm.Along the proof C 1 > 0 is an explicit constant that can be computed by just keeping track of the constants in the proofs of Section 2.
Proof.Diagonal Part: As before, we set Now, we slightly change the choice of measures.More accurately, we choose the measure µ to be a product of two zero mean multivariate Gaussians with mean zero and covariance β −1 I d .For v ∈ S d−1 , let ρ v be a product of two multivariate Gaussian distribution with mean αv and covariance β −1 I d .The KL-divergence becomes To simplify the notation, we write ρ v,α = ρ v .Following the same lines for the proof of the diagonal part, we have with probability at least 1 − 3δ, Next, we choose β = c β Tr(Σ) where c β > 0 is an absolute constant to be chosen later.By the Remark 1, we may define a constant C N > 0 for which N C N p −2 max{r(Σ), log(1/δ)}, and then Off-Diagonal Part: We use the same choice of the measures and proceed analogously.We obtain that, with probability at least 1 − 5δ, the following holds Everything Together: We define the function g(α) : R → R to be equal to From above, we obtain that, with probability at least 1 − 8δ, Notice that the constants N can be made arbitrarily small by increasing C N .In particular, we choose c β and C N so that where L 1 , L 2 > 0 are two absolute constants satisfying the following conditions: The reason for such choice will become clear in what follows.Next, without loss of generality, we assume that P(Y i = 0) = 0.This is always possible by adding a small amount of Gaussian noise without changing the covariance too much.We construct a vector w ∈ S d−1 such that min i∈[n] Y i , w = 0 by sampling the vector from an isotropic Gaussian distribution, and normalizing it to have Euclidean norm exactly one.Notice that g(0) = 0 and g is a continuous function.Moreover, if we show that g assumes values greater or equal to one then by intermediate value theorem, the function g assumes any value within this range.Since w is a unit vector for which min i∈[n] w, Y i = 0, and for every it follows that at least one of the terms in the right-hand side is non-zero.In the case that both are non-zero, we evaluate g at the point It is clear that g is at least one at such point.Moreover, observe that the function g is non-negative as we are allowed to take v = 0 in the supremum of the off-diagonal part.In the case that one term is zero, we just remove it from (5).Finally, regardless the case, we choose α such that g( α) = 1.1L 2 .This is a valid choice.Indeed, recall from (4) that 1.1L 2 is strictly smaller than one, therefore existence of such α is guaranteed by the intermediate value theorem as argued before.
Next, (3) implies that The expression above can be interpreted as a parabola in the variable x := α 2 Σ that has two real roots.One root is negative, and it does not play any role.The other one is a positive absolute constant implying that there exists a constant c min (κ) such that This translates in a lower bound for Σ .We now need an upper bound for Σ in terms of α.We repeat the same argument above for the product measure ρ 2,v between θ and ν, where Moreover, since −g(α) is non-increasing in the interval [0, α], we have Setting x = Σ α 2 , the inequality above holds for all α ∈ [0, α].It follows that, The discriminant of the quadratic equation is ∆ = (1 − L 1 ) 2 − 8.4C 1 κ 4 L 2 which is (strictly) positive by (4).It follows that the inequality above is true if x x 1 or x x 2 , where 0 < x 1 < x 2 are the positive roots of the corresponding quadratic equation.We claim that x x 2 cannot happen.Otherwise, since the inequality above holds for all α ∈ [0, α], it must hold for α * such that Σ α * ∈ (x 1 , x 2 ), but this contradicts the fact that the parabola assumes negative values between (x 1 , x 2 ).Therefore, we obtain that there exists a constant c max (κ) > 0 such that α 2 Σ c max (κ).Putting together with (6), we obtain that c min (κ) α 2 Σ c max (κ).

Completion of the proof of Theorem 1
The final construction of our estimator is the following: 1. Split the sample Y 1 , . . ., Y N into four parts of size at least ⌊N/4⌋ each.
2. Estimate the parameter p with the first quarter of the sample using Lemma 4.
3. Estimate the trace Tr(Σ) with the second quarter using (2) and the operator norm Σ with the third quarter using Lemma 5.
4. For the last quarter of the sample, use the estimator from Proposition 1 to estimate the covariance matrix.
Before proceeding to the proof, we highlight some features about the data-splitting approach.From a theoretical perspective, it only affects the convergence rate by a constant.However, from a practical standpoint, it might be of interest to avoid wasting one quarter of the sample if there are only a few observations missing.This means that we use the complete data to estimate the covariance by setting p = 1.
Moreover, it might be more appropriate to use a smaller fraction of the data to estimate the trace, as it is a one-dimensional quantity, and its convergence rate is faster than the convergence rate of the estimator itself.Unfortunately, due to the intractability of our estimator, we are unable to implement these ideas in a real dataset.
We are now in position to prove our main result, Theorem 1.
Proof.As discussed in Section 2, the proof follows easily once we estimate the parameters of the truncation level.Indeed, the truncation levels in Proposition 1 only requires the knowledge of Tr(Σ), Σ and p up to an absolute constant.The error that we need to take in account is to use the estimated value of p instead of the true value when we divide by p.This is the only reason why we have to estimate the precise value of the parameter p.To this end, by triangle inequality The second term also satisfies, with probability 1 − δ, The same argument holds for the off-diagonal part as clearly Off(Σ) 2 Σ .We omit it for the sake of simplicity.Finally, the desired probability guarantee holds by union bound a constant number of times.