Multidimensional linear functional estimation in sparse Gaussian models and robust estimation of the mean

: We consider two problems of estimation in high-dimensional Gaussian models. The ﬁrst problem is that of estimating a linear functional of the means of n independent p -dimensional Gaussian vectors, under the assumption that at most s of the means are nonzero. We show that, up to a logarithmic factor, the minimax rate of estimation in squared Euclidean norm is between ( s 2 ∧ n ) + sp and ( s 2 ∧ np ) + sp . The estimator that attains the upper bound being computationally demanding, we investigate suitable versions of group thresholding estimators that are eﬃciently computable even when the dimension and the sample size are very large. An interesting new phenomenon revealed by this investigation is that the group thresholding leads to a substantial improvement in the rate as compared to the element-wise thresholding. Thus, the rate of the group thresholding is s 2 √ p + sp , while the element-wise thresholding has an error of order s 2 p + sp . To the best of our knowledge, this is the ﬁrst known setting in which leveraging the group structure leads to a polynomial improvement in the rate. The second problem studied in this work is the estimation of the common p -dimensional mean of the inliers among n independent Gaussian vectors. We show that there is a strong analogy between this problem and the ﬁrst one. Exploiting it, we propose new strategies of robust estimation that are computationally tractable and have better rates of convergence than the other computationally tractable robust (with respect to the presence of the outliers in the data) estimators studied in the literature. However, this tractability comes with a loss of the minimax-rate-optimality in some regimes.


Introduction
Linear functionals are of central interest in statistics. The problems of estimating a function at given points, predicting the value of a future observation, testing the validity of a hypothesis, finding a dimension reduction subspace are all examples of statistical inference on linear functionals. The primary goal of this paper is to investigate the problem of estimation of a particular form of linear functional defined as the sum of the observed multidimensional signals. Although In particular, one can observe that the ratio of the worst-case risk of the group thresholding procedure and that of the component-wise thresholding might be as small as O(p −1/2 ). Indeed, it is not hard to prove that the performance of the component-wise thresholding estimator is of the order of p times the performance of this same estimator in the scalar case, i.e., when p = 1. But it is proven in Theorem 1 in (Collier, Comminges and Tsybakov, 2017) that the optimal rate of estimation of the (scalar) linear functional is σ 2 s 2 log(n/s 2 ), which is in fact reached by hard-thresholding.
To the best of our knowledge, this is the first known setting in which leveraging the group structure leads to such an important improvement of the rate. In previous results, the improvement was of at most logarithmic order. Another interesting remark is that the group soft thresholding estimator we investigate here has a data-dependent threshold 1 . Finally, note that while the thresholding estimators are natural candidates for solving the problem under consideration in the sparsity setting, the greedy subset selection is a new procedure introduced in this paper to get the best known upper bound on the minimax risk.
A second problem studied in this work is the robust estimation of the mean of a Gaussian vector. As explained in forthcoming sections, this problem has close relations to that of estimation of a linear functional. In order to explain this relation, let us recall that one of the most popular mathematical framework for analyzing robust estimators is the Huber contamination model (Huber, 1964). It assumes that there is a reference distribution P μ , parametrized by μ ∈ M, the precise value of which is unknown, and a contamination distribution Q, which is completely unknown. The data points Y i , i = 1, . . . , n are independent random variables drawn from the mixture distribution P ,μ,Q = (1 − )P μ + Q, where ∈ [0, 1] is the rate of contamination. The goal is then to estimate the parameter μ, see the papers (Chen, Gao and Ren, 2015;Chen, Gao and Ren, 2016) for some recent results. This means that among the n observations, there are s inliers drawn from P μ and (n − s) outliers drawn from Q, all these observations being independent and s being a binomial random variable with parameters n and (1− ). Thus, the specificity of the model is that all the outliers are assumed to be drawn from the same distribution, Q.
We suggest here to consider an alternative model for the outliers. In the general setting, it corresponds to considering the number of outliers, s, as a deterministic value and to assuming that the outliers is of cardinality s) are independent and satisfy Y i ∼ P μ i . Thus, we do not assume in this model that the outliers are all generated by the same random mechanism. This model and the Huber model are two different frameworks for assessing the quality of the estimators. It is quite likely that in real world applications none of these two models are true. However, both of them are of interest for comparing various outlier-robust estimators and investigating optimality properties.
To explain the connection between the robust estimation and the problem of estimation of a linear functional, let us consider the contamination model of the previous paragraph. That is, we assume that the observations Y i are independent and drawn from P μ i , with μ i = μ for every inlier i ∈ O c = {1, . . . , n} \ O. In addition, let μ be the mean of P μ and the family {P μ } be translation invariant (meaning that for every vector a, the random variable Y i − a is drawn from P μ i −a ). If we have an initial estimator μ 0 of μ, which is consistent but not necessarily rate-optimal, then we can define the centered The strategy we propose here is to use an estimator L n -based on the transformed observations Y i -of the linear functional L n = 1 n i∈ [n] θ i and then to update the estimator of μ by the formula μ 1 = 1 n i∈ [n] Y i − L n . This procedure can be iterated using μ 1 as an initial estimator of μ. We elaborate on this approach in the case of the normal distribution, P μ = N p (μ, σ 2 I p ), in the second part of the present work.

Organization
The rest of the paper is organized as follows. Section 2 is devoted to the problem of linear functional estimation. It contains the statements of the main results concerning the risk bounds of different relevant estimators and some lower bounds on the minimax risk. The problem of robust estimation is addressed in Section 3. We summarize our findings and describe some directions of future research in Section 4. The proofs of main theorems are postponed to Section 5, whereas the proofs of technical lemmas are gathered in Section 6. Some wellknown results frequently used in the present work are recalled in Section 7.

Notation
We denote by [k] the set of integers {1, . . . , k}. The k-dimensional vectors containing only ones and only zeros are denoted by 1 k and 0 k , respectively. As usual, u 2 stands for the Euclidean norm of a vector u ∈ R k . The k × k identity matrix is denoted by I k . For every p × n matrix M and every T ⊂ [n], we denote by M T the submatrix of M obtained by removing the columns with indices lying outside T . The Frobenius norm of M, denoted by M F , is defined by M 2 F = tr(M M), and its spectral norm, denoted by M , is defined as square-root of the largest eigenvalue of M T M. We will use the notation L(M) for the linear functional M1 n equal to the sum of the columns of M.

Estimation of a linear functional
We assume that we are given a p×n matrix Y generated by the following model: This means that the deterministic matrix Θ is observed in Gaussian white noise of variance σ 2 . Equivalently, the columns Y i of Y satisfy Our goal is to estimate the vector L(Θ) ∈ R p , where L : R p×n → R p is the linear transformation defined by Let us first explain that this is a nontrivial statistical problem, at least when both p and n are large. In fact, the naive solution to the aforementioned problem consists in replacing in (2) the unknown matrix Θ by the noisy observation Y. This leads to the estimator L = Y1 n , the risk of which can be easily shown to be E Θ L − L(Θ) 2 2 = σ 2 np. When the matrix Θ has at most s nonzero columns with s being much smaller than n, it is possible to design estimators that perform much better than the naive estimator L n . Indeed, an oracle who knows the sparsity pattern S = {i ∈ [n] : θ i = 0} may use the oracle-estimator L S = L(Y S ) which has a risk equal to σ 2 sp. It is not difficult to show that there is no estimator having a smaller risk uniformly over all the matrices Θ with a given sparsity pattern S of cardinality s. Thus, we have two benchmarks: the very slow rate σ 2 np attained by the naive estimator and the fast rate σ 2 sp attained by the oracle-estimator that is unavailable in practice. The general question that we study in this work is the following: what is the best possible rate in the range [σ 2 sp, σ 2 np] that can be obtained by an estimator that does not rely on the knowledge of S?
In what follows, we denote by M(p, n, s) the set of all p × n matrices with real entries having at most s nonzero columns:

Greedy subset selection
Let us consider a greedy estimator that tries to successively recover various pieces of the sparsity pattern S. We start by setting I 1 = [n] and I 1 = J ⊆ I 1 : L(Y J ) 2 2 ≥ 12σ 2 (|J|p + λ|J| 2 ) . If I 1 is empty, then we set J 1 = ∅ and terminate. Otherwise, i.e., when I 1 is not empty, we set J 1 = arg min |J| : J ∈ I 1 and I 2 = I 1 \ J 1 . In the next step, we define I 2 , J 2 and I 3 in the same way using as starting point I 2 instead of I 1 . We repeat this procedure until we get J = ∅ or I +1 = ∅. Then we set The detailed pseudo-code for this algorithm is given in Algorithm 1 below. Update S ← S ∪ J.

10
Update I ← I \ J. 11 until I is empty or J is empty 12 return L GSS ← i∈ S Y i Theorem 1. Let δ ∈ (0, 1) be a prescribed tolerance level. The greedy subset selection estimator with λ = 3 /2 log( 2n /δ) satisfies sup Θ∈M(p,n,s) This result tells us that the worst-case rate of convergence of the GSS estimator over the class M(p, n, s) is σ 2 s(p + s log n). As a consequence, the minimax risk of estimating the functional L(Θ) over the aforementioned class is at most of order σ 2 s(p + s log n). As we will see below, this rate is optimal up to a logarithmic factor.
However, from a practical point of view, the GSS algorithm has limited applicability because of its high computational cost. It is therefore appealing to look for other estimators that can be computed efficiently even though their estimation error does not decay at the optimal rate for every possible configuration on (p, n, s). Let us note here that using standard tools it is possible to establish an upper bound similar to (3) that holds in expectation.

Group hard thresholding estimator
A natural approach to the problem of estimating L(Θ) consists in filtering out all the signals Y i that have a large norm and computing the sum of the remaining signals. This is equivalent to solving the following optimization problem where λ > 0 is a tuning parameter. The estimator Θ GHT , hereafter referred to as group hard thresholding, minimizes the negative log-likelihood penalized by the number of non-zero columns in Θ. One easily checks that the foregoing optimization problem can be solved explicitly and the resulting estimator is Using the group hard thresholding estimator of Θ and the method of substitution, we can estimate L(Θ) by It is clear that this estimator is computationally far more attractive than the GSS estimator presented above. Indeed, the computation of the GHT estimator requires at most O(pn) operations. However, as stated in the next theorem, this gain is achieved at the expense of a higher statistical error.
Using the fact that log(1 + x) ≤ x, we infer from this theorem that the rate of the group hard thresholding for fixed σ is of order s 2 √ p ∧ np + sp, up to a logarithmic factor. Moreover, the rate obtained in this theorem can not be improved, up to logarithmic factors, as stated in the next theorem. Proposition 1. Let us denote by L GHT λ the estimator defined in (4) with a threshold λ > 0. There are two universal constants p 0 ∈ N and c > 0, such that for any p ≥ p 0 and s ≤ n/61, the following lower bound holds inf λ>0 sup Θ∈M(p,n,s) The proofs of these claims being deferred to Section 5, let us comment on the stated results. At first sight the presence of the sparsity s in the definition of the threshold λ in Theorem 2 might seem problematic, since this quantity is unknown in most practical situations. However, one can easily modify the claim of Theorem 2 replacing n/s 2 and np 1/2 /s 2 respectively by n and np 1/2 both in the definition of λ and the subsequent risk bound. This will lead to the rate s 2 p 1/2 + sp, up to a logarithmic factor. In order to obtain the better rate s 2 p 1/2 ∧ (np), it suffices to project the estimator L GHT A second remark concerns the rate optimality. If we neglect the logarithmic factors in this discussion, the rate of the GHT estimator is shown to be at most of order σ 2 (s 2 √ p∧np+sp). This coincides with the optimal rate (and the one of the GSS estimator) when s = O( √ p) and has an extra factor √ p in the worstcase. The latter corresponds to the regime in which s 2 √ p ∧ np + sp = Θ(np) and s 2 ∧ n + sp = Θ(n). This is equivalent to p = O(s 4 /n 2 ∧ n/s). When there is a limit on the computational budget, that is when the attention is restricted to the estimators computable in polynomial (in s, p, n) time, we do not know whether such a deterioration of the risk can be avoided.
An inspection of the proof of Theorem 2 shows that if all the nonzero signals θ i are large enough, that is when min i∈S θ i 2 2 ≥ cp for some constant c > 0, the extra factor √ p disappears and the GHT achieves the optimal rate. Put differently, the signals at which the GHT estimator fails to achieve the optimal rate are those having an Euclidean norm of order p 1/4 . This is closely related to the minimax rate of separation in hypotheses testing. It is known that the separation rate for testing H 0 : θ = 0 against H 1 : θ 2 ≥ ρ, when one observes Y ∼ N (θ, σ 2 I p ) is of order σp 1/4 . This follows for example from Theorem 12 in (Collier, Comminges and Tsybakov, 2017). Our last remark on Theorem 2 concerns the relation with element-wise hard thresholding. The idea is the following: any column-sparse matrix Θ is also sparse in the most common sense of sparsity. That is, the number of nonzero entries of the matrix Θ is only a small fraction of the total number of entries. Therefore, one can estimate the entries of Θ by thresholding those of Y and then estimate L(Θ) by the method of substitution. The statistical complexity of this estimator is quantified in the next theorem, the proof of which is similar to the corresponding theorem in (Collier, Comminges and Tsybakov, 2017).

O. Collier and A. Dalalyan
A striking feature of the problem of linear functional estimation uncovered by Theorem 2 and Theorem 3, is that exploiting the group structure leads to an improvement of the risk which may attain a factor p −1/2 (for the squared Euclidean norm). To the best of our knowledge, this is the first framework in which the grouping is proved to have such a strong impact. This can be compared to the problem of estimating the matrix Θ itself under the same sparsity assumptions. Provable guarantees in such a setting show only a logarithmic improvement due to the use of the sparsity structure (Lounici et al., 2011;Bunea, Lederer and She, 2014).

Group-soft-thresholding estimator
A natural question is whether the results obtained above for the group hard thresholding can be carried over a suitable version of the soft-thresholding estimator. Such an extension could have two potential benefits. First, the soft thresholding is defined as a solution to a convex optimization problem, whereas hard thresholding minimizes a nonconvex cost function. This difference makes the soft thresholding method more suitable to deal with various statistical problems. The simplest example is the problem of linear regression: the extension of the soft thresholding estimator to the case of non-orthogonal design is the LASSO, that can be computed even when the dimension is very large. In the same problem, the extension of the hard thresholding is the BIC-type estimator, the computation of which is known to be prohibitively complex when the dimension is large.
A second reason motivating our interest in the soft thresholding is its smooth dependence on the data. This smoothness implies that the estimator is less sensitive to changes in the data than the hard thresholding. Furthermore, it makes it possible to design a SURE-type algorithm for defining an unbiased estimator of the risk and, eventually, selecting the tuning parameter in a datadriven way.
Finally, anticipating the results of the next section, we would like to stress that the analysis of the group-soft-thresholding estimator prepares the ground for the robust estimator of the mean studied in Section 3.
In the model under consideration, the group soft thresholding estimator Θ GST can be defined as the minimizer of the group-LASSO cost function, that is

2839
This problem has an explicit solution given by It is natural then to define the plug-in estimator as L GST = L( Θ GST ). The next theorem establishes the performance of this estimator.
satisfies, for every Θ ∈ M(p, n, s), The comments made after the statement of Theorem 2 can be repeated here. The dependence of γ on s is not crucial; one can replace s by 1 in the expression for γ, this will not have a strong impact on the risk bound. The bound in expectation can be complemented by a bound in deviation. The rate obtained for the soft thresholding is exactly of the same order as the obtained in Theorem 2 for the group hard thresholding. A notable difference, however, is that in the case of soft thresholding the tuning parameter λ suggested by the theoretical developments is data dependent.

Lower bounds and minimax rate optimality
We now address the question of the optimality of our estimators. In (Collier, Comminges and Tsybakov, 2017), the case p = 1 was solved with lower and upper bounds matching up to a constant. In particular, Theorem 1 in (Collier, Comminges and Tsybakov, 2017) yields the following proposition.
Note that when n = s, this rate is of the order of σ 2 s. It is straightforward that this rate generalizes to σ 2 sp in the multidimensional case. Furthermore, if we knew in advance the sparsity pattern S, then we could restrict the matrix of observations to the indices in S, and we would get the oracle rate σ 2 sp. These remarks are made formal in the following theorem.
Therefore, the greedy subset selector in Section 2.1 is provably rate-optimal in the case s = O( √ n). A question that remains open is the rate optimality when √ n = O(s). The lower bound of Theorem 5 is then of order σ 2 (n + sp), whereas the upper bound of Theorem 1 is of order σ 2 (s 2 + sp). Taking into account the fact that the naive estimator L(Y) has a risk of order σ 2 np, we get that the minimax risk is upper bounded by σ 2 (s 2 ∧ np + sp).Thus, there is a gap of order p when p + √ n = O(s). Note that none of the estimators discussed earlier in this work attain the upper bound σ 2 (s 2 ∧ np + sp); indeed, the latter is obtained as the minimum of the risk of two estimators. Interestingly, one can design a single estimator that attains this rate. Previous sections contain all the necessary ingredients for this. We will illustrate the trick in the case of the GSS estimator, but similar technique can be applied to any estimator for which an "in deviation" risk bound is established.
These two facts imply that with probability at least 1−δ the balls B(L(Y); r 1 ) and B( L GSS ; r 2 ) have nonempty intersection. As a consequence, in this event, we have L(Y) − L GSS 2 ≤ r 1 + r 2 and, therefore, s ≤ s. Now, if 60 s(p + λ s) ≤ 2np + 3n log(2/δ), then L adGSS = L GSS and we have where the last equality follows from the fact that s ≤ s. Thus, we have established the following result.
Let us summarize the content of this section. We have established a lower bound on the minimax risk, showing that the latter is at least of order sp+s 2 ∧n, up to a logarithmic factor. We have also obtained upper bounds, which imply that the minimax risk is at most of order sp + s 2 ∧ (np). Furthermore, this rate can be attained by a single estimator (adaptive greedy subset selection).

The problem of robust estimation
The problem of linear functional estimation considered in the previous section has multiple connections with the problem of robust estimation of a Gaussian mean. In the latter problem, the observations Y 1 , . . . , Y n in R p are assumed to satisfy where I p is the identity matrix of dimension p×p. We are interested in estimating the vector μ, under the assumption that most vectors θ i are equal to zero. All the observations Y i such that i ∈ S = { : θ 2 = 0} are considered as inliers, while all the others are outliers. In this problem, the vectors θ i are unknown, but their estimation is not our primary aim. They are rather considered as nuisance parameters. In some cases, it might be helpful to use the matrix notation of (6): The obvious connection with the problem considered in the previous section is that if we know that μ = 0 p in (6), then we recover model (1). This can be expressed in a more formal way as shown in the next proposition.
Proposition 5. The problem of estimating the linear functional L n (Θ) = ( 1 /n) i∈ [n] θ i in model (7) is not easier, in the minimax sense, than that of estimating μ. More precisely, we have where the sup in the left-hand side and in the right-hand side are taken, respectively, over all Θ ∈ M(p, n, s) and over all (μ, Θ) ∈ R p × M(p, n, s).
Proof. The first inequality is a consequence of the fact that when all the entries of Θ are zero, the optimal estimator of μ in the minimax sense is the sample mean of Y i 's. To prove the second inequality, let L n be an estimator of L n (Θ). We can associate with L n the following estimator of μ: μ( L n ) = L n (Y) − L n . These estimators satisfy 2 ] = pσ 2 /n and the claim of the proposition follows. Another important point that we would like to mention here is the relation between model (6) and the Huber contamination model (Huber, 1964) frequently studied in the statistical literature (we refer the reader to (Chen, Gao and Ren, 2015;Chen, Gao and Ren, 2016) for recent overviews). Recall that in Huber's contamination model, the observations X 1 , . . . , X n are n iid p-dimensional vectors drawn from the mixture distribution (1 − s n )N p (μ, I p ) + s n Q. The particularity of this model is that it assumes all the outliers to be generated by the same distribution Q; the latter, however, can be an arbitrary distribution on R p . In contrast with this, our model (6) allows for a wider heterogeneity of the outliers. On the downside, our model assumes that the outliers are blurred by a Gaussian noise that has the same covariance structure as the noise that corrupts the inliers. The relation between these two models is formalized in the next result.
Proposition 6. Let μ : R p×n → R p be an estimator of μ that can be applied both to the data matrix X = [X 1 , . . . , X n ] from Huber's model and to Y from our model (6). Then, we have risk in the Huber model 2 risk in our model (6) .
The supremum of the left-hand side is over all probability distributions Q on R p such that 3 Q = Q 0 * N p (0, σ 2 I p ), where Q 0 is an arbitrary distribution on R p . The notation B(n, s/n) stands for the binomial distribution.
The proof of this proposition is a simple exercise and is left to the reader. Although some statistical properties of robust estimators in a framework of the same spirit as (6) were already explored in the literature (Dalalyan and Keriven, 2012;Dalalyan and Chen, 2012;Balmand and Dalalyan, 2015;Nguyen and Tran, 2013;Klopp, Lounici and Tsybakov, 2017;Cherapanamjeri, Gupta and Jain, 2016), the entire picture in terms of matching upper and lower bounds was not available until very recently. It has been established in (Chen, Gao and Ren, 2015) that the minimax rate of estimating μ in Huber's contamination model is r all mmx (n, p, s) = σ 2 p n ∨ s 2 n 2 .
As one might expect, this rate is proved to be minimax-optimal in our contamination model as well (Carpentier et al., 2018, row 2, Table 1), at least up to a factor poly-logarithmic in s. Furthermore, Chen, Gao and Ren (2015) showed that this rate is achieved by Tukey's median, i.e., the minimizer of Tukey's depth. An important observation is that the evaluation of Tukey's median is a hard computational problem: there exists no algorithm to date capable of approximating Tukey's median in a number of operations that scales polynomially in p, n and the approximation precision. Simple computationally tractable robust estimators of μ, such as the element-wise median or the geometric median, have a rate of order (Chen, Gao and Ren, 2015;Lai, Rao and Vempala, 2016) σ 2 p n ∨ s 2 p n 2 containing an extra factor p in front of (s/n) 2 . We shall show in this section that a suitable adaptation of the group soft thresholding estimator presented in the previous section leads to a rate that can be arbitrarily close to This shows that if we restrict our attention to the estimators that have a computational complexity that is at most polynomial, the minimax rate satisfies, for every ν ∈ (0, 1/ log p), where means inequality up to logarithmic factors.

Maximum of profile likelihood with group LASSO penalty
A computationally tractable estimator that allows to efficiently deal with structured sparsity and has provably good statistical complexity is the group LASSO (Yuan and Lin, 2006;Lin and Zhang, 2006;Chesneau and Hebiri, 2008;Meier, van de Geer and Bühlmann, 2009;Lounici et al., 2011). We define the group-LASSO estimator by where the λ i are some positive numbers to be defined later. The estimator μ can be seen as the maximum of a profile penalized likelihood, where the penalty is proportional to the 2,1 norm (also known as the group LASSO penalty) of the nuisance parameter Θ. The above optimization problem is convex and can be solved numerically even when the dimension and the sample size are large. It is also well known that μ from (8) is exactly the Huber M-estimator (Donoho and Montanari, 2016, Section 6). In addition, these estimators can also be written as where Π denotes the orthogonal projection in R n onto the orthogonal complement of the constant vector 1 n . Unfortunately, we were unable to establish a risk bound for this estimator that improves on the element-wise median. The best result that we get is the following.
Theorem 6. Consider the estimators of Θ and μ defined in (8) with λ 2 = 32σ 2 p + 256σ 2 log(n/δ). Then, with probability at least 1 − δ and provided that s ≤ n/32, we have This result, proved in Section 5.2, shows that the rate of the profiled penalized likelihood estimator of μ, with a group LASSO penalty, converges at the rate σ 2 s 2 p n 2 ∨ p n , which coincides with the one obtained 4 by (Chen, Gao and Ren, 2015). In the rest of this section, we will propose an estimator which improves on this rate. To this end, we start with obtaining a simplified expression for the group LASSO estimator Θ.
First, using the fact that Πt Recall that L n (Θ) = ( 1 /n)L(Θ). The first-order necessary conditions imply that, for every i such that θ i = 0 p , Furthermore, θ i = 0 p if and only if 2(YΠ) i + 2L n ( Θ) 2 ≤ λ i . We infer that This formula shows the clear analogy between the group LASSO estimator Θ and the soft thresholding estimator studied in the previous section. This analogy suggests to choose the tuning parameters in a data driven way; namely, it is tempting to set Unfortunately, such a choice is impossible to realize since this λ i depends on the solution Θ of the optimization problem, which in turn is defined through λ i . To circumvent this problem, we suggest to use an iterative algorithm that starts from an initial estimator L n of L n (Θ), defines the vectors Z i = (YΠ) i + L n and then updates L n by the formula L n = L n ( Θ), where the columns of the matrix Θ are defined by the second equality in (10). This algorithm, called iterative soft thresholding, is described in Algorithm 3. Prior to stating the theorem that describes the statistical complexity of this estimator, we present a result that explains why such an iterative algorithm succeeds in improving the convergence rate.
Combining all these results, we arrive at the following risk bound for the iterative soft thresholding estimator.
Theorem 7. Let δ ∈ (0, 1), N ∈ N and let L IST n (N ) be the iterative soft thresholding estimator obtained after N iterations. Assume that p ≥ log(8/δ) and N ≥ log log p. There are some universal strictly positive constants c 1 , c 2 , c 3 such that if the condition s ≤ c 1 n is satisfied then, with probability at least 1 − 2δ, the following inequalities hold true: This implies, in particular, that if p ≤ C(n/s) 2−4ν for some ν ∈ (0, 1/2) close to zero, then performing N = log 2 (1/ν) iterations of the IST algorithm we will recover the mean μ of the inliers at an optimal rate (s/n) 2 ∨ (p/n).
To complete this section, let us briefly note that one can use the Lepski method as described in Section 2.4 for getting an estimator of μ that does not require the knowledge of s. This will only increase the error by a factor at worst equal to 3.
Remark 1. From an intuitive point of view, the algorithm described in Algorithm 3 can be seen as an iterative approximation of the estimator for an appropriately chosen tuning parameter γ > 0, where ρ H is the Huber function. Unfortunately, the cost function in the above minimization problem is not convex with respect to the parameter μ. This implies that general purpose guarantees available for approximating solutions of convex programs are not applicable to (14). To the best of our knowledge, there is no efficient algorithm that provably approximates μ * .

Conclusion and perspectives
In this work, we have studied two problems: the problem of estimating a multidimensional linear functional and the one of estimating the mean of p-variate random vectors when the data is corrupted by outliers. In the first problem, we have obtained upper and lower bounds on the minimax risk that match in most situations. More importantly, in both problems, we have studied computationally tractable estimators and have obtained the best known rates of convergence. A surprising outcome of our work is that exploiting the group structure of the sparsity is far more important in the problem of linear functional estimation rather than in the problem of the whole signal. We have also designed a new robust estimator of the mean that iteratively performs group soft thresholding on a suitable transformation of the data. There are several questions related to the present work that remain open. First, it would be interesting to close the gap in the minimax rate of estimation of a linear functional when p + √ n = O(s). Second, in both problems studied in this work, a challenging question for future research is to establish lower bounds on the minimax risk over computationally tractable estimators. For the problem of robust estimation, one may use a suitable version of the median of means (Lerasle and Oliveira, 2011;Minsker, 2015;Devroye et al., 2016;Lecué and Lerasle, 2017). However, in the multidimensional Gaussian model considered in the present work, to our knowledge there are no upper bounds on the risk of these methods that are qualitatively better than those for the element-wise median, in the sense of their dependence on the proportion of outliers.
There are also some recent papers studying the problem of robust mean estimation in the case of adversarial contamination (Lai, Rao and Vempala, 2016;Diakonikolas et al., 2016;Balakrishnan et al., 2017). Note that the methods used in those papers are quite different from ours. In particular, they are all more computationally demanding in that their iterations perform a singular values decomposition of a p × p matrix. Furthermore, the rates of convergence for those methods have always some extra poly-logarithmic factors as compared to the minimax rate.

Proofs of the main results
This section contains the proofs of the main theorems stated in the previous sections.

Proofs of the theorems of Section 2
Proof of Theorem 1. Using the triangle inequality several times, we get To upper bound the three terms of the right hand side, we introduce the event We will show that the following three claims are true for the tuning parameter λ chosen as in the statement of the theorem.
Claim 1: On the event Ω λ , at least half of the elements of each J belong to the true sparsity pattern S. Thus | S| ≤ 2s.
Let us first show that these claims imply the claim of the theorem. Indeed, the second term of the right hand side of (15) is bounded by σ 12s(p + λs) in view of Claim 2. The third term is bounded by σ 2s(p + λs) on the event Ω λ . Concerning the first term, we know that on Ω λ it is bounded by σ 2| S|(p + λ| S|).
In view of Claim 1, | S| ≤ 2s. All these inequalities imply that L GSS − L(Θ) 2 2 ≤ 60σ 2 s(p + λs) on the event Ω λ . This is exactly the claim of the theorem.
Let us prove now Claims 1-3. To prove the first claim, let us assume that there is a set J among J 1 , . . . , J and a subset J 0 ⊂ J of cardinality 5 |J|/2 such that J 0 ⊂ S c . This readily implies that L(Y J ) 2 2 ≥ 12σ 2 |J|(p + λ|J|) and L(Y J\J0 ) 2 2 < 6σ 2 |J|(p + λ|J|/2). Using the additivity of L and the triangle inequality, we get This is in contradiction with the fact that J is one of the sets J 1 , . . . , J L . So, Claim 1 is proved. The proof of Claim 2 is simpler. By construction, the set S \ S is a subset of I L , where L is the number of steps performed by the algorithm. Since the algorithm terminated after the Lth step, this means that I was empty, which implies that L(Y S\ S ) 2 2 ≤ 12σ 2 |S \ S|(p + λ|S \ S|) ≤ 12σ 2 s(p + λs). It remains to prove Claim 3. This can be done using the union bound and tail bounds for χ 2 p -distributed random variables. Indeed, we have where η ∼ χ 2 p . Using the well known bound on the tails of the χ 2 p distribution, we get Therefore, for λ = 3 /2 log( 2n /δ), we obtain that P Θ Ω c λ ≤ δ. This completes the proof of Claim 3 and of the theorem.
Proof of Theorem 2. Recall that Ξ = (ξ 1 , . . . , ξ n ). First, denoting by S λ the set of indices such that Y i 2 ≥ λ, we decompose so that The first term corresponds to the stochastic error of estimating the signal vectors that are correctly identified as nonzero. We can write The second-order moment of the spectral norm of the random matrix Ξ S can be evaluated using well-known upper bounds on the spectral norm of matrices with independent Gaussian, recalled in Lemma 9 below, so that Set η i = θ i ξ i / θ i 2 . We can control the second term in (17) using the following inequality and, if a, b > 0, then by a simple resolution This readily yields The third term in (17) corresponds to the Type II error in the problem of support estimation. Denoting t = (λ 2 − σ 2 p)/σ 2 , and using tail bounds for the chi-squared random variables (see Lemma 6 below), we get Using the fact that t = 4 log(1 + n/s 2 ) ∨ 16p log(1 + n 2 p/s 4 ) 1/2 we arrive at The result follows from the previous upper bounds and the choice of λ.
Proof of Proposition 1. Recall that S λ denotes the set of indices such that Y i 2 ≥ λ. We define Θ as the matrix with entries = σp −1/4 in the first s columns and 0 elsewhere. Using the inequality ∀a, b, (a − b) 2 ≥ a 2 /2 − b 2 and (16), we get Moreover, L(Ξ S λ \S ) being centered and independent of L(Θ S\S λ ), we can develop First assume that λ 2 ≥ σ 2 p + σ 2 √ p and focus on the second term in the righthand side of the last display. Using Jensen's inequality, we have On the other hand, since sgn(ξ i θ i ) is a Rademacher random variable independent of ξ i 2 2 , for every i ∈ S, we have This last probability converges to 1/2, so that it is larger than 1/4 for all p large enough.
In the other case, λ 2 < σ 2 p + σ 2 √ p, we consider the first term: The probability in the right-hand side converges to Φ(2 −1/2 ) ≤ 0.8, so that for p ≥ 64 and s ≤ n/2, we have Finally, according to (18), we have for p large enough so that We have to distinguish between two cases. If s 2 √ p ≤ 196sp, then the result holds in view of Theorem 5. In the opposite case, the result holds as long as s ≤ n/61.
Proof of Proposition 2. Without loss of generality, we assume σ = 1. To ease notation, we set L = L(Θ) and B = {u ∈ R p : u − L(Y) 2 2 ≤ 7nplog n}. On the one hand, we remark that if L ∈ B, since L pr belongs to the same ball, we have L pr − L 2 2 ≤ 28np log n ∧ L − L 2 2 . Therefore, On the other hand, since L pr ∈ B, we have Since L ∈ B is equivalent to Ξ 2 2 > 7np log n and Ξ 2 2 ∼ nχ 2 p , we have where we have used a standard bound on the tails of χ 2 distribution recalled in Lemma 6. For p ≥ 2 and n ≥ 3, it is easy to see that 12np log ne −1.5p log n = 12(n 1−p log n)(pn −p/2 ) ≤ 12 log 3 3 2 3 ≤ 3. Combining this inequality with (21) and (22), and using the Minkowski inequality, we get where we have used Lemma 5. (20) and (23) together imply the claim of the proposition.
Proof of Theorem 4. To ease notation, for every random vector X we write X L2 for (E[ X 2 2 ]) 1/2 . We first notice that so that we only need to bound the expected squared norms of the three terms in the right-hand side. These three terms have the following meanings: the first one is the bias of estimation or the approximation error, the second term is the stochastic error on the support S, whereas the third term is the stochastic error on S .

Evaluation of the approximation error
For the first term, we use the Minkowski inequality as follows The first part can be treated exactly as in (19) of the proof of Theorem 2, i.e., For assessing the second term in the right hand side of (24), we set We consider two cases. The first case corresponds to 2σθ i ξ i + σ 2 ξ i 2 2 − p < θ i 2 2 /2. In this case one easily checks that T 1,i ≤ √ 2 γ. In the second case, This readily implies that = σsγ + 10σs + 3σs(2p) 1/4 .
Evaluation of the stochastic error on S The second term can be treated exactly as in the proof of Theorem 2 using Wishart matrices, i.e.,

Evaluation of the stochastic error on S
For the third term, we write We conclude by Lemma 6 that the last term satisfies Using the fact that γ 2 = 4 log(1 + n/s 2 ) ∨ 16p log(1 + n 2 p/s 4 ) 1/2 we arrive at This completes the proof of the theorem.

Proofs of the theorems of Section 3
This section gathers the proofs all of the results concerning the problem of robust estimation of a Gaussian mean.
Proof of Theorem 6. In view of (9), we have Developing the left-hand side, this yields Using the Cauchy-Schwarz inequality, we have on the event where we used the triangular inequality. The last inequality implies that Furthermore, using the Cauchy-Schwarz inequality, we get We can then apply Lemma 4, since (26) ensures that the condition is satisfied with a = 3: provided that s ≤ n/16, we have The first claim follows now from Lemma 7. To show the second inequality, it suffices to remark that To complete the proof, we use the fact that n L n (Ξ) 2 2 is a χ 2 (p) random variable, which implies that with probability at least 1 − δ/2 it is bounded from above by 2p + 4 log(2/δ).
Using equation (11) and the fact that w i = 0 for every i ∈ S in the event Ω 1 ∩ Ω 2 , we get Replacing Z i by θ i + Δ n + σξ i and using the triangle inequality, we obtain We will now evaluate the first and the third terms of the right-hand side.
This completes the proof.

Some technical lemmas
Lemma 4. Let us introduce the projection matrix Π = I n − 1 n J n , where I n and J n are respectively the n × n identity matrix and the constant matrix with all the entries equal to 1. Let U be a p × n matrix with columns u i ∈ R p satisfying, for some set S ⊂ {1, . . . , n} and some real number a > 0, then Proof. We denote by u i ∈ R n the column vector corresponding to the i-th row of U. On the one hand, since 1 /nJ n is an orthogonal projection matrix, we have by the Pythagorean theorem Πu i 2 2 = u i 2 2 − 1 n 2 J n u i 2 2 .
In particular, this implies that Lemma 5. If η is a random variable drawn from the χ 2 d distribution, then for every x > 0 P η ≥ d + x ≤ e − 1 16d x(x∧4d) .
Proof. First assume that x ≥ 4d. Combining the relation with Lemma 5, we get This implies that the random variable (ΞΠ) i 2 2 is drawn from the n−1 n χ 2 p distribution, and the result follows from Lemma 5. Lemma 8 (Corollary 5.35 in (Vershynin, 2012)). Assume that A is a N × n random matrix with independent standard Gaussian entries. Then, for any t ≥ 0, with probability at least 1 − 2e −t 2 /2 , it holds that We deduce from this the following lemma.
Lemma 9. If A is a N ×n random matrix with independent standard Gaussian entries, then E[ A 2 ] ≤ 3N + 3n + 12.
Proof. It is clear that The result follows from the fact that the last integral is equal to one.