Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

For the problem of high-dimensional sparse linear regression, it is known that an $\ell_0$-based estimator can achieve a $1/n$"fast"rate on the prediction error without any conditions on the design matrix, whereas in absence of restrictive conditions on the design matrix, popular polynomial-time methods only guarantee the $1/\sqrt{n}$"slow"rate. In this paper, we show that the slow rate is intrinsic to a broad class of M-estimators. In particular, for estimators based on minimizing a least-squares cost function together with a (possibly non-convex) coordinate-wise separable regularizer, there is always a"bad"local optimum such that the associated prediction error is lower bounded by a constant multiple of $1/\sqrt{n}$. For convex regularizers, this lower bound applies to all global optima. The theory is applicable to many popular estimators, including convex $\ell_1$-based methods as well as M-estimators based on nonconvex regularizers, including the SCAD penalty or the MCP regularizer. In addition, for a broad class of nonconvex regularizers, we show that the bad local optima are very common, in that a broad class of local minimization algorithms with random initialization will typically converge to a bad solution.


Introduction
The classical notion of minimax risk, which plays a central role in decision theory, allows for the statistician to implement any possible estimator, regardless of its computational cost. For many statistical problems, there are a variety of statistically optimal estimators, which can be ordered in terms of their computational complexity. Given that it is usually feasible only to implement polynomial-time methods, it has become increasingly important to study computationally constrained analogues of the minimax estimator, in which the subset of possible estimators is restricted to computationally efficient estimators [27]. A fundamental question is when such computationallyconstrained forms of minimax risk estimation either coincide or differ in a fundamental way from their classical counterpart.
The goal of this paper is to explore such gaps between classical and computationally-practical minimax risks, in the context of high-dimensional sparse regression. Our main contribution is to establish a fundamental gap between the classical minimax risk and the best possible risk achievable by a broad class of M -estimators based on coordinate-separable regularizers, one which includes various nonconvex regularizers that are used in practice.
In more detail, the classical linear regression model is based on a response vector y ∈ R n and a design matrix X ∈ R n×d that are linked via the relationship where the vector w ∈ R n is some form of observation noise. Our goal is to estimate the unknown regression vector θ * ∈ R d . Throughout this paper, we focus on the standard Gaussian model, in which the entries of the noise vector w are i.i.d. N (0, σ 2 ) variates, and the case of deterministic design, in which the matrix X is viewed as non-random. In the sparse variant of this model, the regression vector is assumed to have a small number of non-zero coefficients. In particular, for some positive integer k < d, the vector θ * is said to be k-sparse if it has at most k non-zero coefficients. Thus, the model is parameterized by the triple (n, d, k) of sample size n, ambient dimension d, and sparsity k. We use B 0 (k) to the denote the ℓ 0 -"ball" of all d-dimensional vectors with at most k non-zero entries. An estimator θ is a measurable function of the pair (y, X), taking values in R d , and its quality can be assessed in different ways. In this paper, we focus on its fixed design prediction error, given by E 1 n X( θ − θ * ) 2 2 , a quantity that measures how well θ can be used to predict the vector Xθ * of noiseless responses. The worst-case prediction error of an estimator θ over the set B 0 (k) is given by Given that θ * is k-sparse, the most direct approach would be to seek a k-sparse minimizer to the least-squares cost y − Xθ 2 2 , thereby obtaining the ℓ 0 -based estimator θ ℓ 0 ∈ arg min The ℓ 0 -based estimator θ ℓ 0 is known [7,23] to satisfy a bound of the form M n,k,d ( θ ℓ 0 ; X) σ 2 k log d n , where denotes an inequality up to constant factors (independent of the triple (n, d, k) as well as the standard deviation σ). However, it is not easy to compute in a brute force manner, since there are d k subsets of size k to consider. The computational intractability of the ℓ 0 -based estimator has motivated the use of various heuristic algorithms and approximations, including the basis pursuit method [9], the Dantzig selector [8], as well as the extended family of Lasso estimators [25,9,31,2]. Essentially, these methods are based on replacing the ℓ 0 -constraint with its ℓ 1 -equivalence, in either a constrained or penalized form. There is now a very large body of work on the performance of such methods, covering different criteria including support recovery, ℓ 2 -norm error and prediction error (e.g., see the book [6] and references therein for further details).
For the case of fixed design prediction error that is the primary focus here, such ℓ 1 -based estimators are known to achieve the bound (4) only if the design matrix X satisfies certain conditions, like the restricted eigenvalue (RE) condition [4,26] or the restricted isometry property [8]. Without such conditions, the best known guarantees for ℓ 1 -based estimators are of the form a bound that is valid without any RE conditions on the design matrix X whenever the k-sparse regression vector θ * has a bounded ℓ 1 -norm [7,21,23]. The substantial gap between the "fast" rate (4) and the "slow" rate (5) leaves open a fundamental question: is there a computationally efficient estimator attaining the bound (4) for general design matrices? One possibility is that existing analyses of prediction error are overly conservative, and ℓ 1 -based methods can actually achieve the bound (4), without additional constraints on X. Foygel and Srebro [13] gave a partial negative answer to this question, specifically by constructing a 2-sparse regression vector and a design matrix for which the Lasso prediction error is lower bounded by 1/ √ n. Their construction is explicit for Lasso, and thus doesn't answer the question more generally. 1 Our own recent work [30] establishes a complexity-theoretic result: more precisely, under a standard complexity-theoretic condition (namely, NP ⊂ P/poly), there is no polynomial-time algorithm that returns a k-sparse vector that achieves the fast rate. However, this hardness result does not eliminate the possibility of finding a polynomial-time estimator that returns dense vectors satisfying the fast rate for prediction error.
In this paper, we provide additional evidence against the polynomial achievability of the fast rate (4), in particular by showing that the slow rate (5) is a lower bound for a broad class of M-estimators, namely those based on minimizing a least-squares cost function together with a coordinate-wise decomposable regularizer. In particular, we consider estimators that are based on an objective function of the form L(θ; λ) = 1 n y − Xθ 2 2 + λ ρ(θ) for a weighted regularizer ρ : R d → R that is coordinate-separable. Our first main result (Theorem 1) establishes that there is always a matrix X ∈ R n×d such that for any coordinate-wise separable function ρ and for any choice of weight λ ≥ 0, the objective L always has at least one local optimum θ λ such that Moreover, if the regularizer ρ is convex, then this lower bound applies to all global optima of the convex criterion L. This lower bound is applicable to many popular estimators, including the ridge regression estimator [16], the basis pursuit method [9], the Lasso estimator [25], the weighted Lasso estimator [31], the square-root Lasso estimator [2], and least-squares based on nonconvex regularizers such as the SCAD penalty [12] or the MCP penalty [29].
In the nonconvex setting, it is impossible (in general) to guarantee anything beyond local optimality for any solution found by a polynomial-time algorithm [14]. Nevertheless, to play the devil's advocate, one might argue that the assumption that an adversary is allowed to pick a bad local optimum could be overly pessimistic for statistical problems. In order to address this concern, we prove a second result (Theorem 2) that demonstrates that bad local solutions are difficult to avoid. Focusing on a class of local descent methods, we show that given a random isotropic initialization centered at the origin, the resulting fixed points have poor mean-squared error-that is, they can only achieve the slow rate. In this way, this paper shows that the gap between the fast and slow rates in high-dimensional sparse regression cannot be closed via standard application of a very broad class of methods. In conjunction with our earlier complexity-theoretic paper [30], it adds further weight to the conjecture that there is a fundamental gap between the performance of polynomial-time and exponential-time methods for sparse prediction.
The remainder of this paper is organized as follows. We begin in Section 2 with further background, including a precise definition of the family of M -estimators considered in this paper, some illustrative examples, and discussion of the prediction error bound achieved by the Lasso. Section 3 is devoted to the statements of our main results, along with discussion of their consequences. In Section 4, we provide the proofs of our main results, with some technical lemmas deferred to the appendices. We conclude with a discussion in Section 5.
2 Background and problem set-up As previously described, an instance of the sparse linear regression problem is based on observing a pair (X, y) ∈ R n×d × R n of instances that are linked via the linear model (1), where the unknown regressor θ * is assumed to be k-sparse, and so belongs to the ℓ 0 -ball B 0 (k). Our goal is to find a good predictor, meaning a vector θ such that the mean-squared prediction error 1 n X( θ − θ * ) 2 2 is small.

Least-squares with coordinate-separable regularizers
The analysis of this paper applies to estimators that are based on minimizing a cost function of the form where ρ : R d → R is a regularizer, and λ ≥ 0 is a regularization weight. We consider the following family F of coordinate-separable regularizers: (i) The function ρ : R d → R is coordinate-wise decomposable, meaning that ρ(θ) = d j=1 ρ j (θ j ) for some univariate functions ρ j : R → R.
Let us consider some examples to illustrate this definition.
Bridge regression: The family of bridge regression estimates take the form Note that this is a special case of the objective function (7) with ρ j (·) = | · | γ for each coordinate. When γ ∈ {1, 2}, it corresponds to the Lasso estimator and the ridge regression estimator respectively. The analysis of this paper provides lower bounds for both estimators, uniformly over the choice of λ.  Weighted Lasso: The weighted Lasso estimator [31] uses a weighted ℓ 1 -norm to regularize the empirical risk, and leads to the estimator Here α 1 , . . . , α d are weights that can be adaptively chosen with respect to the design matrix X.
The weighted Lasso can perform better than the ordinary Lasso, corresponding to the special case in which all α j are all equal. For instance, on the counter-example proposed by Foygel and Srebro [13], for which the ordinary Lasso estimator achieves only the slow 1/ √ n rate, the weighted Lasso estimator achieves the 1/n convergence rate. Nonetheless, the analysis of this paper shows that there are design matrices for which the weighted Lasso, even when the weights are chosen adaptively with respect to the design, has prediction error at least a constant multiple of 1/ √ n.
Square-root Lasso: The square-root Lasso estimator [2] is defined by minimizing the criterion This criterion is slightly different from our general objective function (7), since it involves the square root of the least-squares error. Relative to the Lasso, its primary advantage is that the optimal setting of the regularization parameter does not require the knowledge of the standard deviation of the noise. For the purposes of the current analysis, it suffices to note that by Lagrangian duality, every square-root Lasso estimate θ sqrt is a minimizer of the least-squares criterion y −Xθ 2 subject θ 1 ≤ R, for some radius R ≥ 0 depending on λ. Consequently, as the weight λ is varied over the interval [0, ∞), the square root Lasso yields the same solution path as the Lasso. Since our lower bounds apply to the Lasso for any choice of λ ≥ 0, they also apply to all square-root Lasso solutions.
SCAD penalty or MCP regularizer: Due to the intrinsic bias induced by ℓ 1 -regularization, various forms of nonconvex regularization are widely used. Two of the most popular are the SCAD penalty, due to Fan and Li [12], and the MCP penalty, due to Zhang et al. [29]. The family of SCAD penalties takes the form where a > 2 is a fixed parameter. When used with the least-squares objective, it is a special case of our general set-up with ρ j (θ j ) = φ λ (θ j ) for each coordinate j = 1, . . . , d. Similarly, the MCP penalty takes the form where b > 0 is a fixed parameter. It can be verified that both the SCAD penalty and the MCP regularizer belong to the function class F previously defined. See Figure 1 for a graphical illustration of the SCAD penalty and the MCP regularizer.

Prediction error for the Lasso
We now turn to a precise statement of the best known upper bounds for the Lasso prediction error. When the design matrix satisfies a restricted eigenvalue (RE) condition [4,26], the Lasso is known to achieve the fast rate (4) for prediction error. As can be shown by information-theoretic techniques [23], for the purposes of obtaining estimates with low ℓ 2 -error θ − θ * 2 , such RE conditions cannot be avoided: any method (including the ℓ 0 -estimator) requires some sort of control of the sparse restricted eigenvalues. However, the problem of finding a good predictor should be easier than recovering θ * in ℓ 2 -norm, and strong correlations between the columns of X, which lead to violations of the RE conditions, should have no effect on the intrinsic difficulty of the prediction problem. (In fact, Raskutti et al. [23] show that many problems with strongly correlated columns are easy from the prediction point of view.) Recall that the ℓ 0 -based estimator θ ℓ 0 satisfies the prediction error upper bound (4), without any constraint on the design matrix.
In the absence of RE conditions, ℓ 1 -based methods are known to achieve the slow 1/ √ n rate, with the only constraint on the design matrix being a uniform column bound. More precisely, letting X j ∈ R n denote the j th column of the design matrix X, we say that it is 1-column normalized if Our choice of the constant 1 is to simplify notation; the more general notion allows for an arbitrary constant C in this bound. For any design matrix satisfying this normalization condition, the following result [4] provides a bound on the prediction error for the Lasso estimator: Proposition 1 (Prediction error for Lasso). Consider the standard linear model for a design matrix X satisfying the column normalization condition (8). Then for any vector θ * ∈ B 0 (k) ∩ B 1 (R), the Lasso estimator θ λn , with λ n = 4σ 2 log d n , satisfies the bound with probability at least 1 − c 1 d e −c 2 nδ 2 .
By integrating this tail bound, it follows that Although this bound holds without RE conditions on the design, note that it is substantially slower than the 1/n fast rate for reasonable ranges of (k, R). One might wonder whether the analysis leading to the bound (9) could be sharpened so as to obtain the fast rate. Among other consequences, our first main result (Theorem 1 below) shows that no substantial sharpening is possible.

Main results
Having provided the needed background, we now turn to statements of our main results, and discussion of their consequences.

A general lower bound
Our analysis applies to the set of local minima of the objective function L defined in equation (7). More precisely, a vector θ is a local minimum of the function θ → L(θ; λ) if there is an open ball B centered at θ such that θ ∈ arg min θ∈B L(θ; λ). We then define the set an object that depends on the triplet (X, y, ρ) as well as the choice of regularization weight λ. Since the function ρ might be non-convex, the set Θ λ may contain multiple elements. A typical descent method applied to the objective L can be expected (at best) to converge to some element of Θ λ . The following theorem provides a lower bound, applicable to any method that always returns some local minimum of the objective function (7). Theorem 1. For any pair (n, d) in the range d ≥ n ≥ 4, any sparsity level k ≥ 2 and any radius R ≥ 8σ/ √ n, there is a design matrix X ∈ R n×d satisfying the column normalization condition (8) such that for any coordinate-separable penalty, we have Moreover, for any convex coordinate-separable penalty, we have In both of these statements, the constant c is universal, independent of (n, d, k, σ, R) as well as the design matrix. See Section 4.1 for the proof.
In order to interpret the lower bound (11a), consider any estimator θ that takes values in the set Θ λ , corresponding to local minima of L. The result is of a game-theoretic flavor: the statistician is allowed to adaptively choose λ based on the observations (y, X), whereas nature is allowed to act adversarially in choosing a local minimum for every execution of θ λ . Under this setting, Theorem 1 implies that For any convex regularizer (such as the ℓ 1 -penalty underlying the Lasso estimate), equation (11b) provides a stronger lower bound, one that holds uniformly over all choices of λ ≥ 0 and all (global) minima. For the Lasso estimator, the lower bound of Theorem 1 matches the upper bound (9) up to the logarithmic term log(d), showing that the lower bound is almost tight.

Lower bounds for local descent methods
For any least-squares cost with a coordinate-wise separable regularizer, Theorem 1 establishes the existence of at least one "bad" local minimum such that the associated prediction error is lower bounded by 1/ √ n. One might argue that this result could be overly pessimistic, in that the adversary is given too much power in choosing local minima. Indeed, the mere existence of bad local minima need not be a practical concern unless it can be shown that a typical optimization algorithm will frequently converge to one of them. Steepest descent is a standard first-order algorithm for minimizing a convex cost function [3,5]. However, for non-convex and non-differentiable loss functions, it is known that the steepest descent method does not necessarily yield convergence to a local minimum [11,28]. Although there exist provably convergent first-order methods for general non-convex optimization (e.g., [20,17]), the paths defined by their iterations are difficult to characterize, and it is also difficult to predict the point to which the algorithm eventually converges.
In order to address a broad class of methods in a unified manner, we begin by observing that most first-order methods can be seen as iteratively and approximately solving a local minimization problem. For example, given a stepsize parameter η > 0, the method of steepest descent iteratively approximates the minimizer of the objective over a ball of radius η. Similarly, the convergence of algorithms for non-convex optimization algorithms is based on the fact that they guarantee decrease of the function value in the local neighborhood of the current iterate [20,17]. We thus study an iterative local descent algorithm taking the form: where η is a given parameter, and B 2 (η; θ t ) : = {θ ∈ R d | θ − θ t 2 ≤ η} is the ball of radius η around the current iterate. If there are multiple points achieving the optimum, the algorithm chooses the one that is closest to θ t , resolving any remaining ties by randomization. The algorithm terminates when there is a minimizer belonging to the interior of the ball B 2 (η; θ t )-that is, exactly when θ t+1 is a local minimum of the loss function.
It should be noted that the algorithm (13) defines a powerful algorithm-one that might not be easy to implement in polynomial time-since it is guaranteed to return the global minimum of a nonconvex program over the ball B 2 (η; θ t ). In a certain sense, it is more powerful than any first-order optimization method, since it will always decrease the function value at least as much as a descent step with stepsize related to η. Since we are proving lower bounds, these observations only strengthen our result. We impose two additional conditions on the regularizers: (iv) Each component function ρ j is continuous at the origin.
(v) There is a constant H such that |ρ ′ j (x) − ρ ′ j (x)| ≤ H|x −x| for any pair x,x ∈ (0, ∞). Assumptions (i)-(v) are more restrictive than assumptions (i)-(iii), but they are satisfied by many popular penalties. As illustrative examples, for the ℓ 1 -norm, we have H = 0. For the SCAD penalty, we have H = 1/(a − 1), whereas for the MCP regularizer, we have H = 1/b. Finally, in order to prevent the update (13) being so powerful that it reaches the global minimum in one single step, we impose an additional condition on the stepsize, namely that It is reasonable to assume that the stepsize bounded by a time-invariant constant, as we can always partition a single-step update into a finite number of smaller steps, increasing the algorithm's time complexity by a multiplicative constant. On the other hand, the O(1/ √ n) stepsize is adopted by popular first-order methods. Under these assumptions, we have the following theorem, which applies to any regularizer ρ that satisfies Assumptions (i)-(v).
Theorem 2. For any pair (n, d) such that d ≥ n ≥ 4, integer k ≥ 2 and any scalars γ ≥ 0, and R ≥ σ/ √ n, there is a design matrix X ∈ R n×d satisfying the column normalization condition (8) and such that (a) The update (13) terminates after a finite number of steps T at a vector θ = θ T +1 that is a local minimum of the loss function.
(b) Given a random initialization θ 0 ∼ N (0, γ 2 I d×d ), the local minimum satisfies the lower bound Theorem 2 shows that local descent methods based on a random initialization do not lead to local optima that achieve the fast rate. This conclusion provides stronger negative evidence than Theorem 1, since it shows that bad local minima not only exist, but are difficult to avoid.

Simulations
In the proof of Theorem 1 and Theorem 2, we construct specific design matrices to make the problem hard to solve. In this section, we apply several popular algorithms to the solution of the sparse linear regression problem on these "hard" examples, and compare their performance with the ℓ 0 -based estimator (3). More specifically, focusing on the special case n = d, we perform simulations for the design matrix X ∈ R n×n used in the proof of Theorem 2. It is given by  Given the 2-sparse regression vector θ * = 0.5, 0.5, 0, . . . , 0 , we form the response vector y = Xθ * + w, where w ∼ N (0, I n×n ).
We compare the ℓ 0 -based estimator, referred to as the baseline estimator, with three other methods: the Lasso estimator [25], the estimator based on the SCAD penalty [12] and the estimator based on the MCP penalty [29]. In implementing the ℓ 0 -based estimator, we provide it with the knowledge that k = 2, since the true vector θ * is 2-sparse. For Lasso, we adopt the MATLAB implementation [1], which generates a Lasso solution path evaluated at 100 different regularization parameters, and we choose the estimate that yields the smallest prediction error. For the SCAD penalty, we choose a = 3.7 as suggested by Fan and Li [12]. For the MCP penalty, we choose b = 2.7, so that the maximum concavity of the MCP penalty matches that of the SCAD penalty. For the SCAD penalty and the MCP penalty (and recalling that d = n), we studied choices of the regularization weight of the form λ = C log n n for a pre-factor C to be determined. As shown in past work on non-convex regularizers [19], such choices of λ lead to low ℓ 2 -error. By manually tuning the parameter C to optimize the prediction error, we found that C = 0.1 is a reasonable choice. We used routines from the GIST package [15] to optimize these non-convex objectives.
By varying the sample size over the range 10 to 1000, we obtained the results plotted in Figure 2, in which the prediction error E[ 1 n X( θ − θ * ) 2 2 ] and sample size n are both plotted on a logarithmic scale. The performance of the Lasso, SCAD-based estimate, and MCP-based estimate are all similar. For all of the three methods, the prediction error scales as 1/ √ n, as confirmed by the slopes of the corresponding lines in Figure 2, which are very close to 0.5. In fact, by examining the estimator's output, we find that in many cases, all three estimators output θ = 0, leading to the prediction error 1 n X(0 − θ * ) 2 2 = 1 √ n . Since the regularization parameters have been chosen to optimize the prediction error, this scaling is the best rate that the three estimators are able to achieve, and it matches the theoretical prediction of Theorem 1 and Theorem 2.
In contrast, the ℓ 0 -based estimator achieves a substantially better error rate. The slope of the corresponding line in Figure 2 is very close to 1. It means that the prediction error of the ℓ 0 -based estimator scales as 1/n, thereby matching the theoretically predicted scaling (4).

Proofs
We now turn to the proofs of our two main theorems. In each case, we defer the proofs of more technical results to the appendices.

Proof of Theorem 1
For a given triple (n, σ, R), we define the angle α := arcsin √ σ n 1/4 √ 32R , and the two-by-two matrix Using the matrix A ∈ R 2×2 as a building block, we construct a design matrix X ∈ R n×d . Without loss of generality, we may assume that n is divisible by 2. (If n is not divisible by 2, constructing a (n − 1)-by-d design matrix concatenated by a row of zeros only changes the result by a constant.) We then define where the all-zeroes matrix on the right side has dimensions n × (d − n). It is easy to verify that the matrix X defined in this way satisfies the column normalization condition (8).
Next, we prove the lower bound (11a). For any integers i, j ∈ [d] with i < j, let θ i denote the i th coordinate of θ, and let θ i:j denote the subvector with entries {θ i , . . . , θ j }. Since the matrix A appears in diagonal blocks of X, we have and it suffices to lower bound the right-hand side of the above equation.
For the sake of simplicity, we introduce the shorthand B := 4σ √ n , and define the scalars Furthermore, we define and Without loss of generality, we may assume that γ 1 = max i∈[n/2] If the assumption is not true, we can simply re-index the columns of X to make these properties hold. Note that when we swap the columns 2i − 1 and 2i, the value of a i doesn't swap; i.e., it is always associated with the column whose regularization term is equal to γ i . Finally, we define the regression vector θ * = R 2 R 2 0 · · · 0 ∈ R d . Given these definitions, the following lemma lower bounds each term on the right-hand side of equation (16). Lemma 1. For any λ ≥ 0, there is a local minimum θ λ of the objective function L(θ; λ) such that 1 n X( θ λ − θ * ) 2 2 ≥ T 1 + T 2 , where Moreover, if the regularizer ρ is convex, then every minimizer θ λ satisfies this lower bound.
See Appendix A for the proof of this claim. Using Lemma 1, we can now complete the proof of the theorem. It is convenient to condition on the event E : = { w 1:2 2 ≤ σ 32 }. Since w 1:2 2 2 /σ 2 follows a chi-square distribution with two degrees of freedom, we have P[E] > 0. Conditioned on this event, we now consider two separate cases: Case 1: First, suppose that λγ 1 > σ 2 /n. In this case, we have and consequently where last inequality holds since we have assume that R ≥ 8σ/ √ n.
Case 2: Otherwise, we may assume that λγ 1 ≤ σ 2 /n. In this case, we have Combining the two lower bounds (20a) and (20b), we find where we have used the fact that {w ′ i } n/2 i=2 are independent of the event E. Using the inequality min a, n/2 i=2 b i ≥ n/2 i=2 min{2a/n, b i }, valid for scalars a and {b i } n/2 i=2 , we see that where we have used the fact that , and the definition B := 4σ √ n . Since w ′ i ∼ N (0, σ 2 /n), the probability P[2σ/ √ n ≤ w ′ i ≤ 4σ/ √ n] is bounded away from zero independently of all problem parameters. Hence, there is a universal constant c 2 > 0 such that T 3 ≥ c 2 min σR √ n , σ 2 . Putting together the pieces, we have shown that which completes the proof of the theorem.

Proof of Theorem 2
The proof of Theorem 2 is conceptually similar to the proof of Theorem 1, but differs in some key details. We begin with the altered definitions where r := min{R, σ}.
Given our assumption R ≥ σ/ √ n, note that we are guaranteed that 2B = σ/(2 √ n) ≤ r/2. We then define the matrix A ∈ R 2×2 and the matrix X ∈ R n×d by equations (15a) and (15b).

Proof of part (a)
Let {θ t } ∞ t=0 be the sequence of iterates generated by equation (13). We proceed via proof by contradiction, assuming that the sequence does not terminate finitely, and then deriving a contradiction. We begin with a lemma. Lemma 2. If the sequence of iterates {θ t } ∞ t=0 is not finitely convergent, then it is unbounded.
We defer the proof of this claim to the end of this section. Based on Lemma 2, it suffices to show that, in fact, the sequence {θ t } ∞ t=0 is bounded. Letting θ = θ 1:n , θ n+1:d be a partition of the elements of θ, we control the two sequences {θ t 1:n } ∞ t=0 and {θ t n+1:d } ∞ t=0 .
Beginning with the former sequence, notice that the objective function can be written in the form L(θ; λ) = 1 n y − X 1:n θ 1:n where X 1:n represents the first n columns of matrix X. The conditions (15a) and (15b) guarantee that the Gram matrix X T 1:n X 1:n is positive definite, which implies that the quadratic function θ 1:n → y − X 1:n θ 1:n 2 2 is strongly convex. Thus, if the sequence {θ t 1:n } ∞ t=0 were unbounded, then the associated cost sequence {L(θ t ; λ)} ∞ t=0 would also be unbounded. But this is not possible since L(θ t ; λ) ≤ L(θ 0 ; λ) for all iterations t = 1, 2, . . .. Consequently, we are guaranteed that the sequence {θ t 1:n } ∞ t=0 must be bounded.
It remains to control the sequence {θ t n+1:d } ∞ t=0 . We claim that for any i ∈ {n + 1, . . . , d}, the sequence {|θ t i |} ∞ t=0 is non-increasing, which implies the boundedness condition. Proceeding via proof by contradiction, suppose that |θ t i | < |θ t+1 i | for some index i ∈ {n + 1, . . . , d} and iteration number t ≥ 0. Under this condition, define the vector Since ρ j is a monotonically non-decreasing function of |x|, we are guaranteed that L( θ t+1 ; λ) ≤ L(θ t+1 ; λ), which implies that θ t+1 is also a constrained minimum point over the ball B 2 (η; θ t ). In addition, we have so that θ t+1 j is strictly closer to θ t . This contradicts the specification of the algorithm, in that it chooses the minimum closest to θ t .
Proof of Lemma 2: Finally, it remains to prove Lemma 2. We first claim that θ s − θ t 2 ≥ η for all pairs s < t. If not, we could find some pair s < t such that θ s − θ t 2 < η. But since t > s, we are guaranteed that L(θ t ; λ) ≤ L(θ s+1 ; λ). Since θ s+1 is a global minimum over the ball B 2 (η; θ s ) and θ s − θ t 2 < η, the point θ t is also a global minimum, and this contradicts the definition of the algorithm (since it always chooses the constrained global minimum closest to the current iterate).
Using this property, we now show that {θ t } ∞ t=0 is unbounded. For each t = 0, 1, 2 . . ., introduce the shorthand B t = B 2 (η/3; θ t ) for the Euclidean ball of radius η/3 centered at θ t . Since θ s − θ t 2 ≥ η for all s = t, the balls {B t } ∞ t=0 are all disjoint, and hence there is a numerical constant C > 0 such that for each T ≥ 1, we have Since this volume diverges as T → ∞, we conclude that the set B : = ∪ ∞ t=0 B t must be unbounded. By construction, any point in B is within η/3 of some element of the sequence {θ t } ∞ t=0 , so this sequence must be unbounded, as claimed.

Proof of part (b)
We introduce shorthand notations Then we define the quantities a i and w ′ i by equations (18). Similar to the proof of Theorem 1, we assume (without loss of generality, re-indexing as needed) that γ i = sup u∈(0,B] ρ ′ 2i−1 (u) and that Consider the regression vector θ * : = r 2 r 2 0 · · · 0 . Since the matrix A appears in diagonal blocks of X, the algorithm's output θ has error Given the random initialization θ 0 , we define the events E 0 := max{θ 0 1 , θ 0 2 } ≤ 0 and E 1 := λγ 1 ≥ 2 sin 2 (α)r + 2 w 1:2 2 √ n + 3B , as well as the (random) subsets Given these definitions, the following lemma provides lower bounds on the decomposition (22) for the vector θ after convergence. Conditioned on event E 0 , for any index i ∈ S 2 , either the event E 0 ∩ E 1 holds, or we have which means that i ∈ S 1 holds. Applying Lemma 3 yields the lower bound where the last equality holds since σr Since the event E 0 is independent of the event {i ∈ S 2 , i = 2, . . . , n/2}, we have where step (i) uses the lower bound P[E 0 ] = 1/4 and P[i ∈ S 2 ] ≥ c from Lemma 3. Combined with the decomposition (22), the proof is complete.

Discussion
In this paper, we have demonstrated a fundamental gap in sparse linear regression: the best prediction risk achieved by a class of M -estimators based on coordinate-wise separable regularizers is strictly larger than the the classical minimax prediction risk, achieved for instance by minimization over the ℓ 0 -ball This gap applies to a range of methods used in practice, including the Lasso in its ordinary and weighted forms, as well as estimators based on nonconvex penalties such as the MCP and SCAD penalties. Several open questions remain, and we discuss a few of them here. When the penalty function ρ is convex, the M-estimator minimizing function (7) can be understood as a particular convex relaxation of the ℓ 0 -based estimator (3). It would be interesting to consider other forms of convex relaxations for the ℓ 0 -based problem. For instance, Pilanci et al. [22] show how a broad class of ℓ 0 -regularized problems can be reformulated exactly as optimization problems involving convex functions in Boolean variables. This exact reformulation allows for the direct application of many standard hierarchies for Boolean polynomial programming, including the Lasserre hierarchy [18] as well as the Sherali-Adams hierarchy [24]. Other relaxations are possible, including those that are based on introducing auxiliary variables for the pairwise interactions (e.g., γ ij = θ i θ j ), and so incorporating these constraints as polynomials in the constraint set. We conjecture that for any fixed natural number t, if the the t-th level Lasserre (or Sherali-Adams) relaxation is applied to such a reformulation, it still does not yield an estimator that achieves the fast rate (4). Since a t th -level relaxation involves O(d t ) variables, this would imply that these hierarchies do not contain polynomial-time algorithms that achieve the classical minimax risk. Proving or disproving this conjecture remains an open problem.
Finally, when the penalty function ρ is concave, concurrent work by Ge et al. [14] shows that finding the global minimum of the loss function (7) is strongly NP-hard. This result implies that no polynomial-time algorithm computes the global minimum unless NP = P. The result given here is complementary in nature: it shows that bad local minima exist, and that local descent methods converge to these bad local minima. It would be interesting to extend this algorithmic lower bound to a broader class of first-order methods. For instance, we suspect that any algorithm that relies on an oracle giving first-order information will inevitably converge to a bad local minimum for a broad class of random initializations.
In the proofs to follow, it is convenient to omit reference to the index i. In particular, viewing the index i as fixed a priori, we let u and u * be shorthand representations of the sub-vectors ( θ λ ) 2i−1,2i , and θ * 2i−1,2i , respectively. We introduce the normalized noise ε := w 2i−1,2i / √ n. By our construction of the design matrix X in terms of A, the vector θ λ is a local minimizer of the objective function if and only if u is a local minimum of the following loss: , where this statement should hold for each i ∈ [n/2]. Hence, it suffices to find a local minimum of ℓ(u; λ) such that the bounds (23a) and (23b) hold.
This statement holds for each i ∈ [n/2]. Hence, it suffices to study the update formula (28a).
If the claim is true, then we have Aθ t 1:2 − Aθ * 1:2 2 2 = cos 2 (α)(u t 1 − u t 2 ) 2 + sin 2 (α)(r − u t 1 − u t 2 ) 2 ≥ sin 2 (α)(r − u t 1 − u t 2 ) 2 ≥ sin 2 (α)(r − 2B) 2 where the final inequality follows from substituting the definition of α, and using the fact that 2B ≤ r/2. Thus, it suffices to prove the claim (29). We prove the claim (29) by induction on the iteration number t. It is clearly true for t = 0. Assume that the claim is true for a specific integer t ≥ 0, we establish it for integer t + 1. Our strategy is described as follow: suppose that the vector u t+1 minimizes the function ℓ(u; λ) inside the ball {u : u − u t 2 ≤ β}. Then, the scalar u t+1