Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization

Least squares fitting is in general not useful for high-dimensional linear models, in which the number of predictors is of the same or even larger order of magnitude than the number of samples. Theory developed in recent years has coined a paradigm according to which sparsity-promoting regularization is regarded as a necessity in such setting. Deviating from this paradigm, we show that non-negativity constraints on the regression coefficients may be similarly effective as explicit regularization if the design matrix has additional properties, which are met in several applications of non-negative least squares (NNLS). We show that for these designs, the performance of NNLS with regard to prediction and estimation is comparable to that of the lasso. We argue further that in specific cases, NNLS may have a better $\ell_{\infty}$-rate in estimation and hence also advantages with respect to support recovery when combined with thresholding. From a practical point of view, NNLS does not depend on a regularization parameter and is hence easier to use.


Introduction
Consider the linear regression model where y is a vector of observations, X ∈ R n×p a design matrix, ε a vector of noise and β * a vector of coefficients to be estimated. Throughout this paper, we are concerned with a high-dimensional setting in which the number of unknowns p is at least of the same order of magnitude as the number of observations n, which is a convex optimization problem that can be solved efficiently [27]. A minimizer β of (1.2) will be referred to as an NNLS estimator. Solid theoretical support for the empirical success of NNLS from a statistical perspective scarcely appears in the literature. An early reference is [16] dating back already two decades. The authors show that, depending on X and the sparsity of β * , NNLS may have a 'super-resolution'-property that permits reliable estimation of β * . Rather recently, sparse recovery of non-negative signals in a noiseless setting (ε = 0) has been studied in [7,19,55,56]. One important finding of this body of work is that non-negativity constraints alone may suffice for sparse recovery, without the need to employ sparsity-promoting ℓ 1 -regularization as usually. On the other hand, it remains unclear whether similar results continue to hold in a more realistic noisy setup. At first glance, the following considerations suggest a negative answer. The fact that NNLS, apart from sign constraints, only consists of a fitting term, fosters the intuition that NNLS is prone to overfitting, specifically in a high-dimensional setting. Usefulness of NNLS in such setting appears to be in conflict with the well-established paradigm that a regularizer is necessary to prevent over-adaptation to noise and to enforce desired structural properties of the solution, like sparsity of the vector of coefficients. As one of the main contributions of the present paper, we characterize the selfregularizing property which NNLS exhibits for a certain class of designs that turn out to be tailored to the non-negativity constraint, thereby disentangling the apparent conflict above and improving the understanding of the empirical success of NNLS. More precisely, we show that for these designs, NNLS is rather closely related to the non-negative lasso problem min β 0 1 n y − Xβ 2 2 + λ1 ⊤ β, λ > 0, (1.3) the sign-constrained counterpart to the popular lasso problem [47], which is given by min β 0 1 n y − Xβ 2 2 + λ β 1 , λ > 0, (1.4) where λ > 0 is a tuning parameter. Elaborating on the relation between NNLS and the non-negative lasso, we establish that for the aforementioned class of designs, NNLS achieves comparable performance with regard to prediction, estimation of β * and support recovery when combined with subsequent thresholding. We here refer to the monograph [8], the survey [52] and the retrospective [48] for an overview on the appealing performance guarantees established for the lasso in the last decade, which have, along with favourable computational properties, contributed to its enormous popularity in data analysis. In this paper, we argue that, in view of both theoretical considerations and empirical evidence, improvements of NNLS over the (non-negative) lasso are possible, even though they are limited to a comparatively small set of designs. Differences in performance arise from the bias of the ℓ 1 -regularizer in (1.3) and (1.4) that is responsible for a in general sub-optimal rate for estimation of β * in the ℓ ∞ -norm [60]. Unlike for NNLS, a tuning parameter needs to be specified for the (non-negative) lasso, as it is necessary for all regularization based-methods. Selection of the tuning parameter by means of cross-validation increases the computational burden and may be error-prone if done by a grid search, since the grid could have an unfavourable range or a too small number of grid points. Theoretical results on how to set the regularization parameter are often available, but require a sufficient degree of acquaintance with existing literature and possibly also knowledge of quantities such as the noise level. The last issue has been withstanding problem of the lasso until recently. In [4] and [46] two related variants of the lasso are proposed that have similar theoretical guarantees, while the tuning parameter can be set without knowledge of the noise level. On the other hand, NNLS is directly applicable, since it is free of tuning parameters.
Outline and contributions of the paper. The paper significantly extends a previous conference publication [43], which contains the first systematic analysis of NNLS in a high-dimensional setting. Recently, Meinshausen [34] has independently studied the performance of NNLS in such a setting. His work is related to ours in Section 3. The paper is organized as follows. In Section 2 we work out the self-regularizing property NNLS may have in conjunction with certain design matrices. Equipped with that property, a bound on the prediction error is stated that resembles a corresponding 'slow rate' bound available for the lasso. Developing further the connection to the lasso, we use techniques pioneered in Bickel et al. [5] to prove bounds on the estimation error β − β * q in ℓ q -norm, q ∈ [1,2], and an improved bound on the prediction error of NNLS in Section 3. In Section 4, we finally provide bounds on the sup-norm error β − β * ∞ of NNLS. Hard thresholding of β is proposed for sparse recovery, and a data-driven procedure for selecting the threshold is devised. Section 4 also contains a discussion of advantages and disadvantages of NNLS relative to the non-negative lasso. In Section 5, we have a closer look at several designs, which satisfy the conditions required throughout the paper. In Section 6, we discuss the empirical performance of NNLS in deconvolution and sparse recovery in comparison to standard methods, in particular to the non-negative lasso. The appendix contains most of the proofs, apart from those that have been placed into a supplementary file [45] for space reasons.
Notation. We denote the usual ℓ q -norm by · q . The cardinality of a set is denoted by | · |. Let J, K ⊆ {1, . . . , m} be index sets. For a matrix A ∈ R n×m , A J denotes the matrix one obtains by extracting the columns corresponding to J. For j = 1, . . . , m, A j denotes the j-th column of A. The matrix A JK is the sub-matrix of A by extracting rows in J and columns in K. Likewise, for v ∈ R m , v J is the sub-vector corresponding to J. Identity matrices and vectors of ones of varying dimensions are denoted by I respectively 1. The symbols , and ≺, ≻ denote componentwise inequalities and componentwise strict inequalities, respectively. In addition, for some matrix A, A a means that all entries of A are at least equal to a scalar a. The non-negative orthant {x ∈ R m : x 0} is denoted by R m + . The standard simplex in R m , that is the set {x ∈ R m + : m j=1 x j = 1} is denoted by T m−1 . Lower and uppercase c's like c, c ′ , c 1 and C, C ′ , C 1 etc. denote positive constants not depending on the sample size n whose values may differ from line to line. In general, the positive integers p = p n and s = s n depend on n. Landau's symbols are denoted by o(·), O(·), Θ(·), Ω(·). Asymptotics is to be understood w.r.t. a triangular array of observations {(X n , y n ), X n ∈ R n×pn }, n = 1, 2, . . . .

Normalization.
If not stated otherwise, the design matrix X is considered as deterministic, having its columns normalized such that X j 2 2 = n, j = 1, . . . , p. General linear position. We say that the columns of X are in general linear position in R n if the following condition (GLP) holds (GLP) : ∀J ⊂ {1, . . . , p}, |J| = min{n, p} ∀λ ∈ R |J| X J λ = 0 =⇒ λ = 0, (1.5) where for J ⊆ {1, . . . , p}, X J denotes the submatrix of X consisting of the columns corresponding to J.

Prediction error: A bound for 'self-regularizing' designs
The main result of the following section is a bound on the excess error of NNLS that resembles the so-called slow rate bound of the lasso [3,8,24]. In contrast to the latter, the corresponding bound for NNLS only holds for a certain class of designs. We first show that extra conditions on the design are in fact necessary to obtain such bound. We then introduce a condition on X that we refer to as 'self-regularizing property' which is sufficient to establish a slow rate bound for NNLS. The term 'self-regularization' is motivated from a resulting decomposition of the least squares objective into a modified fitting term and a quadratic term that plays a role similar to an ℓ 1 -penalty on the coefficients. This finding provides an intuitive explanation for the fact that NNLS may achieve similar performance than the lasso, albeit no explicit regularization is employed.

A minimum requirement on the design for non-negativity being an actual constraint
In general, the non-negativity constraints in (1.2) may not be meaningful at all, given the fact that any least squares problem can be reformulated as a nonnegative least squares problem with an augmented design matrix [X −X]. More generally, NNLS can be as ill-posed as least squares if the following condition (H) does not hold.
Condition (H) requires the columns of X be contained in the interior of a halfspace containing the origin. If (H) fails to hold, 0 ∈ conv{X j } p j=1 , so that there are infinitely many minimizers of the NNLS problem (1.2). If additionally p > n and the columns of X are in general linear position (condition (GLP) in (1.5)), 0 must be in the interior of conv{X j } p j=1 . It then follows that C = R n , where C = cone{X j } p j=1 denotes the polyhedral cone generated by the columns of X. Consequently, the non-negativity constraints become vacuous and NNLS yields perfect fit for any observation vector y. In light of this, (H) constitutes a necessary condition for a possible improvement of NNLS over ordinary least squares.

Overfitting of NNLS for orthonormal design
Since NNLS is a pure fitting approach, over-adaptation to noise is a natural concern. Resistance to overfitting can be quantified in terms of 1 n X β 2 2 when β * = 0 in (1.1). It turns out that condition (H) is not sufficient to ensure that 1 n X β 2 2 = o(1) with high probability, as can be seen from studying orthonormal design, i.e. X ⊤ X = nI. Let y = ε be a standard Gaussian random vector. The NNLS estimator has the closed form expression β j = max{X ⊤ j ε, 0}/n, j = 1, . . . , p, so that the distribution of each component of β is given by a mixture of a point mass 0.5 at zero and a half-normal distribution. We conclude that 1 n X β 2 2 = 1 n β 2 2 is of the order Ω(p/n) with high probability. The fact that X is orthonormal is much stronger than the obviously necessary half-space constraint (H). In fact, as rendered more precisely in Section 5, orthonormal design turns out to be at the edge of the set of designs still leading to overfitting.

A sufficient condition on the design preventing NNLS from overfitting
We now present a sufficient condition X has to satisfy so that overfitting is prevented. That condition arises as a direct strengthening of (H). In order to quantify the separation required in (H), we define Note that τ 0 > 0 if and only if (H) is fulfilled. Also note that with X j 2 2 = n ∀j, it holds that τ 0 ≤ 1. Introducing the Gram matrix Σ = 1 n X ⊤ X, we have by convex duality that 3) i.e. in geometrical terms, τ 0 equals the distance of the convex hull of the columns of X (scaled by 1/ √ n) to the origin. Using terminology from support vector machine classification (e.g. [41], Sec. 7.2), τ 0 can be interpreted as margin of a maximum margin hyperplane with normal vector w separating the columns of X from the origin. As argued below, in case that τ 0 scales as a constant, overfitting is curbed. This is e.g. not fulfilled for orthonormal design, where τ 0 = 1/ √ p (cf. Section 5).

Condition 1.
A design X is said to have a self-regularizing property if there exists a constant τ > 0 so that with τ 0 as defined in (2.2), it holds that τ 0 ≥ τ > 0.
The term 'self-regularization' expresses the fact that the design itself automatically generates a regularizing term, as emphasized in the next proposition and the comments that follow. We point out that Proposition 1 is a qualitative statement preliminary to the main result of the section (Theorem 1) and mainly serves an illustrative purpose. Proposition 1. Consider the linear model (1.1) with β * = 0 and y = ε having entries that are independent random variables with zero mean and finite variance. Suppose that X satisfies Condition 1. We then have , then any minimizer β of (2.4) obeys 1 n X β 2 2 = o P (1). In Proposition 1, the pure noise fitting problem is decomposed into a fitting term with modified design matrix X, a second term that can be interpreted as squared non-negative lasso penalty τ 2 (1 ⊤ β) 2 (cf. (1.3)) and an additional stochastic term of lower order. As made precise in the proof, the lower bound on τ implies that the ℓ 1 -norm of any minimizer is upper bounded by a constant. Prevention of overfitting is then an immediate consequence under the further assumption that the term 1 n X ⊤ ε ∞ = o P (1) tends to zero. This holds under rather mild additional conditions on X [33] or more stringent conditions on the tails of the noise distribution. As a last comment, let us make the connection of the r.h.s. of (2.4) to a non-negative lasso problem more explicit. Due to the correspondence of the level sets of the mappings β → 1 ⊤ β and β → (1 ⊤ β) 2 on R p + , we have where γ is a non-negative, monotonically increasing function of τ . Proposition 1 in conjunction with (2.5) provides a high-level understanding of what will be shown in the sequel, namely that NNLS may inherit desirable properties of the (non-negative) lasso with regard to prediction, estimation and sparsity of the solution.

Slow rate bound
Condition 1 gives rise to the following general bound on the ℓ 2 -prediction error of NNLS. Note that in Theorem 1 below, it is not assumed that the linear model is specified correctly. Instead, we only assume that there is a fixed target f = (f 1 , . . . , f n ) ⊤ to be approximated by a non-negative combination of the columns of X.

Theorem 1.
Let y = f + ε, where f ∈ R n is fixed and ε has i.i.d. zero-mean sub-Gaussian entries with parameter σ 1 . Define Suppose that X satisfies Condition 1. Then, for any minimizer β of the NNLS problem (1.2) and any M ≥ 0, it holds with probability no less than 1 − 2p −M 2 that Comparison with the slow rate bound of the lasso. Theorem 1 bounds the excess error by a term of order O( β * 1 log(p)/n), which implies that NNLS can be consistent in a regime in which the number of predictors p is nearly exponential in the number of observations n. That is, NNLS constitutes a 'persistent procedure' in the spirit of Greenshtein and Ritov [24] who coined the notion of 'persistence' as distinguished from classical consistency with a fixed number of predictors. The excess error bound of Theorem 1 is of the same order of magnitude as the corresponding bound of the lasso [3,8,24] that is typically referred to as slow rate bound. Since the bound (2.6) depends on τ , it is recommended to solve the quadratic program in (2.3) before applying NNLS, which is roughly of the same computational cost. Unlike Theorem 1, the slow rate bound of the lasso does not require any conditions on the design and is more favourable than (2.6) regarding the constants. In [25,53], improvements of the slow rate bound are derived. On the other hand, the results for the lasso require the regularization parameter to be chosen appropriately.
Remark 1. NNLS has been introduced as a tool for 'non-negative data'. In this context, the assumption of zero-mean noise in Theorem 1 is questionable. In case that the entries of ε have a positive mean, one can decompose ε into a constant term, which can be absorbed into the linear model, and a second term which has mean zero, so that Theorem 1 continues to be applicable.
3. Fast rate bound for prediction and bounds on the ℓ q -error for estimation, 1 ≤ q ≤ 2 Within this section, we further elaborate on the similarity in performance of ℓ 1regularization and NNLS for designs with a self-regularizing property. We show that the latter admits a reduction to the scheme pioneered in [5] to establish near-optimal performance guarantees of the lasso and the related Dantzig selector [13] with respect to estimation of β * and and prediction under a sparsity scenario. Similar results are shown e.g. in [10,12,13,36,51,58,59], and we shall state results of that flavour for NNLS below. Throughout the rest of the paper, the data-generating model (1.1) is considered for a sparse target β * with support S = {j : β * j > 0}, 1 ≤ |S| = s < n.
Reduction to the scheme used for the lasso. As stated in the next lemma, if the design satisfies Condition 1, the NNLS estimator β has, with high probability, the property that δ = β − β * has small ℓ 1 -norm, or that δ is contained in the convex cone The latter property is shared by the lasso and Dantzig selector with different values of the constant c 0 [5,13].

M. Slawski and M. Hein
The rightmost event is most favourable, since it immediately yields the assertion of Theorem 2 below. On the other hand, under the leftmost event, one is in position to carry over techniques used for analyzing the lasso. When combined with the following Condition 2, near-optimal rates with regard to estimation and prediction can be obtained.
. . , p} : 1 ≤ |J| ≤ s} and for J ∈ J (s) and α ≥ 1, We say that the design satisfies the (α, s)-restricted eigenvalue condition if there exists a constant φ(α, s) so that Condition 2 is introduced in [5]. It is weaker than several other conditions such as those in [13,36] employed in the same context; for a comprehensive comparison of these and related conditions, we refer to [51]. Moreover, we note that Condition 2 is satisfied with overwhelming probability if X belongs to a rather broad class of random sub-Gaussian matrices with independent rows as long as n scales as Ω(s log p) [39].
Using Lemma 1, the next statement follows along the lines of the analysis in [5].
Theorem 2. In addition to the conditions of Lemma 1, assume further that X satisfies the (3/τ 2 , s)-restricted eigenvalue condition. It then holds for any q ∈ [1, 2] and any M ≥ 0 that with probability no less than 1 − 2p −M 2 .
Theorem 2 parallels Theorem 7.2 in [5], establishing that the lasso adapts to the underlying sparsity, since its performance attains (apart from factors logarithmic in p) what could be achieved if the support β * were known in advance. The rates of Theorem 2 for NNLS are the same as those for the lasso, modulo (less favourable) multiplicative constants.
The required condition on the design is a combination of the self-regularizing property and the restricted eigenvalue condition. At first glance, these two conditions might appear to be contradicting each other, since the first one is not satisfied if the off-diagonal entries of Σ are too small, while for α ≥ 1, we have φ(α, s) ≤ 2(1 − max j,k, j =k X j , X k /n). We resolve this apparent contradiction in Section 5 by providing designs satisfying both conditions simultaneously. The use of Condition 2 in place of more restrictive conditions like restricted isometry properties (RIP, [13]) used earlier in the literature turns out to be crucial here, since these conditions impose much stronger constraints on the magnitude of the off-diagonals entries of Σ as discussed in detail in [37].
A result of the same spirit as Theorem 2 is shown in the recent paper [34] by Meinshausen who has independently studied the performance of NNLS for high-dimensional linear models. That paper provides an ℓ 1 -bound for estimation of β * and a fast rate bound for prediction with better constants than those in above theorem, even though the required conditions are partly more restrictive. The ingredients leading to those bounds are the self-regularizing property, which is termed 'positive eigenvalue condition' there, and the 'compatibility condition' [51,52] which is used in place of Condition 2. We prefer the latter here, because the 'compatibility condition' is not sufficient to establish ℓ q -bounds for estimation for q > 1. As distinguished from our Theorem 2, a lower bound on the minimum non-zero coefficient of β * is additionally required in the corresponding result in [34].

Estimation error with respect to the ℓ ∞ norm and support recovery by thresholding
In the present section, we directly derive bounds on the ℓ ∞ -estimation error of NNLS without resorting to techniques and assumptions used in the analysis of the lasso. Instead, we build on the geometry underlying the analysis of NNLS for sparse recovery in the noiseless case [17][18][19]55]. In light of the stated bounds, we subsequently study the performance of a thresholded NNLS estimator with regard to support recovery.

Main components of the analysis
In the sequel, we provide the main steps towards the results stated in this section. Basic to our approach is a decomposition of the NNLS problem into two sub-problems corresponding to the support S and the off-support S c . For this purpose, we need to introduce the following quantities. For a given support S, let Π S and Π ⊥ S denote the projections on the subspace spanned by {X j } j∈S and its orthogonal complement, respectively. In the context of the linear model (1.1), we then set These quantities appear in the following key lemma that contains the aforementioned decomposition of the NNLS problem.
Lemma 2 is used in the proof of Theorem 3 below in the following way. We first study the off-support problem (P 1) separately, establishing an upper bound on the ℓ 1 -norm of its minimizer β (P 1) in dependence of the separating hyperplane constant introduced in the next paragraph. Substituting that bound into (P 2), we conclude an upper bound on β * S − β (P 2) ∞ and in turn, by the lemma, on β − β * ∞ . In this second step, we exploit the fact that if β (P 2) ≻ 0, it equals the corresponding unconstrained least squares estimator.
Separating hyperplane constant. To establish an upper bound on β (P 1) 1 , we require a positive lower bound on the following quantity to which we refer as separating hyperplane constant, which is nothing else than the constant (2.2) introduced in the context of self-regularization designs in Section 2 with respect to the matrix Z in (4.1). The term 'separating hyperplane constant' follows the geometric interpretation as margin of a hyperplane that contains the origin and that separates {X j } j∈S from {X j } j∈S c . Accordingly, for given S, we define The last line highlights the connection to (2.3) in Section 2. Expanding 1 n Z ⊤ Z under the assumption that the submatrix Σ SS is invertible, τ 2 (S) can also be written as It is shown in [44] that having τ (S) > 0 is a necessary and sufficient condition for recovery of β * by NNLS in the absence of noise (ε = 0). Thus, the appearance of τ (S) in the present context is natural.

Bounds on the ℓ ∞ -error
The upper bound of the next theorem additionally depends on the quantities below, which also appear in the upper bound on the ℓ ∞ -error of the lasso [54]. (4.5) Theorem 3. Assume that y = Xβ * +ε, where β * 0 and ε has i.i.d. zero-mean sub-Gaussian entries with parameter σ.
If β min (S) > b, then the NNLS estimator β has the following properties with probability no less than 1 − 4p −M 2 : Discussion. Theorem 3 can be summarized as follows. Given a sufficient amount of separation between {X j } j∈S and {X j } j∈S c as quantified by τ 2 (S), the ℓ 1 -norm of the off-support coefficients is upper bounded by the effective noise level proportional to log(p)/n divided by τ 2 (S), provided that the entries of β * S are all large enough. The upper bound b depends in particular on the ratio K(S)/τ 2 (S). In Section 5, we discuss a rather special design for which τ 2 (S) = Ω(1); for a broader class of designs that is shown to satisfy the conditions of Theorem 2 as well, τ 2 (S) roughly scales as Ω(s −1 ). Moreover, we have In total, the ℓ ∞ -bound can hence be as large as O(s 3/2 log(p)/n) even if τ 2 (S) scales favourably, a bound that may already be implied by the ℓ 2 -bound in Theorem 2. On the positive side, Theorem 3 may yield a satisfactory result for s constant or growing only slowly with n, without requiring the restricted eigenvalue condition of Theorem 2.
Towards a possible improvement of Theorem 3. The potentially suboptimal dependence on the sparsity level s in the bounds of Theorem 3 is too pessimistic relative to the empirical behaviour of NNLS as discussed in Section 6. The performance reported there can be better understood in light of Theorem 4 below and the comments that follow. Our reasoning is based on the fact that any NNLS solution can be obtained from an ordinary least squares solution restricted to the variables in the active set F = {j : β j > 0}, cf. Lemma 3 in Appendix E. For the subsequent discussion to be meaningful, it is necessary that the NNLS solution and thus its active set are unique, for which a sufficient condition is thus established along the way.
Conditional on that event, if furthermore β min (S) > b as defined in (4.6), then We first note that for s/n bounded away from 1, condition (4.7) is fulfilled if n scales as Ω(log(p)/τ 2 (S)). Second, the condition on β min (S) is the same as in the previous Theorem 3, so that the scope of application of the above theorem remains limited to designs with an appropriate lower bound on τ 2 (S). At the same time, Theorem 4 may yield a significantly improved bound on β − β * ∞ as compared to Theorem 3 if {φ min (F )} −1/2 , the smallest singular value of X F / √ n ∈ R n×|F | , scales more favourably than K(S)/τ 2 (S), noting that as long as S ⊆ F , {φ min (S)} −1/2 ≤ {φ min (F )} −1/2 . In the first place, control of {φ min (F )} −1/2 requires control over the cardinality of the set F . In a regime with |F | scaling as a constant multiple of s with s = αn, 0 ≤ α ≪ 1, it is not restrictive to assume that {φ min (F )} 1/2 as the smallest singular value of a tall submatrix of X is lower bounded by a positive constant, as it has been done in the literature on ℓ 1 -regularization [13,36,58]. That assumption is strongly supported by results in random matrix theory [32,38]. In Section 5 the hypothesis of having |F | ≪ n is discussed in more detail for the class of socalled equi-correlation-like designs. For equi-correlated design, it is even possible to derive the distribution of |F | conditional on having S ⊆ F (Proposition 2 in Section 5).

Support recovery by thresholding
The bounds on the estimation error presented in the preceding two sections imply that hard thresholding of the NNLS estimator may be an effective means for recovery of the support S. Formally, for a threshold t ≥ 0, the hard-thresholded NNLS estimator is defined by and we consider S(t) = {j : β j > 0} as an estimator for S. In principle, the threshold might be chosen according to Theorem 3 or 4: if t > b and β min (S) > b+ b, where b and b denote upper bounds on β S c ∞ and β S −β * S ∞ , respectively, one has that S = S(t) with the stated probabilities. This approach, however, is not practical, since the bounds b and b depend on constants that are not accessible. In the sequel, we propose a data-driven approach as devised in [23] for support recovery on the basis of marginal regression. A central observation in [23] is that direct specification of the threshold can be avoided if the purpose of thresholding is support recovery. In fact, given a ranking (r j ) p j=1 of the predictors {X j } p j=1 so that r j ≤ s for all j ∈ S, it suffices to estimate s. In light of Theorems 2 to 4, NNLS may give rise to such ranking by setting is the sequence of coefficients arranged in decreasing order. Theorem 5 below asserts that conditional on having an ordering in which the first s variables are those in S, support recovery can be achieved by using the procedure in [23]. Unlike the corresponding result in [23], our statement is non-asymptotic and comes with a condition that is easier to verify. We point out that Theorem 5 is of independent interest, since it is actually not specific to NNLS, but would equally apply to any estimator yielding the correct ordering of the variables.
Theorem 5. Consider the data-generating model of Theorem 3 and suppose that the NNLS estimator has the property that according to (4.10), it holds that r j ≤ s for all j ∈ S. For any M ≥ 0, set with Π(k) denoting the projection on the linear space spanned by the variables whose ranks are no larger than k (using We note that the required lower bound on β min (S) is rather moderate. Similar or even more stringent lower bounds are required throughout the literature on support recovery in a noisy setup [9,12,33,54,59,60], and are typically already needed to ensure that the variables in S are ranked at the top (cf. also Theorems 2 to 4).
Strictly speaking, the estimate s in Theorem 5 is not operational, since knowledge of the noise level σ is assumed. In practice, σ has to be replaced by a suitable estimator. Variance estimation in high-dimensional linear regression with Gaussian errors continues to be a topic of active research, with several significant advances made very recently [26]. In our experiments, this issue appears to be minor, because even naive plug-in estimation of the form σ 2 = 1 n y − X β 2 2 yields satisfactory results 2 (cf. Section 6). A nice property of the approach is its computational simplicity. Repeated evaluation of δ(k) in (4.11) can be implemented efficiently by updating QR decompositions. Finally, we note that subsequent to thresholding, it is beneficial to re-compute the NNLS solution using data (y, X S ) only, because the removal of superfluous variables leads to a more accurate estimation of the support coefficients.

Comparison of NNLS and the non-negative lasso
Let us recall the non-negative lasso problem (1.3) given by with minimizer denoted by β ℓ 1, . In the present subsection, we elaborate on similarities and differences of NNLS and the non-negative lasso regarding the ℓ ∞ -error in estimating β * . We first state a result according to which the nonnegative lasso succeeds in support recovery. We then argue that in general, the non-negative lasso does not attain the optimal rate O( log(p)/n) in estimation with respect to the ℓ ∞ -norm by providing a specific design as counterexample. We finally conclude by summarizing benefits and drawbacks of the non-negative lasso and NNLS in an informal way.
Non-negative irrepresentable condition. We study the performance of the non-negative lasso estimator β ℓ 1, under a version of the 'irrepresentable condition' that takes into account non-negativity of the regression coefficients.
The irrepresentable condition has been employed e.g. in [35,52,54,61] to study the (ordinary) lasso from the point of view of support recovery. For given S ⊂ {1, . . . , p}, the non-negative irrepresentable constant is defined as It follows from the analysis in [54] that the non-negative irrepresentable condition ι(S) < 1 is necessary for the non-negative lasso to recover the support S, cf. Theorem 6 below.
Remark 2. We here point out that the condition ι(S) < 1 can be regarded as a strengthening of the condition which is a necessary condition for recovering a non-negative solution with support S as minimum ℓ 1 -norm solution of an underdetermined linear system of equations [62], which corresponds to a non-negative lasso problem in the absence of noise. The non-negative irrepresentable condition ι(S) < 1 results from (4.13) by additionally requiring the vector w to be contained in the column space of X S . Condition (4.13) highlights a conceptual connection to the separating hyperplane constant as defined in (4.2). Note that as distinguished from (4.2), the separating hyperplane underlying (4.13) does not contain the origin.
Support recovery with the non-negative lasso. Using the scheme developed in [54], one can show that under the non-negative irrepresentable condition and a suitable choice of the regularization parameter λ, the non-negative lasso has the property that β ℓ 1, S c = 0, i.e. no false positive variables are selected, and support recovery can be deduced from having an appropriate lower bound on the minimum support coefficient β min (S). Along with a bound on β ℓ 1, −β * ∞ , this is stated in the next theorem.
Theorem 6. Assume that y = Xβ * + ε, where β * 0 has support S and ε has i.i.d. sub-Gaussian entries with parameter σ. Suppose further that the nonnegative irrepresentable condition ι(S) < 1 according to (4.12) holds. For any (4.14) There is some resemblance of the bound b in (4.14) and that of Theorem 3 for NNLS, with τ 2 (S) playing a role comparable to 1 − ι(S) and Σ −1 SS 1 ∞ being a lower bound on the quantity K(S) defined in (4.5). On the other hand, Theorem 6 yields a considerably stronger control of the off-support coefficients ( β ℓ 1, S c = 0) as does Theorem 3, which only provides an ℓ 1 -bound on β S c . Irrepresentable conditions as in above theorem are regarded as rather restrictive in the literature [36,58,60]. Even in case the condition ι(S) < 1 is fulfilled, the choice of λ in (4.14) with ι(S) possibly close to one may impose a rather stringent lower bound on β min (S) in order to achieve support recovery. At the same time, the choice λ = 2σ 2 log(p)/n in combination with the restricted eigenvalue condition (Condition 2), which is regarded as far less restrictive than the irrepresentable condition, only yields a bound on β ℓ 1, − β * q for q ∈ [1, 2], and it is no longer guaranteed that β ℓ 1, S c = 0. As a result, two-stage procedures like subsequent thresholding of β ℓ 1, may be needed for support recovery. However, this approach in general entails a sub-optimal condition on β min (S) = Ω( s log(p)/n) because of the term Σ −1 SS 1 ∞ scaling as Θ( √ s) in the worst case. In Appendix J this issue is illustrated by providing an explicit example of a design representing that worst case. We point out that optimal rates for estimation in sup-norm of the order O( log(p)/n) have been established e.g. in [9,12,33] for the lasso under 'mutual incoherence' conditions requiring fairly restrictive upper bounds on the maximum inner product between two distinct columns of X.
NNLS vs. the non-negative lasso: pros and cons. To sum up, we list advantages and disadvantages of NNLS and the non-negative lasso, thereby providing some guidance on when to use which approach in practice. While NNLS can formally be seen as a special case of the non-negative lasso with λ = 0, we suppose for the subsequent discussion that λ ≥ σ 2 log(p)/n as it is standard in the literature on the lasso.
• As already stressed previously, reasonable performance of NNLS in a highdimensional regime is restricted to a specific class of designs, which excludes standard models such as random Gaussian design matrices (cf. the discussion in Section 5). This contrasts with the non-negative lasso, which has at least moderate performance guarantees via a slow rate bound in spirit of Theorem 1 without further conditions on the design. • As discussed in the previous paragraph, the non-negative lasso does not always attain the optimal rate for estimating β * in the sup-norm, in which case some room for improvement is left for NNLS. In Section 6, we present two designs for which NNLS empirically yields a better performance both with regard to estimation and support recovery via thresholding, where the ℓ ∞ -error in estimation enters crucially. On the other hand, as asserted by Theorem 6 the non-negative lasso succeeds in support recovery even without thresholding in certain regimes. • From the point of view of a practitioner who is little familiar with theoretical results on how to set the regularization parameter of the (non-negative) lasso, NNLS has the advantage that it can be applied directly, without the need to specify or tune a regularization parameter.

Discussion of the analysis of NNLS for selected designs
Our main results concerning the performance of NNLS as stated in Theorems 1 to 4 are subject to the following conditions: the self-regularizing property (Theorem 1), a combination of that property with a restricted eigenvalue condition (Theorem 2), a lower bound on the separating hyperplane constant (Theorem 3), and sparsity of the NNLS solution (Theorem 4). In the present section, we discuss to what extent these conditions are fulfilled for selected designs, which we here roughly divide into three classes. The first is the class of non-selfregularizing designs for which non-negativity constraints on the regression coefficients do not seem to yield any significant advantage. This is in contrast to the third class of equi-correlation-like designs, which are shown to be tailored to NNLS. The second class comprises designs with a block or band structure arising in typical applications.

Non-self regularizing designs
In this paragraph, we provide several common examples of designs not having the self-regularizing property of Condition 1. Consequently, our main results, which rely on that condition, do not apply. Those designs can be identified by evaluating the quantity τ 2 0 (2.3) underlying Condition 1. From we see that the sum of the entries of Σ must scale as Ω(p 2 ) for Condition 1 to be satisfied. In particular, this requires Σ to have Ω(p 2 ) entries lower bounded by a positive constant, and a maximum eigenvalue scaling as Ω(p). Among others, this is not fulfilled for the following examples.
Example 1 (orthonormal design). As already mentioned while motivating the self-regularizing property in Section 2, for Σ = I, τ 2 0 attains the upper bound in (5.1) which yields τ 2 0 = 1/p. Example 2 (power decay). Let the entries of Σ be given by σ jk = ρ |j−k| , j, k = 1, . . . , p with ρ ∈ [0, 1). From Example 3 (random Gaussian matrices). Consider a random matrix X whose entries are drawn i.i.d. from the standard Gaussian distribution. We here refer to the work [19] in which NNLS is studied in the noiseless case. In that work, random Gaussian matrices are considered as part of a broader class termed the centro-symmetric ensemble. In a nutshell, that class encompasses random matrices with independent columns whose entries have mean zero. The authors of [19] point out the importance of Wendel's Theorem [40,57], which provides the exact probability for the columns of X being contained in a half-space, i.e. of having τ 2 0 > 0. That result implies via Hoeffding's inequality that for p/n > 2, τ 2 0 = 0 with probability at least 1 − exp(−n(p/n − 2) 2 /2) so that the non-negativity constraints in NNLS become meaningless (cf. the discussion following (2.1)).
In all these three examples, a similar reasoning applies with regard to the scaling of the separating hyperplane constant τ 2 (S) (4.3), because its role is that of τ 2 0 with respect to the matrix Z = Π ⊥ S X S c . As a consequence, it is not hard to see that the scalings of the above examples continue to hold (uniformly in S) with p replaced by p − s.

Designs with non-negative Gram matrices having a band or block structure
We now present a simple sufficient condition for the self-regularizing property to be satisfied, based on which we will identify a class of designs for which non-negativity of the regressions coefficients may be a powerful constraint. Suppose that the Gram matrix has the property that all its entries are lower bounded by a positive constant σ 0 . We then have the following lower bound corresponding to the upper bound (5.1) above. i.e. Condition 1 is satisfied with τ 2 = σ 0 . More generally, in case that Σ has exclusively non-negative entries and the set of variables {1, . . . , p} can be partitioned into blocks {B 1 , . . . , B K } such that the minimum entries of the corresponding principal submatrices of Σ are lower bounded by a positive constant, then Condition 1 is satisfied with τ 2 = σ 0 /K: where in the last equality we have used that the minimum of the map x → K l=1 x 2 l over the simplex T K−1 is attained for x = 1/K. As sketched in Figure 1 where µ j ∈ [0, 1], j = 1, . . . , p, and h = 1/K for some positive integer K. Setting X = (φ j (u i )) 1≤i≤n, 1≤j≤p and partitioning the {µ j } by dividing [0, 1] into intervals [0, h], (h, 2h], . . ., (1 − h, 1] and accordingly B l = {j : µ j ∈ ((l − 1) · h, l · h]}, l = 1, . . . , K, we have that min 1≤l≤K 1 n X ⊤ B l X B l h such that Condition 1 holds with τ 2 = h/K = 1/K 2 .
Applications. As mentioned in the introduction, NNLS has been shown to be remarkably effective in solving deconvolution problems [30,31,42]. The observations there are signal intensities measured over time, location etc. that can be modelled as a series of spikes (Dirac impulses) convolved with a point-spread function (PSF) arising from a limited resolution of the measurement device. The PSF is a non-negative localized function as outlined in the previous paragraph. Deconvolution of spike trains is studied in more detail in Section 6.1 below. Similarly, bivariate PSFs can be used to model blurring in greyscale images, and NNLS has been considered as a simple method for deblurring and denoising [2].

Equi-correlation-like designs
We first discuss equi-correlated design before studying random designs whose population Gram matrix has equi-correlation structure. While the population setting is limited to having n ≥ p, the case n < p is possible for random designs.
Equi-correlated design. For ρ ∈ (0, 1), consider equi-correlated design with Gram matrix Σ = (1 − ρ)I + ρ11 ⊤ . We then have so that the design has the self-regularizing property of Condition 1. Let ∅ = S ⊂ {1, . . . , p} be arbitrary. According to representation (4.4), the corresponding separating hyperplane constant τ 2 (S) can be evaluated similarly to (5.4). We have where from the second to the third line we have used that 1 is an eigenvector of Σ SS corresponding to its largest eigenvalue 1 + (s − 1)ρ. We observe that τ 2 (S) = τ 2 (s), i.e. (5.5) holds uniformly in S. We are not aware of any design for which min S: |S|=s<p/2 τ 2 (S) ≥ s −1 , which lets us hypothesize that the scaling of τ 2 (S) in (5.5) uniformly over all sets of a fixed cardinality s is optimal. On the other hand, when not requiring uniformity in S, τ 2 (S) can be as large as a constant independent of s, as it is the case for the following example. Consider a Gram matrix of the form Combining (5.4) and (5.5), we obtain that τ 2 (S) = ρ + 1−ρ p−s independently of the specific form of Σ SS . At the same time, this scaling does not hold uniformly over all choices of S with |S| = s given the equi-correlation structure of the block Σ S c S c .
Sparsity of the NNLS solution for equi-correlated design. Exploiting the specifically simple structure of the Gram matrix, we are able to derive the distribution of the cardinality of the active set F = {j : β j > 0} of the NNLS solution β conditional on the event { β S ≻ 0}. For the sake of better illustration, the result is stated under the assumption of Gaussian noise. Inspection of the proof shows that, with appropriate modifications, the result remains valid for arbitrary noise distributions. 1+(s−1)ρ 11 ⊤ and let z (1) ≥ · · · ≥ z (p−s) denote the arrangement of the components of z in decreasing order. Conditional on the event { β S ≻ 0}, the cardinality of the active set F = {j : β j > 0} has the following distribution: , j = 1, . . . , p − s − 1, and θ(s, ρ) = ρ 1 + (s − 1)ρ .

(5.6)
Proposition 2 asserts that conditional on having the support of β * included in the active set, the distribution of its cardinality is s plus an extra term, whose distribution depends on that of the random variables {ζ} p−s−1 j=1 and a 'threshold' θ(s, ρ). In order to better understand the role of these quantities, let us first consider the case ρ = 0, i.e. orthonormal design: since θ(s, 0) = 0, the distribution of |F | is equal to s plus the distribution of the number of non-negative components of a (p − s)-dimensional Gaussian random vector, i.e. a binomial distribution with p − s trials and a probability of success of is not directly accessible, it can be approximated arbitrarily well by Monte Carlo simulation for given p, s and ρ (note that the distribution does not depend on the scale of the noise ε). Figure 2 depicts the 0.01, 0.5, 0.99-quantiles of the distribution of |F | in virtue of (5.6) for p = 500 and various choices of ρ and s. The results are based on 10, 000 Monte Carlo simulations for each value of s. For comparison, for each pair (s, ρ), we generate 100 datasets (X, y) with n = p = 500 according to the model of Proposition 2 with standard Gaussian noise (the components of β * S are set to the given lower bound on β min (S) in to ensure that the event { β S ≻ 0} has probability close to one). We then solve the corresponding NNLS problems using the active set algorithm of Lawson and Hanson [29] and obtain the cardinalities of the active sets. Figure 2 shows a strong agreement of the predictions regarding the size of the active set based on the distribution of Proposition 2 and the empirical distributions.
Non-negative random designs with equi-correlation structure. We now consider random design matrices whose population Gram matrix is that of equi-correlated design, but with the possibility that n < p. It is investigated to what extent these random design matrices inherit properties from the population setting studied in the previous paragraph. Specifically, we consider the following ensemble of random matrices All random designs from the class Ens + share the property that the population Gram matrix Σ * = E[ 1 n X ⊤ X] possesses equi-correlation structure after re-scaling the entries of X by a common factor. Denoting the mean of the entries and their squares by µ and µ 2 , respectively, we have such that re-scaling by 1/ √ µ 2 leads to equi-correlation structure with parameter ρ = µ 2 /µ 2 . Since applications of NNLS predominantly involve non-negative design matrices, it is instructive to have a closer look at the class (5.7) as a basic model for such designs. Among others, the class of sub-Gaussian random designs on R + encompasses the zero-truncated Gaussian distribution, all distributions on a bounded subset of R + , e.g. the family of beta distributions (with the uniform distribution as special case) on [0, 1], Bernoulli distributions on {0, 1} or more generally multinomial distributions on positive integers {0, 1, . . . , K}, as well as any finite mixture of these distributions.
As shown in the sequel, the class (5.7) provides instances of designs for which Theorems 2 to 4 yield meaningful results in the n < p setting. Our reasoning hinges on both theoretical analysis providing bounds on the deviation from population counterparts as well as on numerical results.

Self-regularizing property + restricted eigenvalue condition of Theorem 2.
Recall that Theorem 2 requires a combination of the self-regularizing property (Condition 1) and the restricted eigenvalue condition (Condition 2) to be satisfied. This turns out to be the case for designs from Ens + in light of he following proposition. The statement relies on recent work of Rudelson and Zhou [39] on the restricted eigenvalue condition for random matrices with independent sub-Gaussian rows.

Proposition 4. Let X be a random matrix from Ens
. . , p}, |S| = s. Then there exists constants c, c ′ , C, C ′ > 0 depending only on ρ and the sub-Gaussian parameter of the centered entries of X such that for all n ≥ Cs 2 log(p ∨ n), with probability no less than 1 − 6/(p ∨ n) − 3 exp(−c ′ (s ∨ log n)).
It turns out that the requirement on the sample size as indicated by Proposition 4 is too strict in light of the results of complementary numerical experiments. For these experiments, n = 500 is kept fixed and p ∈ (1.2, 1.5, 2, 3, 5, 10)·n and s ∈ (0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5)·n vary. For each combination of (p, s) and several representatives of Ens + (re-scaled such that the population Gram matrix has equi-correlation structure), 100 random design matrices are generated. We set S = {1, . . . , s}, compute Z = (I − Π S )X S c using a QR decomposition of X S and then solve the quadratic program min λ∈T p−s−1 λ ⊤ 1 n Z ⊤ Zλ with value τ 2 (S) by means of an interior point method [6]. As representatives of Ens + , we have considered matrices whose entries have been drawn from the following distributions. In order to obtain population Gram matrices of varying correlation ρ, we use mixture distributions with one of two mixture components being a point mass at zero (denoted by δ 0 ). Note that the larger the proportion 1 − a of that component, the smaller ρ.
For space reasons, we here only report the results for E 1 . Regarding E 2 to E 4 , the reader is referred to the supplementary file [45]; in brief, the results confirm what is shown here. Figure 3 displays the 0.05-quantiles of τ 2 (S) over sets of 100 replications. It is revealed that for τ 2 (S) to be positive, n does not need to be as large relative to s as suggested by Proposition 4. In fact, even for s/n as large as 0.3, τ 2 (S) is sufficiently bounded away from zero as long as p is not dramatically larger than n (p/n = 10).
Implications for compressed sensing-type problems. In compressed sensing (CS) [13,15], the goal is to recover a sparse vector β * from a limited number of nonadaptive measurements. In the typical setup, the measurements are linear combinations of β * , contaminated with additive noise and thus fall under the linear model (1.1). CS is related to group testing [20,21]. Here, β * ∈ {0, 1} p indicates the presence of a certain attribute of low prevalence in p objects, e.g. presence of a rare disease in individuals or of a defect in manufactured goods. An effective strategy for locating the affected entities is by forming groups, testing for prevalence at the group level, discarding groups with a negative test result, and repeating the procedure with a new set of groups. The measurements obtained in this way are both adaptive and non-linear and hence do not fit into the conventional framework of CS. However, this is the case if it is possible to obtain aggregate measurements (i.e. sums) over arbitrary groups. As exemplified in Figure 4, the information of interest can be retrieved from few aggregated measurements and the associated group assignments. Proper measurement design needs to achieve a proper amount of overlapping of the groups. In fact, it is not possible to recover β * from a reduced number of measurements involving only disjoint groups. At the same time, overlapping has to be limited to ensure that collections of 2s columns of the measurement matrix are linearly independent; otherwise, recovery of β * is not possible in general. Without further prior knowledge, about the location of the non-zero entries of β * , random group assignments in which each entity j, j = 1, . . . , p, is assigned to group i, i = 1, . . . , n, independently with probability π, appears to be reasonable in light of the results of the previous paragraph. Propositions 3 and 4 assert that, with high probability, the resulting measurement matrix suits well a sparse recovery approach based on NNLS.
Network tomography as discussed in [34] arises as a generalization of the above setting with measurements of the form y = BAβ * + ε, where B ∈ R n×p + represents the measurement design and A ∈ R p×p + is the adjacency matrix of the p nodes whose status of interest is contained in β * ∈ R p + . The goal is to spot sources of anomaly within the network. As distinguished from a random design setting, the measurement matrix X = BA cannot be chosen freely, since A is fixed and the choice of B is subject to various constraints. Therefore, it is not clear a priori whether X satisfies our conditions, even though Σ is potentially self-regularizing as X is non-negative.
Sparsity of the solution. In Proposition 2, we have characterized the sparsity in the population setting. It is of interest to investigate this aspect for random de- sign in the p > n setup, particularly in light of Theorem 4, which implicitly relies on having a sparse NNLS solution. We here provide a sketch of the empirical behaviour within the experimental framework of the previous paragraph. We generate random design matrices (n ∈ {250, 500, 750, 1000}, p/n ∈ {2, 5, 10}) from E 1 for the four values of the parameter ρ as given above. For several values of s/n ranging from 0 to 0.3, we generate observations y = Xβ * + ε, where ε is a Gaussian noise vector, and the components of β * S are set to the lower bound in Proposition 2. For each combination of (n, p/n, s/n, ρ), 100 replications are considered and the fraction of active variables |F |/n is determined. Figure 5 summarizes the main findings of this experimental study. For fixed n = 500, the top panel depicts the empirical distributions of |F |/n over the 100 replications in comparison to the population setting (cf. Figure 2). We observe that for all parameter configurations under consideration, the cardinalities of the active sets stay visibly away from 1 with |F |/n being no larger than 2/3. The cardinalities of the active sets are larger than in the population case. The higher the sparsity level and the ratio p/n, the more pronounced the shifts toward larger cardinalities: while for s/n = 0 and ρ = 0.75, the empirical distribution of |F |/n is rather close to that of the population, there is a consistent gap for s/n = 0.1. The bottom panel displays how |F |/n scales with (n, s/n). For plotting and space reasons, we restrict us to the 0.95-quantiles over the 100 replications and p/n = 10, which, as indicated by the plots of the top panel, is the worst case among all values of p/n considered. The two surface plots for ρ = 0.1 and ρ = 0.75 are of a similar form; a noticeable difference occurs only for rather small s/n. It can be seen that for s/n fixed, |F |/n roughly remains constant as n varies. On the other hand, |F |/n increases rather sharply with s/n. For s/n > 0.25, we observe a breakdown, as |F |/n = 1. We point out that as long as |F |/n < 1, it holds that the NNLS solution and the active set are unique (with probability one), as follows from Lemma 5 in Appendix G.

Empirical performance
We here present the results of simulation studies in order to compare the performance of NNLS and the non-negative lasso in terms of prediction, estimation and sparse recovery.

Deconvolution of spike trains
We consider a positive spike-deconvolution model as in [30], as it commonly appears in various fields of applications. The underlying signal f is a function on [a, b] of the form with φ k (·) = φ(· −µ k ), k = 1, . . . , s, where φ ≥ 0 is given and the µ k 's define the locations of the spikes contained in [a, b]. The amplitudes {β * k } s j=1 are assumed to be positive. The goal is to determine the positions as well as the amplitudes of the spikes from n (potentially noisy) samples of the underlying signal f . As demonstrated below, NNLS can be a first step towards deconvolution. The idea is to construct a design matrix of the form X = (φ j (u i )), where φ j = φ(· − m j ) for candidate positions {m j } p j=1 placed densely in [a, b] and {u i } n i=1 ⊂ [a, b] are the points at which the signal is sampled. Under an additive noise model with zero-mean sub-Gaussian noise ε, i.e.
and if X has the self-regularizing property (cf. Section 2), it follows immediately from Theorem 1 that the ℓ 2 -prediction error of NNLS is bounded as where E * = min β 0 1 n f − Xβ 2 2 . Even though it is not realistic to assume that {µ k } s k=1 ⊂ {m j } p j=1 , i.e. that the linear model is correctly specified, we may think of E * being negligible as long as the {m j } p j=1 are placed densely enough. This means that NNLS may be suitable for de-noising. Furthermore, the bound (6.2) implies that β must have large components only for those columns of X corresponding to locations near the locations {µ k } s k=1 of the spikes, which can then be estimated accurately by applying a simple form of post-processing as discussed in [42]. On the other hand, the application of fast rate bounds such as that of Theorem 2 or corresponding results for the lasso is not adequate here, because the dense placement of the {φ j } p j=1 results into a tiny, if not zero, value of the restricted eigenvalue of Condition 2. For our simulation study, we consider model (6.1) as starting point. The signal is composed of five spikes of amplitudes between 0.2 and 0.7 convolved with a Gaussian function. The design matrix X = (φ j (u i )) contains evaluations of p = 200 Gaussians {φ j } p j=1 at n = 100 points {u i } n i=1 , where both the centers {m j } p j=1 of the {φ j } p j=1 as well as the {u i } n i=1 are equi-spaced in the unit interval. We have by construction that {m j } p j=1 ⊃ {µ k } s k=1 so that E * = 0. The standard deviation of the Gaussians is chosen such that it is roughly twice the spacing of the {u i }. At this point, it is important to note that the larger the standard deviations of the Gaussians, the larger the constant τ 2 0 (2.2), which here evaluates as τ 2 0 = 0.2876. According to that setup, we generate 100 different vectors y resulting from different realizations of the noise ε whose entries are drawn i.i.d. from a Gaussian distribution with standard deviation σ = 0.09. The left panel of Figure 6 visualizes the setting. We compare the performance of NNLS, lasso/non-negative lasso with (i) fixed regularization parameter λ fixed to λ 0 = 2σ 2 log(p)/n (ii) λ chosen from the grid λ 0 · 2 k , k = −5, −4, . . . , 4 by tenfold cross-validation, ridge regression (tuned by tenfold cross-validation) and an oracle least squares estimator based on knowledge of the positions {µ k } s k=1 of the spikes. The right panel of 6 contains boxplots of the MSEs 1 n Xβ * −X β 2 2 over all 100 replications. The performance of NNLS is only slightly inferior to that of the non-negative lasso, which is not far from the oracle, and roughly as good as that of the lasso. All methods improve substantially over ridge regression. The lower part of the left panel provides a summary of the NNLS estimator β, which is remarkably sparse and concentrated near the positions of the underlying spikes.

Sparse recovery
We now present the results of simulations in which we investigate the performance of NNLS with regard to estimation and sparse recovery in comparison to that of the non-negative lasso.
Setup. We generate data y = Xβ * + ε, where ε has i.i.d. standard Gaussian entries. For the design X, two setups are considered.
Design I: Equi-correlation like design. The matrix X is generated by drawing its entries independently from the uniform distribution on [0, 1] and rescaling them such that the population matrix is of equi-correlation structure with ρ = 3/4. Random matrices of that form have been considered for the numerical results in Section 5.3 (cf. Figures 2 to 5). For given (n, p, s), the target β * is generated by setting β * j = 6b · φ −1/2 min 2 log(p)/n(1 + U j ), j = 1, . . . , s, where φ min = (1 − ρ) denotes the smallest eigenvalue of the population Gram matrix, the {U } s j=1 are drawn uniformly from [0, 1], and we let the parameter b > 0 vary. We further set β * j = 0, j = (s + 1), . . . , p. Design II: Localized non-negative functions. The setup leading to the second class of designs can be regarded as a simplification of the deconvolution problem discussed in the previous subsection to fit into the standard sparse recovery framework. Indeed, in the experiments of Section 6.1, recovery of the support of β * fails in the presence of noise, because the {φ j }'s are placed too densely relative to the number of sampling points; see [11] for a similar discussion concerning the recovery of mixture components in sparse mixture density estimation. In order to circumvent this issue, we use the following scheme. As in Section 6.1, we consider sampling points u i = i/n, i = 1, . . . , n, in [0, 1] and localized functions . We here choose ∆ = h = 2/n. The design matrix is then of the form X ij = φ j (u i )/c j , i = 1, . . . , n, j = 1, . . . , p, where the c j 's are scaling factors such that X j 2 2 = n ∀j. As for Design I, we generate observations y = Xβ * + ε, where β * j = b · β min (1 + U j ), j = 1, . . . , s and β * j = 0, j = s + 1, . . . , p. The {U j } s j=1 are random variables from the uniform distribution on [0, 1] and the choice β min = 4 6 log(10)/n has turned out to yield sufficiently challenging problems. For both Design I and II, two sets of experiments are performed. In the first one, the parameter b controlling the magnitude of the coefficients of the support is fixed to b = 0.5 (Design I) respectively b = 0.55 (Design II), while the aspect ratio p/n of X and the fraction of sparsity s/n vary. In the second set of experiments, s/n is fixed to 0.2 (Design I) and 0.05 (Design II), while p/n and b vary. Each configuration is replicated 100 times for n = 500.
Comparison. Across these runs, we compare thresholded NNLS, the nonnegative lasso (NNℓ 1 ), the thresholded non-negative lasso (tNNℓ 1 ) and orthogonal matching pursuit (OMP, [49,59]) with regard to their performance in sparse recovery. Additionally, we compare NNLS and NNℓ 1 with λ = λ 0 as defined below (both without subsequent thresholding) with regard to estimation of β * in ℓ ∞ -norm (Tables 1 and 2) and ℓ 2 -norm (the results are contained in the supplement [45]). The performance of thresholded NNLS with regard to sparse recovery is assessed in two ways. For the first one (referred to as 'tNNLS * '), success is reported whenever min j∈S β j > max j∈S c β j , i.e. there exists a threshold that permits support recovery. Second, the procedure of Theorem 5 (with σ replaced by the naive estimate 1 n y − X β 2 2 ) is used to determine the threshold in a data-driven manner without knowledge of S. This approach is referred to as tNNLS. For tNNℓ 1 , both the regularization parameter λ and the threshold have to be specified. Instead of fixing λ to a single value, we give tNNℓ 1 a slight advantage by simultaneously considering all solutions λ ∈ [λ 0 ∧ λ, λ 0 ∨ λ] prior to thresholding, where λ 0 = 2σ 2 log(p)/n equals the choice of the regularization parameter advocated in [5] to achieve the optimal rate for the estimation of β * in the ℓ 2 -norm and λ = 2 X ⊤ ε/n ∞ can be interpreted as empirical counterpart to λ 0 . The non-negative lasso modification of LARS [22] is used to obtain the solutions { β(λ) : λ ∈ [λ 0 ∧ λ, λ 0 ∨ λ]}; we then report success of tNNℓ 1 whenever min j∈S β j (λ) > max j∈S c β j (λ) holds for at least one of these solutions. We point out that specification of λ 0 is based on knowledge of the noise variance, which constitutes a second potential advantage for tNNℓ 1 .
Under the conditions of Theorem 6, NNℓ 1 recovers the support directly without thresholding. In order to judge the usefulness of subsequent thresholding of NNℓ 1 , we obtain as well the set of non-negative lasso solutions { β(λ) : λ ≥ λ 0 ∧ λ} and check whether the sparsity pattern of any of these solutions recovers S.
Given its simplicity, OMP serves as basic reference method. Success is reported whenever the support has been recovered after s steps.
Discussion of the results. In summary, Figures 7 and 8 indicate that for the two setups under consideration, NNLS and its thresholded version exhibit excellent performance in sparse recovery. A superior performance relative to the thresholded non-negative lasso is achieved particularly in more difficult parameter regimes characterized by comparatively small signal strength b or high fraction of sparsity. The results of the experiments reveal that the non-negative lasso without thresholding may perform well in estimation, but it is not competitive as far as sparse recovery is concerned. This observation is in agreement  with existing literature in which the restrictiveness of the conditions for the lasso to select the correct set of variables is pointed out and two stage procedures like thresholding are proposed as remedy [36,50,60,63,64]. At this point, we stress again that NNLS only requires one parameter (the threshold) to be set, whereas competitive performance with regard to sparse recovery based on the non-negative lasso entails specification of two parameters. Let us now give some more specific comments separately for the two designs. For Design I, thresholded NNLS visibly improves over tNNℓ 1 , predominantly even in case that the threshold is chosen adaptively without knowledge of S. For Design II, noticeable differences between tNNLS * and tNNℓ 1 occur for small values of b. Apart from that, the performance is essentially identical. Even though the results of tNNLS remain competitive, they fall behind those tNNLS * and tNNℓ 1 . OMP partially keeps up with the other methods for s/n and/or b small, while NNℓ 1 succeeds as well in a substantial fraction of cases for small s/n. This is to be contrasted with the situation for Design I, in which both OMP and NNℓ 1 do not even achieve success in a single trial. This outcome is a consequence of the fact that the non-negative irrepresentable condition (cf. Section 4.4), which is necessary for the success of OMP as well [59], fails to hold in all these runs. The ℓ ∞ -errors in estimating β * reported in Tables 1 and 2 are in accordance with the sparse recovery results. The smaller s/n and p/n, the closer NNLS and NNℓ 1 are in performance. An advantage of NNLS arises for more extreme combinations of (s/n, p/n). A similar conclusion can be drawn for the ℓ 2 -errors (cf. supplement [45]).
Second, we bound the r.h.s. of (C.1). We set A = max 1≤j≤p 1 n X ⊤ j ε and use the bound Inserting the lower bound (C.2) into (C.3), we obtain We may assume that δ P = 0, otherwise the assertion of the theorem would follow immediately, because δ N 1 is already bounded for feasibility reasons, see below. Dividing both sides by δ P 1 and re-arranging yields where we have assumed that δ N 1 ≤ δ P 1 (if that were not the case, one would obtain δ P 1 ≤ δ N 1 , which is stronger than (C.5), since 0 < τ 2 ≤ 1). We now substitute (C.5) back into (C.1) and add E * = 1 n Xβ * − f 2 2 to both sides of the inequality in order to obtain noting that by feasibility of β, one has δ −β * and hence δ N 1 ≤ β * 1 . Using (A.3), the event {A ≤ (1 + M )σ 2 log p n } holds with probability no less than 1 − 2p −M 2 . The result follows.
Lemma 3 implies that any NNLS solution is a minimizer of a least squares problem subject to the equality constraint β F c = 0 given the active set F , that is 1 n y − X β 2 2 = min β 1 n y − Xβ 2 2 subject to β F c = 0.
Proof of Lemma 2.. The NNLS objective can be split into two parts as follows.
Separate minimization of the second summand on the r.h.s. of (E.1) yields β (P 1) . Substituting β (P 1) for β S c in the first summand, and minimizing the latter amounts to solving (P 2). In view of Lemma 3, if β (P 2) ≻ 0, it coincides with an unconstrained least squares estimator corresponding to problem (P 2). This implies that the optimal value of (P 2) must be zero, because the observation vector X S β * S +Π S (ε−X S c β (P 1) ) of the non-negative least squares problem (P 2) is contained in the column space of X S . Since the second summand in (E.1) corresponding to (P 1) cannot be made smaller than by separate minimization, we have minimized the non-negative least squares objective.

Appendix F: Proof of Theorem 3
We here state and prove a result that is slightly more general than Theorem 3, as it covers also the case of an approximately sparse target. Writing β * (1) ≥ · · · ≥ β * (p) ≥ 0 for the sequence of ordered coefficients, let S = {j : β * j ≥ β * (s) } be the set of the s largest coefficients of β * (for simplicity, assume that there are no ties). For the result that follows, we think of β * S c 1 being considerably smaller than the entries of β * S . Theorem 7. Consider the linear model y = Xβ * + ε, where β * 0 and ε has i.i.d. zero-mean sub-Gaussian entries with sub-Gaussian parameter σ. For If β min (S) > b, then the NNLS estimator β has the following properties with probability no less than 1 − 4p −M 2 : The special case of exact sparsity in which S equals the support of β * is obtained for β * S c 1 = 0 (cf. Theorem 3). Proof. First note that in the more general case with β * S c = 0, an analog of Lemma 2 holds with (P 1) : min (P 2) : min Consider problem (P 1).
Step 2: Back-substitution into (P2). Equipped with the bound just derived, we insert β (P 1) into problem (P 2) of Lemma 2, and show that in conjunction with the assumptions made for the minimum support coefficient β min (S), the ordinary least squares estimator corresponding to (P 2) β (P 2) = argmin has only positive components. Lemma 2 then yieldsβ (P 2) = β (P 2) = β S . Using the closed form expression for the ordinary least squares estimator, one obtains It remains to control the two terms A S = 1 n Σ −1 SS X ⊤ S ε and Σ −1 SS Σ SS c ( β (P 1) −β * S c ). For the second term, we have ≤ K(S) ( β (P 1) 1 + β * S c 1 ). (F.5) Step 3: Putting together the pieces. The two random terms A and A S are maxima of a finite collection of linear combinations of sub-Gaussian random variables so that (A.2) in Appendix A can be applied by estimating Euclidean norms. For A, we use (F.2). Second, we have where e j denotes the j-th canonical basis vector. One has general linear position, the interior of C is non-empty). Note that ∂C equals the union of the facets of C, that is . . , p}, |J| = n − 1 : ∃w ∈ R n s.t. z ⊤ w = 0 ∀z ∈ C J and z ⊤ w > 0 ∀z ∈ C \ C J }.
From X β ∈ ∂C, it hence follows that the active set F = {j : β j > 0} has cardinality at most n − 1. General linear position implies that the linear system X F β = X β has exactly one solution β = β F . Concerning the second part of the lemma, the assertion |F | ≤ min{n − 1, p} is trivial for p ≤ (n − 1). Conversely, if p ≥ n, the fact that min β 0 1 n y − Xβ 2 2 > 0 allows us to conclude that X β ∈ ∂C so that the assertion follows from the reasoning above. For the third part, we note that the fact that y has a distribution which is absolutely continuous with respect to the Lebesgue measure implies that y is not contained in any subspace of dimension smaller than n with probability one, so that min β 0 1 n y − Xβ 2 2 > 0, and the claim follows from part (ii).
Part 2: proof of the bound on β − β * ∞ . Given uniqueness of the NNLS solution and in turn of its active set F = {j : β j > 0}, the stated bound on β − β * ∞ follows readily once it holds that S ⊆ F . In fact, the optimality conditions of the NNLS problem (cf. Lemma 3) then yield that β F can be recovered from the linear system where the second equality results from S ⊆ F . As an immediate consequence, we have that In order to control the random term, we may follow the reasoning below (F.6) to conclude that for any M ≥ 0, the event Appendix H: Proof of Theorem 5 We first recall that the analysis is conditional on the event E = {r j ≤ s for all j ∈ S}, where r j = k ⇔ β j = β (k) . (H.1) Our proof closely follows the corresponding proof in [23]. We show in two steps that both S \ S = ∅ and S \ S = ∅. For both steps, we shall need the following observations. Let V k denote the linear space spanned by the top k variables according to the given ranking, k = 1, . . . , p, and let V 0 = {0}. Let further U k = V ⊥ k ∩ V k+1 , k = 0, . . . , p − 1, which are subspaces of R n of dimension at most 1. In case that the dimension of U k is one, let u k be the unit vector spanning U k and let u k = 0 otherwise, k = 0, . . . , p−1. Note that Π(k+1)−Π(k) as appearing in the definition of the δ(k)'s equals the projection on the U k , k = 0, . . . , p − 1. In particular, we have (Π(k + 1) − Π(k))ε 2 = | u k , ε |, k = 0, . . . , p − 1. (H.2) Step 1: No false negatives. In the sequel, let ∆ denote the threshold of the procedure so that s = max {0 ≤ k ≤ (p − 1) : δ(k) ≥ ∆} + 1.
Later in the proof, it will be verified that ∆ can be chosen as asserted in the theorem. We first note that = P ( (Π(s) − Π(s − 1))X S β * S 2 < ∆ + | u s−1 , ε |) where Π S and Π S\j denote the projection on the linear spaces spanned by the columns of X corresponding to S respectively S \ j, j = 1, . . . , s. In order to obtain the second inequality, we have used again that we work conditional on the event E. As will be established at the end of the proof, we further have as will be done below after fixing ∆.
Step We start by noting that Σ is strictly positive definite so that the NNLS problem is strictly convex. Thus, the NNLS solution and its active set F = {j : β j > 0} are unique. Let us first consider the case s > 0. Using a slight modification of the scheme used in the proofs of Theorems 3 and 4, we will show that under the required condition on β min (S), the event { β S = β (P 2) ≻ 0} holds with the stated probability, which proves the first statement of the proposition. Following the proof of Theorem 3, we have that with probability at least 1 − 2p −M 2 , where we have used the closed form expression for τ 2 (S) in (5.5). In order to verify that β S = β (P 2) ≻ 0, we follow the back-substitution step (step 2 in Appendix F) apart from the following modification. In place of (F.5), we bound ≤ ρ 1 + (s − 1)ρ β (P 1) 1 ≤ 2(1 + M )σ 2 log(p)/n 1 − ρ For the first equality, we have used that β * S c = 0 and the fact that the matrix Σ SS c has constant entries equal to ρ. For the second inequality, we have used that 1 is an eigenvector of Σ SS corresponding to its largest eigenvalue 1+(s−1)ρ. Turning to step 3 in Appendix F, we note that with φ min (S) = (1 − ρ), β * S −β (P 2) ∞ ≤ 2(1 + M )σ 2 log(p)/n 1 − ρ + (1 + M )σ 2 log(p)/n √ 1 − ρ ≤ 3(1 + M )σ 2 log(p)/n 1 − ρ so thatβ (P 2) = β (P 2) = β S ≻ 0 with probability at least 1 − 4p −M 2 as claimed.
We now turn to the second statement of the proposition concerning the (conditional) distribution of the cardinality of the active set. Conditional on the event { β S ≻ 0}, the KKT optimality conditions of the NNLS problem as stated in Lemma 3 imply that the following block system of inequalities holds.  Resolving the top block for β S , we obtain