An unbiased approach to compressed sensing

In compressed sensing a sparse vector is approximately retrieved from an under-determined equation system $Ax=b$. Exact retrieval would mean solving a large combinatorial problem which is well known to be NP-hard. For $b$ of the form $Ax_0+\epsilon$ where $x_0$ and $\epsilon$ is noise, the `oracle solution' is the one you get if you a priori know the support of $x_0$, and is the best solution one could hope for. We provide a non-convex functional whose global minimum is the oracle solution, with the property that any other local minimizer necessarily has high cardinality. We provide estimates of the type $\|\hat x-x_0\|_2\leq C\|\epsilon\|_2$ with constants $C$ that are significantly lower than for competing methods or theorems, and our theory relies on soft assumptions on the matrix $A$, in comparison with standard results in the field. The framework also allows to incorporate a priori information on the cardinality of the sought vector. In this case we show that despite being non-convex, our cost functional has no spurious local minima and the global minima is again the `oracle solution', thereby providing the first method which is guaranteed to find this point for reasonable levels of noise, without resorting to combinatorial methods.


Background
We consider the classical compressed sensing problem of minimizing the cardinality card(x) of an approximate solution to an underdetermined equation system Ax = b, i.e.
arg min where ε > 0 is some allowed tolerance of the error and x 0 lies in R n or C n . Problem (1) is NP-hard [27] and a popular approach, commonly referred to as "compressed sensing", is to replace card(x) with the convex function x 1 , i.e.
arg min x: Ax−b 2<ε This method goes back (at least) to the 70's (see the introduction of [14] for a nice historical overview) but received increasing attention in the late 90's due to the work by Chen, Donoho and Saunders [18] on what they called basis pursuit, which amounts to solving for a suitable choice of parameter λ (playing the role of ε in the 1 -version of (1)). In fact, (3) is the dual problem of (2) in the sense that for each ε there is a λ such that the solution of (2) and (3) coincides. The method received massive attention after the works of Candès and coworkers in the early 2000, and the term compressed sensing was coined. In [12], Candès, Romberg and Tao proved the surprising result that, given a sparse vector x 0 and a measurement where is Gaussian noise, solving (2) yields (for a suitable choice of ε) a vector x that satisfies where C K is a constant. Arguing that it is impossible to beat a linear dependence on the noise (even knowing the true support of x 0 a priori), the estimate (5) led the authors to conclude that "no other method can significantly outperform this". The result holds given certain assumptions on the matrix A, related to the restricted isometry properties of A, which in a separate publication (Theorem 1.6, [13]) was shown to hold with "overwhelming probability". The result is indeed surprising, in Figure 1 we display the one-dimensional counterparts of card(x) and x 1 , demonstrating that card(x) and x 1 are indeed quite different functionals.
However, the definition "overwhelming probability" is asymptotic in its nature, and therefore it is not clear if (5) is valid in a moderately sized application (it often is not, see Section 2.3). This continues to be the state of the art, see e.g. [1,2,10] which provides asymptotic theorems about when compressed sensing works in concrete setups. Moreover, whereas very strong recovery results were reported e.g. in [11,17,20] for the case of exact data b = Ax 0 , in the presence of noise (c.f. (4)) the method gives a well known bias (see e.g. [21,26]). The 1 term not only has the (desired) effect of forcing many entries in x to 0, but also the (undesired) effect of diminishing the size of the non-zero entries. This is clearly visible even in the one-dimensional situation; the function R x → λ|x| + 1 2 |x − x 0 | 2 has its minimum shifted towards 0 from the sought point x 0 . This has led to a large amount of non-convex suggestions to replace the 1 -penalty, see e.g. [8,7,6,4,17,30,38,22,35,31,24,23,37,36,25,14,9,21,26].
These methods are tailormade for the sparsity problem, and upon changing the non-convex penalty card(x) it is not clear what to do. We consider now the general problem of minimizing where f is some non-convex penalty and x is a vector in some linear space, not necessarily R n . For example, if the desired cardinality K is known a priori, we can take f to be the indicator function ι P K of the set P K = {x : card(x) ≤ K} in which case (6) reduces to arg min In [16] the "quadratic envelope" Q 2 (f ) was introduced, where Q 2 is the "quadratic biconjugate" (apart from the name, this transform was introduced already in [15], see Figure 2 for an illustraion). It has the property that Q 2 (f )(x) + x 2 is the lower semi-continuous convex envelope of f (x) + x 2 , and the relationship between (8) and the original functional in (6) was investigated. The main result is that, given A < 1, the set of local minimizers to (8) is a subset of the local minimizers of (6), and most importantly that the global minimizers coincide. In the particular case of f (x) = card(x) the functional (8) has previously been introduced by Zhang [36] under the name MCP (Minimax Concave Penalty) and independently by Aubert, Blanc-Feraud and Soubies [33] under the name CE 0. It also shows up in earlier publications, for example (2.4) in [21], but it seems like [36] is the first comprehensive performance study and [33] the first publication where the connection with convex envelopes appears. For this choice of f , the value of the contributions of the present paper is mainly theoretical, which goes much beyond what was previously known. In particular we show that the global minimizer with the MCP-penalty (i.e. Q 2 (card)) is the oracle solution (for certain choice of parameters). On the other hand, Q 2 (ι P K ) is a new object that has only appeared previously in earlier publications by the authors of the present article. In this article we provide theoretical results of the type (5) for the two concrete functionals Q 2 (card) and Q 2 (ι P K ). A more extensive discussion of previous results concerning MCP/CE 0 is found in Section 2.5, as well as other related results on non-convex optimization.

Contributions
A clear drawback with non-convex optimization schemes is that algorithms are bound to get stuck in local minima, and in concrete situations it is hard to determine weather this is the case or not. In the present article we give conditions on A which imply that any local minima of (8) for f (x) = card(x) necessarily has a high cardinality unless it is the global minima. Hence, if a sparse local minima is found one can be sure that it is the global minima. In the case of f = ι P K we take this one step further and give conditions under which (8) has a unique local minimizer, which hence must be the global minimizer as well as the solution to the original problem (7).
In the case when b = Ax 0 + and x 0 is a sparse vector, we significantly improve the state of the art in compressed sensing in a number of ways. Firstly, the conditions on A hold in greater generality. Secondly, we obtain an estimate corresponding to (5) where the involved constants are significantly smaller than C K . Thirdly (and most importantly), the method seems to work better in practice, at least in the setting when A has normalized Gaussian random columns. In particular, for reasonable values of noise (e.g. SN R ≈ 4 for the case of a 100x200 matrix A, see Section 2.4) we can find the oracle solution using the Forward Backward Splitting algorithm.
In Section 2 we present highlights from the theory, show some numerical results and compare with the traditional 1 -method (3). In 2.5 we give a brief review of the field. The remainder of the paper, Sections 3-5, are devoted to developing the theory.

Sparse recovery via Q 2 (card)
We return to the first problem of minimizing (8) for f = card(x) i.e.
(where we introduce the parameter µ to control the tradeoff between sparsity and data-fit) which we regularize with The graph of Q 2 (card) is depicted in Figure 1. Is it true that unique global minimizers exist (recall that they are the same given A < 1)? It is easy to see that this is not the case in general, just consider the case of a 2x4−matrix A such that every pair of columns are linearly independent, and let µ be such that the global minimum is attained when Ax−b = 0. In this case we have 4 2 = 6 choices that all give the global minimum. However, in the above example there exists no "sparse" solution, for 2 equals the row-dimension of the matrix, and by sparse we mean a number much smaller than this. For an m × n matrix A, with columns sampled from the unit sphere of R n , this is formalized in Lemma 2.1 of [20], where it is shown that (1) with ε = 0 has a unique sparse solution (with probability 1) if b = Ax 0 and supp x 0 < m/2, which is an upper bound of how much sparsity one needs in order to have a well posed sparse problem.
In this paper we will study uniqueness of sparse minimizers of (10), in the sense that we give concrete conditions such that if there exists one local minimizer x of (10) with the property that card(x ) << m (in a manner to be made precise), then • x is automatically a global minimizer • any other stationary point x of (10) satisfies card(x ) >> card(x ).
We remind the reader that A satisfies a Restricted Isometry Property for integer k, if any k columns of A behaves approximately as an isometry, in the sense that the resulting matrix can be bounded below by √ 1 − δ k Id and bounded above by √ 1 + δ k Id, where δ k is the restricted isometry constant for k. Classical results from compressed sensing literature usually require that the numbers δ k are small, something which we have found is hard to fulfill in practice. For example, the famous estimate (5) holds under the assumption that Our numerical evaluation (see Section 2.3) shows that if K = 5 this condition is usually not satisfied for a Gaussian random matrix A (with normalized columns) of size m × n, unless m is (at least) around 500. The statement that RIP holds with overwhelming probability [13] is therefore somewhat misleading, since it is based on an asymptotic estimate. For a small size matrix of the type discussed above it typically never applies (more on this in Section 2.3). That the RIPconditions are hard to satisfy in practice is well known by the community, and has led to interesting new contributions about efficiency of 1 without RIP, given that the problem is sampled in a certain way, see e.g. [1,2]. However, these results are asymptotic in nature and do not apply in as general situation as the ones we will present here. We base the theory of this paper on the "Restricted Linear Independence Property" (RLIP), basically constituting the lower estimate of the RIP. More precisely, we define for k = 1 . . . n. We say that A satisfies RLIP with respect to the property In other words A is RLIP with respect to this property if and only if any K chosen columns of A are linearly independent.
The relationship with RIP is as follows; if A satisfies RIP with constant δ k then it satisfies RLIP with β k ≥ √ 1 − δ k , whereas the converse often does not hold. To give an idea of the type of results proven in this paper, we first present two corollaries of theorems in Section 4.
then x is a unique global minimum of both K and K reg . Moreover, any other stationary point x has a larger support.
The statement is a combination of Theorem 4.2 and 4.5 (for the choice N = 2K). Upon assuming a bit more in (13) and (14), we may also conclude that x has substantially larger support. As a curious remark, note that β 2K > 0 forces m ≥ 2K which is precisely the upper bound given by Lemma 2.1 in [20] mentioned above.
In the case when b = Ax 0 + , as discussed earlier, we can say more. Below we state Theorem 4.8 for the particular case N = 2K.
Then there exists a unique global minimum x to K reg as well as K, with the property that supp x = supp x 0 , that and that card(x ) > K for any other stationary point x of K reg .
Note that the conditions on and x 0 are very natural; if the noise is too large or if the non-zero entries of x 0 are too small, there is no hope of correctly retrieving the support.

Known "model order".
We now discuss the situation when the model order, i.e. the amount K of nonzero entries, is known. This problem is also known as the K-sparse problem and studied e.g. in [6]. For simplicity we restrict attention to R n , corresponding results for C n are similar but the assumptions on A are slightly more technical. In this case we set (where the subindex K separates the notation from (9)) which we regularize with The result corresponding to Corollary 2.1 reads as follows Corollary 2.3. Let A have normalized columns such that no pair is orthogonal, and assume that n ≥ m + K + 2. Any local minimizer x of K K,reg then lies in P K . Moreover, set z = (I − A * A)x + A * b, letz contain the elements of z sorted after decreasing magnitude, and assume that Then x is the unique global minimum of K K and K K,reg .
The result is a combination of Proposition 5.1, Theorem 5.2 and 5.5. We remark that in the typical compressed sensing application, A is a matrix with m << n and K << m. If the columns of A are normalized random, then the conditions on A are satisfied with probability 1. Moreover, any subset of 2K columns will be close to an isometry as long as 2K << m, so it is not unreasonable to expect that β 2K ≈ 1. In this case the assumption (17) is quite reasonable since |z K+1 | ≤ |z K | by construction and 2β 2 2K − 1 ≈ 1. The size of β 2K in the above scenario is further discussed in subsection 2.3.
We now consider the case when b = Ax 0 + and we wish to retrieve x 0 , where card(x 0 ) = K. By combining Proposition 5.6 and Theorem 5.7, we have (for A as in the previous corollary); there exists a unique local minimizer x to K K,reg with supp (x ) = supp (x 0 ). This is the global minimum of both K K and K K,reg and moreover it satisfies Ax − b ≤ , supp (x ) = supp (x 0 ) and x − x 0 ≤ β K .
If β 2K ≈ 1 the condition is |x 0,j | 2 and the conclusion x − x 0 . We further remark that x in Corollaries 2.2 and 2.4 is the so called "oracle solution", i.e. the one you would get if an oracle told you the true support S  Figure 3: Plot of 1 β K for a 17×25 matrix A with normalized random columns. of x 0 and you were to solve the (overdetermined) equations system A S x = b where A S denotes the m × K matrix whose columns are those with indices in S (and then expand x to R n by inserting zeroes off S). This is clearly the best possible solution one could hope for (as argued also in [12]). If we have a method that would find a vector x with the correct support S (with a bias or not), we can always get this unbiased solution by simply discarding x and follow the above procedure to get the oracle solution. Therefore the issue of finding the support is maybe more central than having a good estimate of x − x 0 . Indeed, finding the correct support is often used as a measurement of success in numerical sections on the topic [7,14,25]. Apart from [7] which studies the minimization of (9) itself (and performs poor in practice, see Figure 5), we have not been able to locate any results in the literature which claim to find S, as Corollaries 2.2 and 2.4. In [28] and [29], the minimizers of the un-regularized functionals K and K K are studied. Corollaries 2.1-2.4 provide new results and extensions of this line of research.

On the size of RIP/RLIP-constants
We are interested in estimating the size of the constants β K , δ K as well as compare the upper bound C K in (5) with 1/β K in the corresponding estimates of Corollary 2.2 and 2.4. We focus on matrices A of size m × n where n > m and the columns are generated from a random Gaussian distribution and then normalized so that A ∞,col = 1, in accordance with [12,20] as well as the the assumptions in this paper. Figure 3 shows numerical computations of 1/β K for random matrices of size 17 × 25. Here we do not plot C K since the requirement δ 3K + 3δ 4K < 2 turned out to almost never be fulfilled when K > 1. In contrast, Corollary 2.2 apply whenever β 2K > 0 and Corollary 2.4 when β 2K > 1 2 . In order to compare 1/β K with C K for a moderately sized application, we would like to compute these for a 256 × 512-matrix, say. However, due to the combinatorial nature of the constants δ j and β k , it is not possible to compute them for matrices with more columns than ≈ 30 (on a standard laptop at least). Nevertheless a 256 × 25-matrix can be seen as the first portion of a 256 × 512matrix, and from this perspective the values obtained in the 256 × 25-case serve as lower bounds of the true values. It turns out that (11) typically does not hold for K > 2, while β 2K > 1/2 holds for all K up to 25.
Finally, considering 512 × 25-matrices, we do have that (11) is satisfied in general, and Figure 4 plots a graph of 1/β K versus C K . From the right graph we also see that values of β K are very decent, around 0.8, for K near 20.

Numerical Recovery Results
In [14] astonishing results are shown in the noise free case, for example in Figure  2 (of that paper) we see how K = 130 non-zero entries are recovered using a matrix A of size m × n = 256 × 512 (which incidentally is close to the theoretical bound 2K < m in the present paper). However, in the presence of noise, performance seems to drop drastically. In Figure 7 (of the same paper) we see an example where K = 8, m = 72 and n = 256. This is in line with the predictions of [25], which use K = √ m in their numerical section 4.3.  (15)- (16). The methods based on Q 2 (card) and Q 2 (ι P K ) work perfectly down to SN R ≈ 4.
Here we will present numerical results for the case of K = 10, m = 100 and n = 200. We use a matrix A with Gaussian randomly generated columns, which are subsequently normalized, and solve problems (3), (10) and (16) for b = Ax 0 + for different levels of noise between 0 and 5. The vector x 0 has random entries between 2 and 4 in magnitude, and a total magnitude x 0 = 11. To solve the optimization problems we use FBS which is known to converge to a stationary point (by [4] in combination with Section 2.4 of [15] or Section 6 of [16]). In Section 5 of [4] the convergence of FBS for the unregularized problems (9) and (15) is considered, but with no analysis of performance. This has also been proposed earlier in [6] where it is compared against matching pursuit. For this reason, we also included graphs for the result of minimizing (9) and (15).
Each point on the respective curves is an average over 50 trials, where we have used 1000 iterations and with a step-size parameter of 0.9/ A 2 , which is close to the upper theoretical bound given in [4] (which coincides with the bound for the convex case, see e.g. [19]). For the 1 -problem (3) we used the formula λ = √ n 2 log(n) corresponding to the recommendations in Section 5.2 of [18]. For (9)-(10) we used µ = 1 and K was set to 10 for (15)- (16).
If the values of β 2K and β K are ≈ 1, then the conditions in Corollary 2.2 hold given that 2 √ µ < min{|x 0,j | : |x 0,j | = 0} which in our case is 2.05 and ≤ √ µ, whereas the conditions in Corollary 2.4 hold as long as 2 < 2.05. In both cases, the estimate for which is supposed to hold for 1. As can be seen from the graph in Figure 5 (left) the true bound (for this particular example) seems to be x − x 0 1 3 for both (10) and (16), whereas the true constant for 1 is around 1 (despite C 10 = ∞ as argued earlier). The unregularized cardinality problem (9), a.k.a. iterative hard thresholding, seems to perform poorly (with our parameters) whereas (15) (a.k.a. the Ksparse algorithm [6]) seems to do a decent job, similar to 1 in performance for higher noise levels. In both cases corresponding performance for the regularized versions is significantly better, indicating that the regularization by S 2 2 indeed has a crucial effect. Note that all 3 methods work for noiselevels much greater than stipulated by the theory. We also remark that, rather surprisingly, there is no major difference between (10) and (16) for moderate noise levels. However, both these methods are designed to find the oracle solution x S , not x 0 , so to evaluate this performance we include in Figure 5 (right) also the graph of x − x S versus . From this we deduce that both work perfectly until = 3, but that (10) deteriorates substantially faster beyond this point. In other words, in this example both methods based on Q 2 (card) and Q 2 (ι P10 ) work as expected down to SN R around 4. Note that, in the best case scenario β K = β 2K = 1, and then a simple computation shows that Corollary 2.2 and 2.4 then applies for SN R down to 2 √ 10 ≈ 6, and hence there is almost perfect harmony between theory and numerical results. More precisely, we can allow 50% more noise in practice than predicted by the theory.
In Figure 6 we show the same graphs except that now A has size 50 × 200. Clearly this has a significant impact on performance. In particular, although (10) and (16) still do better than traditional 1 -minimization, there is no longer a significant difference. This could indicate that the convex 1 -method is more reliable in very difficult scenarios, as opposed to the non-convex methods suggested here, but this would have to be further investigated to be confirmed. A drawback of 1 -methods is that one often needs to find a suitable λ, which leads to slow evaluation in practice. Another issue that we have not discussed is the starting point. We have used 0 for all examples above, and (a bit surprisingly) this seems to work better than using the least squares solution x LS of Ax = b. In our final graph Figure 7 we plot a histogram of the cardinality of x over 50 trials with the noise level = 2.5, using Q 2 (card) and x LS as starting point. For this noise level and starting point, Q 2 (ι 10 ) still works perfectly, which is why its performance is excluded, the histogram hits 50 at K = 10. It is interesting to note the following dichotomy, either the cardinality is around 10, or substantially larger, in harmony with the results presented in this paper (e.g. Corollary 2.1 and more generally Theorem 4.2).

Brief review of related results
As previously mentioned, Q 2 (card) was introduced previously in [36] and [33], and appeared earlier also e.g. in [21], although in this paper they move on to introduce yet another penalty called SCAD which we discuss further below. Needless to say, we are not the first group to address the shortcomings of traditional 1 -minimization by use of non-convex penalties. In fact, even before the birth of compressed sensing, the shortcomings of 1 -techniques were debated and non-convex alternatives were suggested, we refer to [21] for an overview of early publications on this issue. Moreover, shortly after publishing the celebrated result (5), Candès, Wakin and Boyd suggested an improvement called "Reweighted 1 -minimization" [14] which also became a big success. They provide a theoretical understanding of this algorithm as minimizing the non-convex functional where is a parameter chosen by the user. Figure 1 shows the functions card(x), |x| and log(0.1 + |x|) − log(0.1) as well as Q 2 (card). As is clear to see, log(0.1 + |x|) − log(0.1) is closer to card(x) than |x|, which may explain the better performance by reweighted 1 -minimization reported in [14]. The functional Q 2 (card) is even closer to card(x), and while this certainly is one reason behind the superior theoretical results reported in this paper, it is not clear that it is beneficial in practice since it may lead to an increased probability of getting stuck in local minima. Indeed, suppose one has a non-sparse solution to Ax = b where all non-zero elements of x are in the flat part of Q 2 (card), then we clearly have an (undesired) local minima of K reg (x) = Q 2 (card)(x) + Ax − b 2 2 , whose presence for high levels of noise is clearly visible in Figure 7. This is further studied in [33] where it is shown that, for a particular choice of 7 × 15-matrix A and other parameters, the original functional (9) has roughly 16000 (!) local minima, whereas around 5000 of these remain as stationary points for the regularization (10). Several of these stationary points turn out to not be local minima, and hence the authors provide a macro algorithm to avoid non-local minima. In the same vein, Zhang [36] proposes to iteratively update relevant parameters to reach the desired global minima with higher probability.
While these results speak a bit in favor of using another method such as reweighted 1 or Q 2 (ι P K ), which according to Corollary 2.4 does not suffer from the same drawbacks (under mild assumptions), favorable results for Q 2 (card) were reported in [25], which compares the use of Q 2 (card)/MCP with 1 and reweighted 1 (called LSP in [25]) as well as SCAD (which has similar performance as Q 2 (card)). The numerical results in this paper seems also to reconfirm this, despite not employing any algorithm ensuring that we do not converge to an undesired stationary point. At first glance, this seems to contradict the findings of [33] reported above. However, in their experiments they do not use b = Ax 0 for some sparse x 0 and they also do not use a matrix A with good RLIP-properties.
The best theoretical justification to support the use of Q 2 (card) seems to be Corollary 1 in [25] which, under a number of assumptions, prove that K reg does have a unique stationary point with high probability, and provide an estimate of the type (5). The setting of [25] is rather different and we have not been able to verify reasonable values of the involved constants c l , c u , c ∞ , R, µ, λ, η, c 1 , c 2 , c 3 and γ in order to compare the strength of our Corollary 2.2 with Corollary 1 in [25]. We simply note that they point in the same direction and that Corollary 2.2 holds under simpler conditions. The same remark goes for Theorem 3 in [30] and Theorem 1 in [21], which provide conditions under which a class of nonconvex optimization problems give the same estimate as the "oracle estimator", with a high probability.
The papers [4,6] considers (6) for the cases f (x) = card(x) as well as f (x) = ι P K (x), and [4] show in particular that the FBS-algorithm applied to (6) converges to a stationary point, but a further analysis of this point is not present. Incidentally, this article in combination with Section 2.4 of [15] (Section 6 of [16]) shows that FBS also converges for (8) under very soft assumptions.
For the case f (x) = card(x), [7] goes a bit further and actually provide an estimate of the type (5) with the good value C K = 5 (independent of K), however under the assumption δ 3K < 1/8 which, in the light of the Section 2.3, is not easy to satisfy.
Many other non-convex penalties have been proposed over the years [31,8,17,30,38,22,35,24,23,37,36,25,14,9,21,26], and we make no attempt to review them here. The introduction of [25] contains a recent overview. A common denominator seems to be that the penalty function has the form p(x) = j p j (x j ) where p j are functions on R (except the recent contribution [31]). In this sense, Q 2 (ι K ) stands out as an interesting deviation.

Uniqueness of minimizers and stationary points with the desired property
We now turn to the heart of the matter, namely uniqueness of sparse minimizers of K reg , more precisely minimizers in a given K-sparse set P K . As noted in the introduction, this is not possible without imposing additional conditions on A as well as b. In this section we provide such a condition and in the coming ones we show what it entails in practice for the sparsity (and K-sparsity) problem. We first introduce the concept of a stationary point for a non-convex functional g, which in practice is easier to find than local minimizers. We recall that the Fréchet subdifferential∂g(x) is the set of vectors v with the property that We say that a point x is a stationary point of g if 0 ∈∂g(x). For the case when g is a sum of a convex function g c and a differentiable function g d , it is easy to see that x is a stationary point if and only if −∇g d (x) ∈ ∂g c (x) where ∂g c (x) denotes the usual subdifferential. Set i.e. 2G the l.s.c. convex envelope of f (x) + x 2 2 . We have which upon differentiation yields that x is a stationary point of K reg if and only if Given any x, we therefore associate with it a new point z via The importance of z is due to the following simple observation.
Proposition 3.1. Let x and x be distinct stationary points of K reg such that x − x ∈ P K . Then The above proposition will mainly be used backwards, i.e. we will show that (22) does not hold and thereby conclude that x − x ∈ P K .
Proof. We have so taking the scalar product with x − x gives as desired. Note that it is not necessary to take the real part, but we leave it since scalar products in general can be complex numbers.
As we shall see, the point z has a decisive influence on the coming sections. To begin with, it has the following interesting property.
Note the absence of A in the above formula, which in particular implies that Proof. As noted in (20), x is a stationary point of K reg if and only if z ∈ ∂G(x ). By the same token, x is a stationary point of if and only if z ∈ ∂G(x ), and since the functional is convex (and clearly has a well defined minimum) the stationary points coincide with the set of minimizers.

The sparsity problem
We return to the sparsity problem, and consider f (x) = µcard(x) where µ is a parameter and card(x) is the number of non-zero entries in the vector x. In this case we have, To recapitulate, we want to minimize (9), i.e.

Equality of minimizers for K and K reg
As noted by Aubert, Blanc-Feraud and Soubies (see Theorems 4.5 and 4.8 in [33]), K reg has the same global minima and potentially fewer local minima than where a i denotes the columns of A. Below we (essentially) reproduce their statement in the terminology of this paper, and also include a proof for completeness. Proof. We first establish that inf K = inf K reg . Since K ≥ K reg it suffices to show that inf K ≤ inf K reg . Suppose not and let x 0 be a point such that K reg (x 0 ) < inf K. Then there must be some index j such that the corresponding value in Q 2 (µcard)(x 0 ) is different from µcard(x 0,j ), which (see (23) But then It follows that we can redefine x 0,j to equal either 0 or √ µ, so that the resulting point x 1 satisfies K reg (x 1 ) ≤ K(x 0 ). We can now continue like this for another index j such that (27) holds (if it exists), and this process must terminate after finitely many steps N . Denoting the resulting point by x N , we see that it satisfies K(x N ) = K reg (x N ) < inf K, a contradiction. It is easy to see that K has global minimizers, and by the above argument these are also global minimizers for K reg . Now let x 0 be a global minimizer of K reg but not for K. As before we have that (27) and (28) holds for some index j, and we clearly must have equality in the latter. If A ∞,col < 1 this is a contradiction, and hance we see that the global minimizers of K and K reg coincides. If merely A ∞,col = 1, then we can redefine x 0,j to equal either 0 or √ µ without changing the value of K reg , and as before this process eventually leads to a point x N which is also a global minimizer for K. By the construction, there are at least two such points.
It remains to prove the statement about local minimizers. If x 0 is a local minimizer of K reg and A ∞,col < 1, we immediately get a contradiction from (28) unless K(x 0 ) = K reg (x 0 ). In view of K ≥ K reg , this establishes the claim.

On the uniqueness of sparse stationary points
Next we take a closer look at the structure of the stationary points. Given N such that β N > 0, we will show that under certain assumptions the difference between two stationary points always has at least N elements. Hence if we find a stationary point with less than N/2 elements then we can be sure that this is the sparsest one. The main theorem reads as follows: Theorem 4.2. Let x be a stationary point of K reg , let z be given by (21), and assume that for all i ∈ {1, . . . , n} (where the condition is automatically fulfilled if β N > 1.) If x is another stationary point of K reg then card(x − x ) > N .
Note that we allow β N > 1 in the above theorem, in which case the condition on z is automatically satisfied. The proof depends on a sequence of lemmas, and is given at the end of the section. Clearly, we will rely on Proposition 3.1, which requires an investigation of the functional G (18) and in particular its sub-differential. Introducing the function g as we get Its sub-differential is given by where D is the closed unit disc in C or, if working over R, D = [−1, 1]. In the remainder we suppose for concreteness that we work over C (but show the real case in pictures). Note that the sub-differential consists of a single point for each x = 0. Figure 8 illustrates g and its sub-differential. The following two results establish a bound on the sub-gradients of G. We begin with some one-dimensional estimates of g. Lemma 4.3. Assume that z 0 ∈ ∂g(x 0 ) and β N < 1. If then for any x 1 , z 1 with z 1 ∈ ∂g(x 1 ) and x 1 = x 0 , we have Figure 8: The function g(x) (left) and its sub-differential ∂g(x) (right). Note that the sub-differential contains a unique element everywhere except at x = 0.
Proof. By rotational symmetry (i.e. ∂g(e iφ x) = e iφ ∂g(x)), it is no restriction to assume that z 0 > 0. By 1 β 2 N > 1 and (33), we see that z 0 > √ µ, and hence the identity z 0 ∈ ∂g(x 0 ) and (32) together imply that z 0 = x 0 and in particular that x 0 ∈ R and To prove the result we now minimize the quotient Re(z1−z0)(x1−x0) |x1−x0| 2 and show that it is larger than 1 − β 2 N . There are three cases to consider; |x 1 | = 0, 0 < |x 1 | < √ µ and |x 1 | ≥ √ µ. The latter case is easy since then z 1 −z 0 = x 1 −x 0 and 1 − β 2 N < 1, which yields the desired conclusion. For the two other cases we first show that z 1 and x 0 can be assumed to be real. If x 1 = 0 then the optimization of the above quotient is equivalent to minimization of −Re(z 1 ) over z 1 ∈ √ µD, since x 0 = z 0 is real and positive, which is clearly minimized in z 1 = √ µ. For the middle case, z 1 and x 1 have the same angle with R, and |x 1 | < |z 1 | < z 0 . We first hold the radii fixed and only consider the angle as an argument. Recall z 0 = x 0 and set R = |z 1 |/|x 1 |. Then (36) which shows that the quotient is minimized when x 1 is real and 0 < x 1 < √ µ (which then automatically applies to z 1 as well). Summarizing the above we may thus assume that x 1 and z 1 are real and non-negative and 0 ≤ x 1 < √ µ. This simplifies the quotient (36) to x0−z1 x0−x1 . We now hold x 1 , z 1 fixed and consider x 0 as the variable. Since z 1 ≥ x 1 , it is easy to see that this is minimized for x 0 as small as possible, i.e. Summing up, we have that Lemma 4.4. Assume that z 0 ∈ ∂g(x 0 ) and β N < 1. If then for any x 1 , z 1 with z 1 ∈ ∂g(x 1 ), Proof. The proof is similar to the previous lemma. We first note that x 0 = 0, x 1 = 0 and that z 0 may be assumed to be in (0, Rex1 |x1| 2 is smallest when both x 1 and z 1 are real valued and positive. Since It is also easy to see that z 0 = β 2 N √ µ gives a minimum value for any positive choice of x 1 , z 1 . This reduces the problem to finding the minimum of which by basic calculus equals (1 − β 2 N ), as desired. We are now ready to prove Theorem 4.2.

Proof of Theorem 4.2. By Proposition 3.1 it suffices to verify
Suppose first that β N < 1. Since ∂G(x) = n j=1 ∂g(x j ), Lemmas 4.3 and 4.4 imply that Suppose now that β N > 1. By (39) it suffices to prove that Re z − z , x − x ≥ 0 for all x = x . Fix i in {1, . . . , n}. By rotational symmetry it is easy to see that we can assume that x i , z i ≥ 0. Moreover, for fixed values of |z i | and |x i | (but variable complex phase) it is easy to see that Re(z i − z )(x i − x i ) achieves min when these are also real, i.e. we can assume that x i , z i ∈ R. Since the graph of ∂g is non-decreasing it follows that It remains to consider the case when β N = 1, and as above we reach a contradiction if we prove that Re z − z , x − x > 0. Again we can assume that x i , z i ≥ 0 and that x i , z i ∈ R. Then (29) implies that z i = √ µ for all must have x i = x i for some i. Using that z i ∈ ∂g, examination of (32) yields that also z i = z i . With this at hand we see that the left hand side of (39) is strictly positive, whereas the right equals 0, which again is a contradiction.

Conditions on global minimality
Theorem 4.5. Let A satisfy A ∞,col ≤ 1, let x be a stationary point of K reg and let z be given by (21). Assume that If then x is the unique global minimum of K and K reg .
Obviously, it is desirable to pick N as large as possible, which is limited by (40) and the fact that β N decreases with N . Also note that β N ≤ 1 since Proof. Set K = card(x ) and assume that x is not the unique global minimizer of K reg . By Theorem 4.1, there exists a global minima x = x for both K reg and K, which hence is a stationary point of K reg . Theorem 4.2 then shows that card(x ) ≥ N − K + 1. It follows that . This is a contradiction, and hence x must be the unique global minimizer of K reg . By Theorem 4.1 it then follows that x is also unique minimizer of K.

Noisy data.
In this final subsection we return to the compressed sensing problem of retrieving a sparse vector x 0 given corrupted measurements b = Ax 0 + , where is noise and x 0 is sparse. More precisely we set S = supp x 0 where we assume that #S = K is much smaller than m -the amount of rows in A (i.e. number of measurements). We let x 0,j denote the elements of the vector x 0 . Let A S denote the matrix obtained from A by setting columns outside of S to 0, and let x S denote the least squares solution to A S x S = b. Note that this is the so called "oracle solution" discussed in the introduction, which can also be written The below proposition shows that the oracle solution is under mild assumptions a local minimizer of K reg , which we denote by x for notational consistency.
for all j ∈ S then the oracle solution x = x S is a strict local minimum to K reg with supp (x ) = supp (x 0 ). We also have |x j | > √ µ, j ∈ S, Ax − b ≤ , and Proof. Consider the equation A S x = Ax 0 + and note that Ax 0 = A S x 0 . The least squares solution is obtained by applying (A * S A S ) † A * S which gives the solution By construction of the Moore-Penrose inverse, supp δ ⊂ S, and hence where P RanA S denotes the orthogonal projection onto the range of A S . In particular, which establishes the final inequality in the proposition. Also δ ∞ ≤ δ 2 which implies This also gives supp x = supp x 0 since we already have shown supp x ⊂ supp x 0 ∪ supp δ ⊂ S. We now consider Ax − b, which equals and hence Ax − b ≤ . It remains to prove that x is a local minimum of K reg = Q 2 (µcard) + Ax − b 2 2 . To this end, consider K reg (x + v). Since |x j | > √ µ for j ∈ S, the term Q 2 (µcard) is flat for the corresponding indices of v. We get Since x solves the least squares problem posed initially, the vector A * S (Ax −b) = A * S (A S x − b) must be 0. With this in mind (44) then simplifies to By the Cauchy-Schwartz inequality and (43) we have It follows that the term j∈S c √ µ|v j |+v j a j , Ax − b in (45) can be estimated from below by ρ j∈S c v 2 j for some ρ > 0, and hence that for v in a neighborhood of 0, as long as j∈S c v 2 j = 0. To have K reg (x + v) ≤ K reg (x ) we thus need supp v ⊂ S, as seen from (45). But then (45) reduces to Av 2 + K reg (x ), and since β K > 0 it follows that Av 2 > 0 unless v = 0. In other words, x is a strict local minimizer.
In the above proposition, there is nothing said as to whether x is a global minimum or not. To get further, let z correspond to x via (21). We need conditions such that (40) holds for z , i.e.
We remind the reader that N is a number which preferably is a bit larger than 2K, where K is the cardinality of x 0 .
Proof. Using (43) we get Since A * P (RanA S ) ⊥ is 0 on rows with index j ∈ S (being a scalar product of a vector in RanA S and another in its orthogonal complement), we see that z j = x j for such j. Combining this with the final estimate of Proposition 4.6, we see that which is true by the assumptions. For the remaining z j , (i.e. j ∈ S c ), we have Putting all the results together and combining with simple estimates, we finally get. Then the oracle solution x = x S is a unique global minimum to K reg as well as K, with the property that supp x = supp x 0 , that and that card(x ) > N − K for any other stationary point x of K reg .
Proof. All the statements follow by Theorem 4.2, Theorem 4.5 and Proposition 4.6, so we just need to check that these apply. Note that β N ≤ β K ≤ A ∞,col ≤ 1 which will be used repeatedly.
We begin to verify that Proposition 4.6 applies, which is easy by noting that Now, to verify that the two theorems apply we need to check the conditions (46), which follow if we show that Proposition 4.7 applies. The estimate on is satisfied by assumption and the other follows by noting that holds if 2µK + β 4 N µ ≤ µN + µ, which is clearly the case since N ≥ 2K.
As usual, a simpler statement is found by setting N = 2K, which gives the loosest conditions to verify, (see Corollary 2.2). The only difference in the conclusion concerns the cardinality of other local minima, since the estimate on x − x 0 only depends on β K .

Known model order; the K-sparsity problem
Let P K = {x : card(x) ≤ K} where x is a vector in C n or R n . Set f (x) = ι P K (x) and note that the problem arg min is equivalent to finding the minimum of (where we put a subindex K to distinguish from K in the previous section). Again, we will approach this problem by using This is in some ways much simpler than the situation in the previous sections, for example all local minimizers of K K are clearly in P K . On the other hand, Q 2 (ι P K ) turns out to be rather complicated. We recapitulate the essentials, which follows by adapting the computations in [3] (for matrices) to the vector setting. Definex to be the vector x resorted so that (|x j |) d j=1 is a decreasing sequence. Then where k * is the largest value in 1...K for which the non-increasing sequence is non-negative (note that it clearly is non-negative for k = 1). Although it is not very clear from the above expression, Q 2 (ι P K ) is known to be continuous (see e.g. Proposition 3.2 in [16]), and this will be used without comment below. We first show that the global minima of K K,reg and K K coincide.

Equality of minimizers and K-feasibility
In order to provide a theorem similar to 4.1, we need a technical condition on the columns a 1 , . . . , a n of A. We say that A is K-feasible if A ∞,col ≤ 1 and for any subset of n − K columns, we can pick two such that a i − a j 2 ≤ 2. We say that A is strictly K-feasible if the inequality is strict. This is very easy to satisfy, the following proposition lists conditions that imply K-feasibility in R and C respectively. Proposition 5.1. If we work over R, any A with A ∞,col ≤ 1 and n ≥ m + K + 2 is K−feasible. If we add the condition that a i , a j = 0 for all pairs, or that A ∞,col < 1, then strict K−feasibility follows. The same follows over C if n ≥ 2m + K + 2. Another condition ensuring strict K−feasibility, which works in both R and C, is that A ∞,col ≤ 1 and at least nK of the values {Re a i , a j } i>j are positive (repetitions allowed).
We remark that it is possible to choose 2m + 1 vectors in C m such that a i − a j 2 > 2, just consider a simplex in R 2m with equal sidelengths and all corners on the unit sphere. The condition that n ≥ 2m + K + 2 is a bit unfortunate, since it rules out the common situation n = 2m. This is why we added the final part of the proposition.
Proof. If the chosen subset contain a zero vector, the conclusion is immediate, so we can assume that this is not the case. For any m+2 vectors a 1 , . . . , a m+2 in R m , we can always pick two such that a i , a j ≥ 0, which follows from a simple induction argument [34]. The first two conclusions regarding R m follow immediately by this observation. Since C m is isomorphic with R 2m , the corresponding result for C follows. Finally, the elements above the diagonal in ( a i , a j ) n i,j=1 are n(n − 1)/2 in number. When we consider a subset of columns of cardinality n − K, we remove a total of (n − K)K + (K − 1)K/2 of these. Since nK is a bigger number, we are certain that at least one positive value remain.
To illustrate a concrete example, which sometimes appears in applications, we consider the concatenation of a discrete Fourier transform matrix and an identity matrix. To see what the above proposition entails in this case, suppose that m = 4k or m = 4k + 1. Each column of the Fourier matrix gives rise to at least k positive values in its scalar products with the canonical basis coming from the identity matrix, so in total we have at least mk positive off diagonal elements in the Fourier matrix. This gives the condition 2mK ≤ mk, i.e. K ≤ m/8, which is acceptable for relevant applications. We now develop the theory for K-feasible matrices, starting with the analog of Theorem 4.1.
Theorem 5.2. Let A be K-feasible. Then the global minimizers of K K,reg and K K coincides, and all lie in P K . If A is strictly K feasible, then any local minimizer of K K,reg lies in P K and is a minimizer of K K .
Proof. We first treat global minimizers. Clearly all minimizers of K K are in P K . Since K K ≥ K K,reg and they coincide on P K , it suffices to show that a given global minimizer of K K,reg is in P K . This is annoyingly difficult, it is even a problem to show that K K,reg has global minima. We begin by showing that large values of x are not candidates for global minima if they are in the set We have Q 2 (ι P K )(x) > 0 for all x ∈ U . One way to see this is by Corollary 4.4 in [16] (or Theorem 2.20 in [15]), which states that there exists a direction v such that t → Q 2 (ι P K )(x + tv) > 0 has negative second derivative (since K K,reg (x) < ∞ = K K (x)), and this would imply that Q 2 (ι P K ) could take negative values, contradicting Proposition 3.2 in [16] (or Proposition 2.1 in [15]). This can also be deduced by a more careful analysis of (53), which we will perform below. Put Since we are minimizing a continuous non-zero function over a compact set, > 0. We now show that large values of x ∈ U yield large values of Q 2 (ι P K )(x). Let us write s = s x for (54), when there is a need to make the dependence on x clear. The function s is radially dependent, i.e. s tx = ts x for t ∈ R, and hence k * is radially independent. Looking at the expression for Q 2 (ι P K ) we see that Note that K K (0) = K K,reg (0) = 1 2 b 2 so the global minimum is less than or equal to this. If R is such that R 2 > 1 2 b 2 , it finally follows that no point x ∈ U with x > R can be less than 1 2 b 2 .
We remark in passing that x = 0 can not be a global minimizer of K K,reg unless b is perpendicular to the range of A, in which case the theorem holds trivially. To see this, note that Q 2 (ι P K )(0) = 0 and Q 2 (ι P K ) ≥ 0, so if x = 0 is the global minimum of K K,reg the gradient of 1 2 Ax − b 2 must necessarily be zero at x = 0, i.e. A * (−b) = 0. Now let G be the set of global minimizers for K K,reg restricted to [−R, R] n . This set is clearly closed, non-void and bounded. Now let G n ⊂ G be the subset where |x n | attains its minimum over G, let G n−1 ⊂ G n be the subset where |x n−1 | is minimized, and so on until we reach G K+1 , which still is closed, nonvoid and bounded. Suppose that G K+1 ⊂ P K and pick x ∈ G K+1 \ P K . We now show that this is impossible. First of all, note that any two columns of A and corresponding two elements in x may (simultaneously) switch positions and sign, without affecting the problem, so it is no restriction to assume that |x| = x, which we now do.
If x K+1 = R then this must also be the case for x 1 , . . . , x K , and hence x > R and x ∈ U , which is impossible by the earlier conclusions. Thus x K+1 < R.
Use K-feasibility of A to pick two indices j > i > K such that a i −a j 2 ≤ 2. Consider the function We shall show that this function stays in G for small values of t > 0, which contradicts the construction of G K+1 . Note that it stays in [−R, R] n for sure, due to x K+1 < R. A complicating factor is the fact that we may fail to have |x(t)| = x(t) for t > 0, i.e. this vector is not necessarily non-increasing as a function of its index. As long as x K > x i , we can pick both i, j > K such that x(t) is non-increasing for small values of t > 0. We consider this case first. All values of s x(t) (k) in (54) are then unaltered by t, and hence small perturbations do not affect k * . With this at hand, it follows that the first term in the expression (53) for Q 2 (ι P K )(x(t)) is unaffected by small changes in t, whereas the latter term is a quadratic polynomial starting with −2t 2 . The quadratic term in the expression Ax(t) − b 2 on the other hand is a i − a j 2 t 2 . Using A ∞,col ≤ 1, it follows that Thus K K,reg (x(t)) is linear in a neighborhood of 0, and since x(0) ∈ G it must actually be a constant. It follows that x(t) ∈ G for small values of t > 0, which is a contradiction as we noted before. It remains to consider the case when x K = x i . In this case we can make x(t) non-increasing (of its index) for small (fixed) values of t > 0, upon changing i so that x i is the first value to equal x K , but now i ≤ K and the independence of k * is less clear. Let k i be such that i = K + 1 − k i . We will need the following observations about s(k); If x K+1−k = x K+1−(k+1) then s(k) = s(k + 1) so we always have Since x ∈ P K , we also have s x (1) > 0. Combined this gives s x (k i ) = s x (1) > 0 and hence k * ≥ k i . If we now consider s x(t) (k) as functions of t, the inequality s x(t) (k i ) > 0 is stable with respect to small perturbations. On the other hand, the values of s x(t) (k) for k > k i are unaffected by small values of t (they cancel out in the sum of (54)), and so we conclude that k * is unaffected by small values of t. It follows that both terms depending on t in x(t) have indices beyond K − k * , which precisely as before yields that (57) holds, a contradiction.
To sum up, we have so far shown that any element x ∈ G K+1 \P K necessarily satisfies |x K+1 | < R and |x n | = 0. As grande finale we will now show that this is impossible. Let j be such that x j = 0 and pick i such that |x i | ≤ |x K | and a i , a j = 0. That this can be done follows from the basic fact that m + 1 nonzero vectors in R m can not be mutually orthogonal (suppose for the moment that no a i is identically 0). Since x j = 0, it is no restriction to assume that a i , a j > 0, which we do for simplicity. Consider again x(t) = x + te i − te j . The previous analysis can now be repeated without modification, to conclude that (57) holds whether i > K or not. But in this case (57) has a strictly negative value, which contradicts that x is in G to begin with. In the case when one a i = 0, we have a i − a j ≤ 1 for any other a j , which also yields a contradiction in (57).
We can finally conclude that any minimizer of K K,reg in [−R, R n ] is in P K and thus also a minimizer of K K . Since R could be arbitrarily large, the conclusion holds also in R n . The set of global minimizers of K K,reg is thus closed and non-empty. However, the remaining argument becomes easier if we keep R fixed where it is for a while longer. Now consider a path-connected component H of the set of global minimizers in [−R, R] n . Repeating the entire above argument, we see that P K ∩ H = ∅. Assume now that it contains points that are not in P K . Since n j=K+1 |x j | is a continuous function on H which is 0 at P K , there are points in H with arbitrarily small quotient ( n j=K+1 |x j |)/|x 1 |. Recall that we have ruled out the case 0 ∈ H early on in the proof, so the quotient is a continuous function on H. Let I be a level set of this function. We repeat the construction of sets I n ⊃ I n−1 ⊃ . . . I K+1 precisely as we did with G. Pick a concrete x ∈ I K+1 , and as before it is no restriction to assume that x = |x|. As before we see that it is impossible to have x n = 0, and as before we can pick i, j such that a i , a j ≥ 0 where i, j > K. We define x(t) via (56) and establish as before that the (57) must hold, whereby we get that x(t) ∈ I K+1 for small values of t, and this contradicts how the sets I 1 , . . . , I K were chosen.
By this we get that all global minimizers of K K,reg in [−R, R] n must lie in P K , and since R can be arbitrarily large, the proof about global minimizers is complete.
Finally if we assume that A is strictly K-feasible and that x is a local minimizer, then we can always find i, j such that (57) holds, which is a contradiction since −4 + 2 a i − a j 2 < 0 in this case.

On the uniqueness of sparse stationary points
We now give a condition, similar to (29) in Section 4.2, to ensure that a sparse stationary point is unique, in the sense that other stationary points must have higher cardinality.
Theorem 5.3. Let x be a stationary point of K K,reg with cardinality K, let z be given by (21), and assume that If x is another stationary point of K K,reg then card(x ) > K.
Again, we allow β 2K > 1 in the above theorem, in which case the condition on z is automatically satisfied. We begin with a lemma. Recall G given by (18), i.e. 1 2 Q 2 (ι P K )(x) + 1 2 x As before we want to reach a contradiction to Proposition 3.1, i.e. we want to prove Re z − z , x − x > (1 − β 2 2K ) x − x 2 2 . Note that that the first term in (62) and (63) are the same, and that β 2K > 0. Since the second and third sums have the same number of terms it suffices to show that for any pair i ∈ I , i / ∈ I and j / ∈ I , j ∈ I . This in turn will follow upon showing that |z i ||x i | + |z j ||x j | < β 2 2K (|x i | 2 + |x j | 2 ). By Lemma 5.4 it is easy to see that |z i | ≤ |z j | = |x j | and by assumption we also have |z j | < (2β 2 2K − 1)|z i | = (2β 2 2K − 1)|x i |. Thus |z i ||x i |+|z j ||x j | < |x j ||x i |+(2β 2 2K −1)|x i ||x j | = 2β 2 2K |x i ||x j | ≤ β 2 2K (|x i | 2 +|x j | 2 ), as desired.

Conditions on global minimality.
The statements in this section are actually a bit stronger than the corresponding ones in Section 4.3.
Theorem 5.5. Let A be K-feasible and let x ∈ P K be a stationary point of K K,reg . Let z be given by (21) and assume that (59) applies. Then x is a unique global minimizer of K K and K K,reg . If A is strictly K-feasible, then there are no other local minimizers either.
Proof. By Theorem 5.2 there exists x ∈ P K which is a global minimizer for both K K and K K,reg . If x = x this would contradict Theorem 5.3. If in addition A is strictly K-feasible, then Theorem 5.2 says that any local minimizer is a stationary point in P K , which is impossible by Theorem 5.3.

Noisy data.
We now assume that b is of the form Ax 0 + where is noise and x 0 is sparse. More precisely we set S = supp x 0 where we assume that #S = K. Let A S denote the matrix obtained from A by setting columns outside of S to 0. By minor modifications of the proof of Proposition 4.6 we obtain. Proposition 5.6. Let A satisfy A ∞,col ≤ 1. If |x 0,j | > 1 + 1 β K for all j ∈ S then the oracle solution x = x S is a strict local minimizer of K K,reg with supp (x ) = supp (x 0 ). This also satisfies |x j | > , j ∈ S, Ax − b ≤ and x − x 0 ≤ β K .
Proof. We assume for simplicity that we work over R, and as in Proposition 4. 6 we let x be the oracle solution. All estimates of Proposition 4.6 go through with minor modifications, for example (42) is replaced by which shows that supp (x ) = supp (x 0 ) as well as |x j | > for j ∈ S. The only real difference is the proof that x is a local minimizer of K K,reg , so we consider K K,reg (x + v) where as usual we can assume that x = |x |. Since x solves the least squares problem posed initially, the vector A * S (Ax − b) = A * S (A S x − b) must be 0, and so K K,reg (x +v) = Q 2 (ι P K )(x +v)+2 n j=K+1 v j a j , Ax − b + Av 2 +K K,reg (x ).
Let k 1 be such that x K+1−k1 = x K but x K−k1 > x K , and note that k * will depend on v but will always satisfy k * ≤ k 1 . With this in mind Q 2 (ι P K )(x + v) becomes Upon inspection there is a lot of cancelation and the expression reduces to 2x K n j=K+1 |v j | plus quadratic terms in v. Returning to the expression for K K,reg (x + v) and collecting all quadratic contributions from v in a term q(v), we see that K K,reg (x + v) − K K (x ) equals Since | a j , Ax − b | ≤ and x K > , it is easy to see that there exist some constant ρ > 0 such that near 0. As in the proof of Proposition 4.6 we conclude that we must have n j=K+1 v 2 j = 0 in order for K K,reg (x +v) ≤ K K,reg (x ) to be possible. However, for v with supp v ⊂ S, Q 2 (ι P K )(x + v) = 0 and so and the proof is complete.
To go from saying that x is a strict local minimizer to saying unique global minimizer is now a short step.
Theorem 5.7. Let A be K-feasible and β 2K > 1/ √ 2. If |x 0,j | > 1 2β 2 2K − 1 then the oracle solution x in the above proposition is a global minimum of K K and K K,reg . Moreover, if A is strictly K-feasible then K K,reg has no other local minimizers.
Proof. Proposition 5.6 clearly applies and ensures that x is a local minimizer. We now check that (59) applies for z given by (21), i.e. we want to check that |z K+1 | < (2β 2 2K − 1)|z K |. Note that |z K+1 | ≤ by the same estimate as (50). Moreover, since z ∈ ∂G(x ), Lemma 5.4 implies that it suffices to show that < (2β 2 2K − 1)|x K |, which by (65) follows if which easily is seen to be equivalent with the condition in the statement. The desired conclusions now follow by Theorem 5.5.