Computational approaches to non-convex, sparsity-inducing multi-penalty regularization

In this work we consider numerical efficiency and convergence rates for solvers of non-convex multi-penalty formulations when reconstructing sparse signals from noisy linear measurements. We extend an existing approach, based on reduction to an augmented single-penalty formulation, to the non-convex setting and discuss its computational intractability in large-scale applications. To circumvent this limitation, we propose an alternative single-penalty reduction based on infimal convolution that shares the benefits of the augmented approach but is computationally less dependent on the problem size. We provide linear convergence rates for both approaches, and their dependence on design parameters. Numerical experiments substantiate our theoretical findings.


Introduction
In many real-life applications one is interested in recovering a structured signal from few corrupted linear measurements. One particular challenge lies in separating the ground-truth from pre-measurement noise since any such corruption is amplified during the measurement process, a phenomenon known as noise folding [2] or input noise model [1]. It commonly appears in signal processing and compressed sensing applications, where noise is added to the signal both before and after the measurement process occurs. This can be modeled as where u † ∈ R n is an s-sparse original signal that we want to recover, v ∈ R n is the premeasurement noise, ξ ∈ R m is the post-measurement noise, and A ∈ R m×n is the measurement matrix. Note that a signal u ∈ R n is called s-sparse if its support consists of at most s elements, i.e. |supp(u)| = | {i : u i = 0} | s. Information theoretic bounds state that the number of measurements m required for the exact support recovery of u † from (1) needs to scale linearly 4 with n, which leads to poor compression performance [1].
A number of recent studies [3,15,16,21] try and mitigate these issues through a multipenalty regularization framework defined as min u,v∈R n where α, β > 0 are regularization parameters, 0 q < 2, and 2 p < ∞. In particular, to promote sparsity of the u component we choose q 1. A natural way to minimize (2) is via alternating minimization, starting from u 0 , v 0 ∈ R n and then iterating as Whereas the second problem is differentiable and admits an explicit solution, the first problem requires iterative thresholding for q 1 [21], for each outer iteration k ∈ N, and becomes nonconvex if q < 1. Moreover, alternating minimization does not lend itself to an easy analysis of the convergence rate.

Contribution
In this work we examine the multi-penalty problem (2), for the case 0 < q 1 and p = 2. We first show that the augmented approach in [16], which allows to decouple the computation of u and v components of the solution, can be easily extended to q < 1 to obtain an augmented single-penalty iterative thresholding algorithm providing solutions to (2). Since this includes computing the inverse of a possibly high-dimensional matrix, we suggest an alternative singlepenalty iterative thresholding algorithm which is based on an infimal convolution formulation 4 Assume for simplicity v⊥ ⊥ξ, ξ ∼ N (0, σ 2 Id n ), and v ∼ N (0, σ 2 v Id n ). We now write (1) as y = Au † + w, where w:= Av + ξ represents the effective noise. The covariance matrix of w equals σ 2 Id m + σ 2 v AA =: Q. Assuming AA ≈ n m Id m (as is the case, with high probability, for A with zero mean, 1/m-variance sub-Gaussian entries), and σ v ≈ σ, we would have Q = σ 2 (1 + C n m )Id m , for C > 0. Thus, the variance of the noise rises by a factor proportional to n/m, which when m n can be substantial.
of (2) and sidesteps the computational bottleneck of the augmented approach. We show a linear convergence rate for both approaches, in dependence of design parameters, and in numerical simulations confirm both the rate analysis and the efficiency gap. In particular, we argue that the benefits of faster convergence rates are sometimes offset by the computational demands, which suggests that a preferred method for solving the optimization problem can be chosen with respect to the size of A.

Related work
In [21] the authors approach (3), for 0 < q 1 and p = ∞, on separable Hilbert spaces by applying iterative thresholding algorithms to each of the sub-problems, and show convergence of the sequence of iterates to stationary points of the underlying problem. The choice p = ∞ is of special interest when v models uniform pre-measurement noise. However, the authors also show that p = 2 exhibits the best (empirical) performance for the reconstruction of u † , for v modelling various common noise types (including uniform noise). It is for this reason that in this paper we are concerned only with the case p = 2. We add though that more general noise types might be of interest in very particular cases, and this is a possible topic for future research. In [16] the authors reduce the optimization problem (2) to a single-penalty regularization through an augmented data matrix, for q = 1 and p = 2, and derive conditions on optimal support recovery. The authors provide theoretical and numerical evidence of superior performance of multi-penalty regularization over standard single-penalty approaches for the sparse recovery of solutions to (1). In [15] a principled, data-driven parameter selection approach is derived for q = 1 and p = 2, based on the Lasso path. Instead of through noise folding, a multi-penalty formulation of the objective function can also be seen from the perspective of the recovery of a signal that is a superposition of two components, e.g. a sparse and a smooth component. See [12] and references therein. In spite of these and other advances, rigorous results regarding convergence rate and error analysis for (2) have not been established.
Since we reduce (2) to specific single-penalty problems, corresponding convergence results on classical proximal descent methods are of interest. In [9] important insights on support stability and convergence of iterative thresholding algorithms on separable Hilbert spaces have been collected while [28] proved linear convergence rates of the iterative thresholding algorithm, under certain conditions, if the underlying thresholding operator is not continuous, though the dependency on the parameters of the optimization scheme are not explicitly derived. Linear convergence of a single penalty non-convex regularizer with adaptive thresholding was established in [24], where the influence of the restricted isometry property (RIP) of the design matrix on the convergence constant can be inferred. A further survey of nonconvex regularizers for sparse recovery can be found in [25].
Lastly, approaches representing regularizers as infimal convolution can be found in the context of machine learning and signal processing, cf [17,18]. Therein primal-dual schemes are examined for optimizing functionals penalized via infimal convolutions. The results, however, require piece-wise convexity which is not given in our case.

Notation
We restrict boldface lettering to matrices (uppercase), e.g. A, and vectors (lowercase), e.g. u. The ith entry of a vector u is denoted as u i . For m ∈ N we denote [m] := {1, . . . , m}. For 0 < q ∞ the q norm of a vector u = (u 1 , . . . , u n ) ∈ R n is denoted by u q . The support set of u ∈ R n is denoted as supp(u) = {i ∈ [n] : u i = 0} and the sign sgn(u) = (sgn(u i )) n i=1 is defined component-wise by For a matrix M ∈ R m×n , we use M to denote its spectral norm and λ min (M) to denote its smallest singular value. We denote the n × n identity matrix by Id n . For I ⊂ [n], M I ∈ R m×|I| represents the submatrix of M containing the columns indexed by I, and u I ∈ R |I| denotes the subvector of u containing the entries restricted to I. We denote the corresponding orthogonal projection operator onto I as P I ∈ R |I|×n , so that P I u = u I . When indexed by a set T ⊂ R n , P T denotes the orthogonal projection onto T. Finally, the set-valued operator ∂ denotes the limiting Fréchet subdifferential, and dom ∂ f = {x : ∂ f (x) = ∅} is its corresponding domain when applied to a function f : R n → R ∪ {∞}, cf [20,23].

Main results
Consider the multi-penalty problem (2) for p = 2, i.e. minimizing and denote a corresponding solution pair by As mentioned above A ∈ R m×n , y ∈ R m , α, β > 0 are regularization parameters balancing the contributions of the data-fidelity term and the two regularization terms, and 0 < q 1. Let us introduce two widely known concepts relevant for the forthcoming discussion. First, the Kurdyka-Łojasiewicz (KŁ) property; a well-established tool for analyzing the convergence, and convergence rates, of proximal descent algorithms [4]. Definition 2.1. A function f : R n → R ∪ {∞} is said to have the KŁ property atx ∈ dom ∂ f if there exists η ∈ (0, +∞], a neighbourhood Ω ofx, and a continuous concave The KŁ property is used to describe the speed of convergence through the desingularizing function ϕ. It has been shown that semi-algebraic functions satisfy the KŁ property with ϕ(s) = cs 1−θ , where c > 0 and θ ∈ [0, 1) is called the KŁ constant, which characterizes the convergence speed of proximal gradient descent algorithms [4, theorem 11]. As observed in [8], corollary 3.6 in [19] may be used to determine the KŁ constant of piecewise convex polynomials. Even though · q q has the KŁ property, cf [5, example 5.4], it does not result in piece-wise convex polynomials for 0 < q < 1, and thus we cannot apply [19, corollary 3.6] to infer the speed of convergence. We will instead adopt and adapt the ideas from [9,28]. The second concept relevant for this paper is the RIP, which allows to control eigenvalues of small submatrices of A ∈ R m×n , and to characterize measurement operators that allow stable and robust reconstruction of sparse signals from m n measurements.

Definition 2.2.
A matrix A ∈ R m×n satisfies the restricted isometry property of order s (s- Remark 2.3. For a detailed treatment of RIP, and measurement operators that fulfill it, we refer the reader to [14]. Let

Augmented formulation
It was observed in [16] that for q = 1, the multi-penalty problem (2) reduces to single-penalty regularization where measurement matrix and datum are adjusted by the regularization parameter β. We include this result, extended to 0 < q 1, together with the proof (see section A.1), which is analogous to [16, lemma 1].
and u q α,β is the solution of the augmented problem with Remark 2.5. The noise folding forward model (1) is in [2] written in the whitened form Notice that this is particularly related to the augmented problem in (7). On an unrelated note, improving on the analysis in [2, proposition 2] one can show (see lemma B.1) that the coherence, defined for a matrix M as where m i is the ith column of M, of the augmented measurement matrix B β satisfies In compressed sensing literature the magnitude of the coherence of a matrix is an important measure of quality for measurement matrices, cf [14, section 5]. The bound in (8) thus suggests that for small A or large β, the linear measurement process modelled by B β is as information preserving as the one modelled by A. In addition, lemma B.2 shows that coh(B β ) behaves like the coherence of a conditioned version of A if β → 0. Let us mention that in practice coh(B β ) behaves well for all β's, and even moderate values of AA .
By lemma 2.4, to estimate the solution pair (u q α,β , v q α,β ) it is sufficient to first solve (7), and then insert the computed solution into (6). Since the fidelity term 1 2 B β u − y β 2 2 is smooth and the regularization term u q q non-convex, the common approach is to use iterative thresholding through a forward-backward splitting algorithm [4,9]. For F β and the augmented problem (7), the resulting thresholding iterations applied are readily written as ⎧ ⎨ ⎩ Set the initial vector u 0 Each iteration in (9) can be viewed as a thresholded Landweber iteration; we first perform a step in the direction of the negative gradient of the data fidelity term, and then apply the proximal operator of the remaining non-convex term.
The proximal operator of a function Ψ : R n → R n is defined by where μ, ν > 0. For separable mappings (10) can be applied component-wise, and we have . In the general case, the proximal operator (10) could be set-valued, since there might be multiple or even no minima. It can be shown though that for 0 < q < 1 the (one-dimensional) proximal operator of |·| q satisfies where 1 2−q , see [9, lemma 5.1], and it is discontinuous with a jump discontinuity 5 at |u| = τ μ . Note that the proximal operators in (11) are indeed thresholding operators, and as q goes from 0 to 1 they interpolate between hard-and soft-thresholding operators. Moreover, a closed form of the operator prox μ,ν|·| q is known only in special cases, namely for q = 1/2 and q = 2/3 [26].
It follows easily that if the step-size μ > 0 is small enough (smaller than B β −2 ), the difference of iterates in (9) decreases, i.e. u k+1 − u k 2 → 0 as k → ∞, see [9, proposition 2.1]. Note that the iterations in (9) are quite different from those given by alternating minimization, where for each k we need to compute u k+1 through iterative thresholding. The following lemma makes this more precise; it shows that (9) is equivalent to performing only the first step of iterative thresholding when computing u k+1 in (3). The proof can be found in section A.2.
Lemma 2.6. The iterations defined in (9) can be rewritten as which corresponds to a single proximal gradient descent step of (3) starting at u k .
2.1.1. Linear convergence. We now show that the iterates in (9) converge at a linear rate to stationary points u of T q α,β , i.e. points such that 0 ∈ ∂T q α,β (u ), and characterize the convergence constant in dependence of design parameters. Let us emphasize that since our analysis is tailored to q -regularization we derive more explicit guarantees (in terms of the involved parameters) than what would follow by directly applying the more general statements of [28] to the augmented formulation (7). The proof can be found in section A.3.
Theorem 2.7. Let α, β > 0 and 0 < q 1. Assume the matrix A ∈ R m×n has RIP of order s with a constant δ s ∈ (0, 1), and let the stepsize μ satisfy 0 < μ < A −2 + β −1 . Moreover, assume 6 u ∈ R n is such that |supp(u )| s and the iterates (9) satisfy u k → u . Define I = supp(u ) and d min = min i∈I |u i |. Then there exists k 0 ∈ N such that for all k k 0 we have

Remark 2.8.
(a) To have linear convergence in theorem 2.7, we have to choose an α such that This resembles basic assumptions of the main result in [28]. One should thus interpret theorem 2.7 as an additional refinement, better capable of predicting numerical behavior. (b) Theorem 2.7 suggests that the convergence constant depends on the sparsity of the signal and properties of A. Namely, if the signal is sparser (and thus δ s smaller) then the convergence constant decreases. Similarly, the constant decreases if we increase the number of measurements. (c) Assuming α = cα , for c ∈ (0, 1), it is straight-forward to check that the rate in theorem 2.7 becomes minimal by choosing μ ≈ A −2 + β −1 . In this case the result transforms into The sequence u k converges provably to a stationary point since T q α,β is among other things coercive and has the KLproperty, cf [5, theorem 5.1]. The assumption thus is not about whether u k converges but about the specific limit point which mainly depends on the concrete choice of initialization.
(d) Since α and β control the strength of regularization in T q α,β , their choice depends on the expected noise level. Consequently, when setting α and β one needs to make a trade-off between their regularizing effect and the desired convergence speed.  [11]. Such a computational cost can be prohibitive for high-dimensional applications.

Infimal convolution formulation
To overcome the computational limitations observed above, we consider an alternative approach. Define a new program by where the infimal convolution is given by For a detailed treatment of infimal convolution and its properties, see [6]. It is straight-forward to check that an equivalence between minimizing (4) and (13) holds.
In order to solve (13) via iterative thresholding (i.e. proximal gradient descent), we need to efficiently evaluate the proximal operator of (14). A helpful observation is that (14) can be interpreted as the Moreau-envelope of · q q , which for a function f and t > 0 is defined as where the last equality only holds if prox t,f (x) = ∅. It has been observed in [7, theorem 6.63] that computing the proximal operator of the Moreau envelope reduces to computing the proximal operator of the underlying function. Though stated only for convex functions in [7], it is straight-forward to generalize the result.
The proof is in section A.4. Define now the proximal gradient descent for (13) by Set the initial vector w 0 We denote by u k = prox 1 β , α q · q q (w k ) the sequence of minimizers attaining g(w k ), and set v k = w k − u k . Note that with this notation w k and u k can also be characterized via Unlike (15), the representation in (16) does not yield a practically viable algorithm, since w k and u k are not decoupled. It does though lend itself to theoretical analysis of the iterations, cf section A.5. (14) is continuous and separable, i.e. g(w) = n i=1 g i (w i ), it is not continuously differentiable, such that we cannot apply [28] to deduce linear convergence of (15). Nevertheless, using the KKT-conditions of the objective functions in (16), we get linear convergence of the iterates in (15) by a similar strategy as in theorem 2.7. Theorem 2.11. Let α, β > 0 and 0 < q 1. Assume 7 that 0 < μ < A −2 and w k → w . Let I ⊂ [n] denote the support of u = prox 1 β , α q · q q (w ) and define d min = min i∈I u i . Then there exists k 0 ∈ N such that for all k k 0 we have

Linear convergence. Though g in
The proof of theorem 2.11 is given in section A.5.

Remark 2.12.
On the one hand, in theorem 2.11 the assumption on μ and the rate differ from theorem 2.7; there is no influence of β on admissible step-sizes and the rate is split in two distinct components. On the other hand, since, for μ < A −2 , P I − μA I A = P I (Id n − μA A) Id n − μA A < 1 and the rate in theorem 2.11 suggests to choose β large to dominate the second term of the rate in which case the assumptions on μ agree in both theorems. Moreover, this reduces the rate to where the denominator is as in theorem 2.7. In light of (17), we get linear convergence of (15) if As already discussed in remark 2.8, a trade-off between regularization and convergence rate has to be taken into account when choosing α and β.

Remark 2.13. For q = 1, an alternative viewpoint on (16) is given by
where we used [22, equation (3.3)] in the last step, meaning that is a proximal gradient descent sequence of ∇M α β · 1 (·) 2 2 , the squared 2 -norm of the gradient of the smooth Moreau approximation of α β · 1 . From this perspective, multi-penalty regularization resembles a Newton-type method by searching for zeros of the derivative of a smooth approximation of the 1 -norm. However, transferring this intuition to the case q < 1 is nontrivial. On a technical level the equations in (18) break down in the third line, which does not hold for q < 1 due to non-convexity of · q q .

Computational complexity.
While (9) requires computing B β , which can be costly, the infimal convolution formulation (15) does not incur additional computational costs and thus directly inherits efficiency and linear convergence of the proximal descent method. Indeed, for a fixed number of iterations the number of operations performed in (15) is O(mn) (the additional convex combination when evaluating the proximal operator by lemma 2.10 is negligible). This is considerably lower than O(m ρ ), for ρ ∈ [2.37, 3], which is the computational cost of the augmented formulation, particularly if m is large. In numerical simulations, this effect is easy to observe, cf section 3.

Numerical experiments
We now present experimental results that focus on two aspects of our study. First, we examine the convergence rate of the proposed algorithms, confirming linear convergence and in case of the augmented formulation, the dependence of the convergence constant on the parameters of the problem. Second, we examine their efficiency by studying the overall computational effort on larger scale problems.

Convergence rate
Via the RIP-constant δ s theorem 2.7 gives a direct dependence of the convergence rate on the sparsity of the solution and the properties of the matrix, whereas theorem 2.11 is harder to interpret: it is straight-forward to deduce the existence of parameter regimes in which linear convergence occurs but hard to quantify the rate in terms of the parameters. While numerical evidence for linear convergence of the infimal convolution formulation is observed in section 3.2, we continue by validating theorem 2.7 in two experiments. In both, we take q = 1/2, and add preand post-measurement Gaussian noise terms, v and ξ, with noise level v 2 = 0.1. We choose an admissible α according to remark 2.8 and tune it such that the reconstructed signal shares its support size with the ground-truth. Both illustrations in figure 1 plot the relative error between the iterates u k and the stationary point u against the number of proximal gradient descent steps.
Varying the penalty parameter. In the first experiment we take a Gaussian matrix A ∈ R 200×600 , a 20-sparse signal u † , and vary β. Theorem 2.7 predicts that smaller values of β allow to take larger stepsizes, though the convergence constants are (essentially) the same. This effect is readily observed in figure 1(a). Note that we can also observe that for smaller β the algorithm reaches the steep part of the curve faster. This is due to the fact that the convergence of iterates is initially slow (until the support is identified) and larger step-sizes allow to reduce the support size faster. The overall speed-up allowed by a smaller β can be by up to a two-fold, in terms of the number of iterations needed to reach the desired accuracy level.
Varying the measurements. In the second experiment we consider a Gaussian matrix A ∈ R m×600 , for m ∈ {100, 200, 300, 400}, and a 20-sparse signal u † . Varying the number of measurements changes the RIP of the measurement matrix (a larger m decreases δ s , see remark 2.3), and per theorem 2.7 should affect the convergence constant. Figure 1(b) shows exactly that. An analogous effect can be observed for different classes of measurement matrices, such as partial Toeplitz, or partial circulant matrices with Rademacher or Gaussian entries, but those results have not been included for the sake of brevity.

Computational comparison
Iteration count. In order to provide numerical evidence for our initial statement that alternating minimization is highly sub-optimal, in figure 2(a) we look at the decay of the relative error over the number of basic iterations, i.e. the number of thresholded gradient descent steps, of all three discussed approaches: alternating minimization (3), augmented formulation (9), and infimal convolution (15). In this experiment, we use a Gaussian matrix A ∈ R 100×500 , the original signal is 14-sparse, q = 1/2 and the parameter α, β, and μ are selected so that each method returns a 13-sparse vector. The x-axis refers to the number of times the proximal operator is called while the y-axis shows the relative error. The considerably worse performance of alternating minimization is due to the fact that it requires (too) many thresholded gradient steps to solve, for each k ∈ N, sub-problems for the u k component up to pre-fixed accuracy ε = 10 −8 . Thus, the algorithm performs hardly any alternating steps. In the left panel we look at the relative error with respect to the number of times the proximal operator is called for A ∈ R 100×500 and u † ∈ R 500 is 14-sparse. In the right panel we compare average running time of augmented and infimal convolution formulations when reconstructing a 100-sparse signal u † ∈ R 5000 from m measurements, that vary from 1000 to 8000.

Computation time.
To now illustrate the differences between augmented and infimal convolution formulation in terms of computational complexity, we perform the following experiment. We set the parameters generically to α = 0.02, β = 0.2, and μ = 0.1, and reconstruct a 100-sparse signal u † ∈ R 5000 from measurements y ∈ R m , for m varying from 1000 (sub-sampling) to 8000 (over-sampling). We again take q = 1/2, and add pre-and postmeasurement noise terms, v and ξ, with noise level 0.1. Averaging over 20 random realizations of u † , we record for augmented (9) and infimal convolution approach (15) the time needed to perform 50 iterations. After only this many iterations none of the two algorithms has converged, though this already suffices to make a point regarding the computational cost since both algorithms incur the same cost (i.e. the gap remains the same) in the remaining iterations. As figure 2(b) shows, the additional computation of B β in (9) causes a massive additional workload leading to limited applicability of the augmented approach in large-scale settings. In contrast, the infimal convolution formulation is hardly affected by the increase in the number of measurements. Though the augmented approach tends to converge in fewer iterations, cf figure 2(a), the additional iterations needed by the infimal convolution formulation to reach a comparable level of accuracy do not close the gap in computation time. Note that we do not include alternating minimization here since it requires many more iterations (in the sense of single thresholded gradient descent steps) to show similar reconstruction performance as both proximal descents, and hence could not compete with those two algorithms.

Discussion
In the present work we discussed the benefits of multi-penalty regularization for support recovery of signals when pre-measurement noise is amplified by the measurement operator and numerical challenges in solving the corresponding variational formulation. Since alternating minimization is for this task sub-optimal in terms of both the computational efficiency and theoretical analysis, we proposed a novel reduction to single-penalty regularization based on infimal convolution, and compared this new approach to an existing reduction based on augmented formulations. Moreover, we established linear convergence for both single-penalty reductions and showed that our new approach omits a computational bottleneck that is unavoidable in the augmented approach, and causes a significant additional computational workload if the number of measurements increases. There are several interesting open questions left for future work.
First, in remark 2.13 we observed, for q = 1, a connection between the infimal convolution formulation and the proximal descent on the 2 -norm of the gradient of a Moreau-regularized 1 -functional. As we have not seen a comparable relation in the context of multi-penalty regularization so far, we are curious whether this observation can be extended to the case 0 < q < 1. If so, this might provide valuable insights into non-convex optimization.
Second, as the reader might have noticed, great parts of the arguments we used (support stabilization, sign stabilization, etc.) are not restricted to finite dimensions. In light of more general settings of multi-penalty regularization in [21] and single-penalty regularization in [9], it would be fruitful to transfer our findings to general separable Hilbert spaces as well.
Third, we mention that when using the infimal convolution based approach, in some experiments it was possible to choose μ much larger than suggested by theorem 2.11, while still observing reliable convergence of the program. We wonder whether there is an alternative proof leading to a relaxed condition on μ resembling the assumption in theorem 2.7.
Let us conclude by emphasizing that the infimal convolution formulation can as well be applied if regularizers other than the q -norm are used in the multi-penalty problem, e.g. smoothly clipped absolute deviation [13], minimax concave penalty [29], and log-sum penalty [10]. In those cases the more general single-penalty rate analysis in [28] should prove useful as a tool.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.
The Woodbury identity for invertible matrices V ∈ R m×m , W ∈ R n×n and matrices M 1 ∈ R m×n , M 2 ∈ R n×m reads Using (19), this gives Plugging this expression back into T q α,β (u, v(u)), and extracting the square root, we have T q α,β (u, v(u)) = F β (u). Minimizing over u and using the following simple observation gives the conclusion.
Proof. Let u q α,β be a local minimizer of (7) and assume there exists a sequence ( where the first inequality follows from the minimality of v(u k ). This contradicts the assumption that u q α,β is a local minimizer of (7).

A.2. Proof of lemma 2.6
First note that Hence, it suffices to show that Extracting A from the left and using the Woodbury identity (20) with M 1 = A, M 2 = A , W = βId n , and V = Id m the conclusion follows.

A.3. Proof of theorem 2.7
In order to prove theorem 2.7, we have to control the eigenvalues of B β B β characterizing the growth of the data fidelity term in (7).
Lemma A.2. For B β ∈ R m×n defined as in lemma 2.4, is the Lipschitz-constant of the gradient of the augmented data-fidelity term 1 2 B β u − y β Proof. Let A = UΣV denote the SVD of A. This gives implying the second claim.
We can now show that all, up to finitely many, iterates u k ∞ k=1 generated by (9) share the same support and sign pattern. The proof is standard and follows [9].
Proof of theorem 2.7. By lemma A.3 there exists k 0 such that for all k k 0 the support of u k is finite, and support and sign of u k is equal to that of u . Thus, by [9, proposition 2.3], u is a fixed point of (9). Denote I = supp(u ) with |I| s. The definition of proximal operator in (10) and the Karush-Kuhn-Tucker (KKT) conditions yield Subtracting the two equations on the index set I, and denoting ψ(u) = 1 q u q q , we have u k+1 where ψ (u) = (sgn(u i )|u i | q−1 ) i∈ [n] is acting entry-wise. Note that since k k 0 we have sgn(u I ) = sgn(u k+1 I ) and u − u k 2 = u I − u k I 2 . A straightforward calculation gives Taking the inner product of (23) with u k+1 I − u I , and applying the Cauchy-Schwartz inequality, we get Since ψ is twice differentiable, and u k+1 and u have the same sign and support, we have for the second term where C k+1 i lies between u k+1 i and u i , and ψ (u) = (q − 1)u q−2 . Since u k → u , we may assume k 0 sufficiently large to guarantee u k Together with the RIP of A this yields the claim.
Lemma A.4 (Sign and support stability). Assume μ < A −2 . Then the successive iterates w k+1 − w k 2 , u k+1 − u k 2 , and v k+1 − v k 2 converge to zero and all but finitely many iterates u k share the same finite support and the same signs.
Proof. First, note that g is a proper and coercive function. Second, as g(w) = inf u∈R n f (u, w), for f continuous, we obtain continuity of g at any point w ∈ R n since by coercivity of f the infimum can be restricted to a finite ball and the infimum of continuous functions on a compact set is continuous. Consequently, by [9, corollary 2.1] and the assumption on μ we have w k+1 − w k 2 → 0, for w k+1 = prox μ,g (w k − μA (Aw k − y)). By the KKT-conditions of (24), we obtain 0 = (w k+1 − w k ) + μA (Aw k − y) + βμv k+1 , 0 = (w k − w k−1 ) + μA (Aw k−1 − y) + βμv k .
Subtracting the two equations gives v k+1 − v k 2 → 0, and u k = w k − v k yields u k+1 − u k 2 → 0. The second claim follows as in lemma A.3, since u k is a thresholded version of w k .
Proof of theorem 2.11. First note that w k → w implies via lemma A.4 that u k → u and v k → v . Furthermore, w is a fixed point of (15), by [9, proposition 2.3]. By lemma A.4 there exists k 0 such that for all k k 0 the support of u k is finite, and support and sign of u k is equal to that of u . Denote I = supp(u ). By the KKT-conditions of (24), we get For ψ(u) = 1 q u q q with ψ (u) = (sgn(u i )|u i | q−1 ) i∈[n] acting entry-wise, this implies (w k+1 − w ) I + αμ(ψ (u k+1 ) − ψ (u )) = (w k − w ) I − μA I A(w k − w ) and (1 + μβ)(w k+1 − w ) I c = (w k − w ) I c − μA I c A(w k − w ).
Repeating the steps as in theorem 2.7, from (25)  Squaring and summing the last two equations, the claim follows by orthogonality of (w k+1 − w ) I and (w k+1 − w ) I c .