A note on the minimization of a Tikhonov functional with $\ell^1$-penalty

In this paper, we consider the minimization of a Tikhonov functional with an $\ell_1$ penalty for solving linear inverse problems with sparsity constraints. One of the many approaches used to solve this problem uses the Nemskii operator to transform the Tikhonov functional into one with an $\ell_2$ penalty term but a nonlinear operator. The transformed problem can then be analyzed and minimized using standard methods. However, by the nature of this transform, the resulting functional is only once continuously differentiable, which prohibits the use of second order methods. Hence, in this paper, we propose a different transformation, which leads to a twice differentiable functional that can now be minimized using efficient second order methods like Newton's method. We provide a convergence analysis of our proposed scheme, as well as a number of numerical results showing the usefulness of our proposed approach.


Introduction
In this paper, we consider linear operator equations of the form where A : 2 → 2 is a bounded linear operator on the (infinite-dimensional) sequence space 2 . Note that by using a suitable basis or frame, operator equations between separable function spaces such as L p , Sobolev, or Besov spaces can all be transformed into problems of the form (1.1). We assume that only noisy data y δ satisfying are available, where . 2 denotes the standard 2 -norm. We are in particular interested in sparse solutions of the problem, to which end we consider the minimization of the following Tikhonov functional where . 1 denotes the standard 1 -norm. This problem has already been thoroughly studied analytically (compare with Section 2) as well as numerically (see Section 3 for an overview of previously proposed methods). However, the efficient minimization of the Tikhonov functional T α,δ still remains a field of active study, especially since the presence of the 1 -norm makes the functional non-differentiable. One approach to circumvent this issue was proposed in [32], where the authors considered a transformation of the Tikhonov functional into one which is once differentiable. In this paper, we extend their transformation idea by using an approximate transformation approach in order to end up with a functional that is also twice differentiable. This then allows the application of efficient second-order iterative methods for carrying out the minimization. This paper is organized as follows: In Section 2, we review known regularization results concerning sparsity regularization via the Tikhonov functional (1.3) and in Section 3, we discuss some of the existing methods for its minimization. In Section 4, we consider the transformation approach presented in [34] and its extension for obtaining twice differentiable functionals, for which we provide a convergence analysis. Furthermore, in Section 5, we present numerical simulations based on a tomography problem to demonstrate the usefulness of our approach. Finally, a conclusion is given in Section 7.

Sparsity Regularization
In this section, we recall some basic results (adapted from [36,Section 3.3]) concerning the regularization properties of Tikhonov regularization with sparsity constraints.
First of all, concerning the well-definedness of minimizers of T α,δ and their stability with respect to the data y δ , we get the following Theorem 2.1. Let A : 2 → 2 be weakly sequentially continuous, α > 0 and y δ ∈ 2 . Then there exists a minimizer of the functional T α,δ defined in (1.3). Furthermore, the minimzation is weakly subsequentially stable with respect to the noisy data y δ . Theorem 2.2. Let A : 2 → 2 be weakly sequentially continuous, assume that the problem (1.1) has a solution in 1 , and let α(δ) : (0, ∞) → (0, ∞) be chosen such that Moreover, assume that the sequence δ k converges to 0, that y k := y δ k satisfies the estimate y − y k 2 ≤ δ k , and that x k is a sequence of elements minimizing T α(δ k ),y k . Then there exists an 1 -minimum-norm solution x † and a subsequence x kn of x k such that Proof. This is an immediate consequence of [36,Theorem 3.49].
Note that typically, one only gets weak subsequential convergence of the minimizers of the Tikhonov functional to the minimum-norm solution. However, the above theorem shows that for sparsity regularization, one even gets strong subsequential convergence.
Furthermore, note that if A is injective, the 1 -minimizing solution is sparse (i.e., only finitely many of its coefficients are non-zero) and satisfies a variational source condition, then it is possible to prove optimal convergence rates under the a-priori parameter choice α(δ) ∼ δ, both in Bregman distance and in norm [36,Theorem 3.54].

Minimization of the Tikhonov functional
In this section, we review some of the previously proposed methods for the minimization of (1.3). Due to the non-differentiability of the 1 -norm in zero, this minimization problem is a non-trivial task.
Among the first and perhaps the most well-known method is the so-called Iterative Shrinkage Thresholding Algorithm (ISTA), proposed in [10]. Each iteration of this algorithm consists of a gradient-descent step applied to the residual functional, followed by a thresholding step, which leads to the iterative procedure where S αω denotes the component-wise thresholding (shrinkage) operator It was shown that the iterates generated by ISTA converge to a minimizer of the Tikhonov functional (1. 3) under suitable assumptions [5,10]. Unfortunately, this converge can be very slow, which motivated the introduction of Fast ISTA (FISTA) in [2]. Based on Nesterov's acceleration scheme [26], the iterates of FISTA are defined by The convergence analysis presented in [2] as well as many numerical experiments show that the iterates of FISTA converge much faster than those of ISTA, hence making it more practical. This speedup also holds for a generalized version of FISTA, which is applicable to composite (convex) minimization problems [1]. Applied to problem (1.3), it has the same form as (3.2), but with the computation of z δ k replaced by where the choice of α = 3 is common practice.
In the context of compressed sensing, where one tries to recover signals from incomplete and inaccurate measurements in a stable way, minimization problems of the form (1.3) have been analyzed and numerically treated in finite dimensions (see e.g. [8,11,12]). Also in finite dimensions, the minimization problem (1.3) has been tackled sucessfully by using various Krylov-subspace techniques (see e.g. [7,20,24]).
In infinite dimensions, a number of different minimization algorithms for (1.3) have been proposed. For example, the authors of [31][32][33] have proposed a surrogate functional approach, while the authors of [4,6] and [16] have proposed conditional gradient and semi-smooth Newton methods, respectively.
Of particular interest to us is the minimization approach presented in [34,37], which we discuss in detail in Section 4 below. It is based on a nonlinear transformation utilizing a Nemskii operator, which turns the Tikhonov functional (1.3) into one with a standard 2 -norm penalty, but with a nonlinear operator. Since the resulting transformed functional is continuously Fréchet differentiable, one can use standard first-order iterative methods for its minimization. Unfortunately, the functional is not twice differentiable, which prohibits the use of second-order methods, known for their efficiency. Circumventing this shortcoming is the motivation for the minimization approach based on an approximate transformation presented below.

Transformation Approach
As described above, the authors of [34,37] considered a transformation approach for minimizing the Tikhonov functional (1.3). This approach is based on a nonlinear transformation of the functional using the Nemskii operator where the function η p,q is defined by The operator N p,q has for example been used in the context of maximum entropy regularization [14]. Since here we need it only for the special case p = 1 and q = 2, we now define the operator and the function The operator N is continuous, bounded, bijective, and Fréchet differentiable with and is used to define the following nonlinear operator This is then used to transform the problem of minimizing (1.3) into a standard 2 − 2 minimization problem, as shown by the following The following two problems are equivalent: Proof. The proof can be found in [34].
Note that the operator F is nonlinear even if A is linear. However, using the transformed operator has the advantage that the resulting functional J α,δ is differentiable.
Proposition 4.2. The operator F and the functional J α,δ defined in (4.6) and (4.8), respectively, are continuously Fréchet differentiable, with Proof. This is an immediate consequence of the definition of J α,δ and the fact that A is linear and N is differentiable.
Due to the above result, it is now possible to apply gradient based (iterative) methods for minimizing the transformed functional J α,δ , and thus to compute a minimizer of the functional T α,δ , which itself is not differentiable.
Unfortunately, the transformed functional J α,δ is not twice differentiable, due to the fact that N is not twice differentiable (at zero). This prohibits the use of second order methods like Newton's method, which are known to be very efficient in terms of iteration numbers. Hence, we propose to approximate N by a sequence of operators N ε which are twice continuously differentiable, and to minimize, instead of J α,δ , the functional where we define the operator F ε by for a suitable approximation N ε of the operator N . This approximation is based on suitable approximations η ε of the functions η, which we introduce in the following Obviously, η ε → η as ε → 0 and furthermore, we get the following Lemma 4.3. The functions η ε defined by (4.11) are twice continuously differentiable.
Proof. It follows from its definition that η ε is everywhere continuous and that Again it follows that η ε is everywhere continuous and that which is again continuous everywhere, which concludes the proof.
We now use the functions η ε to build the operators N ε via the following For all ε > 0 we define the operators Concerning the well-defined and boundedness of N ε , we have the following and are therefore well-defined as operators from 2 → 2 .
Proof. Let ε > 0 be arbitrary but fixed and take x = (x k ) k∈N ∈ 2 . We have that Therefore, we get that from which we derive that which immediately yields the assertion.
The operators N ε are also continuous, as we see in the following Proposition 4.5. The operators N ε : 2 → 2 defined by (4.12) are continuous.
Proof. Let ε > 0 and x = (x k ) k∈N ∈ 2 be arbitrary but fixed, and consider a sequence x n = (x n k ) k∈N ∈ 2 converging to x. It follows that the norm of x n is uniformly bounded, i.e., there exists a constant c > 0 such that x n ≤ c for all n, from which it also follows that |x n k | ≤ c for all k and n. Furthermore, since the function η ε is continuously differentiable, it follows that it is Lipschitz continuous on bounded sets. This implies that there exists a Lipschitz constant L > 0 such that (4.14) Hence, we get that (4.15) and therefore, which shows the continuity of N ε and concludes the proof.
By their construction, the operators N ε are also twice differentiable, as we see in Proposition 4.6. The operators N ε : 2 → 2 defined by (4.12) are twice continuously Fréchet differentiable, with Proof. This follows from the definition of N ε together with Lemma 4.3.
The approximation properties of the operators N ε are studied in the following Proposition 4.7. For N and N ε be defined by (4.3) and (4.12), respectively, it holds that Proof. Let ε > 0 and x ∈ 2 be arbitrary but fixed. Then it holds that from which it follows that and therefore from which the statement immediately follows.
The above result immediately implies an approximation result for the operators F ε .
Corollary 4.8. Let A : 2 → 2 be a bounded and linear operator and let F and F ε be defined by (4.6) and (4.10), respectively. Then it holds that Proof. By the definition of F and F ε , we have that which, together with Proposition 4.7 now yields the assertion.
Other important properties of the operators F and F ε are collected in the following Proposition 4.9. Let A : 2 → 2 be a bounded linear operator. Then the operators F and F ε defined by (4.6) and (4.10), respectively, are continuous and weakly sequentially closed.
Proof. Since A and, due to Proposition 4.5, N ε are continuous, by its definition also F ε is continuous. In order to show the weak sequential closedness of F ε , note that since its definition space is the whole of 2 , it suffices to show that F ε is weakly continuous. For this, take an arbitrary sequence x n ∈ 2 converging weakly to some element x ∈ 2 . Since in 2 a sequence converges weakly if and only if it converges componentwise and its norm is bounded [9], it follows from the continuity and boundedness of N ε (Lemma 4.4) and Proposition 4.5) that N ε (x n ) converges weakly to N ε (x). Now, as a bounded linear operator, A is also weakly sequentially continuous. Hence, since F ε = A • N ε , it follows that F ε (x n ) converges weakly to F ε (x), which establishes its weak sequential continuity and consequentially also its weak sequential closedness. For the operator F , these result have already been shown in [34]. However, noting that Lemma 4.4 and Proposition 4.5 also hold for the limit case ε = 0, they also follow the same way as above.
Furthermore, the differentiability of N ε immediately translates into the following Proposition 4.10. The operators F ε and thus the functionals J ε α,δ defined in (4.10) and (4.9), respectively, are twice continuously Fréchet differentiable, where Proof. This follows from the definition of F ε and J ε α,δ together with Proposition 4.6. We now consider the problem of minimizing the Tikhonov functional J ε α,δ , whose minimizers we denote by x δ α,ε . Due to the above results, the classical analysis of Tikhonov regularization for nonlinear operators is applicable (see for example [13,15]), and we immediately get the following Theorem 4.11. Let A : 2 → 2 be a bounded, linear operator and let F ε be defined by (4.10).Then for each α > 0, a minimizer x δ α,ε of the functional J ε α,δ defined in (4.9) exists. Furthermore, the minimization of J ε α,δ is stable under perturbations of y δ .
Proof. Since by Proposition 4.9, the operator F ε is continuous and weakly sequentially closed, this follows immediately from [13, Theorem 10.2].
Next, we are interested in the behaviour of the minimizers x δ α,ε as ε → 0. Given a suitable coupling of the noise level δ and the parameter ε, we get the following Theorem 4.12. Assume that F (x) = y has a solution and let α(δ) and ε(δ) satisfy Then x δ α(δ),ε(δ) has a convergent subsequence. Moreover, the limit of every convergent subsequence is a minimum-norm solution of F (x) = y. Furthermore, if the minimumnorm solution x † is unique, then Proof. The proof of this theorem follows the same lines as the classical proof of convergence of Tikhonov regularization [13] and the proof for the case that the operator is approximated by a series of finite dimensional operators [27,29] (in which case a slightly stronger condition than what we can derive from Proposition 4.7 was used). Hence, we here only indicate the main differences in the proof. Note first that due to Proposition 4.7, it follows that This, together with x δ α,ε being a minimizer of J ε α,δ implies that (4.23) Together with (4.20), this implies the boundedness of x δ α,ε and lim δ→0 F ε (x δ α,ε ) − y δ 2 = 0 .
Hence, since then there holds the weak sequential closedness of F implies the convergence of a subsequence of x δ α,ε to a solution of F (x) = y. The remainder of the proof then follows analogously to the one of [13,Theorem 10.3] and is therefore omitted here.
The above result shows that minimizing J ε α,δ instead of J α,δ to approximate the solution of F (x) = y makes sense if ε and the noise level δ are suitably coupled, for example via ε ∼ δ. Furthermore, the assumption that F (x) = y solvable, is for example satisfied if Ax = y has a solution belonging not only to 2 but also to 1 , i.e., is sparse.
Remark. Following the line of the proofs of classical Tikhonov regularization results, it is also possible to derive convergence rate results under standard assumptions. Furthermore, the above analysis also holds for nonlinear operators A which are Lipschitz continuous, since then Corollary 4.8 also holds.

Minimization methods for the Tikhonov functional
In the previous section, we established existence, stability, and convergence of the minimizers of J α,δ and J ε α,δ under standard assumptions. However, there still remains the question of how to actually compute those minimizers in an efficient way.
One way to do this is to interpret the minimization of J α,δ and J ε α,δ as Tikhonov regularization for the nonlinear operator equations F (x) = y and F ε (x) = y, respectively, and to use iterative regularization methods for their solution. Since both the operator F and F ε are continuously Fréchet differentiable, iterative regularization methods like Landweber iteration [23], TIGRA [30], the Levenberg-Marquardt method [17,21] or iteratively regularized Gauss-Newton [3,22] are applicable. Of course, as all of those methods only require a once differentiable operator, it makes sense in terms of accuracy to apply them for the operator F and not for the approximated operator F ε .
Another way is to use standard iterative optimization methods for the (well-posed) problem of minimizing J α,δ or J ε α,δ . In particular, since we have derived in the previous section that J ε α,δ is twice continuously Fréchet differentiable, efficient second order methods like Newton's method are applicable for its minimization.
In this section, we introduce and discuss some details of the minimization methods used to obtain the numerical results presented in Section 6 below.

Gradient descent, ISTA and FISTA
We have seen that the Tikhonov functional J α,δ defined in (4.8) is continuously Fréchet differentiable. Hence, it is possible to apply gradient descent for its minimization.
For this, note first that since N (x)h is a linear operator, it can be written as where G(x) is the infinite dimensional 'matrix' representation of N (x) given by which is called the gradient of N . Similarly, there is an (infinite-dimensional) matrix representation of J α,δ (x), i.e., the gradient ∇J α,δ (x) of J α,δ (x), which is given by where, with a small abuse of notation, A denotes the (infinite-dimensional) matrix representation of the linear operator A, and A T denotes its transpose.
Using the above representations, we can now write the gradient descent algorithm for minimizing J α,δ in the well-known form where ω n is a sequence of stepsizes. If the stepsizes are chosen in a suitable way, for example via the Armijo rule [19], the iterates converge to a stationary point of J α,δ (see e.g. [19,Theorem 2.2]). In order to stop the iteration, we employ the well-known discrepancy principle, i.e., the iteration is terminated with index n * = n * (δ, y δ ), when for the first time where τ > 1 is fixed. Note that since the Tikhonov functional may have several (local and global) minima, convergence to a global minimum is only guaranteed if a sufficiently good initial guess is chosen. The (infinite-dimensional) matrix representations introduced above can also be used to rewrite ISTA (3.1) into the following form which immediately also translates to a similar rewriting of FISTA defined in (3.2).

The Levenberg-Marquardt method
It is well-known that gradient based methods like gradient descent or ISTA are quite slow with respect to convergence speed. Although it is possible to speed them up by using suitable stepsizes (see for example [28,35]) or acceleration schemes like FISTA, it is often advantageous to use second-order methods instead. One such method is the Levenberg-Marquardt method [17,21], which is given by Although this is a second-order method, it only requires the operator F to be once continuously Fréchet differentiable. Using again the (infinite-dimensional) matrix representation of N (x)h from (5.1), the method can be rewritten into the following form In order to obtain convergence of this method, one needs, among other things, a suitably chosen sequence α n converging to 0 as well as a sufficiently good initial guess [17]). As a stopping rule, one usually also employs the discrepancy principle (5.3). The Levenberg-Marquardt method typically requires only very few iterations to satisfy the discrepancy principle. However, in each iteration step the linear operator F (x δ n ) * F (x δ n ) + α n I has to be inverted, which might be costly for some applications. This can be circumvented, though, via approximating the result of this inversion by the application of number of iterations of the conjugate gradient method.
It is possible to add an additional regularization term to the Levenberg-Marquardt method, thereby ending up with the so-called iteratively-regularized Gauss-Newton method [3,22]. Typically behaving very similar in practice, this method can be proven to converge under slightly weaker assumptions than the Levenberg-Marquardt method.

Numerical Examples
In this section, we demonstrate the usefulness of our proposed approximation approach on a numerical example problem based on Computerized Tomography (CT). In particular, we focus on how the Newton approach for the minimization of J ε α,δ introduced in Section 5.3 above performs in comparison to the other methods presented in Section 3.
In the medical imaging problem of CT, one aims to reconstruct the density function f inside an object from measurements of the intensity loss of an X-ray beam sent through it. In the 2D case, for example if one scans a cross-section of the human body, the relationship between the intensity I 0 of the beam at the emitter position and the intensity I L at the detector position is given by [25] log Thus, if one defines the well-known Radon transform operator the reconstruction problem (6.1) can be written in the standard form Expressing f in terms of some basis or frame, and noting that typically one considers objects whose density is equal to 0 on large subparts, the above problem precisely fits into the framework of 1 sparsity regularization considered in this paper.

Discretization and Implementation
In order to obtain a discretized version of problem (6.1), we make use of the toolbox AIR TOOLS II by Hansen and Jorgensen [18]. Therein, the density function f is considered as a piecewise constant function on an m × m pixel grid (see Figure 6.3 for examples). With this, equation (6.1) can be written in the discretized form where the x j denote the value of f at the j-th pixel, I L denote the emitted and detected intensity of the i-th ray, respectively, and a ij denotes the length of the path which it travels through within the j-th pixel cell. Note that since any given ray only travels through relatively few cells, most of the coefficients a ij are equal to 0 and thus the matrix A is sparse. Collecting the coefficients a ij into a matrix A, equation (6.2) can be written as a matrix-vector equation of the form Specifying all required parameters as well as the exact solution which one wants to reconstruct, the toolbox provides both the matrix A and the right-hand side vector y. For our purposes, we used the toolbox function paralleltomo, creating a parallel beam tomography problem with (the suggested default values of) 180 angles and 70 parallel beams for each of them. For the number of pixels we used m 2 = 50 2 , which altogether leads to the dimension 12600 × 2500 for the matrix A. The exact solution (the Shepp-Logan phantom) is depicted in Figure 6.3. In order to obtain noisy data, we used y δ := y +δ y 2 r, where r is a randomly generated, normed vector, andδ denotes the relative noise level.
The implementation of the methods introduced in Section 3 was done in a straightforward way by using their infinite-dimensional matrix representations but for the now finite dimensional matrices. The iterations were stopped using the discrepancy principle (5.3) with the choice τ = 1.1 for all methods, and for the approximation parameter ε in the definition of J ε α,δ , we have used the choice ε = 10 −4 δ, which is conforming with the theory developed above. The stepsize ω in ISTA and FISTA was chosen as a constant based on the norm of A, and for the gradient descent method (5.2), the stepsizes ω n were chosen via the Armijo rule. In the Levenberg-Marquardt method (5.4), we chose α n = 0.6 n δ, which is a sequence tending to 0 in accordance with the convergence theory. All computations were carried out in Matlab on a desktop computer with an Intel Xeon E5-1650 processor with 3.20GHz and 16 GB RAM.

Numerical Results
In this section, we present the results of applying the iterative methods introduced in Section 3 to the tomography problem described above.
The first results, which are related to the computational efficiency of the different methods, are presented in Figure 6.1. One can clearly see that regardless of the noise levelδ, the Newton method and the Levenberg-Marquardt method outperform the gradient based methods, both in terms of computation time and number of iterations n * required to meet the discrepancy principle. Furthermore, as was to be expected, FISTA also performs much better than both ISTA and the gradient descent method. Note also that with the Levenberg-Marquardt and the Newton method, one can satsify the discrepancy principle also for very small noise levels, which becomes infeasible for the other methods due to the too large runtime which would be required for that.
The results depicted in Figure 6.2 show that not only do the Levenberg-Marquardt and the Newton method require less iterations and computation time to satisfy the discrepancy principle, the resulting approximations also have a comparable and even somewhat smaller relative error than for the gradient based methods. This is of course partly due to the fact that each iteration step of those methods is 'larger' than in the other methods, which nevertheless turns out to be an advantage in our case. The resulting approximate solutions for 10% relative noise are shown in Figure 6

Conclusion
In this paper, we presented a minimization approach for a Tikhonov functional with 1 penalty for the solution of linear inverse problems with sparsity constraints. The employed approximate transformation approach based on a Nemskii operator was mathematically analysed within the framework of ill-posed problems, and the fact that the resulting transformed functional is twice continuously Fréchet differentiable served as a basis for the construction of an effective minimization algorithm using Newton's method. Numerical example problems based on the medical imaging problem of computerized tomography demonstrated the usefulness of the proposed approach.

Support
The authors were funded by the Austrian Science Fund (FWF): F6805-N36.