Inexact Fixed-Point Proximity Algorithm for the ℓ 0 Sparse Regularization Problem

We study inexact fixed-point proximity algorithms for solving a class of sparse regularization problems involving the ℓ 0 norm. Specifically, the ℓ 0 model has an objective function that is the sum of a convex fidelity term and a Moreau envelope of the ℓ 0 norm regularization term. Such an ℓ 0 model is non-convex. Existing exact algorithms for solving the problems require the availability of closed-form formulas for the proximity operator of convex functions involved in the objective function. When such formulas are not available, numerical computation of the proximity operator becomes inevitable. This leads to inexact iteration algorithms. We investigate in this paper how the numerical error for every step of the iteration should be controlled to ensure global convergence of the resulting inexact algorithms. We establish a theoretical result that guarantees the sequence generated by the proposed inexact algorithm converges to a local minimizer of the optimization problem. We implement the proposed algorithms for three applications of practical importance in machine learning and image science, which include regression, classification, and image deblurring. The numerical results demonstrate the convergence of the proposed algorithm and confirm that local mini-mizers of the ℓ 0 models found by the proposed inexact algorithm outperform global minimizers of the corresponding ℓ 1 models, in terms of approximation accuracy and sparsity of the solutions.


Introduction
Sparse learning plays a pivotal role in today's era of big data due to its capacity for significantly decreasing the computational burden through sparse solution representation.Natural images or data streams are often inherently sparse in certain bases or dictionaries, under which they can be approximately expanded by only few significant elements carrying the most relevant information [7,31,42,49].Sparsity-promoting regularization is a common and powerful tool to capture the most important features hidden in the enormous data.Such a technique has been successfully applied in many areas, for example, compressed sensing [4,14], image processing [15,20,24,32,38], and machine learning [27,40].
The most intuitive choice of the regularization term for promoting sparsity is the ℓ 0 -norm which counts the number of the nonzero components of a vector.The geometric viewpoint of the ℓ 0 -norm has been explored in [50], demonstrating that the ℓ 0 regularization term results in lifting the graph of the target function according to the level of sparsity.Nevertheless, it is challenging to develop an efficient algorithm to solve the ℓ 0 -norm based regularization problem due to the discontinuity and non-convexity of the ℓ 0 norm.This challenge has been discussed in [6,18,33] that finding a global minimizer of the ℓ 0 -norm based regularization problem is an NP-hard problem.A widely adopted approach to address this issue is replacing the ℓ 0 -norm by the ℓ 1 -norm which transfers a non-convex problem to a convex one.There are plenty of feasible algorithms from convex optimization for solving the ℓ 1 -norm based regularization problem [8,24,32].However, the ℓ 1 norm whose proximity operator is the well-known soft thresholding operator, could bring up unnecessary bias on the components of the solution even if some of them are dominant [16,53] and such a bias may jeopardize the performance of the solution, especially the accuracy on the data.The mentioned disadvantage of the ℓ 1 -norm in fact can be avoided by the ℓ 0 -norm, which takes the hard thresholding operator as its proximity operator and potentially elevates the accuracy of the solution applying to the data.Consequently, there arises a need to devise algorithms that directly address the challenge of solving the ℓ 0 -norm based regularization problem.
Recently, there was notable advancement in the development of convergent algorithms designed for the ℓ 0 -norm based regularization problem.It was proposed in [37] a fixed-point proximity algorithm in the context of image inpainting, with guaranteed convergence to a local minimizer of the objective function involving the ℓ 0 -norm.Its proof reveals that the support of the sequence generated by the algorithm remains unchanged after a finite number of iterations, and thus on the fixed support the non-convex optimization problem is reduced to a convex one.This idea has also been applied to the problems of the seismic wavefield modeling [48] and medical image reconstruction [52].Moreover, the numerical examples provided in [37,48,52] affirm that solutions obtained from the ℓ 0 -norm based regularization model exhibit superior performance compared to those from the ℓ 1 -norm based regularization model.The algorithm studied in [37,48] usually requires the evaluation of the proximity operator of the composition of a convex function and a matrix for each iteration.The precise value of that proximity operator is difficult to obtain except for the case of the matrix having some special property, such as the matrix being reduced to be an identity matrix or having circulant blocks structure [20].This requirement significantly constrains the applicability of the algorithm in some areas, for example, regression and classification problems in machine learning and deblurring problems in image processing.The fixed-point iterative algorithms proposed in [24,32] facilitate computing the proximity operator of a convex function composite with a general matrix, in which errors are unavoidable with a finite number of iterations.This motivates us to consider an inexact version of the algorithm proposed in [37,48], namely the inexact fixed-point proximity algorithm.
In the field of optimization, inexact methods have attracted considerable attention.Inexact methods for solving convex optimization problems related to monotone operators and the proximal point algorithm were developed in [35].Moreover, the inexact forward-backward algorithm, the inexact variant of the Krasnosel'skiȋ-Mann iteration, and the inexact implicit fixed-point proximity algorithm were proposed in [39,44], [11,29] and [34], respectively.In [1], inexact descent methods were developed for non-convex optimization problems with the objective function satisfying the Kurdyka-Łojasiewicz (KL) property [21,30] and their convergence to a critical point of the objective function was established.
The goal of this paper is to propose an inexact fixed-point proximity algorithm to solve the ℓ 0 -norm based regularization problem and prove its convergence to a local minimizer of the objective function without assuming the KL property.Specifically, we show that the proposed algorithm first pursues the support of a sparse solution of the non-convex model involving the ℓ 0 -norm and then searches for the solution on the resulting support set, and based on this understanding, we establish the convergence result.We apply the inexact fixed-point proximity algorithm to three application problems including regression and classification and image deblurring.The convergence and effectiveness of the proposed algorithm are confirmed by numerical experiments.
This paper is organized in six sections.In Section 2, we describe the ℓ 0 -norm based regularization model and characterize a solution of the model as the fixed-point problem.In Section 3, we propose the inexact fixed-point proximity algorithm to solve the ℓ 0 -norm based regularization model and demonstrate its feasibility.Section 4 is devoted to convergence analysis of the proposed inexact algorithm.In Section 5 we apply the proposed algorithm to solve regression, classification and image deblurring problems and demonstrate the performance of the proposed algorithm.Our conclusions are drawn in Section 6.

Sparse Regularization Model
In this section, we describe the sparse regularization model and present a necessary condition for a global minimizer of the resulting minimization problem.
Many application problems may be modeled as sparse regularization problems.The sparsity of a vector is naturally measured by the number of its nonzero components.Specifically, for x ∈ R n , ∥x∥ 0 counts the number of nonzero components of x and we call ∥x∥ 0 the ℓ 0 -norm of x, even though it is not really a norm.A sparse regularization problem may be described as where the function ψ : R p → R is proper, convex, continuously differentiable, and bounded below, B ∈ R p×m , D ∈ R n×m are matrices, and λ is a positive regularization parameter.The matrix D appearing in (2.1) is often a redundant system chosen as a mathematical transformation, such as a discrete cosine transform [41], a wavelet transforms [13,28,31], or a framelet transform [9,36], depending on specific applications.
In this paper, we are mainly interested in the case of D being a discrete tight framelet system, that is, D ⊤ D = I with D ⊤ being the transpose of D and I the identity matrix.
The discontinuity and non-convexity of the ℓ 0 -norm are major barriers to developing efficient algorithms for solving the optimization problem (2.1).Practically the ℓ 0 -norm in the model (2.1) is often replaced by the ℓ 1 -norm [8,24,32] which is both continuous and convex.The resulting model is 2) is a special case of the model in [8,24], which may be solved by the firstorder primal-dual algorithm [8] and multi-step fixed-point proximity algorithm [24].
Both of these algorithms are required to compute the proximity operator of λ ∥Dv∥ 1 , which often has no closed-form.Alternatively, one may introduce an auxiliary variable u to free Dv from the non-differentiable norm ∥•∥ 1 and add the difference between u and Dv as a penalized term.This yields the model This falls into the quadratic penalty approach which can be tracked back to [12] and has been widely used in image reconstruction [43,45,51].However, the ℓ 1 -norm may promote biases [16].For this reason, we prefer using the ℓ 0 -norm by overcoming difficulties brought by it.Specifically, we adopt the quadratic penalty approach to free Dv from ∥•∥ 0 .This leads to the following two variables ℓ 0 -norm based regularization problem where γ > 0 is the envelope parameter.We remark that the sum of the last two terms in (2.4) approaches to ∥Dv∥ 0 as γ goes to zero.For notation simplicity, we let In this notation, problem (2.4) may be restated as We next present a necessary condition for a solution of (2.6).To this end, we recall the definition of the proximity operator [2].For a proper, lower semi-continuous function f : R m → R ∪ {∞} and a positive number t, the proximity operator prox t f : R m → R m at y ∈ R m is defined as We remark that the proximity operator prox t f defined by (2.7) is set-valued and if f is further assumed to be convex, the corresponding proximity operator will be single-valued.In particular, if f is continuously differentiable and convex, then The remaining part of this section is to establish a necessary condition for a solution of (2.6) in the form of a fixed-point of the proximity operators.To prepare for this, we present three technical lemmas. (2.8) This establishes the desired inequality (2.8).
Lemma 2 Suppose that f : R n → R∪{∞} is a proper, lower semi-continuous function.
Proof From the definition of the proximity operator, it suffices to prove that (2.10) By employing inequality (2.8) of Lemma 1 with a := x − y * and b := x * − y * , we obtain that (2.11) Combining (2.11) with the assumption (2.9) leads to estimate (2.10). (2.12) Proof We prove (2.12) by direct computation.Since by assumption D ⊤ D = I, we may expand ∥u − Dv∥ 2 2 as which leads to equation (2.12).
The next proposition presents a necessary condition for a solution of (2.6) as a fixed-point of the proximity operator.For notation simplicity, we define the operator S : R m → R m as ) (2.17) Proof Since (u * , v * ) is a global minimizer of (2.6), we obtain that By noting the definition of F and choosing v := v * in (2.18), we obtain that Applying Lemma 2 with p := λ 2γ , q := α, x * := u * , y * := Dv * and f := λ ∥ • ∥ 0 and employing the definition of the proximity operator, we get (2.16).We next take u := u * in (2.18) and find that , for all v ∈ R m .(2.23) The necessary condition for a global minimizer of optimization problem (2.6) presented in Proposition 1 suggests an inexact fixed-point proximity algorithm for solving (2.6), which will be discussed in the next section.

Inexact Fixed-Point Proximity Algorithm
In this section, we first describe an exact fixed-point proximity algorithm for the model (2.6), and discuss a computational issue of the exact algorithm.We then propose an inexact fixed-point proximity algorithm to address the computational issue.
One may apply the alternating direction method of multipliers (ADMM) to solve optimization problem (2.1).ADMM was originally proposed for solving convex optimization [5,17,19], and was later extended to solving nonconvex nonsmooth optimization [46].Convergence of ADMM was established in [46] by requiring that the augmented Lagrangian function be of Kurdyka-Łojasiewicz and that the penalty parameter λ /γ in the augmented Lagrangian function defined by (3.5) below be sufficiently large.
To describe ADMM for (2.1), we rewrite problem (2.1) in an equivalent form minimize ψ(Bv) + λ ∥u∥ 0 subject to Dv − u = 0 (3.4)where (u, v) ∈ R n × R m .We then define the augmented Lagrangian function for (3.4) as where w ∈ R n is a dual variable.ADMM minimizes the augmented Lagrangian function (3.5) with respect to variables u and v alternatively and then updates the dual variable w, leading to the following iteration scheme ) Note that the gradient of the augmented Lagrangian function L with respect to the dual variable w is given by Dv − u.ADMM employs the gradient ascent method to update the dual variable w, utilizing a fixed step-size λ γ .Consequently, for each iteration, the function value L(u, v, w) decreases when updating the primal variables u and v and increases when updating the dual variable w.In other words, the gradient ascent step that updates the dual variable in ADMM may lead to oscillations in values of the augmented Lagrange function.As such, ADMM may fail to converge for the augmented Lagrangian function L involving the ℓ 0 norm, since due to the nonconvexity of the ℓ 0 norm, ensuring the monotonic decreasing behavior of the objective function at every iteration step is crucial for its convergence.Unlike ADMM, the exact fixed-point proximity algorithm (3.1)-(3.2) consists only the u-minimization step (3.1) and v-minimization step (3.2).It was proved in Theorem 2.1 of [51] that the objective function F defined by (2.5) is monotonically decreasing at every iteration step of (3.1)-(3.2),leading to its convergence.For this reason, we prefer minimizing the objective function F defined by (2.5) rather than the augmented Lagrangian function L. Our goal is to develop a convergence guaranteed inexact fixed-point algorithm to solve the optimization problem (2.6) without requiring the objective function to be a Kurdyka-Łojasiewicz function and the penalty parameter λ γ to be sufficiently large, as required in [46].
We now consider the implementation of the exact fixed-point proximity algorithm (3.1)-(3.2).It is well-known that for u ∈ R n , the proximity operator of the ℓ 0 -norm is given by with t > 0, where otherwise. (3.10) Note that prox t|•| 0 is the hard thresholding operator.The closed-form formula (3.9) facilitates the implementation of (3.1).However, the implementation of (3.2) encounters a computational issue of evaluating the operator S .We assume that the operator (I + λ /γ∇ψ) −1 has a closed form, which is true in many problems of practical importance.In the special case when the matrix B in (3.2) satisfies BB ⊤ = I, the operator S also has a closed form.It is known (for example, Theorem 6.15 of [3]) that if BB ⊤ = I, then for any x ∈ R m , In this case, the update v k+1 in (3.2) can be computed exactly by using the above formula and a closed-form formula of (I + γ/λ ∇ψ) −1 .For general cases where BB ⊤ ̸ = I, there is no closed form for S and thus, the update v k+1 in (3.2) cannot be computed exactly.According to [32], the operator S may be computed by solving a fixed-point equation.Therefore, we can only obtain approximate values of v k+1 .This leads to an inexact fixed-point approach: Suppose that εk+1 ∈ R m is a predetermined error and α ∈ (0, 1].Assuming that ũk and ṽk are given, we find and define ṽ1 * := ṽ1 .The equation (3.12) is uncomputable since it is not feasible to calculate ṽk+1 * exactly.In practice, computing an approximation of ṽk+1 * requires an inner loop, which leads to an inexact algorithm.
We now describe a specific algorithm to compute an approximation of ṽk+1 * at the (k + 1)-step, which will serve as an inner loop for the inexact algorithm for solving model (2.6).By Theorem 3.1 of [32], for an x ∈ R m , we have that where w satisfies the fixed-point equation for any positive parameter q.Note that introduction of the parameter q allows us to choose its values so that the corresponding fixed-point algorithm for finding w converges.When ũk+1 is available, we can find an update ṽk+1 via employing the Fixed-Point Proximity Algorithm (FPPA) proposed in [24,32] to solve equations (3.14) and (3.15) numerically.For given positive constants p, q and initial points ṽk+1 1 , wk+1 1 , the FPPA is described as From [24,32], iteration (3.16) converges if ∥B∥ 2 2 < pq.Computing an exact value of the proximity operator of ψ • B requires executing infinite times of iteration (3.16).While, in practical implementation, iteration (3.16) can only be executed finite number of times.This means that we can only obtain an approximate surrogate of ṽk+1 * .The error between ṽk+1 * and the approximation will influence the global error and thus appropriately choosing a stopping criterion for the inner loop is essential.
We next motivate the choice of the stopping criterion for the inner loop.To this end, we define Note that for fixed ũk+1 , H(•; ũk+1 ) is a continuously differentiable convex function.
From the definition of operator S , we observe that ṽk+1 * is equal to It is advantageous to have an equivalent form of (3.18).Recalling the definition of F, we define the optimization problem For the sake of simplicity, we introduce the similarity notation of two functions.We say two functions f and g are similar if their difference is a constant and we write it as f ∼ g.Hence, if f ∼ g, then Two minimization problems are said to be equivalent if they have the same set of global minimizers.
Proof From Lemma 3, the difference between ũk+1 − Dv By the definition of F, the minimization problems (3.18) and (3.19) are equivalent.
We next review the notion of the strictly convex function (see, for example, Chapters 8 and 11 of [2]).Let f : R d → R ∪ {∞} be a proper function.A function f is strictly convex if It is well-known that the minimizer of a proper strictly convex function is unique.We use N to denote the set of all positive integers.Lemma 5 Suppose that α ∈ (0, 1] and ũk , ṽk , ṽk * ∞ k=1 is the sequence defined by Proof We first prove (a).The hypothesis of (a) is equivalent to ũk+1 = ũk and ∇H(ṽ k ; ũk+1 Since H(•; ũk+1 ) is a continuously differentiable convex function and ∇H(ṽ k ; ũk+1 ) = 0, we observe that ṽk is the global minimizer of problem (3.18).Since ψ • B is continuously differentiable, we have that ṽk = S D ⊤ ũk+1 .This together with ũk+1 = ũk yeilds that Equations (3.21) and (3.22) show that ( ũk , ṽk ) is a fixed-point of (2.16)-(2.17).
The proof is completed by choosing Inequalities (3.24) and (3.25) in Lemma 6 may be used as a stopping criterion for the inner loop iteration (3.16).Combining (3.11)-(3.12),iteration (3.16) for finding the proximity operator of ψ • B, with the stopping criterion provided by Lemma 6, we obtain an executable inexact fixed-point proximity algorithm, for model (2.6).We summarize it in Algorithm 1.
Proposition 2 guarantees that Algorithm 1 will generate a sequence ũk , ṽk ∞ k=1 , which is for sure to satisfy the conditions These two inequalities will play a crucial role in convergence analysis, of Algorithm 1, which will be provided in the next section.

Convergence Analysis
In this section we analyze the convergence of the inexact fixed-point proximity algorithm.Specifically, we will show that a sequence ũk , ṽk ∞ k=1 generated by Algorithm 1 converges to a local minimizer of (2.6).Since the sequence ( ũk , ṽk ) ∞ k=1 generated from Algorithm 1 is a sequence defined by (3.11)-(3.12)satisfying conditions (3.29) and (3.30), we shall study the convergence property of sequences defined by (3.11) and (3.12).For this purpose, we reformulate the iterations (3.11) and (3.12) as an inexact Picard sequence of a nonlinear map.We then show their convergence by studying the property of the map.Finally, we show the convergence property of inexact fixed-point proximity algorithm (Algorithm 1) based on the convergence of iterations (3.11) and (3.12).
We first state the main results of this section.
Theorem 1 Suppose that ψ is a proper, continuously differentiable, and bounded below convex function, B is a p × m matrix, D is an n × m matrix satisfying D ⊤ D = I, λ and γ are positive constants, α ∈ (0, 1), and the operator T defined in (4.11) has at least one fixed-point.
This section is devoted to providing a proof for Theorem 1.To this end, we establish a number of preliminary results.Throughout this section, we assume without further mentioning that ψ : R p → R is a proper, continuously differentiable, and bounded below convex function, B is a p × m matrix, D is an n × m matrix satisfying D ⊤ D = I, λ and γ are positive constants, and α ∈ (0, 1).
Note that in Theorem 1, we assume that α ∈ (0, 1) even though Algorithm 1 may still converge for α equal to 1 or slightly bigger than 1, see the numerical examples presented in section 5.For α ≥ 1, convergence of Algorithm 1 is not guaranteed since the technical proof presented in this section requires α ∈ (0, 1).Moreover, there are examples exhibiting divergence of Algorithm 1 when α > 1.
We prepare for the proof of Theorem 1.The iterations (3.11) and (3.12) have two stages.Its first stage basically searches the support of the candidate sparse vectors u, in model (2.6), which involve the ℓ 0 -norm.Once it is obtained, the support will remain unchanged in future iterations and thus, the original non-convex optimization problem (2.6) reduces to one with u restricted to the fixed support.The restricted optimization problem becomes convex since the term λ ∥u∥ 0 reduces to a fixed number independent of u.For this reason, the second stage of the iterations is to solve the resulting convex minimization problem.
We now show that the first N steps of iterations (3.11) and (3.12), for certain positive integer N, is to pursue a support of the sparse vector u.By the definition of the proximity operator, the right hand side of (3.11) is We establish an equivalent form of (4.1).To this end, we consider the optimization problem Lemma 7 The minimization problems (4.1) and (4.2) are equivalent.
Proof Expanding the quadratic term in the objective function of minimization problem (4.1), we have that Note that multiplying an objective function by a constant will not change its minimizer.
Recalling the definition of F, we find that the minimization problems (4.1) and (4.Proof By the definition of the proximity operator, ũk+1 is a global minimizer of (4.1).It follows from Lemma 7 that Adding the above inequality to inequality (3.29) yields that Noting that λ , γ are positive, α ∈ (0, 1), and , the second term on the left hand side of the above inequality is non-negative, which implies the inequality in part (a).Since F is bounded below, the inequality of part (a) leads to the existence of lim k→∞ F ũk , ṽk .This completes the proof of part (a).
It remains to show part (b).The second statement of part (a) ensures that lim k→∞ F( ũk , ṽk ) − F( ũk+1 , ṽk+1 ) = 0.This together with (4.3) and the assumption Next, we show that the support of the inexact sequence ũk ∞ k=1 will remain unchanged after a finite number of steps, which reduces finding a local minimizer of the non-convex optimization problem to solving a constrained convex optimization problem.We first review the support of a vector.For a positive integer n, let N n := {1, 2, . . ., n}.For x ∈ R n , we use S(x) to denote the support of x, which is defined as Clearly, S(x) is a subset of N n .It is empty if all components of x are zeros and it is equal to N n if all components of x are nonzeros.It is known (for example, Lemma 3 of [37]) that if u ∈ prox t∥•∥ 0 (x), v ∈ prox t∥•∥ 0 (y), for x, y ∈ R n , then , then there exists a positive integer N such that S ũk = S ũN for all k ≥ N.
Proof From Lemma 8, we have that lim k→∞ ũk+1 − ũk 2 = 0. Hence, there exists an integer N, such that The above inequality together with (4.4) leads to S ũk+1 = S ũk for all k ≥ N.This further ensures that S ũk = S ũN for all k ≥ N.
Proposition 3 reveals that the support of ũk remains unchanged when k ≥ N and it turns out that the influence of ∥•∥ 0 does not exist anymore when k ≥ N. Throughout this section, we use C ⊆ N n to denote the unchanged support.Accordingly, the nonconvex optimization problem (2.6) reduces to a convex one.To describe the resulting convex optimization problem, we need the notion of the indicator function.For a given index set Λ ⊆ N n , we define the subspace of R n by letting and define the indicator function δ Λ : R n → R ∪ {∞} as We define a convex function by Proposition 3 implies that when k ≥ N, the non-convex optimization problem (2.6) boils down to the convex optimization problem Specifically, we have the next result which follows directly from Theorem 4.8 of [50].We now turn to showing that the inexact sequence ( ũk , ṽk ) ∞ k=N converges to a local minimizer of optimization problem (2.6).To this end, we reformulate the sequence as an inexact Picard sequence of a nonlinear operator and then show the nonlinear operator has certain favorable property that guarantees convergence of a sequence generated from the operator.For a given index set Λ ⊆ N n , we define the projection operator P Λ : R n → B Λ as and the diagonal matrix From the definition of the operator P Λ and the matrix T Λ , we have that We next express the proximity operator of the ℓ 0 -norm in terms of the linear transformation T C .To this end, we recall that for x, z ∈ R n , if x ∈ prox t∥•∥ 0 (z), then x = P S(x) (z) (see, Lemma 5 of [48]).
Proof By the hypothesis of this lemma, according to Lemma 5 of [48], we see that ũk+1 = P S( ũk+1 ) (z k ).From Proposition 3, we have that S( ũk+1 ) = C for all k ≥ N. Hence, ũk+1 = P C (z k ).Combining this equation with (4.10) completes the proof of this lemma.
We next reformulate iterations (3.11)-(3.12)as an inexact Picard sequence of a nonlinear operator for k ≥ N. We define operator T : R n → R n by Substituting (4.13) into the right hand side of (4.14) and using the linearity of T C , we observe that We complete the proof by calling the definition of T .
The next proposition identifies the relation between a global minimizer of minimization problem (4.7) and a fixed-point of the operator T .We first review the notion of subdifferential [2].The subdifferential of a convex function f : [32]).The proximity operator of the indicator function δ C defined in (4.5) may be expressed as the projection operator, that is, Proof This proof is similar to that of Theorem 2.1 in [24].From Proposition 16.8 and Corollary 16.48 of [2], for u ∈ R n , v ∈ R m , we have that The Fermat rule (see, Theorem 16.3 of [2]) yields that (u * , v * ) is a global minimizer of (4.7) if and only if From the definition of the indicator function δ C , we notice that Hence, (4.17) is equivalent to for the quantity α appearing in (3.11).We then rewrite (4.Simplifying the relation (4.18) and solving for v * , we can obtain (4.16).We then substitute (4.16) into the right hand side of (4.20) and get that Thus, by the definition of T , u * is a fixed-point of T .
According to Proposition 6, obtaining a global minimizer of problem (4.7) is equivalent to finding a fixed-point of T .We turn to considering the fixed-point of T .Proposition 5 shows that the sequence ũk ∞ k=N generated from iteration (4.12) is an inexact Picard sequence of T .Convergence of inexact Picard iterations of a nonlinear operator, studied in [11,29,35], may be guaranteed by the averaged nonexpansiveness of the nonlinear operator.Our next task is to show the operator T is averaged nonexpansive.
We now review the notion of the averaged nonexpansive operators which may be found in [2].An operator P : R d → R d is said to be nonexpansive if ∥P(x) − P(y)∥ 2 ≤ ∥x − y∥ 2 , for all x, y ∈ R d .
An operator F : R d → R d is said to be averaged nonexpansive if there is κ ∈ (0, 1) and a nonexpansive operator P such that F = κI + (1 − κ)P, where I denotes the identity operator.We also say that F is κ-averaged nonexpansive.It is well-known that for any proper lower semi-continuous convex function f : R d → R ∪ {∞}, its proximity operator prox f is nonexpansive.
We now recall a general result, which concerns the convergence of the inexact Picard sequence for F : R d → R d .Its proof may be found in [11,29].See [35] for a special case.
Lemma 10 Suppose that F : R d → R d is averaged nonexpansive and has at least one fixed-point, z k ∞ k=1 is the sequence defined by (4.21), and e k+1 ∞ k=1 is a positive sequence of numbers.If ε k+1 2 ≤ e k+1 for k ∈ N and ∑ ∞ k=1 e k+1 < ∞, then z k ∞ k=1 converges to a fixed-point of F .
We next show that the operator T defined by (4.11) is averaged nonexpansive.To this end, we make the following observation.

Lemma 11
The operator S • D ⊤ is nonexpansive.
Proof Since γ λ ψ • B is continuously differentiable and convex, we have that It follows from the convexity of γ λ ψ • B that the operator prox γ λ ψ•B is nonexpansive.Therefore, the operator S is also nonexpansive.That is, for Using the fact that D ⊤ 2 = 1, we have that The above two inequalities ensure that the operator S • D ⊤ is nonexpansive.
Proof We write T = 1−α 2 I + 1+α 2 T 1 with and show that T 1 is nonexpansive.The definition of T leads to that Then, for any u 1 , u 2 ∈ R n , it follows from the triangle inequality that . (4.23) Since the matrix T C is diagonal with diagonal entries being either 1 or 0, we find that the matrix 2T C − I is also diagonal with diagonal entries being either 1 or −1. Therefore, Noting that ∥T C ∥ 2 ≤ 1 and ∥D∥ 2 = 1, we obtain from (4.23)-(4.24)and Lemma 11 that which proves that T 1 is nonexpansive.
We next establish convergence of the inexact fixed-point iterations (3.11)-(3.12)satisfying condition (3.29).We next prove that ṽk ∞ k=N converges to S D ⊤ u * .Notice that by the triangle inequality, for each k ∈ N, Combining definition (3.12) of ṽk , Lemma 11, and the above inequality, we have that (4.26) According to the assumption that εk+1 2 ≤ e k+1 for k ∈ N and ∑ ∞ k=1 e k+1 < ∞, we obtain that lim k→∞ Noting that λ , γ are positive constants, we have that Therefore, all conditions of Proposition 8 are satisfied.Hence, ũk , ṽk ∞ k=1 converges to a local minimizer of optimization problem (2.6) and S(u * ) = C.

Numerical Experiments
In this section, we present numerical experiments to demonstrate the effectiveness of the proposed inexact algorithm for solving the ℓ 0 models for three applications including regression, classification, and image deblurring.The numerical results verify the convergence of Algorithm 1 and explore the impact of the inner iterations within the algorithm.Furthermore, the numerical results validate the superiority of the solution obtained from the ℓ 0 model (2.6) solved by the proposed algorithm when compared to the solution from the ℓ 1 model (2.2), across all three examples.
For the regression and classification models, we need a kernel as described in [25,26].In the two examples to be presented, we choose the kernel as where σ will be specified later.
We will use Algorithm 1 to solve model (2.6) for the three examples.The model (2.2) has a convex objective function, and hence, it can be efficiently solved by many optimization algorithms.Here, we employ Algorithms 2 and 3 (described below) to solve (2.2) with D being the identity matrix and D being the tight framelet matrix [37] (or the first order difference matrix [32]), respectively.Note that Algorithm 2 was Algorithm 2: Fixed-point proximity scheme for problem (2.2) with D being identity matrix.
proposed in [24,32] and Algorithm 3 was introduced in [34].According to [34], the sequence (ṽ k , wk ) generated from Algorithm 3 converges to a global minimizer of problem (2.2) if the condition ∑ ∞ k=1 e k+1 < ∞ is satisfied.We choose the parameter in Algorithm 1 with α, λ , and γ to be specified.The parameter λ serves as the regularization parameter in models (2.6) and (2.2).If this parameter is set too large, it can lead to a decrease in prediction accuracy.Conversely, if it is set too small, the resulting solution may lack sparsity, potentially leading to overfitting.Thus, identifying an appropriate range for λ is crucial.Similarly, the parameter p in Algorithm 1 and Algorithm 2, where 1 p serves as the step size, affects the speed of the algorithm.It is crucial to choose an appropriate step size in both algorithms for optimal performance.The impact of the choice of the parameter α in the exact fixed-point proximity algorithm (3.1)-(3.2) was numerically investigated in [37], revealing its influence on the convergence speed of the algorithm.We will also explore the effect of the parameter α for the inexact fixed-point proximity algorithm (Algorithm 1).For the Algorithm 3: Inexact fixed-point proximity scheme for problem (2.2) Input: The functions ψ, matrix D, B in problem (2.2) and ϕ(Dv) = λ ∥Dv∥ 1 ; parameters p 1 > 0, q 1 > 0 satisfying p 1 q 1 > ∥D∥ 2 2 , p 2 > 0, q 2 > 0 satisfying p 2 q 2 > ∥B∥ 2 2 , positive sequence e k+1 ∞ k=1 Initialization: ṽ1 , z1 , w1 1 for k =1, 2, . . .
regression, classification, and Gaussian noise image deblurring problems, we test Algorithm 1 with different values of α, even those with which the algorithm is not guaranteed to converge (α ≥ 1).In the instances where the value of α is greater than or equal to 1, we set the corresponding ρ ′ to be 0.Moreover, to explore the impact of the inner iteration in Algorithm 1, we design its two variants.In the first variant, the inner iteration is performed for only one iteration.In this variant, the generated sequence may not satisfy the conditions (3.29) and (3.30).We will use several sets of parameters to explore the convergence of this case.In the second variant, we increase the value of ρ ′ of Algorithm 1 in a way that it fails to satisfy the condition of Theorem 1.We will use a single set of parameters λ , γ and only increase the value of ρ ′ , testing it at 10, 100, and 1000 times the choice (5.2).We adopt the following stopping criterion for all iteration algorithms tested in this study.For a sequence x k ∞ k=1 generated by an algorithm performed in this section, iterations are terminated if where TOL denotes a prescribed tolerance value.

Regression
In this subsection, we consider solving the regression problem by using the ℓ 0 model.Specifically, we explore the convergence behavior of Algorithm 1 and the impact of its inner iterations and compare the numerical performance of the ℓ 0 model with that of the ℓ 1 model.
In our numerical experiment, we use the benchmark dataset "Mg" [10] with 1,385 instances, each of which has 6 features.Among the instances, we take m := 1, 000 of them as training samples and the remanding 385 instances as testing samples.
We describe the setting of regression as follows.Our goal is to obtain a function that regresses the data set (x j , y j ) : j ∈ N m ⊂ R 6 × R by using the kernel method detailed in [25,26], where x j is the input features and y j is the corresponding label.In this experiment, we choose the kernel K as defined in (5.1) with σ := √ 10 and d := 6.The regression function f : R 6 → R is then defined as (5.4) The coefficient vector v ∈ R m that appears in (5.4) is learned by solving the optimization problem where λ , γ are positive constants and the fidelity term L : R m → R is defined as Optimization problem (5.5) may be reformulated in the form of (2.6) by choosing The ℓ 0 model (2.6) and the ℓ 1 model (2.2) are solved by Algorithm 1 and Algorithm 2, respectively.For both of these algorithms, the parameters p and λ are varied within the range of [10 −1 , 10 3 ] and [10 −7 , 10 −2 ], respectively, and q := (1 + 10 −6 )∥B∥ 2  2 /p.In addition, for Algorithm 1, the parameter γ is varied within the range [10 −7 , 10 −3 ] and e k+1 := M/k 2 with M := 10 16 .The stopping criterion for Algorithm 1 (resp.Algorithm 2) is either when (5.3) is satisfied with x k := ũk (resp.x k := v k ) and TOL := 10 −6 , or when the maximum iteration count of 10 5 is reached.
We first explore the effect of the parameter α in convergence of Algorithm 1 by testing different values of α.Values of the objective function of model (2.6) against the number of iterations for the outer loop of Algorithm 1 with different values of α are displayed in Figure 1.We observe from Figure 1a that for α ∈ (0, 2], Algorithm 1 with a larger value of α trends to produce a smaller value of the objective function.This suggests that for α ∈ (0, 2], increasing the value of α can accelerate the speed of convergence when implementing Algorithm 1.However, for α > 2, as shown in Figure 1b, Algorithm 1 diverges.In the remaining part of this example, we choose α := 0.99 so that the hypothesis of Theorem 1 on the range of α is satisfied, which guarantees convergence of Algorithm 1. We then apply Algorithm 1 and its two variants to test their convergence for the ℓ 0 model.We use four sets of parameters for Algorithm 1 and the first variant, and a  single set of parameters for the second variant.Figure 2a, 2b, and 2c display values of the objective function generated from Algorithm 1, the first variant, and the second variant, respectively.Figure 2a confirms convergence of Algorithm 1 and Figure 2b reveals divergence of the first variant.From Figure 2c, we observe that a slight increase in ρ ′ maintains a convergence trend in the objective function value, but a significant increase, for example, changing it to 1000 × ρ ′ , leads to divergence.These results demonstrate that the inclusion of the inner iteration is essential for effectively solving the ℓ 0 model.One conceivable explanation for this phenomenon could be that in the regression problem, the norm of B in the objective function (2.5) is as large as 863, and thus even slight variations in the input variable can have a significant impact on the resulting objective function value.Therefore, the inner iteration is critical for controlling the error for each inexact step.
In Table 1, we compare the number of nonzero (NNZ) components of the solutions obtained by the ℓ 0 model and the ℓ 1 model, when they have comparable testing accuracy.It is clear from Table 1 that when the mean squared error on the testing dataset (TeMSE) falls within varying intervals, the ℓ 0 model consistently generates solutions sparser than those generated by the ℓ 1 model, with comparable mean squared errors on the training dataset (TrMSE).

Classification
In this subsection, we consider solving the classification problem by using the ℓ 0 model.Specifically, we investigate the convergence behavior and the impact of inner iterations on Algorithm 1 and compare the numerical performance of the ℓ 0 model with that of the ℓ 1 model.In our numerical experiment, we use the handwriting digits from the MNIST database [22], which is composed of 60, 000 training samples and 10, 000 testing samples of the digits "0" through "9".We consider classifying two handwriting digits "7" and "9" taken from MNIST.Handwriting digits "7" and "9" are often easy to cause obfuscation in comparison to other pairs of digits.We take m := 5, 000 instances as training samples and 2, 000 as testing samples of these two digits from the dataset.Specifically, we consider training data set (x j , y j ) : j ∈ N m ⊂ R d × {−1, 1}, where x j are images of the digit 7 or 9.
The setting of classification is described as follows.Our goal is to find a function that separates the training data into two groups with label y j := 1 (for digit 7) and y j := −1 (for digit 9).We employ the kernel method to derive such a function, as detailed in [49].We choose the kernel K as defined in (5.1) with σ := 4 and d := 784.
The classification function f : R 784 → {−1, 1} is then defined as where sign denotes the sign function, assigning −1 for negative inputs and 1 for non-negative inputs.The coefficient vector v ∈ R m that appears in (5.7) is learned by solving the optimization where λ , γ are positive constants and the fidelity term L : R m → R is defined as Optimization problem (5.8) may be reformulated in the form of (2.6) by choosing For both of these algorithms, the parameter p is varied within the range [1, 100] and q := (1 + 10 −6 )∥B∥ 2  2 /p.For Algorithm 1, the parameters λ are varied within the range of [0.001, 1], while for Algorithm 2, it varies within [0.01, 5].In addition, we vary the parameter γ within the range [10 −6 , 1] and fix e k+1 := M/k 2 with M := 10 16 for Algorithm 1.
We first investigate the impact of the parameter α on convergence of Algorithm 1 by testing it with various values of α and display the results in Figure 3.When α falls within the range (0, 1), increasing its value accelerates the algorithm's convergence speed.However, for α ∈ (1, 2), speed-up of convergence when further increasing α becomes limited.Notably, Algorithm 1 diverges when α ≥ 3.5.To adhere the theoretical requirement α ∈ (0, 1) as established in Theorem 1, the same as in the regression problem, we set α := 0.99 in the rest of this example.
We then apply Algorithm 1 and its two variants to solve the ℓ 0 model for classification, maintaining the same stopping criterion as used in the previous subsection.Figures 4a, 4b, and 4c display objective function values generated from Algorithm 1, the first variant, and the second variant, respectively.The observed behavior closely resembles that of the previous example, with the key distinction that for the first variant, the objective function value converges for certain parameters and diverges for others.The numerical results demonstrate the critical importance of the inner iteration in solving the ℓ 0 model, primarily due to the norm of B being 592.
We compare the solution of the ℓ 0 model solved by Algorithm 1 with that of the ℓ 1 model solved by Algorithm 2. The stopping criteria for both of these algorithms   remains consistent with that in the previous subsection, except for adjusting TOL to 10 −4 and setting a maximum iteration count of 5 × 10 4 .In Table 2, we compare NNZ and the accuracy on the training dataset (TrA) of the solutions obtained from the ℓ 0 model and the ℓ 1 model when they have the same accuracy on the testing dataset (TeA).We observe from Table 2 that the solutions of the ℓ 0 model have almost half NNZ compared to those of the ℓ 1 model, and the ℓ 0 model produces a higher TrA value compared to the ℓ 1 model.In Table 3, we compare TeA and TrA of the solutions obtained from the ℓ 0 model and the ℓ 1 model when they have comparable NNZ.We see from Table 3 that the solution of the ℓ 0 model has slightly better accuracy on the testing set and about one percent superior on the training set.

Image Deblurring
In this subsection, we consider the ℓ 0 model solved by Algorithm 1 applied to image deblurring.Specifically, we compare performance of models (2.6) and (2.2) for image deblurring for images contaminated with two different types of noise: the Gaussian noise and the Poisson noise.We present numerical results obtained by using the ℓ 0 model with those obtained by using two ℓ 1 models, which will be described precisely later.Again, we study the impact of the inner iteration on Algorithm 1 when applied to image deblurring.In this regard, we present only the Gaussian noise case since the Poisson noise case is similar.
In our numerical experiments, we choose six clean images of "clock" with size 200 × 200, "barbara" with size 256 × 256, "lighthouse", "airplane", and "bike" with size 512 × 512, and "zebra" with size 321 × 481, shown in Figure 5.Each tested image is converted to a vector v via vectorization.The blurred image is modeled as x := P(Kv) (5.10) where x denotes the observed corrupted image, K is the motion blurring kernel matrix, and P is the operation that describes the nature of the measurement noise and how it

Gaussian noise image deblurring
In this case, we choose the operation P as the Gaussian noise, and model (5.10) becomes x := N (Kv, σ 2 I) (5.11) where N (µ, σ 2 I) denotes the Gaussian noise with mean µ and covariance matrix σ 2 I.The clean images are first blurred by the motion blurring kernel matrix and then contaminated by the Gaussian noise with standard deviation σ := 3. The resulting corrupted images are shown in Figure 6.Given an observed image x ∈ R m , the function ψ in all three models is chosen to be The matrix B appearing in both models (2.6) and (2.2) is specified as the motion blurring kernel matrix K. Models (2.6) and (2.2) with D being the tight framelet matrix constructed from discrete cosine transforms of size 7 × 7 [37] are respectively referred to as "L0-TF" and "L1-TF".Model (2.2) with D being the first order difference matrix [32] is referred to as "L1-TV".The L0-TF model is solved by Algorithm 1, while both the L1-TF and L1-TV models are solved by Algorithm 3. In all three models, the parameter λ is adjusted within the interval [0.01, 2].For the L0-TF model, we vary the parameter γ within the interval [0.1, 6], while maintaining fixed values of p := 0.1, q := (1+10 −6 )×4/p, and e k+1 := M/k 2 with M := 10 6 in Algorithm 1.For both the L1-TF and L1-TV models, the parameter p 1 = p 2 is varied within the interval [0.01, 1] and q 2 := (1 + 10 −6 ) × 4/p 2 , while q 1 := (1 +10 −6 )/p 1 for the L1-TF model and q 1 := (1 +10 −6 )×8/p 1 for the L1-TV model.Additionally, we set e k+1 := M/k 2 in Algorithm 3, where M := 10 8 for the L1-TF model and M := 10 7 for the L1-TV model.The stopping criterion for both algorithms is when (5.3) is satisfied with x k := ṽk and TOL := 10 −5 .
We explore the effect of α in Algorithm 1 for images 'clock' and 'barbara' by testing different values of α and display the results in Figure 7.As observed in Figure 7, when the value of α is close to 1, Algorithm 1 tends to yield smaller objective function values over the same number of iterations.Figure 8 provides zoomed-in views of values of the objective function for the last 800 iterations for the images of 'clock' and 'barbara', with α := 1.00, 1.01.While Algorithm 1 converges for both the α values for the image 'clock', it appears that it converges for α := 1 but diverges for α := 1.01 for the 'barbara' image.These numerical results confirm the condition α ∈ (0, 1) for convergence of Algorithm 1 established in Theorem 1. Consequently, we set α := 0.99 in the rest of this example to ensure convergence of Algorithm 1.We then apply Algorithm 1 and the first variant to solve the L0-TF model.Figures 9a and 9b depict objective function values generated from Algorithm 1 and its first variant, respectively, while Figures 9c and 9d illustrate the PSNR values of images reconstructed by Algorithm 1 and its first variant, respectively.Figure 9a confirms convergence of Algorithm 1 for all six tested images with values of parameters λ and γ displayed.Figure 9b shows that the objective function values generated from the first variant exhibit significant oscillation for the first 90 steps of iteration, followed by a convergence trend.These observations are also reflected in the PSNR value as depicted in Figures 9c and 9d.The numerical results show that the significance of the inner iteration is limited for image deblurring problems, due to the fact that the norm of matrix B is not greater than 2.
We list in Table 4 the highest PSNR values for each of the reconstructed images, with the corresponding values of parameters λ , γ, p and p 1 showed in Table 5.The results in Table 4 consistently demonstrate that the L0-TF model achieves the highest PSNR-value, surpassing the second-highest value by approximately 1 dB.For visual comparison, we display the deblurred images in Figures 10 and 11.One can see that the L0-TF model is more powerful to suppress noise on smooth areas compared to the L1-TF and L1-TV models.In Figure 12, comparison of two specific local areas in the zebra image is presented.We observe from Figure 12 that the deblurred images obtained from the L0-TF model preserve finer details and edges more effectively than those obtained from the other two models.
We show in Table 6 the highest PSNR values for each of the reconstructed images, with the corresponding values of parameters λ , γ, p and p 1 listed in  surpassing the second-highest value by approximately 0.6 dB expect for the 'barbara' image, where the PSNR-values of the L0-TF model are comparable with those of the L1-TF model.For visual comparison, we display the deblurred images in Figures 14 and 15.One can see that the L0-TF model is more powerful to suppress noise on smooth areas compared with the L1-TF and L1-TV models.In Figure 16, we compare performance of the three models on two specific local areas in the zebra image and observe that the images restored from the L0-TF model preserve finer details and edges more effectively than those from the other two models.
In summary, the numerical experiments presented above all confirm the convergence of the proposed inexact algorithm and demonstrate the superiority of the ℓ 0 regularization over the ℓ 1 regularization, in both approximation accuracy and sparsity.The numerical results also confirm that the importance of the inner iteration in the proposed inexact algorithm depends on the magnitude of the norm of the matrix B that appears in the objective function (2.5).In the case of regression and classification problems, inner iterations are crucial due to the norm of matrix B exceeding 500, while in deblurring problems, the significance of inner iterations is limited as the norm of matrix B is not greater than 2. The numerical results consistently verified that increasing the value of α can speed up convergence of the proposed algorithm when α falls within the range (0, 1), which is guaranteed by Theorem 1.While for α > 1, the impact of α on the algorithm varies depending on a specific application.

Conclusion
The theoretical study and numerical experiments conducted in this paper have shown that the proposed inexact fixed-point proximity algorithms are effective for solving sparse regularization models involving the ℓ 0 -norm.The proposed inexact algorithm incorporates an inner iteration for the evaluation of the proximity operator of a convex function composed with a linear transformation.We have shown convergence of the

Fig. 1 :
Fig. 1: The effect of choices of parameter α on convergence of Algorithm 1 for regression.

Fig. 3 :
Fig. 3: The effect of choices of parameter α on convergence of Algorithm 1 for classification.

Fig. 7 :
Fig. 7: The effect of choices of parameter α on convergence of Algorithm 1 for the Gaussian noise image deblurring problem of the image "clock" (left) and "barbara" (right).

Fig. 9 :
Fig. 9: Exploring the convergence of Algorithm 1 for image deblurring problems.(a) and (b) depict the function values, while (c) and (d) illustrate the PSNR values for Algorithm 1 and its first variant, respectively.

Fig. 12 :
Fig. 12: Gaussian noise image deblurring: Deblurred local zebra images from the corrupted images with MBL being 21.The second, third, and fourth columns correspond to the deblurred local zebra images obtained from L0-TF, L1-TF, and L1-TV, respectively.

Table 1 :
Comparison of the ℓ 0 model and the ℓ 1 model for regression: The NNZ and TrMSE results within fixed TeMSE intervals.

Table 2 :
Comparison of the ℓ 0 model and the ℓ 1 model for classification: The NNZ and TrA results with fixed TeA.

Table 3 :
Comparison of the ℓ 0 model and the ℓ 1 model for classification: The TeA and TrA results within fixed NNZ intervals.

Table 5 :
Gaussian noise image deblurring: chosen values of the parameters

Table 7 .
The results in Table6demonstrate that the L0-TF model achieves the highest PSNR-value,