Nonconvex Sparse Representation With Slowly Vanishing Gradient Regularizers

Sparse representation has been widely used over the past decade in computer vision and signal processing to model a wide range of natural phenomena. For computational convenience and robustness against noises, the optimization problem for sparse representation is often relaxed using convex or nonconvex surrogates instead of using the <inline-formula> <tex-math notation="LaTeX">$l_{0}$ </tex-math></inline-formula>-norm, the ideal sparsity penalty function. In this paper, we pose the following question for nonconvex sparsity-promoting surrogates: What is a good sparsity surrogate for general nonconvex systems? As an answer to this question, we suggest that the difficulty of handling the <inline-formula> <tex-math notation="LaTeX">$l_{0}$ </tex-math></inline-formula>-norm does not only come from the nonconvexity but also from its gradient being zero or not well-defined. Accordingly, we propose desirable criteria to be a good nonconvex surrogate and suggest a corresponding family of surrogates. The proposed family of surrogates allows a simple regularizer, which enables efficient computation. The proposed surrogate embraces the benefits of both <inline-formula> <tex-math notation="LaTeX">$l_{0}$ </tex-math></inline-formula>- and <inline-formula> <tex-math notation="LaTeX">$l_{1}$ </tex-math></inline-formula>-norms, and most importantly, its gradient vanishes slowly, which allows stable optimization. We apply the proposed surrogate to well-known sparse representation problems and benchmark datasets to demonstrate its robustness and efficiency.


I. INTRODUCTION
Recently, sparse representation of signals has been one of the most successful models in many fields including signal processing, machine learning, and computer vision. Sparse representation has shown to be a powerful tool for highdimensional data such as images [1], [2], where the goal is to represent or compress cumbersome data using a few representative samples. A simple sparse representation problem (for a noiseless scenario) can be described as follows: where α 0 = #{i : α i = 0, ∀i} is the l 0 -norm, x ∈ R m is an observation data, D ∈ R m×p is an overcomplete dictionary (m p), and α ∈ R p is the coefficient vector to be estimated. Typical applications of sparse representation include face recognition [3], image restoration [4], and superresolution [5], to name a few.
The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Remagnino .
Behind the successful outcomes, many efforts have been made for learning sparse representation efficiently [1], [3], [6]- [16], since solving a sparse representation problem using the l 0 -norm has two main drawbacks: (1) the computational intractability of a combinatorial search and (2) its noise sensitivity due to the nature of the l 0 ball. One of the most popular algorithms to estimate sparse signals is the orthogonal matching pursuit (OMP) [6], which finds the best matching projection based on an overcomplete dictionary. However, the greedy pursuit method can find a sub-optimal solution and even can fail to find a reasonable solution. Even worse, there can be a computational issue when the size of the dictionary is large.
There is little doubt that the recent popularity of the sparse representation is attributed to the attempt that the l 0 -norm is relaxed to its convex counterpart, i.e., the l 1 -norm [17]. In many cases, the use of the l 1 -norm turns the problem into convex optimization, which can be efficiently solved with theoretical guarantees. Especially, some analyses have shown that the l 1 -norm-based problems can exactly recover the best VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ sparse solution under certain conditions [2], [18], making a strong justification for the use of the l 1 -norm. Accordingly, the l 1 -norm has been extensively utilized in many problems under different forms, and many efficient methods, including the basis pursuit denoising (BPDN) methods, such as FISTA [19], have been proposed to solve l 1 -norm minimization problems.
Obviously, the l 1 -norm relaxation is beneficial when the relaxed problem or system indeed becomes convex. However, some problems are inherently nonconvex and, for those problems, replacing the sparsity surrogate to a convex one does not necessarily make the overall problem convex. Some wellknown examples of such problems are: matrix factorization [20] and rank-constrained subspace learning [21]- [23]. For these problems, using the l 1 -norm will not bear as much significance as the previous examples. In fact, for general problems aside from some special (convex) cases mentioned above, the constant slope of the l 1 -norm, which is also known as a biased penalty function 1 [8], can over-penalize the values of nonzero elements unlike the l 0 -norm and make the solution deviate from the desired solution [8], [9], [12], [13]. This constant slope is the one that makes the l 1 -norm a convex surrogate, which is not really necessary for the nonconvex settings discussed here. Note that there is a tighter convex approximation to the l 0 -norm [24], but it also has a constant gradient along each direction.
Motivated by the fact that the l 1 -norm may not be the best choice for nonconvex problems, we ask the following question: What is a good nonconvex sparsity surrogate if it is not possible to transform the problem to a convex one? It is quick to notice that the l 0 -norm is still difficult to handle in nonconvex problems. It is worth mentioning that the nonconvex nature is not the only difficulty of the l 0 -norm, but its gradient either being zero (for the most parts) or not being well-defined is the true culprit, which necessitates a combinatorial optimization approach. If we can find an approximation of the l 0 -norm that has a sufficiently large region with nontrivial gradients, it would be much easier to apply conventional optimization techniques.
As prior works, there have been attempts to use nonconvex smooth (or possibly nonsmooth) approximations of the l 0 -norm [7]- [10], [12], [25], [26]. We will discuss the theoretical relevance and difference of the proposed surrogate compared to other nonconvex alternatives in Section III-B.

A. PAPER CONTRIBUTIONS
In this paper, we analyze possible approximations of the l 0 -norm. We first propose desirable criteria to be a good nonconvex surrogate and present a representative family of curves, termed slowly vanishing gradient (SVG) surrogates, that is a solution of a differential equation. The proposed surrogate avoids existing issues of rapidly vanishing gradient of other well-known surrogates which can hinder the optimization progress. We also show that there is a trade-off between the values and the vanishing speed of their gradients. Locally, the surrogate follows the l 1 -norm to reduce the chance of numerous local optima without losing the ability of promoting sparsity. Globally, it follows the l 0 -norm to reduce penalty on large-values, but it still possesses slowly vanishing gradients to help drawing the solution of an optimization algorithm to sparse points. Moreover, it has an efficient proximity operator for the surrogate. The proposed surrogate function is applied to various applications such as low-rank approximation (LRA), sparse coding with dictionary learning (SC), and sparse subspace clustering (SSC) problems, to demonstrate its adequacy and experimental results confirm that our proposal performs favorably against those of other well-known sparsity surrogates.

B. NOTATIONS
An observation matrix is denoted by X ∈ R m×n , where each column corresponds to a data sample in R m . We denote matrices, vectors, and scalars by bold letters in upper case, bold letters in lower case, and letters in lower case, respectively, unless stated otherwise. Spaces and subspaces are denoted by bold italic letters in upper case. Throughout this paper, we use A q to denote matrix norms of a matrix A, with q = 1 for the matrix l 1 -norm, ij |a ij |, and q = F for the Frobenius-norm, ij |a ij | 2 . We denote the projection operator by P(·) and the support set of a matrix A by A . rank(A) denotes the rank of A and | · | denotes the absolute value operation of a scalar. Diagonal elements in a matrix A is denoted by diag(A).

II. ANALYSIS ON THE l 0 -NORM APPROXIMATION A. DESIRABLE CRITERIA FOR A NONCONVEX SURROGATE
In this section, we will mainly discuss a sparse representation problem whose cost function consists of a data term and a regularizer. As explained earlier, if the problem itself (data term) has a nonconvex structure, then the convexity of the sparsity surrogate (regularizer) is not absolutely necessary. In this case, the constant slope of the l 1 -norm will not necessarily make the problem convex but over-penalize nonzero values in the input, which makes the solution deviate from the desired solution, especially when the problem assumes the presence of noises. Hence, we might be interested in finding a good nonconvex surrogate for such general nonconvex problems. Prior works support the superiority of nonconvex sparsitypromoting surrogates [9], [12], [26]- [31].
If the nonconvexity of the l 0 -norm is not a problem, then the only difficulty in handling it is that its value only changes around zero (or we can imagine that its shape appears as if it gives an extremely local gradient at the origin). This is highly undesirable from the perspective of conventional optimization. Since the derivative of the l 0 -norm is zero for nonzero inputs and undefined at zero, there can be undesirable effects for finding sparse solutions and discovering a good local optimal solution.
In order to find a surrogate which has least undesirable effects and can also be handled efficiently, we might consider smooth approximations of the l 0 -norm [9], [10], [13]. However, there can be infinitely many such approximations and we need some criteria for finding a good surrogate. Below are basic assumptions to be a good candidate: Assumption 1: We pose the following criteria on the surrogate 2 φ(x) (defined on −∞ < x < ∞) we are looking for: 1) Symmetry: The sign of an input does not matter but the magnitude, hence, we assume φ(x) = φ(−x). 2) Continuity at x = 0: In order to avoid a jump at x = 0, we assume φ (0 satisfies lim x→∞ φ(x) = 1. This prevents φ(x) from penalizing large nonzero inputs equally as small ones, and makes it closer to the l 0 -norm. 4) Monotonicity: In order for φ to be a valid surrogate, , φ is a monotonically increasing function on x > 0.

Remark 1:
We give more details for the last criterion. Note that a local optimum of a cost function, consisting of a data term and a continuous and symmetric regularizer term, exists at a sparse point when the sum of both (generalized) subgradients [32] of the two terms becomes 0. First, φ (0 + ) should be nonzero to promote sparsity. If φ (0) = 0, due to the symmetry assumption, then the derivatives of a cost function depend only on the data term. However, if φ (0 + ) is nondifferential, the chance of an optimal point can increase since a subgradient of the regularizer may make one of the subgradients of the entire cost function zero. In short, a nondifferentiable regularizer increases the chance of zero subgradient of a cost function at sparse points more than a differentiable regularizer. Second, φ (0 + ) should be finite to avoid a sub-differential containing infinite large number of subgradients (−∞ ≤ φ (0 + ) ≤ ∞), since this subdifferential always makes the sub-differential of the entire cost function contain zero, thus it can create too many local optima at sparse points. In other words, we would like to avoid a sparse point becoming a local optimum if the data term at the point has a steep slope.
Aside from the above criteria, we have another criterion on the choice of φ. As discussed before, the gradient either being 0 or not being well-defined is what makes the optimization difficult for the l 0 -norm. Thus, we aim to find a surrogate that has an opposite characteristic: φ(x) whose gradient is as large as possible across the entire interval. Because of the fifth criterion, this is equivalent to finding φ(x) that has slowly vanishing gradients. If φ (x) decreases slowly, then the effect of the sparsity surrogate can spread across a large region to help drawing the solution to sparse points. This can be viewed as mimicking the constant slope of the l 1 -norm under the above criteria. Hence, we might try to find φ(x) with the most slowly decreasing gradient. However, due to the third criterion, the ''total amount'' of gradient is finite, i.e., This means that we have to divide this finite value for 0 < x < ∞.

B. A REPRESENTATIVE FAMILY OF SURROGATES
To analyze the situation discussed above more closely, we present two extreme examples among the possible family of surrogates that satisfy the above criteria. Because of the first criterion, we can assume φ(x) = y(|x|) for some function y on R + . First, let us see an example that is a smooth relaxation of the l 0 -norm, but its gradient vanishes exponentially fast in a relatively local region. An easy example is which satisfies y(0) = 0, y(∞) = 1, y (0 + ) = 1, and all the above criteria. Its derivative is y (x) = e −x , which means that the gradient vanishes exponentially. Hence, this surrogate will quickly become negligible except the local region near x = 0.
As an opposite example, let us consider a case, in which the gradient vanishes very slowly; with very small a > 0. Its derivative is and this also satisfies y(0) = 0, y(∞) = 1, y (0 + ) = 1, and all of the above criteria. Here, since a is very small, y (x) is close to a reciprocal function 1 for 0 ≤ x < ∞ does not converge, hence, this can be seen as an extreme example with very slowly vanishing gradients. However, 1 (1+ x a ) 1+a is very close to 0 for most of x, which is a natural consequence of spreading a finite value VOLUME 8, 2020 ( ∞ 0 y (x)dx = 1) to a broad interval. Indeed, we can verify that and the function itself approaches to zero, i.e., Note that the previous example can be viewed as an opposite extreme in this sense as lim a→∞ Therefore, there is a tradeoff between the spread (vanishing speed) of gradients and their actual values. Some example curves of y and y for various values of a are illustrated in Figure 1.
In addition to the extreme examples, there are infinitely many functions that satisfy our criteria. However, the details of curve shapes do not matter much because local differences between two curves does not bear a significant meaning for general problems. Hence, it suffices to choose a representative family of curves that has a nice interpretation and includes various rates of gradient vanishment, in order to narrow down our choices. In fact, the previous examples are good candidates, since they are solutions to the following differential equation that has an elegant meaning: where a > 0 and > 0 are parameters. It is worth noting that (1 − y) on the left side is the difference between the l 0 -norm and y, thus, the decreasing speed of (1 − y) is identical to the rate of asymptotic convergence (criterion 3). Therefore, this equation describes the rate of gradient vanishment in terms of the rate of asymptotic convergence. This can be transformed into a Bernoulli equation, and the solution is given as which satisfies y (0 + ) = 1 , y(0) = 0, and y(∞) = 1 for a > 0. We call the corresponding penalty functions satisfying (9) as a family of slowly vanishing gradient (SVG) surrogates. As a special case of the family of SVG surrogates when = 1 and a → ∞, the solution leads to (3).

III. PROPOSED NONCONVEX SPARSITY SURROGATE A. CHOOSING A SIMPLE ONE AMONG THE SVG FAMILY
As explained in the previous section, there is a tradeoff between the vanishing speed and the actual value of the gradient. Thus, we can, at best, choose a good compromise between them. Since there is no clear winner between the curves in our SVG family, it is better to choose the simplest one among the reasonable choices. Accordingly, we constrained a to be an integer, and find one that gives the slowest decreasing rate of gradient, which is a = 1. As a result, we have y(x) = 1 − x+ = x x+ . Based on this function, our proposed sparsity surrogate 3 is given as follows: where > 0 is a weighting parameter that determines the slope at α i = 0 + . The following proposition shows the pointwise convergence of the proposed surrogate to the conventional norms. Proposition 1: SVG approximates the l 0 -and l 1 -norms: Note that the above properties still hold for the proposed SVG family based on (9). The proof is included in Appendix A. Some example curves of SVG are illustrated in Figure 2 to visualize these properties.
Another nice property of SVG is that it possesses a simple proximity operator. Recently, there have been remarkable theoretical progresses on convergence analysis for the sparse optimization techniques, and nonconvex versions for the accelerated proximal gradient method (nAPG) [33] and the alternating directional method of multipliers (nADMM) [34] have been proposed to solve sparse optimization problems efficiently in nonconvex settings. Hence, even though SVG is nonconvex, having a simple proximity operator is still a good advantage to incorporate the above methods for efficient nonconvex programming. The proximity operator for SVG is defined as follows: Note that this equation is separable, and we can solve it for each element of u. Since SVG is a symmetric function for each element, an element of the solution vectorû will either be of the same sign with the corresponding element of x or be zero. Let us assume that the sign of x i , the ith element of x, is positive without loss of generality. Then, one of the positive solutions of the following cubic equation (12) or zero will be the optimal point of u i . Note that the coefficient of the third-order term of g(u i ) is positive, as well as the value of g(− ) = λ > 0. This indicates that g(u i ) has at least one root for u i < 0, i.e., there can be at most two roots for u i ≥ 0. If there is no root or a double root for u i ≥ 0, g(u i ) is nonnegative for u i ≥ 0, i.e., the cost function is monotonically increasing for positive u i , and the optimal point will be 0. If there are two distinct roots, then the solution with a larger value is a local minimum, so either this solution or zero will be the optimal point. In conclusion, the optimalû i is either the largest positive root of (12) or zero, and we can compare the costs of these two points to find the final solution. This analysis will relieve the computational complexity when solving the third-order equation.

B. RELATIONSHIP WITH OTHER SPARSITY SURROGATES
There are many nonconvex sparsity surrogates, such as smoothly clipped absolute deviation (SCAD) [8], minimax concave penalty (MCP) [12], and Capped-l 1 (CapL1) penalty [26], which have been proposed to approximate the l 0 -norm. A comprehensive study on the nonconvex sparsity surrogate can be found in [27], [35]. In [8], authors advocate a nonconvex surrogate that has three desired properties: unbiasedness, sparsity, and continuity. More general properties to be a good nonconvex surrogate are described in [35] (see Assumption 1). Note that the proposed family of surrogates satisfies the properties and further extends them by introducing an important new criterion; slowly vanishing gradients. In addition, our surrogate can provide theoretical guarantees as shown in Section III-C. The above mentioned surrogates do not satisfy the criterion of slowly vanishing gradients, since they have large flat regions (gradient zero or quickly converging gradient). This may increase the chance of local optima if some local optima of a loss function (data term) are located at the plateau of the surrogate functions (regularizers). Our aim is to mitigate this effect. Unlike the above surrogates, there is another line of surrogates [7], [9], [10] as an alternative to the l 0 -norm, which gives a constantly inclinatory curve analogous to the proposed surrogate. A typical example is the l q -norm (0 < q < 1) [7]. However, there is no analysis about the l q -norm analogous to ours. Even worse, the l q -norm is known to be difficult to solve due to the q-th power. Whereas, ours enjoys a simple proximity operator and handles the raised issues efficiently. Analogous to the l q -norm penalty, the logsum penalty (LSP) [9] gives a non-flat curve, but it does not give the satisfying performance compared to the proposed penalty as shown in Section IV-A. There has been another attempt to use a smooth approximation of the l 0 -norm (SL0) based on an exponential function in [10], but no analysis was provided for justifying such a choice. Furthermore, our analysis shows that the approximation based on an exponential function has fast vanishing gradients, which is more prone to local optima, and thus this approximation does not give satisfactory performance as shown in Section IV-D.
While preparing this manuscript, we became aware of that our proposal, as a special case (a = 1) of the SVG family, leads to the same surrogate proposed by Geman and Yang [25] over two decades ago. However, it is important to note that there are clear differences between their and our studies. First, the specific choice for approximating the l 0 -norm is not justified in [25] because its focus is an image reconstruction problem. Second, we provide detailed analyses and design motivation for the SVG surrogate. Furthermore, the optimization approach in [25] is outdated, while we provide efficient algorithms based on a proximity operator for the proposed surrogate. Lastly, we show comprehensive experimental results compared to existing nonconvex surrogates using many well-known examples in the literature. VOLUME 8, 2020 Overall, our motivation and analysis give a new insight from the optimization perspective for nonconvex sparsity surrogates and the proposed one provides superior performance compared to the existing surrogates of the l 0 -norm as described in Section IV.

C. CONNECTION TO EXISTING THEORY
In this section, we provide an analysis about connection to existing theory. To this end, we first describe the following well-studied assumption: Assumption 2 ( [35]): We consider a scalar variable x for simplicity and define a regularizer as φ λ : R → R.
Proposition 2: The representative family of surrogates φ λ designed by our criteria with the parameters and a satisfies the conditions of Assumption 2 with L = 1 and µ = (a+1)λ a 2 . Corollary 1: The proposed SVG surrogate given in (10) satisfies the conditions of Assumption 2 with L = 1 and µ = 2λ 2 . The proof of Proposition 2 is included in Appendix B.

D. LEARNING SPARSE REPRESENTATION WITH SVG
The proposed surrogate can be applied to various sparse representation problems that the l 0 -norm and l 1 -norm are applied. In this section, we focus on three important problems including low-rank approximation (LRA) [20], sparse coding (SC) [1], and sparse subspace clustering (SSC) [36].

1) SVG FOR MODELING SPARSE ERRORS IN LRA
Sparse representation has been widely used in many applications to filter out outliers in data. One of the most popular applications is the low-rank approximation (LRA) of a matrix under the existence of outliers, and the l 1 -norm is usually used to model the sparse outliers [2], [14], [37]. We consider an LRA problem that the rank is explicitly specified, such as structure reconstruction [38] and photometric stereo [18], to name a few. In this case, it becomes a nonconvex problem. For the problem, we apply SVG for modeling sparse errors denoted by E, whose problem formulation (LRA-SVG) is constructed as This problem can be efficiently solved using the nADMM framework [34] as discussed before. The derivation of LRA-SVG is included in Appendix C.

2) SVG FOR SPARSE CODING
The proposed surrogate can be applied to another well-known nonconvex sparse representation problem, sparse coding (SC) with dictionary learning [1], [39], which is basically a matrix factorization problem. Here, SVG is used to enforce the sparsity of the encodings in this case. The problem formulation of SC for observation vectors x 1 , x 2 , . . . , x n , where n is the number of samples, based on SVG (SC-SVG) can be given as follows: where D and α i are an overcomplete dictionary and the ith sparse coefficient vector corresponding to x i , respectively. This problem is solved in an alternating fashion based on the proximal gradient method.

3) SVG FOR SSC
Subspace clustering is a problem to find the cluster memberships of data points based on an assumption that a point can be represented by a linear combination of other points in the same cluster. Note that this problem can be efficiently solved based on convex optimization, nevertheless we apply SVG to this problem, in order to verify the capability of the proposed surrogate in general problems. We apply SVG to the well-known sparse subspace clustering (SSC) [36], where the corresponding formulation (SSC-SVG) under noisy scenario is given as follows: (15) where Z is an affinity matrix to reveal cluster membership. This problem can be efficiently solved by nAPG under the nonmonotone update framework [33]. Especially, we incorporate the nonmonotone update framework [33] to accelerate the convergence of the algorithm. Note that initial values of optimization variables for the proposed algorithm are set to zero, based on empirical observations that the algorithm is not sensitive to initial values.

IV. EXPERIMENTAL RESULTS
In this section, we report numerical results of the sparse representation algorithms based on SVG: LRA-SVG, SC-SVG, and SSC-SVG. We compare these algorithms with other state-of-the-art algorithms 4 : RPCA [18], ALADM [37], and LRA-L1 (an l 1 -norm version of LRA-SVG) for lowrank approximation problems, KSVD [1] and SC [39] for sparse coding problems, and LRR [40], SSC-BP [36], SSC-OMP [41], and SSC-SL0 (SSC based on smoothed l 0 -norm [10]) for subspace clustering tasks, face clustering and motion segmentation. We also compare with other wellknown nonconvex sparsity surrogates, SCAD [8], MCP [12], CapL1 [26], and LSP [9], in order to demonstrate the superiority of the proposed nonconvex surrogate for problems described above. For the compared algorithms, we used the codes provided by the authors, unless stated otherwise, and each algorithm's parameters are tuned to yield the best performance for each dataset. For low-rank approximation and sparse coding problems, we compute the reconstruction error as where M GT and M are the ground-truth and reconstructed matrices or vectors, respectively, W is a weight matrix concerning missing entries, and is the Hadamard product operator. For subspace clustering, we compute the accuracy by the Hungarian method [42], where p i and q i are the i-th ground-truth and obtained cluster labels, respectively, δ(a, b) is the Kronecker delta function, and map(·) is a mapping function to permute the obtained labels to match with the ground-truth labels, which is computed by the Kuhn-Munkres algorithm [42]. We set the parameter of SVG to 3 for all experiments except the motion segmentation problem where is set to 0.07. We also provide settings for λ in the following experiments. All experiments were performed using MATLAB environment on a desktop computer with 24GB RAM and a 3.4GHz quad-core CPU.

A. EVALUATION FOR NONCONVEX SPARSITY SURROGATES
We first evaluate SVG on synthetic examples to compare with other nonconvex sparsity surrogates. We used the codes of other compared penalties provided by the work in [13], which solves the nonconvex optimization problems efficiently with a convergence guarantee. Following the practice in [13], we construct a sparse coding problem to find a sparse vector α: min α , where x ∈ R m is a target vector, D ∈ R m×p is a data/dictionary matrix, and φ(α) is a penalty function. For all experiments in this subsection, we set m = p = 500. We made a scenario by varying sparsity (0 ∼ 90%) of a ground-truth vector α GT , where lower sparsity means denser representation, and made an observation x GT from the multiplication of D and α GT , which are obtained by the standard normal distribution. Based on x GT , we made x by adding Gaussian noises from N (0, 10 −2 ). For each setting in the scenario, we performed k independent runs, where k is set to 30. We set λ to 0.3. The average reconstruction error is computed as 1 Average results of the compared surrogates are shown in Figure 3. As shown in Figure 3(a), the proposed surrogate performs better than the other nonconvex surrogates on average. LSP, which represents a similar non-flat curve, gives the similar performance to ours when the sparsity ratio is larger than 40%. SCAD and MCP show the similar but worst performances in this problem. The average computation times (sec) of the surrogates for the reconstruction problem are as follows: 0.15 for CapL1, 0.28 for SCAD, 0.26 for MCP, 0.23 for LSP, and 0.3 for SVG, respectively. In the problem, most of the methods take similar execution times.
We also applied the surrogates to nonnegative sparse coding problems under the same setting as the previous example, except that all coefficients are nonnegative. Figure 3(b) shows the l 2 errors between the true coefficient vector and obtained vectors under different numbers of true coefficients in α i . Overall, the results show the similar tendency to the previous experiment, in which the proposed surrogate finds all sparse coefficients with lowest errors.

B. LOW-RANK APPROXIMATION OF MATRICES
We report the results for low-rank approximation problems using both synthetic and real-world problems. To generate synthetic examples, we made a matrix whose size is 500 × 500 and set the rank of the matrix to 10. In the matrix, we added Gaussian noises with N (0, 10 −5 ) and outliers with magnitude of 10 for randomly chosen elements. VOLUME 8, 2020 The outlier ratio is varied from 0% to 50% to verify the robustness of the proposed method. Here, we compare with MCP, CapL1, and LSP under the same LRA framework to ours. The experimental results for 50 independent trials are described in Figure 4(a). From the figure, we observe that the proposed method withstands higher outlier ratios, whereas other methods fail to find a good solution when the outlier ratio becomes roughly over 30%. The three nonconvex penalty based algorithms perform better than the methods based on the convex l 1 -norm, on average, but they could not endure as many outliers as ours. Interestingly, the algorithms based on nonconvex penalties bear a lot of outliers relatively compared to the convex penalty based algorithms particularly when the outlier ratio is over 50%. The average computation times (sec) of the algorithms are 0.62 for ALADM, 11.74 for RPCA, 1.76 for LRA-L1, 50.24 for LRA-LSP, 13.77 for LRA-MCP, 13.8 for LRA-CapL1, and 3.16 for LRA-SVG, respectively.
We performed real-world experiments on two problems; nonrigid motion estimation [43] and photometric stereo [38]. For nonrigid motion estimation, we used the Shark sequence (rank 6) [43]. To consider missing environments, we replaced 10% randomly selected entries in the sequence as missing. For photometric stereo, we used Static Face dataset (rank 4) [38] which has 42% missing entries. For these problems, we did not evaluate RPCA because they are rankconstrained problems. Figure 4(b) and 4(c) show the average reconstruction errors of the algorithms for 50 independent runs under various outlier ratios (0 ∼ 50%). From the figure, we confirm that the proposed method outperforms the other methods for both problems. Especially, the proposed method is highly robust against corruptions for the Static Face dataset. Most of the nonconvex penalty based algorithms show lower reconstruction error than the l 1 -norm based algorithms. While LRA-LSP gives competitive results to LRA-SVG for the Shark sequence, it performs poorer than ours for Static Face. The l 1 -norm approaches, LRA-L1 and ALADM, perform worse than other nonconvex surrogate based algorithms on average for both datasets. The reconstruction results of the three selected algorithms, the proposed method, LRA-LSP, and LRA-L1, for three randomly selected frames of the  Shark sequence in the presence of 30% outliers are shown in Figure 5.

C. SPARSE CODING
We conducted experiments for a sparse coding problem (14) based on well-known example images in the literature: Barbara, Lena, Boat, and Peppers, which are shown in Figure 6. Following the practice of [1], we extracted n 64-dimensional word vectors based on 8 × 8 local patches for each image, where n is the number of training data which was set to n = 15, 000. Based on these word vectors, we learned both dictionary and sparse code for each sample. For all tested images, the size of dictionary D was set to 250, i.e., D ∈ R 64×250 . In each dataset, we added Gaussian noises from N (0, 0.3). The parameter λ for SVG is set to 8 for  [36], SSC-OMP [41], and LRR [40]. Tracked points are marked by a symbol '+'. Different colors in the mark correspond to independent motion clusters. (·) denotes the segmentation accuracy. Best viewed in color (x2). the images. The average reconstruction errors of the tested algorithms are shown in Table 1. In the table, our algorithm gives excellent results for all cases. KSVD, which uses OMP, performs slightly better than SC based on the l 1 -norm, but it is unsatisfactory compared to ours. This experiment also demonstrates the excellence of the proposed surrogate.

D. SUBSPACE CLUSTERING
We applied the proposed method, termed SSC-FAN, to two benchmark problems, face clustering [44] and motion segmentation [45], for subspace clustering.

1) FACE CLUSTERING
We evaluated the proposed surrogate on the Extended Yale B database [44] for face clustering. The dataset consists VOLUME 8, 2020  of 38 subjects, each of which has 64 frontal face images under illumination changes. We collected the first c subjects, where c ∈ {2, 5, 8, 10}, and performed subspace clustering on the images of these subjects. For each problem, we used PCA to project images in 9c-dimensional subspaces to make an overcomplete dictionary. We set the parameter λ to 90. Table 2 shows the clustering accuracy for different numbers of subjects. The proposed method, SSC-SVG, shows better clustering performance than the existing algorithms based on the convex or nonconvex regularizers. SSC-OMP performs better than SSC-BP, SSC-SL0, and LRR on average, but it gives lower accuracy than ours for most scenarios. Especially, its performance collapses considerably when c > 5. SSC-SL0 shows the worst performance among the tested algorithms.

2) MOTION SEGMENTATION
The goal of motion segmentation task is to segment trajectories of rigidly moving objects based on tracked points along the frames. Since collected trajectories from a rigid motion lie in a low-dimensional subspace, we can solve the motion segmentation as a subspace clustering problem [36]. Hence, we applied SSC-SVG to the well-known benchmark dataset, Hopkins 155 [45], which consists of 155 video sequences with two or three motion clusters. The average size of dimensionality and samples for each sequence are roughly 60 and 296, respectively. In this problem, we set λ = 7×10 −3 . Four quantitative measures were used for clustering performance: mean, standard deviation (Std.), minimum, and median, following the work in [36]. The average performance of the algorithms are shown in Table 3. As shown in the table, our proposal outperforms existing algorithms approximating the l 0 -norm and the dense representation method, LRR. SSC-BP and LRR give the similar performance, but they are unsatisfactory compared to ours, Two algorithms approximating the l 0 -norm, SSC-OMP and SSC-SL0, show the disappointing results in this problem. Some graphical results on the dataset for four selected methods are illustrated in Figure 7.

V. CONCLUSIONS
We have analyzed desirable criteria to be a good nonconvex sparsity surrogate and presented a corresponding family of surrogates that are a solution of a differential equation, named slowly vanishing gradients (SVG). Among the SVG surrogates, we selected a practical one as a proposed surrogate, which complements both l 0 -and l 1 -norms. The penalty possesses a simple proximity operator which allows efficient nonconvex optimizations. The penalty is a good alternative to the l 0 -norm due to its slowly vanishing gradient property. The proposed surrogate has been tested on various applications in the literature to demonstrate its effectiveness and empirical results have confirmed the superiority of the proposal.

APPENDIX A PROOF OF PROPOSITION 1
Since the proposed surrogate, SVG, is one of our representative family, we prove the properties in Proposition 1 for our family. We redefine the family of curves, called SVGF, as follows: where a and are parameters of the family as defined in (9). If a = 1, the function in (18) becomes the proposed surrogate. Proposition 3: SVGF satisfies the following properties: 1) x a, SVGF ≤ x 0 ∀a, and x a, x a, SVGF ≤ x 1 ∀a, and x a, Proof: Assume a and in x a, SVGF are positive. We simply show the proposition for a scalar case, but its extension to a vector case is straightforward. It is easily checked that y(x) = 0 if x = 0 and y(x) ≤ 1 if x = 0, thus we verify that SVGF is always lower than or equal to the l 0 -norm for all x regardless of . If goes to zero, 1 (1+ |x| a ) a → 0 when x = 0. Thus, y(x) → 1 and the asymptotic convergence to the l 0 -norm holds.
Note that both y(x) and the l 1 -norm are symmetric around zero and nonnegative (with y(0) = 0). Then, y(x) is lower than or equal to the l 1 -norm, since y (x) = 1 (1+ x a ) a+1 ≤ 1 for all nonnegative x. This also holds for x < 0. Finally, in order to show that y(x) asymptotically converges to |x| if → ∞, we use the following relation: where and g(β) = β 1 .

APPENDIX C DERIVATIONS OF THE LRA PROBLEMS
For the LRA problem, we apply SVG for modeling sparse errors, whose problem formulation, termed LRA-SVG, is constructed as follows: The augmented Lagrangian of (27) is constructed as such that rank(M) ≤ r. Based on (28), we obtain an algorithm using the following steps: where (e ij − x ij + m ij + π ij γ ) 2 (33) where x ij , m ij , and π ij are the (i, j) th elements of X, M, and , respectively. The solution of (33) can be found by an efficient computation for each element separately as explained in Section III-A. For another element e kl indexed by X , where X is a complementary support set of X, we obtain e kl ← x kl − m kl − π kl γ . For the tested algorithms based on the same ADMM framework, such as LRA-L1, LRA-CapL1, and LRA-MCP, we simply switch the penalty function · SVG in (27), (28), and (29) to a nonconvex penalty function and solve its corresponding optimization problem. As an example, LRA-L1 considers the following optimization problem in the ADMM framework when solving for the variable E: and its solution is computed as follows: where Y X − M − γ and S γ (t) = sign(t) max(|t| − γ , 0) is the shrinkage operator [46] for a scalar variable t. Other problems based on the nonconvex penalty functions described in the experimental section to solve for E can be solved efficiently by the work in [13].