The Strength of Nesterov’s Extrapolation2019

The extrapolation strategy raised by Nesterov, which can accelerate the convergence rate of gradient descent methods by orders of magnitude when dealing with smooth convex objective, has led to tremendous success in training machine learning tasks. In this paper, we theoretically study its strength in the convergence of individual iterates of general non-smooth convex optimization problems, which we name \ textit { individual convergence } . We prove that Nesterov’s extrapolation is capable of making the individual convergence of projected gradient methods optimal for general convex problems, which is now a challenging problem in the machine learning community. In light of this consideration, a simple modiﬁcation of the gradient operation suﬃces to achieve optimal individual convergence for strongly convex problems, which can be regarded as making an interesting step towards the open question about SGD posed by Shamir (missing citation). Furthermore, the derived algorithms are extended to solve regularized non-smooth learning problems in stochastic settings. {\ color { blue } They can serve as an alternative to the most basic SGD especially in coping with machine learning problems, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence. } Typically, our method is applicable as an eﬃcient tool for solving large-scale $ l 1 $ -regularized hinge-loss learning problems. Several real experiments demonstrate that the derived algorithms not only achieve optimal individual convergence rates but also guarantee better sparsity than the averaged solution.


I. INTRODUCTION
I N THIS article, we aim at regularized optimization prob- lems arising in machine learning, where the objective function is the sum of two nonsmooth convex terms: one is the loss from the learning task and the other is a regularizer such as l 1 -norm for promoting sparsity.For better understanding, we first focus on the constrained optimization problem, min f (w), s.t.w ∈ Q (1) where Q ⊆ R N is a closed convex set and f is a convex function on Q.
A popular method to solve (1) is the projected subgradient (PSG) algorithm.When f is smooth, it was proven that no first-order method can converge at a rate faster than O((1/t 2 )) by Nemirovski and Yudin in the early 1980's [18], where t is the number of iterations performed by the algorithm.However, PSG only has a suboptimal O((1/t)) rate of convergence [30].This created a gap between the guaranteed convergence rate of PSG and what could potentially be achieved.Since then, various efforts have been paid to close this gap (for instance, see [30] and references therein).One well-known strategy is to perform the extrapolation operation, where momentum terms involving the previous iterations are added to the current iteration.Such technique dates back to the pioneering work of Nesterov in 1983 [19].Specifically, Nesterov showed that the extrapolation step with suitably chosen parameters can accelerate the rate of convergence from O((1/t)) to O((1/t 2 )), that is, this strategy makes PSG optimal among first-order techniques that can access only sequences of gradients [20].So far, there have been many papers devoted to extending Nesterov's accelerated method to stochastic settings for regularized learning problems, where the regularization structure (e.g., sparsity, low-rank, etc.) are effectively exploited [4]- [6], [13], [15], [31], [34].Nesterov's accelerated methods have also been widely adopted in training deep neural networks and significantly increased their performance [14].
In contrast to smooth objective cases, there is no specific difficulty preventing almost all the subgradient algorithms to achieve optimal rates of O((1/ √ t)).In spite of various progress in convergence rates (for instance, see [2] and references therein), several fundamental questions such as the attainable convergence rates still remain unsolved.Almost all the optimal convergence rates in nonsmooth cases are only obtained by averaging all past iterates.However, as far as the strongly convex objective is concerned, simply averaging all past iterates is not enough, as it only corresponds to the suboptimality rate of O((log t/t)) as opposed to the expected optimal rate of O((1/t)) [25].Moreover, an example given in [24] shows that the convergence rate of stochastic gradient descent (SGD) with the averaged output is exactly 2162-237X © 2019 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS ((log t/t)) in nonsmooth and strongly convex cases.Due to these facts, an open question was posed in [26] whether the averaging scheme is required to reach optimal convergence.During the last few years, there have been many interesting developments to make SGD optimal.One important idea is based on appropriately modifying the averaging scheme.By averaging only a suffix of the iterates [24] or by weighted averaging of all past solutions [17], the logarithmic factor within the suboptimality rate can be removed.While these theoretical results show that we should use improved averaging schemes, empirically, people have mostly been using only the last iteration as the final solution [24], [25].Intuitively, an individual output directly decided by the step of gradient operation potentially has the advantage of readily enforcing the regularization structure [6].These facts suggest that we should study convergence rates of the last individual iterate theoretically, which is also accordant with the traditional optimization theory.In 2013, the first finite-sample bounds on individual iterates of SGD were established [27].However, it only has a suboptimal O((log t/ √ t)) rate for general nonsmooth convex objective functions, and a suboptimal O((log t/t)) rate in nonsmooth but strongly convex cases.So far, there still exists a gap between practical performance and theoretical analysis.In this article, we will focus on the convergence of the last individual iterate of PSG in nonsmooth cases, which is referred to as individual convergence for simplicity.
Several significant efforts have been paid to close the aforementioned gap by using other first-order gradient algorithms.In particular, Chen et al. [32] improved the regularized dual averaging method (RDA) to be self-adaptive and obtained uniformly optimal individual convergence for nonsmooth convex functions [6].However, their optimal RDA is remarkably different from the regular RDA in computation, that is, the former requires two steps of subgradient computation at each iteration, rather than only one step of subgradient operation used in the latter.As a result, the subgradient operation becomes less intuitive.Recently, by incorporating the averaging scheme into DA directly, Nesterov et al. [22] succeeded in deriving an optimal individual convergence rate of O((1/ √ t)) for nonsmooth convex problems.However, the strongly convex case is not discussed.It should be mentioned that the derivations of optimal individual convergence so far only focus on the DA-like methods.Motivated by the averaging step in quasi-monotone DA [22], we recently presented a primal averaging (PA) strategy for PSG [29], in which the subgradient evaluation is imposed on the average of all past iterates.The PA strategy can accelerate the individual convergence of PSG to be optimal for nonsmooth convex problems.In light of this consideration, an optimal rate of individual convergence for strongly convex problems can also be obtained by further modifying the PA step.
Inspired by the great success of Nesterov's extrapolation step in smooth optimization, we will study its convergence rate in nonsmooth optimization.In particular, we discover that Nesterov's extrapolation strategy is capable of accelerating the individual convergence of fundamental PSG to be optimal for general convex problems without the additional assumption of smoothness.Further, a simple modification of the gradient operation suffices to achieve optimal individual convergence for strongly convex problems.These results indicate that the averaging scheme to attain optimal rates can be unnecessary.Although whether an optimal individual rate of SGD can be attained is still unknown, we can simply achieve optimal individual convergence with the help of Nesterov's extrapolation, which might be regarded as making an interesting step toward the open question about SGD posed by Shamir [26].No matter there is a smoothness assumption or not, Nesterov's extrapolation consistently has the strength to provide theoretical guarantee of optimal convergence.The derived algorithms in this article can serve as an alternative to the fundamental SGD especially in the machine learning community, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence.Moreover, the novelty of convergence analysis presented in this article reveals more insights about the similarities and differences between smooth and nonsmooth optimization methods, and then help us understand how to correctly and provably extend smooth optimization methods to nonsmooth tasks.
It should be noted that the effect of Nesterov's accelerated strategy on nonsmooth problems has been investigated in [16] and [8].Specifically, a modified mirror descent (MD) algorithm [called accelerated stochastic approximation (AC-SA)] was proven to attain optimal individual convergence for general convex problems [16], in which the same stepsize policy as the second accelerated method [30] is employed.Unfortunately, as stated in [7], AC-SA assumes a priori knowledge of the performed number of iterations and use stepsizes based on this number.On the other hand, a fast gradient method (FGM) developed originally for smooth optimization in [21] was proven to be optimal for nonsmooth problems in terms of the individual convergence [8].However, FGM suffers from the same drawback as ORDA [6] in needing two steps of gradient evaluation per iteration.Fortunately, our algorithms avoid the disadvantages of both AC-SA and FGM while keeping optimal individual convergence.The key to success lies in our direct use of Nesterov's extrapolation strategy and suitable selection of the time-varying stepsize parameters.Moreover, the optimal individual convergence for challenging strongly convex problems can be further derived.
Based upon our convergence analysis for black-box problems, the derived algorithms are extended to regularized and stochastic settings for large scale machine learning tasks.Unlike the quasi-monotone algorithm [22] or PA-PSG [29], the subgradientlike operation in our method follows the extrapolation evaluation, which brings significant benefits in keeping the regularization structure.Our method is applicable as an efficient tool for solving large-scale l 1 -regularized hinge-loss learning problems.It not only guarantees the best possible rate of individual convergence but also achieves better sparsity than the averaged solution.Several experiments confirm the correctness of our convergence analysis and illustrate the performance of our algorithms in keeping sparsity.
The rest of this article is organized as follows.Section II reviews the success of Nesterov's extrapolation in smooth optimization.Section III reviews some related first-order subgradient methods in nonsmooth cases.In Sections IV and V, optimal individual convergence of PSG with Nesterov's extrapolation for general convex and strongly convex problems are proved, respectively.The optimal individual convergence results are extended to regularized and stochastic settings in Section VI.Comparison experiments are conducted in Section V, and the concluding remarks follow in the last section.

II. NESTEROV'S EXTRAPOLATION
IN SMOOTH OPTIMIZATION In this section, we briefly review the strength of Nesterov's extrapolation in smooth optimization.To this end, we assume that the objective function f in (1) is smooth, that is, there exists a constant L ≥ 0 such that The iteration of PSG is or equivalently where P is the projection operator on Q [2], a t is the parameter about step-size, and ∇ f (w t ) is the gradient of f at w t .Let w * denote an optimal solution of problem (1).Specifically when a t = (1/L), it holds that [30] f In a series of work (see also [30]), Nesterov proposed three methods for solving smooth problems (1) with constraints that, at each iteration, use either one or two steps of gradient operation together with extrapolation to accelerate convergence.In particular, the first accelerated method can be formulated as where Note the stepsize choice of θ t +1 in ( 6) is given by Tseng [30].It can be regarded as a generalization of the original rule in [19].According to this generalized rule, one choice of θ t +1 is which is also employed in SAGE [13].Another choice of θ t +1 is Thus y t , which the gradient operation is imposed on, is a extrapolation from w t −1 to w t .The stepsize rule in ( 6) is referred to as Nesterov's extrapolation in this article.
The second and third accelerated methods are also described in [30].Despite different extrapolation steps are used, all these three methods are equivalent [30] when the concerned Bregman divergence is Euclidian.Without loss of generality, we mainly focus on the first accelerated method (6).
Let {w t } ∞ t =1 be generated by the first accelerated method (6).For any w ∈ R N , it holds that [30] f This bound means that the convergence rate of the accelerated method ( 6) is O((1/t 2 )), which clearly exhibits the strength of Nesterov's extrapolation in obtaining optimal convergence.So far, stochastic Nesterov's accelerated methods have been employed to solve large-scale regularized learning problems [6], [13], [31].In addition to accelerating the convergence in the cases of smooth loss but nonsmooth regularizer, each accelerated method obtains an optimal rate of convergence in terms of the individual output, which is highly expected in nonsmooth optimization.

III. PSG IN NONSMOOTH OPTIMIZATION
This section reviews several significant results about the convergence of PSG.We now assume that the objective function f in (1) is nonsmooth. Let To get concrete convergence rates, we require Assumption 1: Let ∇ f (w) denote any subgradient of f at w. Assume that there exists a number M > 0 such that With suitable a t , an optimal rate of convergence can be derived for nonsmooth problems [22], that is, Let ĝt be an unbiased estimate of subgradient of f at w t .The PSG (3) can be simply extended to stochastic settings, in which the key operation is To get the convergence rates in stochastic settings, we require Assumption 2: Let ĝ be an unbiased estimate of subgradient of f at w. Assume that Based on the regret bounds in [35], it is easy to know that the stochastic PSG ( 8) has While an optimal rate of convergence is easily attained, it is only in terms of the averaged output 1 t t k=1 w k .When f is strongly convex, even the regret bounds in [11] and [12] are employed, we can only get This fact indicates that even the averaged output (1/t) t k=1 w k can not lead to an optimal rate due to a logarithmic factor.
To obtain individual convergence without changing the original algorithms, a general technique was presented in [27] to reduce results on averages of iterates to convergence on individual iterates.As a result, they have gotten so far the best rates for stochastic PSG in terms of the individual convergence (3), that is, for nonsmooth convex problems and for strongly convex problems.
To get first-order gradient algorithms with the expected optimal individual convergence, there are some important works on Nesterov's DA.Typical examples include optimal RDA [6] and quasi-monotone DA [22].Specifically, the quasimonotone subgradient method is formulated as Note that the additional averaging parameters in ( 10) is a specified interpolation step.With suitable a t and γ t , it holds that [22] f In order to make PSG optimal in terms of the individual convergence for general convex problems, a stochastic PSG with PA operation (PA-PSG) is given in [29], that is, where ĝt is an unbiased estimate of subgradient of f at w t .
It is easy to find that the gradient operation in ( 11) is imposed on f (w t ), in which w t in fact is a weighted average of all past iterates.When the objective in ( 1) is μ-strongly convex, stochastic PA-PSG [29] is modified as With suitable a t and γ t [29], PA-PSG ( 12) has However, when dealing with regularized learning problems, we don't know whether the regularization structure can be guaranteed by employing (11).One of the main purposes of this article is to make PSG optimal in terms of the individual convergence while keeping the regularization structure.The main idea is to employ Nesterov's extrapolation in nonsmooth optimization.

IV. NESTEROV'S EXTRAPOLATION IN NONSMOOTH OPTIMIZATION
In this section, we study the strength of Nesterov's extrapolation in improving the individual convergence of basic PSG.Naturally, we consider In contrast to regular PSG (3) and PA-PSG (11), we call (13) a PSG with Nesterov's extrapolation or Nesterov's PSG.The only difference between (6) and ( 13) lies in the selection of step size.It looks like that the convergence analysis will be completely similar to that of accelerated methods for smooth optimization in [30].However, the convergence analysis in smooth optimization [30] heavily depends on the property which no longer holds for nonsmooth functions.To get convergence rates without smooth assumption, the key issue is how to deal with some terms caused by the time-varying stepsize and some additional terms caused by the nonsmooth objective function.
To conduct a convergence analysis, a few lemmas are required.We only give the main theorem here.All the lemmas (Lemmas 1, 2, 3, and 4) with proof details are shown in the Appendix.
Theorem 1: Let {w t } ∞ t =1 be generated by (13).Let θ t = (2/t + 1) and a t = (1/(t + 1)(t + 1) 1/2 ).For any w ∈ Q, it holds Without modifying the regular stepsize rule, PSG only has a suboptimal O((log t/ √ t)) individual rate for nonsmooth convex objective functions [27].By simply replacing the original diminishing stepsize rule with the Nesterov's extrapolation step originally for smooth optimization, we prove that Nesterov's PSG (13) has optimal individual convergence for nonsmooth problems.Naturally, its stochastic version can be obtained by directly substituting the gradient ∇ f (y t ) in ( 13) with its unbiased estimate ĝt .Such a substitution will not affect its optimal individual convergence.In fact, it is not difficult to find that Lemma 1 becomes By using Assumptions 1 and 2 Note (see [7, proof of Th. 1, p. 9] and [16, eq.(58), p. 21]) E∇ f (y t ) − ĝt , y t − w = 0. Now, it becomes clear that similar bounds like that in Lemma 2 hold for stochastic Nesterov's PSG.Using the same deduction as that in Lemmas 3 and 4, we can prove that Nesterov's PSG (13) in the stochastic setting achieves an optimal individual convergence rate in expectation for nonsmooth convex problems.
It should be mentioned that Nesterov's accelerated methods have been applied in nonsmooth optimizations [8], [16].To make a careful comparison, we give some remarks here.
1) The key operation of AC-SA algorithm in [16] can be formulated as which is stated as the second accelerated method in [30] for smooth optimization.It was proven to attain an optimal individual convergence rate for nonsmooth optimization but requires β t to depend on the performed number of iterates [16].As stated in [7], this assumption is not desirable when we want to run a method for a given time and not for a given number of iterates.2) FGM was originally developed for smooth optimization [21].As a byproduct, it was proven to be optimal for nonsmooth problems in terms of the individual convergence [8].However, at each iteration, there are two steps of subgradient operation in FGM, which results in the same disadvantage as that in ORDA [6].In [30], FGM was improved to have only one projection instead of two, which is also called the third accelerated method.But this accelerated method is only limited to smooth optimization problems [30].3) In contrast to FGM, our method uses time-varying stepsizes that does not need to know the performed number of iterates in advance.Based on the equivalence between the three accelerated methods [30], we can say that our method avoids the disadvantages of both AC-SA and FGM in coping with nonsmooth optimization.Moreover, its optimal individual convergence in strongly convex cases can be further achieved, which will be discussed in the next section.
V. NESTEROV'S EXTRAPOLATION IN STRONGLY CONVEX CASES In this section, we will discuss the strongly convex cases, that is, there exists a real number μ > 0 such that or equivalently [2] f (θ ∀u, w ∈ R N . To utilize the strong convexity of f , we modify (15) as It is easy to find that algorithm ( 15) is special case of (13) when μ = 0. Without loss of generality, we will assume μ = 0 in the following.Obviously, in strongly convex cases, the key to derive optimal individual convergence lies in how to get stronger results from the analysis in Section IV.Specifically, we give Lemmas 6 and 7 in the Appendix.Based on Lemmas 6 and 7, we have Theorem 1: Let {w t } ∞ t =1 be generated by (15).Let a t = (3/μt 2 ) and Without modifying the regular gradient operation and stepsize rule, PSG only has a suboptimal O((log t/t)) individual rate in nonsmooth but strongly convex cases [27].By imposing on the strong convexity coming from the objective function and Nesterov's extrapolation step originally for smooth optimizations, our PSG (15) achieves an optimal rate of individual convergence for strongly convex problems.
Like that in general convex cases, PSG with Nesterov's extrapolation (15) can also be extended to stochastic settings by substituting the gradient ∇ f (y t ) in (15) with its unbiased estimate ĝt .Under Assumptions 1 and 2, we can obtain a similar optimal convergence bound of Theorem 1 in expectation, that is, by choosing identical θ t and a t , we can derive

VI. EXTENSION TO REGULARIZED LEARNING
In this section, we study the regularized learning problems, that is, where Q ⊆ R N is a closed convex set, and r (w) : R N → R is a simple convex function and Like the regularizing technique in dealing with accelerated PSG [30], MD [10] and DA [31], the regularization term should not be linearized.Naturally, our stochastic regularized subgradient (SRSG) with Nesterov's extrapolation takes the form w t +1 = arg min w∈Q a t ĝ t , w +a t r (w)+ where ĝt is an unbiased estimate of subgradient of f at y t .Similarly, when f is μ-strongly convex, the SRSG with Nesterov's extrapolation can be formulated as To get concrete convergence rates, we require Assumption 3: Assume that there exists a number M > 0 such that Assumption 4: Let ĝ be an unbiased estimate of subgradient of f at w. Assume that The regularized reformulation will not affect the optimal individual convergence we have derived in Theorems 1 and 2. In fact, it is not difficult to find that Lemma 1 becomes where ĝt is an unbiased estimate of subgradient of f at y t .Note Then Under Assumptions 3 and 4, similar bounds as in Lemmas 2 and 7 hold for SRSG with Nesterov's extrapolation (18) and (19).Using the same deduction as that in Sections III and IV, we can prove that SRSG with Nesterov's extrapolation (18) and ( 19) attains optimal individual convergence in expectation for nonsmooth convex and strongly convex problems respectively, that is, we have Theorem 1: Let w 0 ∈ Q be an initial point and {w t } ∞ t =1 be generated by SRSG with Nesterov's extrapolation (18).Let θ t = (2/t + 1) and a t = (1/(t + 1)(t + 1) 1/2 ).We have where w * is an optimal solution of problem (17).Theorem 2: Assume μ > 0 and that f is μ-strongly convex.Let w 0 ∈ Q be an initial point and {w t } ∞ t =1 be generated by SRSG with Nesterov's extrapolation (19).Let a t = (3/μt 2 ) and , otherwise.
We have In [29], PA-like algorithms are extended to solve regularized nonsmooth loss optimization problems in stochastic settings.Specifically, PA-PSG (11) where B is the Bregman divergence and ĝt is an unbiased estimate of subgradient of f at w t .When solving l 1 -regularized learning problems, each w + t in ( 22) is derived by solving the optimization subproblem in closed-form and thus has better This means the individual solution w t +1 in ( 22) is a linear combination of w + k (k = 1, 2, . . ., t).As a result, the structure especially imposed on by the l 1 -regularizer may not be guaranteed.
In contrast to PA-PSG ( 22), the individual solution of SRSG with Nesterov's extrapolation is directly generated by the closed-form solution of the optimization subproblem.The sparsity decided by the l 1 -regularizer is then effectively kept.

VII. EXPERIMENTS
In this section, we will conduct experiments to verify our theoretical analysis and illustrate the performance of our algorithms in keeping the sparsity.To clearly illustrate its advantage over other strategies, we focus on the comparison experiments on PSG with different stepsize rules in terms of convergence and sparsity (i.e., the percentage of nonzero entries, [34]).Four benchmark data sets in Table I are considered, where one data set is available at http://www.causality.inf.ethz.ch/data/SIDO.html and the other three data sets are available at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.
Let S = {(x 1 , y 1 ), . . ., (x m , y m )} be a training set, where be the convex and nonsmooth loss caused by (x i , y i ).At the t-th step of the stochastic algorithms, we assume where the sample (x t , y t ) is uniformly at random chosen from S. To make a fair comparison, we independently repeated each stochastic algorithm ten times and report the results averaged over ten trials.
For nonsmooth cases, we consider the l 1 -regularized hinge loss learning problems, that is, where f i (w) = max{0, 1 − y i w, x i }.It is well-known that the tradeoff parameter λ in (23) influences both convergence and sparsity.In this experiment, we choose to compare our SRSG (18) with stochastic COMID [10], PA-PSG (22) and SRSG (13), in which the only difference lies in their stepsize rules.The parameters θ t and a t of SRSG ( 13) are selected    according to Theorem 3. The convergence of the averaged solution of COMID and individual solution of PA-PSG (22) and SRSG (18) on four different data sets (Table I) are illustrated respectively in Figs.1-4.These convergences are all proved to attain optimal rates.From these figures, it can be observed that they have almost the same behavior.
The sparsity of individual output of stochastic SRSG (18) and PA-PSG (22), and the sparsity of the averaged solution of  stochastic COMID [10] on four different data sets (Table I) are illustrated respectively in Figs.5-8.It can be seen that the individual solution of our stochastic SRSG (18) consistently has better performance while the individual output of PA-PSG (22) only derives almost the same sparsity as the averaged solution.Thus for regularized nonsmooth problems, we conclude that Nesterov's extrapolation enables PSG to have better sparsity than other strategies while attaining optimal individual convergence.
For nonsmooth but strongly convex cases, we consider standard support vector machine (SVM) problems, that is, where λ is a tradeoff parameter.The goal of this experiment is to verify the theoretic analysis in Section V. Note for strongly convex problems with constraints, one state-of-the-art algorithm is Pegasos and its high performance has been sufficiently illustrated in [25].So, we compare our Nesterov-PSG (15) with Pegasos (a t = (1/λt) in ( 8) in terms of convergence.We also compare our Nesterov-PSG (15) with PA-PSG (12), in which their difference only lies in the stepsize rules.For all the four data sets, λ = (1/n) (n is the number of training samples) and Q is determined by using the strong duality theorem [25].The parameters θ t and a t are selected according to Theorem 1.The individual convergence of Pegasos, PA-PSG (12), and our stochastic PSG (15) on four different data sets (Table I) are illustrated, respectively, in Fig. 8-12.From these figures, it can be observed that three algorithms have almost the same convergence.However, it should be pointed out that theoretical guarantee for optimal individual convergence of Pegasos still remains unsolved now [26].
Thus for strongly convex problems, we conclude that our Nesterov-PSG (15) has optimal individual convergence.

VIII. CONCLUSION
In the domain of smooth optimization, a well-known fact is that Nesterov's extrapolation strategy can accelerate the convergence rate of gradientlike methods by orders of magnitude.However, as far as know, its strength in nonsmooth optimization has not been fully investigated.In this article, we prove that Nesterov's extrapolation succeeds in leading to optimal individual convergence of PSG for nonsmooth convex and strongly convex problems.These results can be regarded as making an interesting step toward the open question about SGD posed by Shamir.Besides, the final solution of our algorithm is directly generated by the gradientlike operation, which succeeds in keeping the sparsity when dealing with l 1 -regularized learning problems.
This article only focuses on the basic optimization method PSG.Based on its equivalent reformulation, Nesterov's extrapolation can be employed in MD to achieve optimal individual convergence.Obviously, it is interesting that whether we can extend Nesterov's extrapolation to other optimization methods such as alternating method of multiplier (ADMM) [28] and preconditioned SGD [23], and its application like that in [32] is also expected.All these issues will be considered in our future work.APPENDIX 1: Let {w t } ∞ =1 and {y t } ∞ t =1 be generated by (13).For any w ∈ Q, we have Proof: Note is equivalent to According to the convexity of Note the objective function a t ∇ f (y t ), w +(1/2)w−y t 2 is (1/2)-strongly convex, then we have Thus Lemma 1 is proven.
In contrast to the proof in smooth optimizations [30], −(1/2)y t − w t +1 2 + a t ∇ f (y t ), y t − w t +1 is an extra term caused by the nonsmooth objective function.Using the technique in [10], we can incorporate these two terms into 1 2 a 2 t M 2 by applying the Fenchel-Young inequality, that is, Lemma 2: Let {w t } ∞ t =1 be generated by (13).For any w ∈ Q, we have where M is defined in Assumption 1.
Proof: According to Lemma 1 From Assumption 1, f is Lipschitz with constant M. Then we can establish the relationship between f (w t +1 ) f (y t ), that is, Thus Lemma 2 is proven.Note that (1/2)w − y t 2 − (1/2)w − w t +1 2 is not convenient for further recursion, we can use similar technique in smooth optimization [30], that is, Lemma 3: Let {w t } ∞ t =1 be generated by (13).Let For any w ∈ Q, we have According to the definition of z t and ( 13) Therefore Lemma 3 is proven.
In smooth optimization, the constant stepsize directly enable us to recursively deal with w − z k 2 − w − z k+1 2 .However, a k is time-varying in nonsmooth optimization.Fortunately, we can employ the technique in [35] to cope with Lemma 4: Let {w t } ∞ t =1 be generated by (13).For any w ∈ Q, we have and Using Lemma 3 Dividing both sides by a t θ 2 t and using ( 13) Recursively applying the above inequality, we can obtain This is Let w = w * .Note that θ 0 = 1 and f (w t +1 ) ≥ f (w * ).Thus Lemma 4 is proven.
Proof  Using (30) in Lemma 4, by simple deduction, we can prove that there exists a positive number M 0 > 0 such that w * − z t +1 2 ≤ M 0 ∀t > 0.
Since the sequence w * − z t +1 is bounded, according to (29), we have Thus Theorem 1 is proven.