Spectral Projected Gradient methods: Review and Perspectives

Over the last two decades, it has been observed that using the gradient vector as a search direction in large-scale optimization may lead to eﬃcient algorithms. The eﬀectiveness relies on choosing the step lengths according to novel ideas that are related to the spectrum of the underlying local Hessian rather than related to the standard decrease in the objective function. A review of these so-called spectral projected gradient methods for convex constrained optimization is presented. To illustrate the performance of these low-cost schemes, an optimization problem on the set of positive deﬁnite matrices is described.


Introduction
In 1988, a pioneering paper by Barzilai and Borwein [9] proposed a gradient method for the unconstrained minimization of a differentiable function f : R n → R that uses a novel and nonstandard strategy for choosing the step length.Starting from a given x 0 ∈ R n , the Barzilai-Borwein (BB) iteration is given by where the initial step length λ 0 > 0 is arbitrary and, for all k = 1, 2, . .., where When f (x) = 1 2 x T Ax + b T x + c is a quadratic function and A is a symmetric positive definite (SPD) matrix, then the step length (2) becomes Curiously, the step length (3) used in the BB method for defining x k+1 is the one used in the optimal Cauchy steepest descent method [32] for defining the step at iteration k.Therefore, the BB method computes, at each iteration, the step that minimizes the quadratic objective function along the negative gradient direction but, instead of using this step at the k-th iteration, saves the step to be used in the next iteration.The main result in the paper by Barzilai and Borwein [9] is to show the surprising result that for two-dimensional strictly convex quadratic functions the BB method converges R-superlinearly.This fact raised the question about the existence of a gradient method with superlinear convergence in the general case.In 1990, the possibility of obtaining superlinear convergence for arbitrary n was discarded by Fletcher [51], who also conjectured that, in general, only R-linear convergence should be expected.In 1993, Raydan [83] established global convergence of the BB method for the strictly convex quadratic case with any number of variables.The nonstandard analysis presented in [83] was later refined by Dai and Liao [38] to prove the expected R-linear convergence result, and it was extended by Friedlander et al. [54] to consider a wide variety of delayed choices of step length for the gradient method; see also [85].
For the minimization of general (not necessarily quadratic) functions, a bizarre behavior seemed to discourage the application of the BB method: the sequence of functional values f (x k ) did not decrease monotonically and, sometimes, monotonicity was severely violated.Nevertheless, it was experimentally observed (see, e.g.[51,55]) that the potential effectiveness of the BB method was related to the relationship between the step lengths and the eigenvalues of the Hessian rather than to the decrease of the function value.It was also observed (see, e.g.[78]) that, with the exception of very special cases, the BB method was not competitive with the classical Hestenes-Stieffel [64] Conjugate Gradients (CG) algorithm when applied to strictly convex quadratic functions.On the other hand, starting with the work by Grippo, Lampariello and Lucidi [59], nonmonotone strategies for function minimization began to become popular when combined with Newton-type schemes.These strategies made it possible to define globally convergent algorithms without monotone decrease requirements.The philosophy behind nonmonotone strategies is that, frequently, the first choice of a trial point by a minimization algorithm hides a lot of wisdom about the problem structure and that such knowledge can be destroyed by the decrease imposition.
The conditions were given for the implementation of the BB method for general unconstrained minimization with the help of a nonmonotone strategy.Raydan [84] developed a globally convergent method in 1997 using the Grippo-Lampariello-Lucidi (GLL) strategy [59] and the BB method given by (1) and (2).He exhibited numerical experiments that showed that, perhaps surprisingly, the method was more efficient than classical conjugate gradient methods for minimizing general functions.These nice comparative numerical results were possible because, albeit the Conjugate Gradient method of Hestenes and Stiefel continued to be the method of choice for solving strictly convex quadratic problems, its efficiency is hardly inherited by gener-alizations for minimizing general functions.Therefore, there existed a wide space for variations and extensions of the BB original method.
The Spectral Projected Gradient (SPG) method [24,25,26], for solving convex constrained problems, was born from the marriage of the global Barzila-Borwein (spectral) nonmonotone scheme [84] with classical projected gradient strategies [15,56,74].In Section 2, we review the basic idea of the SPG method, list some of its most important properties, and describe the usage of an R implementation of the SPG method available in the TANGO Project web page [101].In Section 3, we illustrate the properties of the method by solving an optimization problem on the convex set of positive definite matrices.In Section 4, we briefly describe some of the most relevant applications and extensions of the SPG method.Finally, in Section 5, we present some conclusions.

Spectral Projected Gradient (SPG) method
Quasi-Newton secant methods for unconstrained optimization [44,45] obey the recursive formula where the sequence of matrices {B k } satisfies the secant equation It can be shown that, at the trial point ), the affine approximation of ∇f (x) that interpolates the gradient at x k and x k−1 vanishes for all k ≥ 1.
Now assume that we are looking for a matrix B k+1 with a very simple structure that satisfies (5).More precisely, if we impose that B k+1 = σ k+1 I, with σ k+1 ∈ R, then equation (5) becomes: In general, this equation has no solutions.However, accepting the least-squares solution that minimizes σs k − y k 2 2 , we obtain: i.e. σ k+1 = 1/λ k+1 , where λ k+1 is the BB choice of step length given by (2).Namely, the method for unconstrained minimization is of the form x k+1 = x k + α k d k , where at each iteration, λ k = 1/σ k , and formula ( 6) is used to generate the coefficients σ k provided that they are bounded away from zero and that they are not very large.In other words, the method uses safeguards 0 < λ min ≤ λ max < ∞ and defines, at each iteration: By the Mean-Value Theorem of integral calculus, one has: Therefore, (6) defines a Rayleigh quotient relative to the average Hessian matrix This coefficient is between the minimum and the maximum eigenvalue of the average Hessian, which motivates the denomination of spectral method [24].
Writing the secant equation as H k+1 y k = s k , which is also standard in the Quasi-Newton tradition, we arrive at a different spectral coefficient: (y T k y k )/(s T k y k ); see [9,83].Both this dual and the primal (6) spectral choices of step lengths produce fast and effective nonmonotone gradient methods for large-scale unconstrained optimization [52,54,84].Fletcher [52] presents some experimental considerations about the relationship between the nonmonotonicity of BB-type methods and their surprising computational performance; pointing out that the effectiveness of the approach is related to the eigenvalues of the Hessian rather than to the decrease of the function value; see also [42,55,85].A deeper analysis of the asymptotic behavior of BB methods and related methods is presented in [39].The behavior of BB methods has also been analyzed using chaotic dynamical systems [47].Moreover, in the quadratic case, several spectral step lengths can be interpreted by means of a simple geometric object: the Bézier parabola [14].All of these interesting theoretical as well as experimental observations, concerning the behavior of BB methods for unconstrained optimization, justify the interest in designing effective spectral gradient methods for constrained optimization.
The SPG method [24,25,26] is the spectral option for solving convex constrained optimization problems.As its unconstrained counterpart [84], the SPG method has the form where the search direction d k has been defined in [24] as P Ω denotes the Euclidean projection onto the closed and convex set Ω, and λ k is the spectral choice of step length (2).A related method with approximate projections has been defined in [26].
for α small enough.This means that, in principle, one could define convergent methods imposing sufficient decrease at every iteration.However, as in the unconstrained case, this leads to very inefficient practical results.A key feature is to accept the initial BB-type step length as frequently as possible while simultaneously guarantee global convergence.For this reason, the SPG method employs a nonmonotone line search that does not impose functional decrease at every iteration.In [24,25,26] the nonmonotone GLL [59] search is used (see Algorithm 2.2 below).The global convergence of the SPG method, and some related extensions, can be found in [26].
The nonmonotone sufficient decrease criterion, used in the SPG method, depends on an integer parameter M ≥ 1 and imposes a functional decrease every M iterations (if M = 1 then GLL line search reduces to a monotone line search).The line search is based on a safeguarded quadratic interpolation and aims to satisfy an Armijo-type criterion with a sufficient decrease parameter γ ∈ (0, 1).Algorithm 2.1 describes the SPG method in details, while Algorithm 2.2 describes the line search.More details can be found in [24,25].
Step 1. Stopping criterion Step 2. Iterate Compute the search direction , compute the step length α k using Algorithm 2.2 (with parameters γ, M , σ 1 , and σ 2 ), and set Step 3. Compute the spectral step length

Algorithm 2.2: Nonmonotone line search
Compute Step 1. Test nonmonotone GLL criterion Step 2. Compute a safeguarded new trial step length In practice, it is usual to set λ min = 10 −30 , λ max = 10 30 , σ 1 = 0.1, σ 2 = 0.9, and γ = 10 −4 .A typical value for the nonmonotone parameter is M = 10, but the performance of the method may vary for variations of this parameter and a fine tuning may be adequate for specific applications.An alternative nonmonotone Armijo-like criterion that may be less sensitive to its parameter setting was introduced in [96].Other nonmonotone techniques were also introduced in [41,60,61,73,90].Parameter λ 0 ∈ [λ min , λ max ] is arbitrary.In [24,25] it was considered assuming that x 0 is such that P Ω (x 0 − ∇f (x 0 )) = x 0 .In [26], at the expense of an extra gradient evaluation, it was used where s = x − x 0 , ȳ = ∇f (x) − ∇f (x 0 ), x = x 0 − α small ∇f (x 0 ), and ε rel and ε abs are relative and absolute small values related to the machine precision, respectively.
It is easy to see that SPG only needs 3n + O(1) double precision positions but one additional vector may be used to store the iterate with best functional value obtained by the method.This is almost always the last iterate but exceptions are possible.C/C++, Fortran 77 (including an interface with the CUTEr [58] test set), Matlab, and Octave subroutines implementing the Spectral Projected Gradient method (Algorithms 2.1 and 2.2) are available in the TANGO Project web page [101].A Fortran 77 implementation is also available as Algorithm 813 of the ACM Collected Algorithms [99].
An R implementation of the SPG was developed to ilustrate the usage of the SPG method in the present work.Recall that the method's purpose is to seek the least value of a function f of potentially many variables, subject to a convex set Ω onto which we know how to project.The objective function and its gradient should be given by the user in subroutines that take the name evalf and evalg, respectively.Similarly, the user must provide a subroutine, named proj, that computes the projection of an arbitrary vector x onto the feasible set.The way of calling SPG should be clear from the specific example described in Section 3.

A matrix problem on the set of positive definite matrices
Optimization problems on the space of matrices, which are restricted to the convex set of positive definite matrices, arise in various applications, such as statistics, as well as financial mathematics, model updating, and in general in matrix least-squares settings; see, e.g.[30,48,50,66,67,95].
To illustrate the use of the SPG method, we now describe a classification scheme that can be written as an optimization problem on the convex set of positive definite matrices.Given a training set of labelled examples subject to symmetry and positive definiteness of the q × q matrix A. Moreover, to impose closed constraints on the semi-axes of the ellipsoid, we define a lower and an upper bound on the eigenvalues λ i (A), 1 ≤ i ≤ q.Given a square matrix A, its projection onto this closed and convex set can be computed in two steps [48,65].In the first step one symmetrizes A and in the second step one computes the QDQ T decomposition of the symmetrized matrix and replaces its eigenvalues by their projection onto the interval [ λmin , λmax ].
The number of variables n of the problem is n = q(q + 1), corresponding to the q 2 elements of matrix A ∈ R q×q plus the elements of b ∈ R q .The array of variables x corresponds to the column wise representation of A followed by the elements of b.The prototypes of the subroutines that define the problem are In subroutines evalf, evalg, and proj, n and x are input parameters.Parameter n is an integer value that corresponds to the number of variables n and x is an n-dimensional double precision array containing the point at which the objective function or the gradient must be evaluated or that must be projected onto the feasible convex set.Subroutine evalf must compute the value of the objective function f at x, i.e. it must return the double precision output parameter f = f (x).Subroutine evalg must compute the value of the gradient of the objective function ∇f at x, i.e. it must return the double precision n-dimensional output parameter g = ∇f (x).Subroutine proj must compute the projection of x onto the feasible set Ω, i.e. it must return the double precision n-dimensional output parameter p = P Ω (x).The output parameter flag is an integer scalar and must be used to indicate whether an exception occurred during the computation or not.For example, if the objective function calculation (or any other calculation) requires a division to be done and the denominator is null, or if a square root should be taken and the radicand is negative, parameter flag must be set to any non-null value.In this way, on return, SPG will know that the required quantity was not properly computed and a warning message will be shown to the user.Depending on the situation, the optimization process may be interrupted or not.If everything went well in the computations, output parameter flag must be set equal to zero.The evaluation of the objective function and its gradient requires, in addition to the input parameters of subroutines evalf and evalg, the dimension of the space q, the number of points m, and the points z i ∈ R q and their classification w i ∈ {1, −1}, i = 1, . . ., m.All these quantities must be set previous to the optimization process and must be global variables accessible to subroutines evalf and evalg.This global variables must also be accessible to subroutine proj that requires the value of q to map A into x before and after computing its eigenvalues and eigenvectors.Values of the eigenvalues lower and upper bounds λmin and λmax may be set within subroutine proj since they only take part in the feasible set definition.Note that the feasible set is implicitly defined by the effect of the projection subroutine proj.Subroutine proj uses subroutine eigen (that calls to subroutine DSYEVR from Lapack [102]) to compute the eigenvalues and eigenvectors of matrix A.
The prototype of subroutine spg is given by spg <-function(n,x,epsopt,M,maxit,maxfc,iprint) { [...] list(x=...,f=...,gpsupn=...,iter=...,fcnt=...,spginfo=...) } The description of the input parameters follows.Parameter n is an integer value that corresponds to the number of variables n and x is an n-dimensional double precision array containing the initial guess x 0 .The double precision parameter epsopt corresponds to the tolerance ε in the stopping criterion related to the sup-norm of the projected gradient at Step 1 of Algorithm 2.1.
The integer parameter M is the nonmonotone line search parameter M .Integer parameters maxit and maxfc are maximum allowed numbers of iterations and functional evaluations, respectively, that determine alternative stopping criteria for the method.Finally, integer parameter iprint controls the amount of information displayed on the output of the method.On output, the n-dimensional double precision array x represents the best approximation to the solution, the double precision value f is the objective function value at x, the double precision value gpsupn is the value of the sup-norm of the projected gradient at the last iterate of the method, and the integer values iter and fcnt says the number of iterations and functional evaluations, respectively, used by the method (the number of gradient evaluations and projections is equal to the number of iterations plus one).The integer value spginfo describes the stopping criterion satisfied by the method: spginfo = 0 means that the stopping criterion related to the sup-norm of the projected gradient was satisfied, while spginfo = 1 and spginfo = 2 mean that the maximum allowed number of iterations or functional evaluations was achieved, respectively.The listings of the calling sequence to spg and the subroutines evalf, evalg, and proj that define the problem are available in [100].Numerical experiments were conducted using R version 2.14.1 on a 2.67GHz Intel Xeon CPU X5650 with 8GB of RAM memory and running GNU/Linux operating system (Ubuntu 12.04 LTS, kernel 3.2.0-33).
In the numerical experiments, we considered the default parameters of the SPG mentioned in the previous section.Regarding the initial spectral step length, we considered the choice given by ( 8) and for the stopping criterion we arbitrarily fixed ε = 10 −6 .The nonmonotone line search parameter was arbitrarily set as M = 100.
Figure 1 shows solutions to four small two-dimensional illustrative examples.In the four examples we consider q = 2 and m = 10, 000 random points z i ∈ R 2 uniformly distributed within the box [−100, 100] 2 ; and we fix the suitable values λmin = 10 −4 and λmax = 10 4 .In the example of Figure 1(a), label w i = 1 is given to the points inside a circle with radius 70 centered at the origin.If the reader is able to see the figure in colors, points labelled with w i = 1 (i.e.inside the circle) are depicted in blue while points outside the circle (with w i = −1) are depicted in red.Examples in Figures 1(b-d) correspond to a square of side 70 centered at the origin, a rectangle with height equal to 70   Table 1 presents a few figures that represent the computational effort needed by the SPG method to solve the problems.Figures in the table show that the stopping criterion was satisfied in the four cases and that, as expected, the ratio between the number of functional evaluations and the number of iterations is near unity, stressing the high rate of acceptance of the trial spectral step length along the projected gradient direction.A short note regarding the reported CPU times is in order.A Fortran 77 version of the SPG method (available in [100]) was also considered to solve the same four problems.Equivalent solutions were obtained using CPU times three orders of magnitude smaller than the ones required by the R version of the method.From the required total CPU times, approximately 99% is used to compute the objective function and its gradient.This observation is in complete agreement with the fact that SPG iterations have a time complexity O(n) while the objective function and gradient evaluations have time complexity O(nm), and that, in the considered examples, we have n = q(q + 1) = 6 and m = 10, 000.In this scenario, coding the problem subroutines evalf, evalg, and proj in R and developing an interface to call a compiled Fortran 77 version of spg is not an option.On the other hand, when the evaluation of the objective function and its derivatives is inexpensive compared to the linear algebra involved in an iteration of the optimization method, coding the problem subroutines in an interpreted language like R and developing interfaces to optimization methods developed in compiled languages might be an adequate choice.This combination may jointly deliver (a) the fast development supplied by the usage of the user preferred language to code the problem subroutines and (b) the efficiency of powerful optimization tools developed in compiled languages especially suited to numeric computation and scientific computing.This is the case of, for example, the optimization method Algencan [2,3] (see also the TANGO Project web page [101]) for nonlinear programming problems, coded in Fortran 77 and with interfaces to R, AMPL, C/C++, CUTEr, Java, MATLAB, Octave, Python, and TCL.

Applications and extensions
As a general rule, the SPG method is applicable to large-scale convex constrained problems in which the projection onto the feasible set can be inexpensively computed.Moreover, the scenario is also beneficial to the application of the SPG method whenever the convex feasible set is hard to describe with nonlinear constraints or when the nonlinear constraints that are needed to describe the convex feasible set have an undesirable property (e.g.being too many, being highly nonlinear, or having a dense Jacobian).Comparisons with other methods for the particular case in which the feasible set is an n-dimensional box can be found in [24] and [20].Since its appearance, the SPG method has been useful for solving real applications in different areas, including optics [1,8,17,18,33,37,80,81,82,92,93,94], compressive sensing [12,13,49,76], geophysics [10,16,35,43,98], statistics [28,91], image restoration [11,27,29,63], atmospheric sciences [68,79], chemistry [53], and dental radiology [70,71].The SPG method has also been combined with several schemes for solving scientific problems that appear in other areas of computational mathematics, including eigenvalue complementarity [69], support vector machines [34,40,88], non-differentiable optimization [36], trust-region globalization [77], generalized Sylvester equations [29], nonlinear monotone equations [97], condition number estimation [31], optimal control [19], bound-constrained minimization [5,6,21,22], nonlinear programming [2,3,4,7,23,46,57], non-negative matrix factorization [75], and topology optimization [89].Moreover, alternative choices of the spectral step length have been considered and analyzed for solving some related nonlinear problems.A spectral approach has been considered to solve large-scale nonlinear systems of equations using only the residual vector as search direction [62,72,73,91,97].Spectral variations have also been developed for accelerating the convergence of fixed-point iterations, in connection with the well-known EM algorithm which is frequently used in computational statistics [86,87].

Conclusions
The SPG method is nowadays a well-established numerical scheme for solving large-scale convex constrained optimization problems when the projection onto the feasible set can be performed efficiently.The attractiveness of the SPG method is mainly based on its simplicity.No sophisticated linear codes are required, no additional linear algebra packages are needed and the user can code its own version of the method with a relatively small effort.In this review, we presented the basic ideas behind the SPG method, discussed some of its most relevant properties, and discussed recent applications and extensions.In addition, we illustrated how the method can be applied to a particular matrix problem for which the feasible convex set is easily described by a subroutine that computes the projection onto it.

Figure 1 :
Figure 1: Solutions to four small two-dimensional illustrative examples.

Table 1 :
and width equal to 140 centered at the origin, and a triangle Performance of the R version of SPG in the four small two-dimensional illustrative examples.