Greedy quasi-Newton methods with explicit superlinear convergence

,

proximating the target operator.In Section 3, we analyze greedy quasi-Newton methods, applied to the problem of minimizing a quadratic function.We show that these methods have a global linear convergence rate, comparable to that of the standard gradient descent, and also a superlinear convergence rate, which contains a contraction factor, depending on the square of the iteration counter.In Section 4, we show that similar results also hold for general nonlinear functions, provided that the starting point is chosen sufficiently close to the solution.The main difficulty here, compared to the quadratic case, is that the Hessian of the objective function is no longer constant, resulting in the necessity to apply a special correction strategy to keep the Hessian approximations under control.Finally, in Section 5, we present some preliminary computational results.
Notation.In what follows, E denotes an arbitrary n-dimensional real vector space.Its dual space, composed of all linear functionals on E, is denoted by E * .The value of a linear function s ∈ E * , evaluated at a point x ∈ E, is denoted by ⟨s, x⟩.
For a smooth function f : E → R, we denote by ∇f (x) and ∇ 2 f (x) its gradient and Hessian respectively, evaluated at a point x ∈ E. Note that ∇f (x) ∈ E * , and ∇ 2 f (x) is a self-adjoint linear operator from E to E * .
The partial ordering of self-adjoint linear operators is defined in the standard way.We write A ≼ A 1 for A, A 1 : Any self-adjoint positive definite linear operator A : E → E * induces in the spaces E and E * the following pair of conjugate Euclidean norms: (1.1) When A = ∇ 2 f (x), where f : E → R is a smooth function with positive definite Hessian, and x ∈ E, we prefer to use notation ∥ • ∥ x and ∥ • ∥ * x , provided that there is no ambiguity with the reference function f .Sometimes, in the formulas, involving products of linear operators, it is convenient to treat x ∈ E as a linear operator from R to E, defined by xα = αx, and x * as a linear operator from E * to R, defined by x * s = ⟨s, x⟩.In this case, xx * is a rank-one self-adjoint linear operator from E * to E, acting as follows: Likewise, any s ∈ E * can be treated as a linear operator from R to E * , defined by sα = αs, and s * as a linear operator from E to R, defined by s * x = ⟨s, x⟩.Then, ss * is a rank-one self-adjoint linear operator from E to E * .For two self-adjoint linear operators A : E → E * and W : Note that W A is a linear operator from E to itself, and hence its trace is well-defined (it coincides with the trace of the matrix representation of W A with respect to an arbitrary chosen basis in the space E, and the result is independent of the particular choice of the basis).Observe that ⟨•, •⟩ is a bilinear form, and for any x ∈ E, we have ⟨Ax, x⟩ = ⟨xx * , A⟩. (1.2) When A is invertible, we also have If the operator W is positive semidefinite, and A ≼ A 1 for some self-adjoint linear operator A 1 : E → E * , then ⟨W, A⟩ ≤ ⟨W, A 1 ⟩.Similarly, if A is positive semidefinite and W ≼ W 1 for some self-adjoint linear operator W 1 : E * → E, then ⟨W, A⟩ ≤ ⟨W 1 , A⟩.When A is positive definite, and R : E → E * is a self-adjoint linear operator, ⟨A −1 , R⟩ equals the sum of the eigenvalues of R with respect to the operator A. In particular, if R is positive semidefinite, then all its eigenvalues with respect to A are non-negative, and the maximal one can be bounded by the trace:

Greedy Quasi-Newton Updates
Let A : E → E * be a self-adjoint positive definite linear operator.In this section, we consider a class of quasi-Newton updating rules for approximating A.
Let G : E → E * be a self-adjoint linear operator, such that and let u ∈ E be a direction.Consider the following family of updates, parameterized by a scalar τ ∈ R. ) ] . (2.2) and, for τ = 1, it corresponds to the well-known DFP update: ) Thus, (2.2) describes the Broyden family of quasi-Newton updates (see [22,Section 6.3]), and can be written as the linear combination of DFP and SR1 updates: Our main interest will be in the class, described by the values τ ∈ [0, 1], i.e. in the convex combination of the DFP and SR1 updates.Note in particular, that this subclass includes another well-known update-BFGS.Indeed, for we have 1 − τ BFGS = ⟨(G−A)u,u⟩ ⟨Gu,u⟩ , and thus ) ) This is the classic BFGS formula for direction u.
Let us show that the Broyden family is monotonic in the parameter τ .
Lemma 2.1 If (2.1) holds, then, for any u ∈ E and any Proof: Without loss of generality, suppose Gu ̸ = Au.Then Broyd τ (G, A, u) ) ) The claim now follows from the fact that ⟨(G − A)u, u⟩ss * ≽ 0 in view of (2.1). 2 Next, let us show that the relation (2.1) can be preserved after applying to G any update from the class of our interest.Moreover, each update from this class does not increase the deviation from the target operator A.

Lemma 2.2 If, for some η ≥ 1, we have
then, for any u ∈ E and any τ ∈ [0, 1], we also have ≽ 0, and let I E , I E * be the identity operators in the spaces E, E * respectively.Then, Thus, the first relation in (2.8) is proved.To prove the second relation, we apply Lemma 2.1, using that τ ≤ 1, to obtain ) Interestingly, from Lemma 2.1 and Lemma 2.2, it follows that, if (2.1) holds, then In other words, the approximation, produced by SR1, is better than the one, produced by BFGS, which is in turn better than the one, produced by DFP.
Let us now justify the efficiency of update (2.2) with τ ∈ [0, 1] in ensuring convergence G → A. For this, we introduce the following measure of progress: (2.9) According to Lemma 2.3, the choice of the updating direction u directly influences the decrease in the measure σ A .Ideally, we would like to select a direction u, which maximizes the right-hand side in (2.10).However, this requires finding an eigenvector, corresponding to the maximal eigenvalue of G with respect to A, which might be computationally a difficult problem.Therefore, let us consider another approach.
Let us fix in the space E some basis: Let us show that the greedy quasi-Newton update decreases the measure σ A with a linear rate.For this, define (2.12) Note that B is a self-adjoint positive definite linear operator from E to E * .
Theorem 2.1 Let (2.1) hold, and let µ, L > 0 be such that Then, for any τ ∈ [0, 1], we have Remark 2.1 A simple modification of the above proof shows that the factor nL in (2.14) can be improved up to ⟨B −1 , A⟩.However, to simplify the future analysis, we prefer to work directly with constant L.

Unconstrained Quadratic Minimization
Let us demonstrate how we can apply the quasi-Newton updates, described in the previous section, for minimizing the quadratic function where A : E → E * is a self-adjoint positive definite linear operator, and b ∈ E * .Let B be the operator, defined in (2.12), and let µ, L > 0 be such that Thus, µ is the constant of strong convexity of f , and L is the Lipschitz constant of the gradient of f , both measured with respect to the operator B.
Consider the following quasi-Newton scheme: For k ≥ 0 iterate: Note that scheme (3.3) starts with G 0 = LB.Therefore, its first iteration is identical to that one of the standard gradient method : Also, from (3.2), it follows that A ≼ G 0 .Hence, in view of Lemma 2.2, we have for all k ≥ 0. In particular, all G k are positive definite, and scheme (3.3) is well-defined.
Remark 3.1 For avoiding the O(n 3 ) complexity for computing G −1 k ∇f (x k ), it is typical for practical implementation of scheme (3.3) to work directly with the inverse operators G −1 k (or, alternatively, with the Cholesky decomposition of G k ).Due to a low-rank structure of the updates (2.2), it is possible to compute efficiently To estimate the convergence rate of scheme (3.3), let us look at the norm of the gradient of f , measured with respect to A: Note that this measure of optimality is directly related to the functional residual.Indeed, let x * = A −1 b be the minimizer of (3.1).Then, using Taylor's formula, we obtain The following lemma shows how λ f changes after one iteration of process (3.3).

Lemma 3.1
Let k ≥ 0, and let η k ≥ 1 be such that Then, Proof: By Taylor's formula, we have Therefore, Note that Consequently, and Thus, to estimate how fast λ f (x k ) converges to zero, we need to upper bound η k .There are two ways to proceed, depending on the choice of directions u k in (3.3).
First, consider the general situation, when we do not impose any restrictions on u k .In this case, we can guarantee that η k stays uniformly bounded, and λ f (x k ) → 0 at a linear rate.
Theorem 3.1 For all k ≥ 0, in scheme (3.3), we have ) Proof: Since G 0 = LB, in view of (3.2), we have By Lemma 2.2, this implies (3.8).Applying now Lemma 3.1, we obtain for all k ≥ 0, and (3.9) follows. 2 Note that (3.9) is exactly the convergence rate of the standard gradient method.Thus, according to Theorem 3.1, the convergence rate of scheme (3.3) is at least as good as that of the gradient method.Now assume that the directions u k in scheme (3.3) are chosen in accordance to the greedy strategy (2.11).Recall that, in this case, we can guarantee that G k → A. Therefore, we can expect faster convergence from scheme (3.3).

Theorem 3.2 Suppose that, for each
Then, for all k ≥ 0, we have

.11)
Proof: We already know that A ≼ G k .Hence, or, equivalently, At the same time, by Theorem 2.1, we have Thus, (3.10) is proved.Applying now Lemma 3.1, we obtain (3.11).
Theorem 3.2 shows that the convergence rate of λ f (x k ) is superlinear.Let us now combine this result with Theorem 3.1 and write down the final efficiency estimate.Denote by k 0 ≥ 0 the number of the first iteration, for which Clearly, k 0 ≤ nL µ ln 2nL µ .According to Theorem 2.1, during the first k 0 iterations, After that, by Theorem 3.2, for all k ≥ 0, we have or, more explicitly, 2 Note that the first factor in this estimate depends on the square of the iteration counter.
To conclude, let us mention one important property of scheme (3.3) with greedily selected u k .It turns out that, in the particular case, when τ k = 0 for all k ≥ 0, i.e. when scheme (3.3) corresponds to the greedy SR1 method, it will identify the operator A, and consequently, the minimizer x * of (3.1), in a finite number of steps.(2.11), and Thus, the dimension of Ker (R k ) grows at least by 1 at every iteration.In particular, the dimension of Ker (R n+1 ) must be at least n + 1, which is impossible, since the operator R n+1 acts in an n-dimensional vector space.2 Note that for other updates, such as (2.4) or (2.6), the inclusion Ker (R k ) ⊆ Ker (R k+1 ) is, in general, no longer valid.

Minimization of General Functions
Now consider a general unconstrained minimization problem: where f : E → R is a twice differentiable function with positive definite Hessian.Our goal is to extend the results, obtained in the previous section, onto the problem (4.1), assuming that the methods can start from a sufficiently good initial point x 0 .
Our main assumption is that the Hessians of f are close to each other in the sense that there exists a constant M ≥ 0, such that for all x, y, z, w ∈ E. We call such a function f strongly self-concordant.Note that strongly self-concordant functions form a subclass of self-concordant functions.Indeed, let us choose a point x ∈ E and a direction h ∈ E.Then, for all t > 0, we have Dividing this inequality by t and computing the limit as t ↓ 0, we obtain for all h ∈ E. Thus, function f is self-concordant with constant 1 2 M (see [19,27]).The main example of a strongly self-concordant function is a strongly convex function with Lipschitz continuous Hessian.Note however that strong self-concordancy is an affineinvariant property.
Example 4.1 Let C : E → E * be a self-adjoint positive definite operator.Suppose there exist β > 0 and L 2 ≥ 0, such that the function f is β-strongly convex and its Hessian is Proof: By strong convexity of f , we have for all x ∈ E. Therefore, using the Lipschitz continuity of the Hessian, we obtain Let us establish some useful relations for strongly self-concordant functions.
Also, for and 2), we obtain which gives us the second relation in (4.4) after moving ∇ 2 f (x) into the right-hand side.
Interchanging now x and y in (4.2) and taking z = x, w = y, we get which gives us the first relation in (4.4) after moving ∇ 2 f (x) into the right-hand side and then dividing by 1 + M r.
Choosing now y = x + th in (4.2) for t > 0, and w = z = x, we obtain This gives us the second relation in (4.5) after integrating for t from 0 to 1 and moving ∇ 2 f (x) into the right-hand side.Interchanging x and y in (4.2) and taking y = x + th for t > 0, z = x, while leaving w arbitrary, we get Hence, by integrating for t from 0 to 1, Taking now w = x + th and integrating again, we obtain and the first inequality in (4.5) follows after moving J to the right-hand side and dividing by 1 + M r 2 .Relations (4.6) can be proved similarly to (4.5). 2 Let us now estimate the progress of a general quasi-Newton step.As before, for measuring the progress, we use the local norm of the gradient: Lemma 4.2 Let x ∈ E, and let G : E → E * be a self-adjoint linear operator, such that for some η ≥ 1.Let ) Hence, in view of Lemma 4.1, we have Therefore, (4.12) Further, Hence, and Consequently, and thus, Now we need to analyze what happens with the Hessian approximation after a quasi-Newton update.Let G be the current approximation of ∇ 2 f (x), satisfying, as usual, the condition Using this approximation, we can compute the new test point After that, we would like to update G into a new operator G + , approximating the Hessian ∇ 2 f (x + ) at the new point and satisfying the condition A natural idea is, of course, to set for some u ∈ E and τ ∈ [0, 1].However, we cannot do this, since update (4.14) is welldefined only when (see Section 2), which may not be true, even though (4.13) holds.To avoid this problem, let us apply the following correction strategy: 1. Choose some δ ≥ 0, and set G = (1 + δ)G.
Clearly, for some value of δ, the condition ∇ 2 f (x + ) ≼ G will be valid.If, at the same time, this δ is sufficiently small, then the above correction strategy should not introduce too big error.

Lemma 4.3 Let
x ∈ E, and let G : E → E * be a self-adjoint linear operator, such that for some η ≥ 1.
and, for all u ∈ E and τ ∈ [0, 1], we have and, Thus, and the claim now follows from Lemma 2.2. 2 Let us now make one more assumption about the function f .We assume that, with respect to the operator B, defined by (2.12), the function f is strongly convex, and its gradient is Lipschitz continuous, i.e. there exist µ, L > 0, such that, for all x, y ∈ E, we have µB ≼ ∇ 2 f (x) ≼ LB. (4.17) Remark 4.1 In fact, for our purposes, it is enough to require that conditions (4.2), (4.17) hold only in a neighborhood of a solution, but, for the sake of simplicity, we do not do this.
We are ready to write down the scheme of our quasi-Newton methods.For simplicity, we assume that the constants M and L are available.
For k ≥ 0 iterate: As before, we present two convergence results for scheme (4.18).The first one establishes linear convergence and can be seen as a generalization of Theorem 3.1.Note that for this result the directions u k in the method (4.18) can be chosen arbitrarily.Theorem 4.1 Suppose the initial point x 0 is chosen sufficiently close to the solution: Then, for all k ≥ 0, we have )
Note that Applying Lemma 4.2, we obtain that and Note (using the inequality 1 − t ≥ e −2t , valid at least for 0 and also (using ln(1 + t) ≤ t, valid for t ≥ 0) that Hence, Consequently, Finally, from Lemma 4.3, it follows that Then, for any τ ∈ [0, 1], we have ) .
Proof: We already know from Lemma 4.3 that (2.11)).Hence, by Theorem 2.1, we have Further, ) . 2 Now we can prove superlinear convergence.In what follows, we assume that n ≥ 2.
And suppose that the initial point x 0 is chosen sufficiently close to the solution: Then, for all k ≥ 0, we have ) In view of Theorem 4.1, the first relation in (4.29) is indeed true, and also for all k ≥ 0.
Let us show by induction that, for all k ≥ 0, we have where )), we have ≤ n Therefore, for k = 0, inequality (4.32) is satisfied.Now suppose that it is also satisfied for some or, equivalently, Therefore, applying Lemma 4.2, we obtain that and Further, by Lemma 4.4, we have Note that 1 2 ≤ 1 − µ nL since n ≥ 2. Therefore, Thus, (4.32) is proved.Let us fix now some k ≥ 0. Since λ k ≥ 0, we have This proves the second relation in (4.29) in view of (4.34).Finally, and we obtain (4.30). 2 Similarly to the quadratic case, combining Theorem 4.1 with Theorem 4.2, we obtain the following final efficiency estimate: where k 0 ≤ nL µ ln 2nL µ .

Numerical Experiments
In this section, we present preliminary computational results for greedy quasi-Newton methods, applied to the following test function We compare scheme (4.18) (which realizes GrDFP, GrBFGS and GrSR1, depending on the choice of τ k ) with the usual gradient method (GM) and standard quasi-Newton methods DFP, BFGS and SR1.
All the standard methods need access only to the gradient of function f : where Note that, for a given point x ∈ R n , ∇f (x) can be computed in O(mn) operations.Note that by construction ∇f (0) so the unique minimizer of our test function (5.1) is x * = 0.The starting point x 0 for all methods is the same and generated randomly from the uniform distribution on the standard Euclidean sphere of radius 1/n (this choice is motivated by (4.28)).Thus, our test function (5.1) has three parameters: the dimension the number m of linear functions, and the regularization coefficient γ.Let us present computational results for different values of these parameters.The termination criterion for all methods is f ( In the tables below, for each method, we display the number of iterations until its termination.The minus sign (−) means that the method has not been able to achieve the required accuracy after 1000n iterations.We see that all quasi-Newton methods outperform the gradient method and demonstrate superlinear convergence (from some moment, the difference in the number of iterations between successive rows in the table becomes smaller and smaller).Among quasi-Newton methods (both the standard and the greedy ones), SR1 is always better than BFGS, while DFP is significantly worst than the other two.At the first few iterations, the greedy methods loose to the standard ones, but later they catch up.However, the classical SR1 method always remains the best.Nevertheless, the greedy methods are quite competitive.Now let us look at the quality of Hessian approximations, produced by the quasi-Newton methods.In the tables below, we display the desired accuracy ϵ vs the final Hessian approximation error (defined as the operator norm of G k − ∇ 2 f (x k ), measured with respect to ∇ 2 f (x k )).We look at the same problems as in Table 1 and Table 3.As we can see from these tables, for standard quasi-Newton methods the Hessian approximation error always stays at the initial level.In contrast, for the greedy ones, it decreases relatively fast (especially for GrBFGS and GrSR1).Note also that sometimes the initial residual slightly increases at the first several iterations (which is noticeable only for GrDFP).This happens due to the fact that the objective function is non-quadratic, and we apply the correction strategy.

Table 5: n
Note that in all the above tests we have used the same values for the parameters n and m.Let us briefly illustrate what happens when, for example, m > n.It is instructive to compare these tables with Table 1 and Table 3, which contain the results for the greedy methods on the same problems.We see that the randomized methods are slightly slower than the greedy ones.However, the difference is not really significant, and, what is especially interesting, the randomized methods do not loose superlinear convergence.

Discussion
We have presented greedy quasi-Newton methods, that are based on the updating formulas from the Broyden family and use greedily selected basis vectors for updating Hessian approximations.For these methods, we have established explicit non-asymptotic rate of local superlinear convergence for the iterates and also a linear convergence the deviations of Hessian approximations from the correct Hessians.
Clearly, there is a number of open questions.First, at every iteration, our methods need to compute the greedily selected basis vector.This requires additional information beyond just the gradient of the objective function (such as the diagonal of the Hessian).However, many problems, that arise in applications, possess certain structure (separable, sparse, etc.), for which the corresponding computations have a cost similar to that of the gradient evaluation (such as the test function in our experiments).Nevertheless, ideally it is desirable to get rid of the necessity in this auxiliary information at all.A natural idea might be to replace the greedy strategy with a randomized one.Indeed, as can be seen from our experiments, the corresponding scheme (4.18), (5.5) demonstrate almost the same performance as the greedy one.Therefore, one can expect that it should be possible to establish similar theoretical results about its superlinear convergence.Nevertheless, at the moment, we do not know how to do this.Although it is not difficult to show that, in terms of expectations, the randomized strategy still preserves the linear convergence of Hessian approximations (see [25]), it is not clear how to proceed after this in proving the superlinear convergence of the iterates, even in the quadratic case.The main difficulty, arising in the analysis, is that, at some moment, one needs to take the expectation of the product of random variables with known expectations, but the random variables themselves are non-independent.
Second, we have analyzed together a whole class of Hessian approximation updates by essentially upper bounding all its members via the worst one-DFP.Thus, all the efficiency guarantees, that we have established, might be too pessimistic for other members of this class such as BFGS, and especially SR1.Indeed, in our experiments, we have seen that the convergence properties of these three methods might differ quite significantly.It is therefore desirable to refine our current analysis and obtain separate estimates for different updates.
Third, note that our current results do not prove anything about the rate of superlinear convergence of the standard quasi-Newton methods.Of course, it would be interesting to obtain the corresponding estimates and compare them to the ones, that we have established in this work.
Finally, apart from the quadratic case, we have not addressed at all the question of global convergence.
In any case, we believe that the ideas and the theoretical analysis, presented in this paper, will be useful for future advances in the theory of quasi-Newton methods.

(4. 18 ) 4 . 2
Remark Similarly to Remark 3.1, in a practical implementation of scheme(4.18),one should work directly with the inverse operators G −1 k , or with the Cholesky decomposition of G k .Note that the correction step Gk = (1 + M r k )G k does not affect the complexity of the iteration.

(4. 26 ) 2
Thus, (4.20),(4.21)are valid for k ′ = k + 1, and we can continue by induction.Now let us analyze the greedy strategy.First, we analyze how the Hessian approximation measure (2.9) changes after one iteration.