Correction of nonmonotone trust region algorithm based on a modiﬁed diagonal regularized quasi-Newton method

In this paper, a new appropriate diagonal matrix estimation of the Hessian is introduced by minimizing the Byrd and Nocedal function subject to the weak secant equation. The Hessian estimate is used to correct the framework of a nonmonotone trust region algorithm with the regularized quasi-Newton method. Moreover, to counteract the adverse eﬀect of monotonicity, we introduce a new nonmonotone strategy. The global and superlinear convergence of the suggested algorithm is established under some standard conditions. The numerical experiments on unconstrained optimization test functions show that the new algorithm is eﬃcient and robust.


Introduction
In this paper, we deal with the following unconstrained optimization problem: where f : R n → R is a twice continuously differentiable function.Line search (LS) and trust region (TR) methods are two prominent classes of iterative methods to solve the problem (1.1).The LS method, for a given initial point x 0 ∈ R n , is a procedure that computes a step length α k in the specific direction p k and considers a new point as x k+1 = x k + α k p k .On the other hand, the TR algorithm computes a trial step p k which is an approximate solution of the following quadratic subproblem: in which g k = ∇f (x k ), B k ∈ R n×n is the exact Hessian ∇ 2 f (x k ), or a symmetric approximation of it, and k > 0 is the TR radius.In the rest of the paper, • refers to the Euclidean norm.According to [19] (also see [5,Theorem 8.5]), p * is the exact solution of (1.2) if and only if there exists a λ ≥ 0 such that and (B k + λI) is positive semidefinite.The regularized Newton method (RNM) is another efficient approach for solving the problem (1.1) and has good convergence properties; see [7,16,21,23,24,26].At each iteration of the RNM, the trial step p k is obtained by approximately minimizing the following unconstrained quadratic function: where f k = f (x k ) and λ k is called the regularized parameter.Here we define It is worth noting that the update rule of k is similar to the TR radius.At each iteration, the RNM method obtains the trial step p k by solving the following regularized Newton equation: where I is the identity matrix and (B k + λ k I) is positive semidefinite.Therefore, p k is well defined.If (B k + λ k I) is positive definite, p k is unique.We can conclude that p k solves (1.5) if and only if it is the global minimizer of the unconstrained quadratic function (1.3).
If we let k = p k = -(B k + λ k I) -1 g k , then it can be verified (see [19,Theorem 6.1.2])that p k is also a solution of the TR subproblem (1.2).
By the famous result given by Powell in [17] (also see [19,Lemma 6.1.3]),we know that The preceding inequality and (1.3) indicate that where γ ∈ (0, 1) is a constant.Generally, solving the TR subproblem is more expensive than the RNM subproblem.In the RNM, only one equation (1.5) is solved at each iteration.Hence, the computational cost of obtaining an RNM step is much lower than solving a TR subproblem [24].
The most common update formula for B k is the BFGS update formula.Numerically, this method needs O(n 2 ) storage, which makes it unsuitable for large-scale problems.The application of quasi-Newton methods for solving large-scale unconstrained optimization problems has been extended by limited-memory quasi-Newton methods [15] and truncated Newton methods [12,14].However, the implementation of these methods for their practical usage is very sophisticated, and the associated software is quite complex [3].Therefore, researchers have considered an alternative approach for the matrix B k , in which a diagonal matrix B k = diag(b (1)  k , b (2)  k , . . ., b (n) k ) is used to approximate the Hessian matrix [4,11,18,27].Observe that in this method, only O(n) storage is required to store B k [4].
This paper first introduces an appropriate diagonal matrix estimation of the Hessian by minimizing the Byrd and Nocedal [6] function subject to the weak secant equation of Dennis and Wolkowicz [8].Subsequently, a new nonmonotone strategy to overcome the adverse effect of monotonically is introduced.The estimation of the Hessian is used to correct the framework of a new nonmonotone TR algorithm with the regularized quasi-Newton method.The suggested algorithm exploits a stronger nonmonotone strategy far from the solution and a weaker one close to the solution.We prove that the new algorithm is globally and superlinearly convergent.
In the next section, an appropriate diagonal matrix estimation of the Hessian is derived.In Sect.3, the new nonmonotone strategy and the structure of the suggested algorithm are explained.Section 4 is associated with the convergence analysis of the new algorithm.In Sect.5, some numerical experiments on a set of unconstrained optimization test problems are examined.The conclusions are given in Sect.6.

Derivation of new diagonal updating
In the quasi-Newton method framework, the Hessian approximation matrix B k+1 is usually required to satisfy the secant equation where s k = x k+1x k and y k = g k+1g k .To find an appropriate diagonal matrix estimation of the Hessian in the sense of we assume that B k is positive definite, and s T k y k > 0 for all k.Since it is difficult for a diagonal matrix to satisfy the known secant equation (2.1), we will consider that B k+1 satisfies the weak secant equation of Dennis and Wolkowicz [8], namely 2) The motivation for using the weak secant equation (2.2) can be seen in [4].Byrd and Nocedal [6] introduced the function ϕ(A) = tr(A) -ln det(A) , defined on positive definite matrices, where ln(•) denotes the natural logarithm.This is an elegant and efficient tool for analyzing the global properties of quasi-Newton methods.We will introduce an appropriate diagonal matrix estimation of the Hessian by minimizing the Byrd and Nocedal [6] function subject to the weak secant equation (2.2) as follows: subject to To achieve a new diagonal updating formula, we give the following penalized version of (2.3) and (2.4): where s i k , i = 1, . . ., n, are the components of vector s k .The required solution of (2.3) and (2.4) is a stationary point of the penalized function.Hence, from (2.6), we have Therefore, using (2.7), the elements of the diagonal matrix B k+1 can be expressed as which are positive and well defined for s i k = 0. Since we have s T k y k > 0 for all k, to ensure positiveness as well as uniformly boundedness of b i k+1 given by (2.8) in general situations, we set where υ is a small positive constant.Therefore, our Hessian approximation can be given by (2.10) A crucial problem is choosing the bounds L k and U k .Here, we introduce an adaptive strategy to determine them.Let us begin by considering the curvature of f (x) in direction s k , which is represented by ) dt is the average Hessian matrix along s k .Since it is not practical to compute the eigenvalue of the Hessian matrix in each iteration, we can estimate its size based on the scalar ) can be chosen according to the value of t k as follows:

Nonmonotone strategy and new algorithm
Grippo et al. [10] found that monotonically decreasing objective function values in the classical iterative schemes for solving (1.1) may reduce the convergence speed of the TR method, especially in the presence of narrow curved valleys.Also, see [20].As a remedy, scholars put their best efforts into developing nonmonotone strategies that guarantee global convergence [1,2,18,25].The pioneering nonmonotone LS method was introduced by Grippo et al. [10] as follows: in which σ ∈ (0, 1) is a constant, and N is a nonnegative integer.Despite the good advantages of this strategy, Zhang and Hager [25] found that this method suffers from various weaknesses.Therefore, a nonmonotone strategy based on the weighted average of previous consecutive iterations was proposed by them.Moreover, using an adaptive convex combination of f l(k) and f k , Amini et al. [2] put an effective substitution in (3.1).
To counteract the adverse effect of monotonicity, here we introduce the following hybrid nonmonotone LS condition: where δ ∈ (0, 1) is a constant and with ξ k ∈ [0, 1].As we see, the definition of mean values D k implies that each D k is a convex combination of the f l(k) and f k .For given ξ 0 ∈ [0, 1], to calculate ξ k we employ the following update formula: The new nonmonotone LS is performed in a backtracking scheme.That is, the step length α k is the largest member of {ρ j β k } j≥0 with ρ ∈ (0, 1) and β k > 0 which satisfies inequality (3.2) [2,19].Similar to [13], we set Let p k be the solution of (1.3) in which B k+1 is a diagonal matrix.To determine whether a trial step will be accepted, we compute rk as the ratio between of f (x) and the model function ψ k (p) by the following relation: where D k is computed by (3.3).
The new TR ratio implies that the suggested algorithm benefits from the best convergence results with a stronger nonmonotone strategy far from the solution and a weaker one close to the solution; see [25].Now, we can present the framework of the new algorithm as follows (see Algorithm 1).
Step 4. Find step length α k satisfying LS rule (3.2), then set x k+1 = x k + α k p k .Update the TR radius by k+1 = c 2 k .
Step 6. Set k = k + 1 and go to Step 1.

Convergence analysis
In this section, we examine the convergence properties of the suggested algorithm.To this end, the following standard assumption is needed [18].
where is a closed and bounded set of R n .
Remark 1 Let f (x) be a twice continuously differentiable function.Therefore, Assumption 4.1 implies that there exists a constant M 1 > 0 such that Therefore, using the mean value theorem, one can conclude that To establish global convergence of the iterative scheme x k+1 = x k + α k p k , with the backtracking LS satisfying (3.3), we assume that Assumption 4.1 holds and the direction p k satisfies the following sufficient descent conditions: where a 1 and a 2 are two positive real-valued constants.For convenience in the discussion, we consider two index sets as follows: Proof Using the Taylor expansion with (2.11) and Assumption 4.1, we get Hence, the proof is complete.
which shows that f k+1 ≤ D k , for all k ∈ I. Case 2. If k ∈ J , then the trial step is rejected and LS must be performed.Through (1.7) we know that g T k p k ≤ 0 for all k.Therefore, from this inequality along with (3.2), we conclude Hence, we have f k+1 ≤ D k for all k ∈ J .In addition, using the definition of f l(k) and (3.3), we have From (4.3) and (4.4) along with (4.5), we have Proof From the definition of f l(k+1) and Lemma 4.3, we have Thus, {f l(k) } is a nonincreasing sequence.Also, the boundedness of {f k } leads to a lower bound.Therefore, the sequence {f l(k) } is convergent.Lemma 4.5 Suppose that the sequence {x k } is generated by Algorithm 1. Then we have Proof From definition of f l(k+1) , we have f k+1 ≤ f l(k+1) , for all k ∈ N. Thus, according to (3.3), we can write This completes the proof of the lemma.Proof First, suppose by contradiction that there exists k ∈ J such that Using Taylor expansion and Lemma 4.5, we obtain for some Therefore, using (2.11) and Assumption 4.1, we can write If α k → 0, then we get Due to the fact that δ ∈ (0, 1), inequality (4.6) leads us to g T k p k ≥ 0, which contradicts (1.7).So, Step 4 in Algorithm 3.1 is well defined.Lemma 4.7 Assume that the sequence {x k } is generated by Algorithm 1. Then for all k ∈ J , the step length α k satisfies .
Now, similarly as in the proof of Theorem 3.2 in [1], we can deduce that (4.10) holds.Case 2. k ∈ J .For k > N , using (3.2) and Lemma 4.3, we can write Now, from (1.7) along with (4.2), we have (4.12) Thus, using (4.11) and (4.12), it follows that The remainder of the proof can be found in [10] and here is omitted holds for all k ∈ N, then holds for all k ∈ N.
Proof We consider two cases as follows: Case 1. k ∈ I. From (1.6), (2.11), (3.5), and (4.13), we have   In this situation, it is possible to prove the following convergence theorems for Algorithm 1. (4.17) Proof If Algorithm 1 terminates in finitely many iterations, the theorem is true.If (4.17) is not true, then there exists a constant ε > 0 such that (4.13) holds.
Let S = {k : rk ≥ μ 0 }.We prove that λ k → ∞ and k → ∞, as k → ∞.From We consider the following cases: Case 1.If S is a finite set, then there exists some k > 0 such that rk < μ 0 holds for all k > k.Thus we have that k+1 ≥ c 2 k holds for all k > k.Since c 2 > 1, we conclude that k → ∞, and thus λ k → ∞ (as k → ∞). (4.18) Case 2. If S is an infinite set, then from Lemma 4.9, we have that Case 3. Suppose that S c denotes the complementary set of S and S c is an infinite set.Now we only need to prove that λ k → ∞, as k → ∞ and k ∈ S c .Let I * = {k i : k i -1 ∈ S and k i ∈ S c }, then {k i -1} is an infinite subset of S. Using (4.20), we have Hence, we have For any k / ∈ S, there exists an index k i such that k i ≤ k and all iterations between k i and k are unsuccessful.According to the construction of Algorithm 1, we can write Now from (1.5) and (4.24), we can write for all k ∈ N. Thanks to (4.25), it follows that p k → 0. Therefore, using (1.6), (2.11), and Lemma 4.2 as k → ∞, we get from which we can deduce that Proof Let p k be the exact solution of (1.3) which satisfies We show that rk ≥ μ 1 , for k sufficiently large and k → ∞ as k → ∞.First, we define and then we prove that r k ≥ μ 1 .From (4.29) it follows that thus we can express From the direct computation, we obtain where the last equality is obtained from (4.28).Therefore, we have that By Taylor expansion, we can write where ζ k ∈ (x k , x k+1 ).From (1.3) and (4.32), we get ), using (4.27) and ( 4.33), we obtain It follows from (4.31), (4.34), and 0 < μ 1 < 1 that therefore, from (4.30), we have r k ≥ μ 1 .Now using (3.5) and Lemma 4.5, we can write for k sufficiently large.Using Step 4 of Algorithm 1, we have k+1 = c 1 k , which implies that lim k→∞ k → 0. Therefore, from (1.4), we get By the Taylor expansion, we can write where ζ k ∈ (x k , x k+1 ).So, from (4.28) and (4.36) for k sufficiently large, we have Thus we can write

.37)
The right-hand side of (4.37) converges to 0 due to the fact that Since f (x) is twice continuously differentiable, along with Assumption 4.1, we conclude that there exists τ > 0 such that τ x k+1x * ≤ g k+1 . (4.39) From (4.39), it follows that In view of (4.38), from (4.40) we have lim k→∞ Therefore, the convergence rate {x k } is superlinear.

Numerical experiments
In this section, we report the performance of the proposed algorithm, CNTR, as well as some comparisons of the CNTR algorithm with the NARQNLS algorithm of Zhang and Ni [24] and the NTRLS algorithm of Qunyan and Dan [18].
The experiments have been performed on a set of unconstrained test functions.All test functions are chosen from Andrei [3], which are listed in Table 1.
The value of all parameters used for the NARQNLS and NTRLS algorithms are similar to those in [24] and [18], respectively.All algorithms were ended with an iterate satisfying g(x k ) ≤ 10 -5 or k > 10000.
The results obtained are reported in Table 2. Then notations listed in the tables are defined as follows: NI, the number of iterations; NF, the number of function evaluations; and (-), the number of iterations over 10000.
Taking a glance at Table 2, it can be seen that the CNTR algorithm has solved all the test functions, while the other considered algorithms have failed in some cases.Dolan and Moré [9] proposed a new method to compare the performance of iterative algorithms with a statistical process by displaying performance profiles.
Figures 1, 2, and 3 show the performance profile of CNTR and the other considered algorithms in terms of the number of iterations (NI), number of function evaluations (NF), and CPU time, respectively.
From Figs. 1 and 2, it can be easily seen that CNTR has the most wins among all considered algorithms.More precisely, the CNTR algorithm is the best in terms of the total number of iterations and function evaluations in more than 47% and 68% of the test functions, respectively.In Fig. 3, we observe that in more than 85% of cases, the CNTR algorithm is faster than the other algorithms.Another remarkable factor of these three figures is that the performance profile of the CNTR algorithm grows faster than the other profiles.These observations imply that the CNTR algorithm is more efficient and robust than the other considered algorithms.

Conclusion
Minimizing the Byrd and Nocedal

Funding
Not applicable.

Lemma 4 . 3
Suppose that the sequence {x k } is generated by Algorithm 1. Then for all k ∈ N ∪ {0}, we have x k ∈ (x 0 ).Proof We consider two cases.Case 1.If k ∈ I, then from (1.6) and (3.5), we can write

Lemma 4 . 4
Suppose that the sequence {x k } is generated by Algorithm 1. Then the sequence {f l(k) } is convergent.

Lemma 4 . 6
Suppose that the sequence {x k } is generated by Algorithm 1. Then Step 4 of the algorithm is well defined.

Lemma 4 . 8 Proof
Suppose that the sequence {x k } is generated by Algorithm 1. Then we have We consider the following two cases: Case 1. k ∈ I.It follows from (3.5) and Lemma 4.5 that

Corollary 4 . 1 . 8 . 4 . 9
Suppose that the sequence {x k } is generated by Algorithm 1. Then we have lim k→∞ D k = lim k→∞ f k .Proof From Lemmas 4.3 and 4.5, we have f k ≤ D k ≤ f l(k) , This completes the proof by using Lemma 4Lemma Suppose that the sequence {x k } is generated by Algorithm 1.If the sequence {x k } does not converge to a stationary point, i.e., there exists a constant ε > 0 such that g k > ε,(4.13)

Theorem 4 . 1
Algorithm 1 either terminates in finitely many iterations, or generates an infinite sequence {x k } which satisfies lim k→∞ inf g k = 0.

[ 6 ]
function subject to the weak secant equation of Dennis and Wolkowicz[8], we have introduced an appropriate diagonal matrix estimation of the Hessian.The Hessian estimate has been used to correct the framework of a nonmonotone trust region algorithm with the regularized quasi-Newton method.To overcome the adverse effect of monotonicity, we have introduced a new nonmonotone strategy.The global and superlinear convergence of the proposed algorithm has been established under some standard conditions.It has been shown by the Dolan-Moré performance profile that the suggested algorithm is efficient and robust in applying the set of unconstrained optimization test functions.

Figure 1 Figure 2 Figure 3
Figure 1 Performance profile based on NI and Ū > 1. Obviously, the values of two bounds can be adjusted by σ 1 and σ 2 , whose values depend on t k .According to the relation (2.9), there exist two positive constants m and M such that .15) Case 2. k ∈ J .Similar to Case 2 in the proof of Lemma 4.3, it follows thatf k+1 -D k ≤ δα k g T k p k .
Now from (1.7), (2.11), (4.13), and Lemma 4.7, we get Thanks to Corollary 4.1, we get .26)for all sufficiently large k.The construction of Algorithm 1 and (4.26) shows that there exists a positive constant * such that k ≤ * holds for all sufficiently large k, which contradicts (4.24).The proof is completed.Suppose the infinite sequence {x k }, convergent to x * , is generated by Algorithm 1.In addition, assume that ∇ 2 f (x * ) is positive definite.If the condition k } converges to x * superlinearly.

Table 1
List of test function

Table 2
Numerical results