A limited memory BFGS subspace algorithm for bound constrained nonsmooth problems

The subspace technique has been widely used to solve unconstrained/constrained optimization problems and there exist many results that have been obtained. In this paper, a subspace algorithm combining with limited memory BFGS update is proposed for large-scale nonsmooth optimization problems with box-constrained conditions. This algorithm can ensure that all iteration points are feasible and the sequence of objective functions is decreasing. Moreover, rapid changes in the active set are allowed. The global convergence is established under some suitable conditions. Numerical results show that this method is very effective for large-scale nonsmooth box-constrained optimization, where the largest dimension of the test problems is 11,000 variables.


Introduction
Consider the following large-scale nonsmooth optimization problems: where f (x) : n → is supposed to be locally Lipschitz continuous and the number of variables n is supposed to be large. The vectors l and u represent the lower and the upper bounds on the variables, respectively. Many practical optimization problems involve nonsmooth functions with large amounts of variables (see, e.g., [1,24]). The active-set method can be easily generalized when the objective function is nonsmooth. For example, Sreedharan [34] extends the method developed in [33] to solve nonsmooth problem with a special objective function and inequality constraint. Also, it is quite easy to generalize the ε-active-set method to the nondifferentiable case [25]. Yuan et al. [42,45] use the two-point gradient method and the trust region method to solve nonsmooth problems. Wolfe [35] and Lemaréchal [22] initiated a giant stride forward in nonsmooth optimization by the bundle concept. Kiwiel [20] proposed a bundle variant, which is close to bundle trust iteration method [32]. Some good results about the bundle technique can be found in [19,21,31] etc. The basic assumption of the

Results of convex analysis and nonsmooth analysis
Let f MY : n → be the so-called Moreau-Yosida regularization of f and be defined by where λ is a positive parameter. Then it is not difficult to see that the problem (1.1) is equivalent to the following problem: 2) The function f MY has some good properties: it is a differentiable convex function and has a Lipschitz continuous gradient even when the function f is nondifferentiable. The gradient function of f MY can be proved to be semismooth under some reasonable conditions [12,30]. Based on these features, many algorithms have been proposed for (2.2) (see [2] etc.) when B = n . Set and denote p(x) = argmin θ (z). Then p(x) is well-defined and unique since θ (z) is strongly convex. By (2.1), f MY (x) can be expressed by In what follows, we denote the gradient of f MY by g. Some features about f MY (x) can be seen in [3,8,16]. The generalized Jacobian of f MY (x) and the property of BD-regular can be found in [6,29], respectively. Some properties are given as follows.
(i) The function f MY is finite-valued, convex, and everywhere differentiable with Moreover, the gradient mapping g : n → n is globally Lipschitz continuous with modulus λ, i.e., If g is BD-regular at x, which means all matrices V ∈ ∂ B g(x) are nonsingular, then there exist constants μ 1 > 0, μ 2 > 0 and a neighborhood Ω of x such that It is obviously that f MY (x) and g(x) can be obtained through the optimal solution of argmin z∈ n θ (z). However, p(x), the minimizer of θ (z), is difficult or even impossible to solve exactly. Such makes that we cannot apply the exact value of p(x) to define f MY (x) and g(x), thus the numerical methods are often used to solve it. In the following, we always suppose that the results (i)-(ii) holds without special notes.

L-BFGS update
At every iteration x k , the L-BFGS method stores a small number (say m) of correction pairs {s i , y i } (i = k -1, . . . , km) to get H k+1 , instead of storing the matrices H k with where h(x) : n → is a continuously differentiable function. In fact, this method is an adaptation of the BFGS method to large-scale problems (see [4,5,37,41,44] for details). The L-BFGS update formula is defined by where ρ k = 1 y T k s k and V k = Iρ k y k s T k . These correction pairs contain information about the curvature of the function and, in conjunction with the BFGS formula, define the limited memory iteration matrix. This method often provides a fast rate of linear convergence and requires minimal storage.
It is well known that the positive definiteness of update matrix H k is very important to analyze the convergence of the algorithm. Byrd et al. [4] show that the limited memory BFGS matrix has this property if the curvature s T k y k > 0 is satisfied. Similarly Powell [28] proposes that y k should be designed by , B k is an approximation of ∇ 2 h(x k ) and B k = H -1 k . Inspired by the Moreau-Yosida regularization and the limited memory technique, we will give a limited memory BFGS method for box-constrained optimization with nonsmooth objective function. In the given algorithm, we also combine an active-set strategy with the gradient projection method. The techniques of the following algorithm are similar to those in Facchinei and Lucidi [10], Ni and Yuan [26], and Xiao and Wei [36], where the main difference lies in solving of the nonsmooth optimization problem.

Algorithm
Setting the feasible region B = {x ∈ n : l i ≤ x i ≤ u i , i = 1, . . . , n}, where x i denotes the ith element of vector x. A vector x ∈ B is said to be a stationary point for problem (2.2) if the following relations: . Considering the observation of the above section, we will first solve (2.2). And we will develop its solution to problem (1.1). The iterative method is used to solve (2.2) and defined by where α k is a steplength and d k is a search direction of f MY at x k . Let x ∈ B be a stationary point of problem (2.2) and define the active constraint set thus we can define the set of the free variables by Therefore (3.1) can be rewritten as It is reasonable to define the approximation Γ (x), Υ (x) and Ψ (x) to Γ , Υ and Ψ , respectively: where a i and b i are nonnegative continuous bounded from above on B, which have the properties, namely if x i = l i or x i = u i then a i (x) > 0 or b i (x) > 0, respectively. Similar to Theorem 3 in [10] for smooth optimization, we can get some results about Γ (x), Υ (x), Ψ (x), Γ , Υ , and Ψ in nonsmooth problems.
This implies that l k = x k = u k and g k (x) = 0, which is a contradiction. Then Γ (x) ∩ Ψ (x) = ∅ holds. Now we prove that the second conclusion of this theorem holds. If i ∈ Γ , by the definition of Γ and the strict complementarity, then g i (x) > 0 holds. Since a i is nonnegative, Since a i is nonnegative, g i (x) > 0 holds. Since g i is continuous in x and the strict complementarity holds, we deduce that i ∈ Γ holds. Thus Γ (x) ⊆ Γ holds.
Theorem 3.1 proves that Γ (x), Υ (x) and Ψ (x) are "good" estimate of Γ , Υ , and Ψ , respectively. In the next, we give the choices of the direction d k and the stepsize α k along with the current point x k ∈ B, respectively. Consider the sets k denotes the subspace direction for the inactive variables, H k = Z T H k Z ∈ |Υ k |×|Υ k | is an approximation of the reduced inverse Hessian matrix, H k is an approximation of the full space inverse Hessian matrix, and Z is the matrix with columns {e i | i ∈ Υ k } and e i is the ith column of the identity matrix in n×n .
For smooth optimization problems, several authors use the projected search for quadratic and nonlinear programming problems with bounds (see [27]). The projected search finds a steplength α k = β k > 0 with sufficient decrease in the function φ k : → such that where β ∈ (0, 1) and σ ∈ (0, 1 2 ) are constants, [·] + is the projection into B defined by (3.10) In this paper, we also use this technique to determine the steplength α k . Based on the selections about d k and α k and let ms ≤ m be the number of the correction pairs, we give the steps of this algorithm.
Step 8: Set k = k + 1 and go to Step 2.
Remark The given algorithm can be regarded as an extension of [36] from smooth optimization to nonsmooth optimization.

Assumption B The level set
Similar to the proof techniques of paper [36] on smooth box optimization, we can get the following lemma. So we only state it as follows but omit the proof.
where > 0 is a constant, Proof We will prove this theorem by contradiction. Let x * be any accumulation point of {x i }, then for i = 1, 2, . . . , there exists a subspace {x k i } satisfying lim k→∞ x k i = x * . Suppose that x * is not a KKT point, by (3.3) and (3.4), we can conclude that there exists j ∈ Γ or j ∈ Ψ such that By the line search (3.9) and (4.2), we can deduce that the sequence {f MY (x k )} is descent. Using f MY is bounded from below in Assumption B, we have In particular, we get Using the definition of d k , setting ϑ 1 = max x∈B g k 2 , and we get The above relation and (4.1) implies that there exists a constant β * ∈ (0, 1) such that Suppose that α k satisfies the line search (3.9), if α k < 0.1β * , then there exists an unacceptable steplength α k,i ≤ 10α k with i ≥ 0 satisfying where ζ 2 = 1 2 max x∈B ∇ 2 f MY (x) . Then we obtain where the second inequality follows from (4.2). Therefore, we get This together with (4.6) implies that By the definition of Γ k and Ψ k , we can conclude that every term of the right side in the following relation: is larger than zero. Thus, by (4.8), we get Therefore, for some j ∈ Υ such that (4.4). By the above three relations, for all sufficiently this is a contradiction. This completes the proof.
Remark If the condition g(x) = 0 holds, by (2.3) and the convexity of f MY (x), it is not difficult to get x = p(x). Accordingly the point x is the unique optimal solution of (1.1).

Numerical results
In the experiments, all codes were written in MATLAB r2010a and run on a PC with CPU Intel(R) Core(TM) i3-3217U CPU 1.80 GHz, 4.00 G bytes of RAM memory, and Windows 7 operating system. Our experiments are performed on a set of the nonlinear box-constrained nonsmooth problems from Karmitsa [17] which have the given initial points; these problems are listed in Table 1. In the experiment, we choose σ = β = 0.1, a i (x) = b i (x) = 10 -5 in (3.5), θ = 1 and the "basic matrix" to be the identity matrix I in the limited memory BFGS method, and m = 5. For the subproblem (2.1), we use the PRP conjugate gradient algorithm to solve it where the iteration number and the function number are added to the main-Algorithm program. The PRP conjugate gradient algorithm for (2.1) is listed as follows:
Step 5: If g sub k+1 < ι, stops, otherwise compute the search direction by Step 6: Set k = k + 1 and go to Step 3.
In the above algorithm, x 0 follows Algorithm 1, ι = 1e-3, and τ 0 = 0.25. Since we aim to design the given algorithm to solve large-scale nonsmooth problems, the dimensions of the test problems are 5000, 7000, 10,000, and 11,000. The following Himmeblau stop rule is used: The program stops if stop1 < 1e-4 is satisfied. In the experiment, we find that the different stop rules will influence the iteration number and the function number obviously but for the final function value. Since the results of iteration number are stable, we choose the Himmeblau stop rule. In order to show the efficiency of the given method, we also test the normal Active-set algorithm with L-BFGS update and compare their performance. The columns of Table 2 Table 2, it is not difficult to see that both of these two methods are effective for solving these ten box-constrained nonsmooth problems. The iteration number and the function number do not change with the dimension increasing except for problems Generalization of MAXQ and Chained Crescent I, which shows that the given algorithm is feasible and stable. For problems Chained Crescent I and Chained Crescent II, since they have many similar properties, they have the same optimization value, respectively, with the determined dimensions. The cpu-time is interesting although it is becoming large with the dimension increasing. To directly show the performance of these two algorithms, the tool of Dolan and Moré [9] is used to analyze them. Figure 1 shows that the performance of Algorithm 1 and Active-set algorithm is relative to cpu-time of Table 2. It is not difficult to see that Algorithm 1 has won, since it has the higher probability of being the optimal solver. Figure 1 shows that Algorithm 1 can successfully solve 100% of the test problems at t ≈ 2 and Active-set algorithm completely solves the test problems at about t ≈ 41. All in all, although the proposed method does not obtain significant development as we have expected, we think that the enhancement of this proposed method is still noticeable.

Conclusions
(i) It is well known that the nonsmooth problems are very difficult to solve even when the objective function is unconstrained, especially for large-scale nonsmooth problems. The Moreau-Yosida regularization technique is an effective tool to deal with this problem. Then we use this technique and propose a subspace algorithm for solving box-constrained nonsmooth optimization problems.
(ii) In order to decrease the workload of the computer and get good numerical performance, the limited memory BFGS method is utilizable in the given algorithm. The numerical performance of the test problems show that the presented algorithm is very interesting for large-scale problems. The dimension is from 5000 to 11,000 variables, which are larger than those of the unconstrained nonsmooth problems of [18].
(iii) In the experiment, we find the different stop rules will obviously influence the iteration numbers and the function numbers but for the final functions. The reason lies in the stop rule, then further work in the future is to find more correct stop rules of the given algorithms.
(iv) Inspired by the idea of [26] and [10], we extend their techniques to nonsmooth problems. The proof methods of this paper are similar to [36]. However, all of these three papers concentrate on the continuous differentiable optimization problems.
(v) Considering the above discussions, we think there are at least three issues that could lead to improvements. The first point is the constant m in the L-BFGS update formula that could be adjusted. Another important point that should be considered is probably the choice of the parameters in the active-set identification technique, since the value of the used parameters is not the only choice. The last one is the most important one, which is from the numerical performance, namely whether are there better stop rules, other optimality conditions and convergence conditions in the nonsmooth problems? All of these aspects are for further work in the future.
Overall, we think that the method provide a valid approach for solving large-scale boxconstrained nonsmooth problems, since the numerical performance is interesting.