An inertial extrapolation method for convex simple bilevel optimization

We consider a scalar objective minimization problem over the solution set of another optimization problem. This problem is known as simple bilevel optimization problem and has drawn a significant attention in the last few years. Our inner problem consists of minimizing the sum of smooth and nonsmooth functions while the outer one is the minimization of a smooth convex function. We propose and establish the convergence of a fixed-point iterative method with inertial extrapolation to solve the problem. Our numerical experiments show that the method proposed in this paper outperforms the currently best known algorithm to solve the class of problem considered.

control of the follower. For more details on the vocabulary and connections of problem (1.1)- (1.2) to the standard bilevel optimization problem, see Subsection 2.1 below.
A common approach to solve problem (1.1)-(1.2) consists of the Tikhonov-type regularization [28] (indirect method), based on solving the following regularized problem min ϕ λ (x) := ϕ(x) + λh(x) (1.3) for some λ > 0. Note that problem (1.1)-(1.2) can be traced back to the work by Mangasarian and Meyer [20] in the process of developing efficient algorithms for large scale linear programs. The model emerged in turn as a refinement of the regularization technique introduced by Tikhonov [28]. The underlying idea in the related papers by Mangasarian and his co-authors is called finite-perturbation property, which consists of finding a parameterλ (Tikhonov perturbation parameter) such that for all λ ∈ [0,λ], arg min x∈X * h(x) = arg min x∈R n ϕ λ (x) := ϕ(x) + λh(x). (1.4) This property, initially proven in [20] when the lower-level problem is a linear program, was later extended in [16] to the case where it is a general convex optimization problem.
To the best of our knowledge, the development of solution algorithms specifically tailored to optimization problems of the form (1.1)-(1.2) can be traced back to the work by Cabot [9], where a proximal point method is proposed to solve the problem and its extension to a simple hierarchical optimization problem with finitely many levels. In contrary to the latter paper, where the approximation scheme is only implicit thus making the method not easy to numerically implement, Solodov [26] proposed an explicit and more tractable proximal point method for problem (1.1)-(1.2). Since then, various proximal point algorithms have been developed to solve the problems under different types of frameworks, see, e.g., [7,21,25] and references therein.
Inspired by recent results on inertial extrapolation type algorithms for solving optimization problem (see, e.g., [1,4,6,23] and references therein), our aim in this paper is to solve problem (1.1)-(1.2) by introducing an inertial extrapolation step to BiG-SAM (1.5) (which we shall call iBiG-SAM). We then establish the global convergence of our method under reasonable assumptions. Numerical experiments show that the proposed method outperforms the BiG-SAM (1.5) introduced in [25].
For the remainder of the paper, first note that there is a striking similarity between the exact penalization model (1.3) and a corresponding partial penalization approach based on the partial calmness concept [32] often used to solve the general bilevel optimization problem. Both approaches seem to have originated from completely different sources and their development also seems to be occurring independently from each other till now. In Subsection 2.1, we clarify this similarity and discuss some strong relationships between the two problem classes. In Subsection 2.2, we recall some basic definitions and results that will play an important role in the paper. The proposed method and its convergence analysis are presented in Section 3. Some numerical experiments are given in Section 4. We conclude the paper with some final remarks in Section 5.

Standard bilevel optimization.
In this subsection, we provide a discussion to place the simple bilevel optimization introduced above in a general context of bilevel optimization. To proceed, we consider a simple optimistic version of the latter class of problem, which aligns suitably with problem (1.1)-(1.2), i.e., where h : R n × R m → R represents the upper level objective function and the set-valued mapping S defines the the set of optimal solutions of the lower level problem min y ϕ(x, y) (2.2) (ϕ : R n × R m → R) for any fixed upper level variable x. Obviously, problem (1.1)-(1.2) is a special case of problem (2.1)-(2.2), where the optimal solution of the leader is simply picked among the optimal solutions of the lower level problem, which in turn are obtained without any influence from the leader as it is the case in the latter problem.
On the other hand problem (2.1)-(2.2) can be equivalently written as the following optimization problem over an efficient set where E (R n × R m ,φ, ) denotes the efficient set (i.e., optimal solution set) of the problem of minimizing a multiobjective functionφ (based on ϕ (2.2)) over R n × R m w.r.t. a certain order relation ; for examples of choices of the latter function and corresponding order relations, see the papers [14,17]. Obviously, an optimization problem over an efficient set is a generalization of the simple bilevel optimization problem (1.1)-(1.2), and has been extensively investigated since the seminal work by Philip [24]; see [31] for a literature review on the topic.
One common approach to transform problem (2.1)-(2.2) into a single-level optimization problem is the so-called lower-level optimal value function (LLVF) reformulation where the function ϕ * (x) = min y ϕ(x, y) represents the optimal value function of the lower level problem (2.2). Recall that this reformulation is an underlying feature in the development of the link (1.4) between the simple bilevel optimization problem (1.1)-(1.2) and the penalized problem (1.3) as outlined in the corresponding publications; see, e.g., [16]. However, we instead want to point out here an interesting similarity between the finite termination property (1.4) and the partial calmness concept [32] commonly used in the context of standard bilevel optimization. To highlight this, let (x,ȳ) be a local optimal solution of (1.1)-(1.2). The problem is partially calm at (x,ȳ) if and only if there exists λ > 0 such that (x,ȳ) is also a local optimal solution of the penalized problem The partial calmness concept does not automatically hold for the simple bilevel optimization problem (1.1)-(1.2). To see this, consider the example of convex simple bilevel optimization problem of minimizing (x − 1) 2 subject to x ∈ arg min y 2 . It is clear that 0 is the only optimal solution of this problem. But for the corresponding penalized problem (2.4) to minimize (x − 1) 2 + λx 2 , we can easily check that the optimal solution is the number x(λ) := 1 1+λ for all λ > 0. Clearly, x(λ) = 0 for all λ > 0.
It is also important to note that, possibly unlike the finite termination property (1.4), the partial calmness concept was introduced as a qualification condition to derive necessary optimality conditions for problem (2.3); see [13,32] for some papers where this concept is used, and also the papers [12,15] for new results on simple bilevel optimization problems from the perspective of standard bilevel optimization.

Basic mathematical tools
We state the following well-known lemmas which will be used in our convergence analysis in the sequel.
Assume γ n < ∞. Then the following results hold: (i) If σ n α n M for some M 0, then {a n } is a bounded sequence.
We state the formal definition of some classes of operators that play an essential role in our analysis in the sequel.
(b) averaged if and only if it can be written as the average of the identity mapping I and a nonexpansive operator, i.e., T := (1 − β)I + βS with β ∈ (0, 1) and S : R n → R n being a nonexpansive operator. More precisely, we say that T is β-averaged; (c) firmly nonexpansive if and only if 2T − I is nonexpansive, or equivalently, Alternatively, T is said to be firmly nonexpansive if and only if it can be expressed as T := 1 2 (I + S), where S : R n → R n is nonexpansive.
We can see from above that firmly nonexpansive operators (in particular, projections) Next, we provide some relevant properties of averaged operators.
Proposition 2.5. (see, e.g., [8]) For given operators S, T , and V defined from R n to R n , the following statements are satisfied: (a) If T = (1 − α)S + αV for some α ∈ (0, 1) and if S is averaged and V is nonexpansive, then the operator T is averaged.
(b) The operator T is firmly nonexpansive if and only if the complement I − T is also firmly nonexpansive.
(d) The composite of finitely many averaged operators is averaged. That is, if for each i = 1, . . . , N, the operator T i is averaged, then so is the composite operator Finally, for the last proposition of this section, we recall the definition of monotonicity of nonlinear operators.
The following proposition gathers some useful results on the relationship between averaged operators and inverse strongly monotone operators.
If T : R n → R n is an operator, then the following statements hold: In this section, we give a precise statement of our method and its convergence analysis. We first state the assumptions that will be needed throughout the rest of this paper. (a) f : R n → R is convex and continuously differentiable such that its gradient is Lipschitz continuous with constant L f .
is proper, lower semicontinous and convex.
(c) h : R n → R is strongly convex with parameter σ > 0 and continuously differentiable such that its gradient is Lipschitz continuous with constant L h .
(d) The set X * of all optimal solutions of problem (1.2) is nonempty.
Note that the stepsize λ in Assumption (c) above is chosen in a larger interval than that of [25]. Also, our Assumption (a) is weaker than Assumption C of [25] since {α n } is not required in our Assumption (a) to satisfy lim n→∞ α n+1 α n = 1 as assumed in Assumption C of [25]. Take, for example, α n = 1 √ n , when n is odd and α n = 1 n , when n is even. We see that {α n } satisfies Assumption (a) but α n+1 α n → 1.
We next give a precise statement of our inertial Bilevel Gradient Sequential Averaging Method (iBiG-SAM) as follows.
Step 1: Given the iterates x n−1 and x n (with n 1), choose θ n such that we have 0 θ n θ n withθ n defined bȳ Step 2: Proceed with the following computations: , s n = prox λg (y n − λ∇f(y n )), z n = y n − γ∇h(y n ), x n+1 = α n z n + (1 − α n )s n , n 1. Also note that Step 1 in our Algorithm 3.1 is easily implemented in numerical computation since the value of x n − x n−1 is a priori known before choosing θ n .
We are now in the position to discuss the convergence of iBIG-SAM. Let us define 3) The next lemma shows that the prox-grad mapping T λ is averaged. This is an improvement over Lemma 1(i) of [25]. Proof. Observe that the Lipschitz condition on ∇f implies that ∇f is 1 L f -ism (see [2]), which then implies that λ∇f is 1 λL f -ism. Hence, by Proposition 2.7(c), I − λ∇f is ( λL f 2 )averaged. Since prox λf is firmly nonexpansive and hence 1 2 -averaged, we see from where β := 2+λL f 4 ∈ [a, b] ⊂ (1/2, 1) and T is a nonexpansive mapping.
Lemma 1(ii) of [25] showed the equivalence between the fixed points of prox-grad mapping T λ (3.3) and optimal solutions of problem (1.2). That is, x ∈ X * if and only if x = T λ x. This equivalence will be needed in our convergence analysis in this paper.
Before we proceed with the main result of this section, we first show that the iterative sequence generated by our algorithm is bounded. Proof. From (3.2), for any z ∈ X * , we have z ∈ F(T λ ) = F(T ). Therefore, Observe that sup n 1 (1 − α n (1 − η)) θ n α n x n − x n−1 exists by Remark 3.4 and take Then (3.7) becomes By Lemma 2.2 , we get that {x n } is bounded. As a consequence, {y n } is also bounded. Proof. Start by observing that Substituting (3.10) into (3.9), we obtain where the last inequality follows from the fact that θ n ∈ [0, 1). Using Lemma 2.1 (ii) and (iii), we obtain from (3.2) that Combining (3.11) and (3.12), we get Setting Γ n := x n − z 2 for all n 1, it follows from (3.13) that (3.14) We consider two cases for the rest of the proof.
Case 1: Suppose there exists a natural number n 0 such that Γ n+1 Γ n for all n n 0 . Therefore, lim n→∞ Γ n exists. From (3.14), we have Observe that lim inf immediately implies that lim n→∞ T y n − y n = 0.
Since {x n } is bounded, take a subsequence {x n k } of {x n } such that x n k → p ∈ R n and using the definition of contraction mapping S γ in Lemma 3.6, we have lim sup From y n = x n + θ n (x n − x n−1 ), we get Since x n k → p, we have y n k → p. Lemma 2.4 then guarantees that p ∈ F(T ) = X * . Furthermore, we have from (3.8) and (3.16 From the contraction of S γ and (3.11), we can write Therefore from (3.14) it holds (1 − α n ) 2 + α n η)( Γ n + Γ n−1 + 2((1 − α n ) 2 + α n η) x n − x n−1 . Therefore where δ n := 2(1−η)α n 1−α n η and σ n : Using Lemma 2.2 (ii) and Assumption 3.2 in (3.19), we get Γ n = x n − z → 0 and thus x n → z as n → ∞.
We give some brief comments on the nonasymptotic O(1/n 2 ) convergence rate of some estimates obtained in Theorem 3.8. Full details on the convergence rate of the result in Theorem 3.8 is left for further careful investigation in a separate work.
For numerical implementation of our proposed method in Section 3 we consider the inverse problems tested in [25] and give numerical comparison with the proposed Algorithm 3.1 (iBiG-SAM) and that of BiG-SAM method in [25]. The codes are implemented in Matlab. We perform all computations on a windows desktop with an Intel(R) Core(TM) i7-2600 CPU at 3.4GHz and 8.00 GB of memory. We take α n = 2κ n(1−β) with κ = 0.1, which is the best choice for BiG-SAM considered in [25] and β ∈ [0, 1) defined as in (3.5) and θ n =θ n as in (3.1) with α = 3 and n = α n /n 0.01 for iBiG-SAM.
where δ X is the indicator function over the nonnegative orthant X := {x ∈ R n : x 0}. Furthermore, we take the outer objective function as where Q is a positive definite matrix. It is clear that L f = A t A and L h = Q . We choose λ = 1 L f and γ = 2 L h +σ .
Following [4], we consider three inverse problems, i.e., Baart, Foxgood, and Phillips [25]. For each of these problems, we generated the corresponding 1, 000 by 1, 000 exact linear system of the form Ax = b, by applying the relevant function (baart, foxgood, and phillips). We then performed the simulation by adding normally distributed noise with zero mean to the right-hand-side vector b, with deviation ρ = 0.01. The matrix Q is defined by Q = LL + I, where L is generated by the function get-l(1,000,1) from the regularization tools (see http://www.imm.dtu.dk/~pcha/Regutools/) and approximates the first-derivative operator. Following [25], we use the stopping condition (ϕ(x n ) − ϕ * )/ϕ * 10 −2 for both methods, where ϕ * is the optimal value of the inner problem computed in advance by BiG-SAM with 1000 iterations. In Table 1 we present the averaged number of iterations and time (out of 100 runs) until the algorithms reach the stopping criterion. It can be seen that iBiG-SAM outperforms BiG-SAM (on averaged about 20%) in all problems tested.
In Figure 1, we compare the behavior of iBiG-SAM wih BiG-SAM for Baart and Foxgood problems when n = 100.
where A ∈ R m×n is a given matrix, b is a given vector and µ a positive scalar. This is LASSO (Least Absolute selection and Shrinkage Operator) [27] in compressed sensing.
The proximal map with g(x) = µ x 1 is given as prox g (x) = arg min u µ x 1 + 1 2 u − x 2 2 , which is separable in indices. Thus, for x ∈ R n , prox g (x) = prox µ|.| 1 (x 1 ), . . . , prox µ|.| 1 (x n ) = (β 1 , . . . , β n ) , where β k = sgn(x k ) max{|x k | − µ, 0} for k = 1, 2, . . . , n. As in Example 4.1, we take the outer objective function as in (4.1) with Q similarly being a positive definite matrix. We take µ = 0.5, and the data b is generated as Ax + δe, where A and e are random matrices whose elements are normally distributed with zero mean and variance 1, and δ = 0.01, and x is a generated sparse vector. The stopping condition is x n − x * with = 10 −3 and x * computed in advance by BiG-SAM with 1000 iterations. In Table  2 we present the averaged number of iterations and time (out of 100 runs) until the algorithms reach the stopping criterion for different choices of α 3 in different dimensional spaces. Again iBiG-SAM outperforms BiG-SAM in all simulations.
In Figure 2, we compare the behavior of BiG-SAM wih iBiG-SAM for different parameters α. It seems that iBiG-SAM with α = 3 takes advantage over other values tested. The paper has introduced and proved the global convergence of an inertial extrapolationtype method for solving simple convex bilevel optimization problems in finite dimensional Euclidean spaces. One can easily check that the results developed here remain valid in infinite dimensional Hilbert spaces. Based on the numerical experiments conducted, we illustrated that our method outperforms the best known algorithm recently proposed in [25] to solve problems of the form (1.1)-(1.2). Our next project in this subject area is to derive the convergence rate of the method proposed in this paper.