A Bilevel Approach for Parameter Learning in Inverse Problems

A learning approach to selecting regularization parameters in multi-penalty Tikhonov regularization is investigated. It leads to a bilevel optimization problem, where the lower level problem is a Tikhonov regularized problem parameterized in the regularization parameters. Conditions which ensure the existence of solutions to the bilevel optimization problem of interest are derived, and these conditions are verified for two relevant examples. Difficulties arising from the possible lack of convexity of the lower level problems are discussed. Optimality conditions are given provided that a reasonable constraint qualification holds. Finally, results from numerical experiments used to test the developed theory are presented.


Introduction
Tikhonov regularization is a well-known method for solving ill-posed inverse problems, see e.g. [2,9,16,13]. Given only a noisy measurement y δ of some outcome y † ∈ Y , and assuming that the inverse problem is to find u † ∈ U ad such that S(u † ) = y † , (1.1) for suitable choices of a norm · , a vector valued penalty function Ψ : U → [0, ∞] r , and a vector of regularization parameters α ∈ (0, ∞) r . Which norm and penalty functions should be chosen depends heavily on the specific application. For choosing the regularization parameters, many general strategies have been proposed; see e.g. [9], and the references given there. Typically these strategies focus on the case of a single scalar regularization parameter, and they become quite involved when one has to deal with a larger number of parameters.
The learning problem In this paper we consider a basic learning approach for selecting regularization parameters in (P α,y δ ). The idea is to choose regularization parameters based on their performance on a training database. In the simplest case, the database consists of a single vector of data (y † , u † , y δ ), where y δ is a noisy measurement of y † , and u † is such that S(u † ) = y † .
We may think of (y † , u † ) as an idealistic ground truth input-output pair, and of y δ as the associated noisy measurement of the output available in practice. Given such data, for every choice of α we can compute the distance between solutions u α to the regularized problem (P α,y δ ) and the exact solution u † . This is used in the learning process where we aim at finding the regularization parameter α * for which a solution u α * to (P α * ,y δ ) has the minimal distance to u † over all parameter vectors within an a-priori chosen parameter set I. This leads us to the following problem: The quotation marks are used, since if solutions to the Tikhonov regularized problems (P α,y δ ) are not unique, then it is not clear which solutions to choose. One possibility is to look for α such that the minimal distance to the exact solution over all solutions to (P α,y δ ) is small. This is called the optimistic position and leads to the following problem. Another possibility is to look for α such that the maximal distance to the exact solution over all solutions to (P α,y δ ) is small. This is called the pessimistic position and amounts to the following problem. Here we only consider the optimistic position (1.3). From now on we call (1.3) the learning problem, since by solving it regularization parameters should be learned. Conceptually, the learning problem is an optimization problem in two variables, which is constrained by requiring that one variable is a solution of another optimization problem depending on the other variable. In the literature problems of this type are called bilevel optimization problems; see e.g. [8].
The present work was motivated by a similar learning approach that has successfully been used for imaging problems in [14] and the subsequent works [6,4,7]. In these works, both the cases of smooth and non smooth lower level problems are studied in a finite and infinite dimensional setting. However, in all these contributions it is required that S is either an identity embedding operator or has closed range. The case of a general linear operator S is considered in [5] in a finite dimensional setting. We are very much aware of the potential which may rest in currently heavily investigated technology of deep learning in order to choose regularization parameters for inverse problems, and we aim to work in this direction. We hope that a mathematical analysis of the deep learning approach can profit from the present work.
What we aim to do is to use parameter learning for the inverse problems of determining coefficients or controls in partial differential equations. This requires us to consider an infinite dimensional setting with S either a linear operator with non closed range or even non linear. Although we develop the theory in a somewhat general setting, throughout this work we have two concrete examples in mind. In the first example, S is the linear solution operator to − γ∆y + y = u in Ω, and y = 0 on ∂Ω, (1.5) where γ > 0. In the second example, S is the non-linear solution operator to Let us now give a brief summary of the contents of the following sections. In Section 2 we provide a precise statement of the learning problem and introduce the basic notation. In Section 3 we recall some basic properties of the lower level problem, i.e. of the Tikhonov regularized problem. These properties are in turn used in Section 4 to show that the learning problem has a solution under standard assumptions. In Section 5 we discuss the derivation of optimality conditions for the learning problem. Standard examples for possible applications are presented in Section 6. Finally, in Section 7 we present results from numerical experiments.

Problem statement
In the following we present the general setting of the learning problem to be considered in this work.
where m, r ∈ N, and • U ad is a subset of a reflexive Banach space U , • Y is a reflexive Banach space, •Ũ is a Hilbert space such that U is continuously embedded inŨ , •Ỹ is a Hilbert space such that Y is continuously embedded inỸ , • u † ∈Ũ is the exact control, and y δ j ∈Ỹ , 1 ≤ j ≤ m, are noisy measurements of the exact state, • e : Y × U ad → Z represents equality constraints in a Banach space Z, are penalty functionals, and Ψ := (Ψ 1 , . . . , Ψ r ) T , •ᾱ,ᾱ ∈ R r are bounds for the regularization parameters with where the inequalities should be understood element wise.
Instead of working with an explicit solution operator S as in the introduction, here we consider a more general implicit formulation by requiring that for feasible (y, u) ∈ Y × U ad it holds that e(y, u) = 0.
If for each u ∈ U ad there exists a unique y ∈ Y such that (2.1) holds, then a solution operator S can be defined by setting The so-called lower level problem u ∈ U ad and e(y, u) = 0, which depends on the parameter α ∈ [ᾱ,ᾱ], is a multi-penalty Tikhonov regularized inverse problem. We let F ad := {(y, u) ∈ Y × U | u ∈ U ad and e(y, u) = 0} denote the set of feasible points of the lower level problem. To fix ideas, typical choices for the used spaces are where Ω is a bounded Lipschitz domain. Concrete examples are given in Section 6.

Basic assumptions
The following assumptions are frequently invoked throughout this work.
(H1) The feasible control set U ad is convex and closed in U .
(H2) The feasible set of the lower level problem F ad is non-empty.
(H4) For every sequence (y n , u n ) in F ad it holds that if (u n ) is bounded in U , then (y n ) is bounded in Y .
is coervice on U and proper on F ad .
(H6) The penalty functionals Ψ i , 1 ≤ i ≤ r, are weakly lower semi continuous on U .

The lower level problem
When we discuss existence of solutions and optimality conditions for the learning problem in Section 4 and 5, respectively, we frequently make use of basic properties of the lower level problem. In this section these properties are derived. Throughout this section we always assume that α ∈ [ᾱ,ᾱ]. Proof. By (H1) the set Y × U ad is closed and convex, and thus weakly closed [3, Theorem 3.7 on p.60]. It is then a direct consequence of (H3) that F ad is weakly sequentially closed. From (H4)-(H5) and the assumption that α > 0, it follows that J α,y δ is coercive on F ad . The mapping

Existence of solutions
is weakly lower semi continuous as a convex continuous function [3, Corollary 3.9 on p.61].
In combination with (H6) this implies that J α,y δ is weakly lower semi continuous on Y ×U . Since it is well-known that a weakly lower semi continuous and coercive function attains a minimum on a non empty and weakly sequentially closed subset of a reflexive Banach space, the proof is complete.
Remark 3.1. As an immediate consequence of Proposition 3.1, we obtain that the feasible set of the learning problem (LP) is non empty.

Stability
One of the reasons for regularizing an inverse problem is lack of stability with respect to the data. It is thus expected that stability, at least in some sense, holds for the Tikhonov regularized problem (P α,y δ ). Indeed, as stated below in Corollary 3.1, stability can be guaranteed under reasonable assumptions. Before we begin working towards this result, we need to clarify what we mean by stability (in particular in the context of problems with possibly non unique solutions).
Definition 3.1 (Stability with respect to the data) We say that (P α,y δ ) is stable with respect to the data if and only if the following holds: For every sequence (y δ n ) inỸ m such that y δ n → y δ , it follows that every sequence (y n , u n ) of corresponding solutions to (P α,y δ n ) has a cluster point, and every such cluster point is a solution to (P α,y δ ).
Remark 3.2. If (P α,y δ ) has a unique solution, then it is straightforward to verify that stability with respect to the data is equivalent to requiring that every sequence (y n , u n ) as in Definition 3.1 is converging to the unique solution of (P α,y δ ).
Recall that in the learning problem we minimize the distance to the exact control over the set of all feasible regularization parameters and corresponding solutions to the lower level problem. It is useful to know, if the lower level problem is stable with respect to the regularization parameters.
Definition 3.2 (Stability with respect to the regularization parameters) We say that (P α,y δ ) is stable with respect to the regularization parameters if and only if the following holds: For every sequence (α n ) in [ᾱ,ᾱ] such that α n → α, it follows, that every sequence (y n , u n ) of corresponding solutions to (P α n ,y δ ) has a cluster point, and every such cluster point is a solution to (P α,y δ ).
As a first step towards showing stability, we prove the following lemma, which states that under standard assumptions at least weak stability can be guaranteed with respect to both the data and the regularization parameters.
Proof. The proof is divided into three steps.
Step 1: We first aim at showing that the sequence (y n , u n ) has a weakly convergent subsequence in F ad . Since F ad is a weakly closed subset of a reflexive Banach space, for this purpose it is sufficient to show that (y n , u n ) is bounded. Utilizing (H4), in turn, the boundedness of (y n , u n ) follows if we can prove that (u n ) is bounded.
To show that (u n ) is bounded, we argue as follows: Since the sequence (y δ n ) is convergent, there exists M > 0 such that y δ n j Ỹ ≤ M for all n ∈ N and 1 ≤ j ≤ m.
Using that Ψ is proper on F ad , we can choose (y, u) ∈ F ad such that the right-hand side of this chain of inequalities is finite. Since the right-hand side is independent of n, andᾱ > 0, this shows that is bounded. Consequently, from (H5) it follows that (u n ) is bounded; and thus the first step is complete.
Step 2: Using the first step, we can assume that there exists a subsequence of (y n , u n ), which, for simplicity, we again denote by (y n , u n ), and (ȳ,ū) ∈ F ad such that Our goal in the second step is to show that (ȳ,ū) solves (P α,y δ ). For this purpose, since (y n , u n ) solves (P α n ,y δ n ), note that for all (y, u) ∈ F ad and n ∈ N. Using that (α n , y δ n , y n , u n ) → J α n ,y δ n (y n , u n ) is weakly lower semi continuous on [ᾱ,ᾱ] ×Ỹ m × Y × U , and that for every (y, u) ∈ F ad the mapping is continuous on [ᾱ,ᾱ] ×Ỹ m , taking the limit n → ∞ in (3.1) we arrive at As a consequence of this estimate, we have lim n→∞ J α n ,y δ n (y n , u n ) = J α,y δ (ȳ,ū) = min which shows that (ȳ,ū) solves (P α,y δ ). This finishes the second step.
Step 3: In order to complete the proof it remains to show that which is done now. First, observe that due to weak lower semi continuity of the involved functions and We now argue as follows: If for some 1 ≤ i ≤ r it holds that then in view of (3.4)-(3.5), and using J α,y δ (ȳ,ū) < ∞, this implies Since we have already shown that lim n→∞ J α n ,y δ n (y n , u n ) = J α,y δ (ȳ,ū), this leads to a contradiction. Consequently, we must have Since (3.6) is also true for every subsequence of (u n ), this implies that which is what was left to show.
Strong convergence as in Definition 3.1 and 3.2, and thus stability, can be achieved if the following additional assumptions are satisfied.
(H7) For every sequence (u n ) in U and u ∈ U it holds that, if then it follows that u n → u.
(H8) For each u ∈ U ad there exists a unique y(u) ∈ Y such that e(y(u), u) = 0, and the mapping u → y(u) is continuous from U ad to Y .
The following corollary, which under reasonable assumptions guarantees stability for the lower level problem, summarizes the considerations in this subsection.
Corollary 3.1 (Stability) If (H1)-(H8) hold, then (P α,y δ ) is stable with respect to both the data and the regularization parameters.
Proof. In combination with (H7) and (H8) this is a direct consequence of Lemma 3.1.

Optimality conditions
Optimality conditions for the lower level problem can be derived using standard Lagrangian methods. In this subsection we provide the main results needed for our purposes. Thereby we always make the following assumptions.
(A1) e is well-defined and continuously F-differentiable on Y × U .
The following standard result is a special case of a theorem provided in [15].
Then there exists a unique λ * ∈ Z such that (y * , u * , λ * ) is a KKT point of (P α,y δ ). In particular, (y * , u * ) satisfies the first order necessary optimality conditions.
Definition 3.4 (Lagrange function) We define the Lagrange function L α : Y ×U ad ×Z → R of the lower level problem by Definition 3. 5 We say that (y * , u * ) satisfies the second order sufficient optimality conditions of (P α,y δ ), if there exists λ * ∈ Z and η > 0 such that (y * , u * , λ * ) is a KKT point and The following result can be found in [17].

Existence of solutions of the learning problem
Using results from the previous section, we can now apply standard arguments to prove that the learning problem has a solution.
Proof. We begin by showing that the feasible set of (LP), which is given by is non empty and weakly sequentially compact. The non emptiness of F follows from Proposition 3.1. In order to prove that F is weakly sequentially compact, we argue as follows: As a consequence of the Bolzano-Weierstraß theorem, every sequence (α n , y n , u n ) in F has a subsequence (α n k , y n k , u n k ) such that for some α * ∈ [ᾱ,ᾱ] Utilizing that (P α,y δ ) is weakly stable with respect to the regularization parameters (Lemma 3.1) we can assume, possibly after taking another subsequence, that in addition to (4.1) for some (y * , u * ) ∈ F ad which solves (P α * ,y δ ). Since (α * , y * , u * ) ∈ F, this proves that F is weakly sequentially compact. In view of the fact that a weakly lower semi continuous function attains a minimum on a non empty and weakly sequentially compact set (see e.g. [12, Theorem 2.3 on p.8]), it remains to show that the mapping is weakly lower semi continuous on F. This follows from [3, Corollary 3.9 on p.61], and thus the proof complete.

Optimality conditions
Throughout this section we make the following assumptions.
(B1) e is well-defined and twice continuously F-differentiable on Y × U .
(B2) U ad = U , i.e. there are no control constraints in the lower level problem.
In a first step towards deriving optimality conditions for the learning problem, we consider its so-called KKT reformulation. In this reformulation, the lower level problem is replaced by its first order necessary optimality conditions (3.8a)-(3.8c). (LP) If the lower level problem is convex for every α ∈ [ᾱ,ᾱ], then the learning problem (LP) and its KKT reformulation (LP) are equivalent. In general, this is not the case since points which satisfy the necessary optimality conditions of the lower level problem are not necessarily solutions to the lower level problem. Before we address this issue, we note that at least for the KKT reformulation, optimality conditions can be derived by standard methods. In the following lemma, the assumption that the second order sufficient optimality condition holds is crucial, and serves as a constraint qualification.
Lemma 5.1 Let (α * , y * , u * , λ * ) be a local solution to (LP) with (y * , u * , λ * ) satisfying the second order sufficient optimality condition of (P α * ,y δ ). Then there exists a unique Proof. A proof is given in the Appendix.
We note that the equalities (5.1b), (5.1c), (5.1d) hold in the spaces Y , U , Z, respectively, and typically represent partial differential equations. Since we want to use the optimality conditions from Lemma 5.1 for the original learning problem, it is important to know when solutions to the learning problem are at least local solutions of its KKT reformulation. This is addressed in the following theorem, where it is implicitly assumed that the lower level problem (P α,y δ ) admits a solution for every α ∈ [ᾱ,ᾱ].
Theorem 5.1 Let (α * , y * , u * ) be a solution to (LP). Assume that the following statements hold.
(i) (P α * ,y δ ) is stable with respect to the regularization parameters.

Proof. A proof is given in the Appendix.
Note that by Corollary 3.1, the first condition in Theorem 5.1 is satisfied, if (H1)-(H8) hold . The second condition is also needed to ensure the existence of an optimality system for (LP). The third condition seems to be quite restrictive. Unfortunately, as indicated by a counterexample in [10, Example 4.2.1], without the third condition the conclusion of Theorem 5.1 no longer remains true.

Linear state equation
We consider the class of problems with penalty functionals given by and a linear state equation, i.e.
where A ∈ L(Y, Z), B ∈ L(U, Z), and K i ∈ L(U, E i ) with E i being Hilbert spaces for 1 ≤ i ≤ r. The following assumptions are invoked to ensure that (LP lin ) has a solution, and that every solutions satisfies appropriate optimality conditions.
(L2) There exists ζ > 0 such that Proof. In view of Theorem 4.1 we only have to verify (H1)-(H6), which can be done using standard arguments.

Optimality conditions
Since the lower level problem in (LP lin ) is strictly convex, the following optimality system can be easily derived from Lemma 5.1 by making a few straightforward computations.
Proposition 6.2 Let (α * , A −1 Bu * , u * ) be a solution to (LP lin ). Then there exists q * ∈ U such that where Remark 6.1. Note that (6.1c) is the optimality condition for the reduced lower level problem in the control variable, where we use that (y, u) ∈ F ad if and only if The adjoint equation for the learning problem is given by (6.1b).

Bilinear state equation
As an example with a bilinear state equation, we consider the estimation of the diffusion coefficient in a second order elliptic equation using H k -regularization, where k = 1 or 2. This leads to the following problem.
where Ω ⊂ R d , for d ∈ N, is a bounded Lipschitz domain, f ∈ L 2 (Ω), and Existence of solutions The existence of solutions to (LP bil ) can be guaranteed without any additional assumptions.

Proposition 6.3 The learning problem (LP bil ) has a solution.
Proof. In view of Theorem 4.1, it suffices to verify (H1)-(H6). This can be done applying standard arguments. A proof for the case k = 1 can be found in [10, Proposition 5.4.2]. The same arguments as given there can be used for any k ∈ N.
Optimality conditions We aim at applying the results from Section 5, where we assume (B1)-(B4) to hold. To guarantee (B1) for (LP bil ), we require that H k (Ω) can be continuously embedded into L ∞ (Ω), which is the case if and only if k > d/2; see e.g.
[1, Theorem 5.4, Example 5.25, and 5.26]). Recall that the discussion in Section 5 does not cover the case of control constraints. However, here control constraints are needed to ensure that (LP bil ) is well-posed. To circumvent this issue, we consider a relaxed version of (LP bil ), in which e is replaced by the relaxed state equatioñ ]. The precise definition of φ is given in the Appendix. One can show that the learning problem with this relaxed state equation fulfills the assumptions of Theorem 4.1, and thus has a solution. For k > d/2, the conditions (B1)-(B4) are satisfied. Consequently, in what follows we can use the results from Section 5 to derive optimality conditions. Let us first state the KKT reformulation of the relaxed problem.
The following result can be seen as a straightforward consequence of Lemma 5.1. Proposition 6.4 Assume that k > d/2, and let (α * , y * , u * , p * ) be a solution to (LP rel ). If (y * , u * , p * ) satisfies the second order sufficient optimality conditions of (P α * ,y δ ) with the relaxed state equation, then there exists a unique (q *

Numerical experiments
In this section we present results for two numerical experiments regarding learning regularization parameters in weighted H 1 -regularization.

Linear state equation
In the first experiment the inverse problem to be regularized is to estimate the forcing function in a second order elliptic partial differential equation. We let the exact state y † be given as the solution to the control The exact control is shown in Figure 1a. We discretize the problem on a 128 × 128 mesh using the standard five-point stencil for the Laplace operator. Noisy data measurements y δ j are generated by pointwise setting where ξ j follows a normal distribution with mean 0 and standard deviation 1, and ε := max |y † | with being the relative noise level. We consider the following regularization operators.
In Figure 2a, we plot the values of the bilevel cost functional, i.e. the squared distance between the recovered control and the exact control, in dependence of the regularization parameter when using the single operator K 1 for different noise levels. Figure 2b shows the (a) Only using K 1 = I with γ = 0.1; using 10% noise (blue), 5% noise (red) and 1% noise (green).  values of the bilevel cost functional in dependence of the regularization parameters using the operators K 1 and K 2 for 1% noise. Note that in both figures the bilevel cost functional seems to attain a distinct minimum. This motivates the feasibility of the formulation of finding regularization parameters as a learning problem. Additionally, the region in which the bilevel cost functional has non-negative curvature seems to be quite small.

Used methods and the solution algorithm
To solve (LP lin ) we use a globalized quasi-Newton method. Since modifying the approximate Hessian to be positive definite would result in quite poor performance, we use a different strategy: We perform a regular BFGS update, unless we detect that a descent condition in the BFGS update direction is violated. In that case, instead, we perform a gradient descent update and reset the approximate Hessian (compare [18, Algorithm 11.5 on p.60]). In both cases, we perform an Armijo backtracking line search along the search directions. For a warm start, we always begin the iteration with 5 initial gradient descent steps. We terminated the algorithm, if the norm of the gradient fell below a certain threshold. In addition, for finer discretizations, we also terminated the algorithm if the Armijo backtracking line search was unsuccessful (which also indicates that we are close to a solution).

Results
We tested the algorithm in MATLAB R2012b for various choices of operators K i , for different noise levels as well as for a different number of available noisy data measurements. To be able to compare results for the different settings, we used a fixed seed for random number generation for each noisy data measurement. We noticed the following behavior: • In all tested cases K 1 = I is the best operator to use, if only one operator should be used. Using only K 2 = ∂ x 1 or K 3 = ∂ x 2 results in quite poor performances (see Figure 3, Table 1 and 2 ).
• Adding another regularization operator K i to any choice of one or two regularization operators improves tracking of the exact control (see Table 1 and 2).

Algorithm 1: Iterative method for parameter learning with a linear state equation.
Data: Let α 0 be given. Define H 0 := I. Compute u 0 solving the lower level problem for α 0 and the corresponding Lagrange multiplier q 0 from the optimality system in Proposition 6.2. For 1 ≤ i ≤ 3, set g 0 i := K i q 0 , K i u 0 , whenever the operator K i should be used, and set d 0 := −g 0 , k := 0. s while g k 2 2 > tolerance do if g k , d k < − min{c 1 , c 2 d k 2 2 } d k 2 2 and k ≥ 5 then Perform Armijo backtracking line search along d k , set k = k + 1 and update α k ; else H k = I; Perform Armijo backtracking line search along −g k , set k = k + 1 and update α k ; Compute u k solving the lower level problem for α k and the corresponding Lagrange multiplier q k . Set g k i := K i q k , K i u k , whenever the operator K i should be used. Update the approximate Hessian H k and compute the BFGS update direction d k .
• Using K 2 = ∂ x 1 and K 3 = ∂ x 2 is the best choice amongst the two operator cases. The performance using these two operators is only slightly inferior to the performance using all three operators (see Table 1 and 2). This suggests that in this case the additional use of the regularization operator K 1 is not necessary.
• When using multiple noisy data measurements y δ j with the same statistical structure, the ability to track u † is significantly improved, as we would expect (compare Table 1 with Table 2).

Bilinear state equation
In the second numerical experiment the inverse problem is to estimate the diffusion coefficient in a second order elliptic partial differential equation. Note that since we use H 1 -regularization for dimension d = 2, the optimality system used to compute optimal regularization parameters is only obtained by formal computations.
Recall that φ : R → R was introduced Section 6.2 to avoid control constraints. In the state equation, we choose f ∈ L 2 (Ω) such that for the control u † given by the exact state y † is given by The exact control is shown in Figure 5a. We discretize the problem on a 64 × 64 mesh using Lagrange P 1 finite elements. Noisy data measurements y δ j are generated by pointwise setting y δ j = y † + εξ j , for 1 ≤ j ≤ m, where ξ j follows a normal distribution with mean 0 and standard deviation 1, and ε = max |y † | with being the relative noise level. We consider the following regularization operators Used methods and the solution algorithm We used nearly the same globalized quasi-Newton method as for the linear state equation. The only significant difference is that here a solution to the lower level problem is computed using the sequential programming method (SP method for short) from [11]. We terminated the algorithm, if the norm of the gradient fell below a certain threshold. In addition, for finer discretizations, we also terminated the algorithm if the Armijo backtracking line search was unsuccessful (which also indicates that we are close to a solution).

Algorithm 2: Iterative method for (LP bil )
Data: Let α 0 be given. Define H 0 := I. Compute u 0 solving the lower level problem for α 0 using the SP method, and the corresponding Lagrange multiplier p 0 and (q 0 1 , q 0 2 , q 0 3 ) as in Proposition 6.4. For 1 ≤ i ≤ 3 set g 0 i := K i q 0 2 , K i u 0 , whenever the operator K i is used, and set d 0 := −g 0 , k := 0. while g k 2 2 > tolerance do if g k , d k < − min{c 1 , c 2 d k 2 2 } d k 2 2 and k ≥ 5 then Perform Armijo backtracking line search along d k , set k = k + 1 and update α k ; else H k = I; Perform Armijo backtracking line search along −g k , set k = k + 1 and update α k ; Compute u k solving the lower level problem for α k using the SP method, and the corresponding Lagrange multiplier p k and (q k 1 , q k 2 , q k 3 ) as in Proposition 6.4. For 1 ≤ i ≤ 3 set g k i := K i q k 2 , K i u k , whenever the operator K i is used. Update the approximate Hessian H k and compute the BFGS update direction d k .
Results We tested the algorithm in MATLAB R2012b for various choices of operators K i , for different noise levels as well as for a different number of available noisy data measurements. As for the linear state equation, we used a fixed seed for random number generation for each noisy data measurement. We noticed the following behaviour: • K 1 = I was the only choice of a single operator which lead to meaningful results (see Figure 6a).
• Adding another regularization operator K i to any choice of one or two regularization operators generally improves tracking of the exact control (see Table 3). There is an outlier to this claim comparing the use of a single operator K 1 to the use of the two regularization operators K 1 and K 3 for m = 20 (see Table 4).
• Using K 2 = ∂ x 1 and K 3 = ∂ x 2 is the best choice amongst the two operator cases. The performance using these two operators is only slightly inferior to using all three operators (see Table 3). This suggests, that in this case the additional use of the regularization operator K 1 is not necessary.
• When using multiple noisy data measurements y δ j with the same statistical structure, the ability to track u † is generally improved, as we would expect (compare Table 3 and 4). Note that this is not the case comparing m = 5 to m = 20 when using the operators K 1 and K 3 , but since the data was generated using a random process, this does not contradict the theory.
• Using the operator K 2 = ∂ x 1 seems to be more significant for the quality of the reconstructions than using K 3 = ∂ x 2 (see Table 3 and 4). This is also indicated by observing that the obtained optimal regularization parameter for K 2 is usually larger than the optimal regularization parameter for K 3 when using both operators.
• When we only use unilateral regularization associated to K 2 = ∂ x 1 or K 3 = ∂ x 2 together with L 2 -regularization, the optimal u * usually suffers from over-smoothing     Table 3 Locally optimal α * for different sets of K i with 3% noise and m = 5.
in the penalized direction. In contrast, u * can have rapid changes in the direction which is not penalized.
• We expect difficulties reconstructing u † at stationary points of y † (see [9, p.24]). A simple computation shows that (x 1 , x 2 ) is a stationary point of y † if and only if one of the following statements is true: a) x 1 = 0 (line segment along the x 2 -axis) b) |x 1 | = 1 and |x 2 | = 1 (edges of the domain) c) |x 1 | = 1/2 and x 2 = 0 Here we have continuously extended the gradient of y † to the boundary of the domain. Difficulties reconstructing u † near the edges of the domain can be seen in Figure 6a. Since in this case there is no additional smoothing in any of the directions, the values of the reconstructed u * near the edges tend to zero. Difficulties reconstructing u † near the x 2 -axis can be seen in Figure 6a and 6b. Note that smoothing in the x 1 -direction, however, largely prevents the issues near the x 2 -axis, as we can see in Figure 6c. x 1 x 2 (d) Using K 1 = I K 2 = ∂ x1 K 3 = ∂ x2 , with optimal α * = (9.59 × 10 −8 , 2.23 × 10 −7 , 6.36 × 10 −8 ) and u * − u † 2 2 = 0.13786.   Table 4 Locally optimal α * for different sets of K i with 3% noise and m = 20.

Outlook
An open question which deserves to be investigated in the future is how learned regularization parameters can be used in structurally related -but different -problems. While in some cases learned parameters might be used directly, we suggest that in other cases they should merely be used as weights in-between multiple penalty terms, with an additional weight for the sum of all penalty terms still being determined by a classical parameter choice strategy. We also point out that the ability to compute optimal regularization parameters provides the opportunity to evaluate how well classical parameter choice strategies are performing. Another direction for further research could be to consider more general learning problems such as learning filters and problems with non smooth lower level problems.
As in the proof of Lemma 5.1, one can show bijectivity of the mapping (δ y , δ u , δ λ ) →   F yy + λ * e yy F yu + λ * e yu e * y F yu + λ * e yu F uu + λ * e yu e * u e y e u 0     δ y δ u δ λ   .
Thus, by the implicit function theorem, there exists neighbourhoods I of α * and V of (y * , u * , λ * ) and a continuously F-differentiable function Φ : I → V such that for all α ∈ I and (y, u, λ) ∈ V it holds that F y (α, y, u) + λe y (y, u) = 0, A standard argument can be used to show that I can be chosen such that the second order sufficient optimality conditions of (P α,y δ ) still hold in Φ(α) = (y α , u α , λ α ) for every α ∈ I. Consequently, (y α , u α , λ α ) is a local solution to lower level problem for every α ∈ I. We now claim that there exists a neighbourhood J of α * contained in I such that φ(α) is a global solution to the lower level problem for every α ∈ J. We prove this by contradiction. If our claim was false, there would be a sequence (α n ) in I such that α n → α * with an associated sequence (y n , u n , λ n ) of solutions to (P α n ,y δ ), which does not intersect V . Using the stability assumption, the uniqueness assumption, and that e y (y * , u * ) is bijective, it is straightforward to see that (y n , u n , λ n ) must converge to (y * , u * , λ * ). However since the sequence was chosen such that (y n , u n , λ n ) / ∈ V for all n ∈ N, this leads to a contradiction. This shows that (α * , y * , u * , λ * ) is a solution to (LP) restricted to J ×V , i.e. a local solution to (LP). This completes the proof.
Remark. We point out that the proofs of Lemma 5.1 and Theorem 5.1 do not rely on the specific structure of the learning problem. Thus, these results can be applied to more general bilevel optimization problems of the form min (α,yα,uα)∈C×Y ×U G(α, y α , u α ) s.t. (y α , u α ) ∈ arg min (y,u)∈Y ×U {F (α, y, u) | e(y, u) = 0}, for given G, F : X × Y × U → R, e : Y × U → Z, and a closed and convex set C ⊆ X, where Y, U are Hilbert spaces, and X, Z are Banach spaces.

C Used functions
The function φ : R → R is defined as follows. In particular, one can verify that φ ∈ C 3 (R, R).