Iterative Surrogate Model Optimization (ISMO): An active learning algorithm for PDE constrained optimization with deep neural networks

We present a novel active learning algorithm, termed as iterative surrogate model optimization (ISMO), for robust and efficient numerical approximation of PDE constrained optimization problems. This algorithm is based on deep neural networks and its key feature is the iterative selection of training data through a feedback loop between deep neural networks and any underlying standard optimization algorithm. Under suitable hypotheses, we show that the resulting optimizers converge exponentially fast (and with exponentially decaying variance), with respect to increasing number of training samples. Numerical examples for optimal control, parameter identification and shape optimization problems for PDEs are provided to validate the proposed theory and to illustrate that ISMO significantly outperforms a standard deep neural network based surrogate optimization algorithm.


Introduction
A large number of problems of interest in engineering reduce to the following optimization problem: Here, Y ⊂ R d , the underlying map (observable) L : Y → Z, with either Z ⊂ R m or is a subset of an infinite-dimensional Banach space and G : Z → R is the cost or objective function.We can think of Y as a design or control space, with any y ∈ Y being a vector of design or control parameters, that steers the system to minimize the objective function.Many engineering systems of interest are modeled by partial differential equations (PDEs).Thus, evaluating the desired observable L and consequently, the objective function in (1.1), requires (approximate) solutions of the underlying PDE.The resulting problem is termed as a PDE constrained optimization problem, [4,46] and references therein.
A representative example for PDE constrained optimization is provided by the problem of designing the wing (or airfoils) of an aircraft in order to minimize drag [40], [31].Here, the design variables are the parameters that describe the wing shape.The underlying PDEs are the compressible Euler or Navier-Stokes equations of gas dynamics and the objective function is the drag coefficient (while keeping the lift coefficient within a range) or the lift to drag ratio.Several other examples of PDE constrained optimization problems arise in fluid dynamics, solid mechanics and the geosciences [4].
A wide variety of numerical methods have been devised to solve optimization problems in general, and PDE constrained optimization problems in particular [4,46].A large class of methods are iterative and require evaluation of the gradients of the objective function in (1.1), with respect to the design (control) parameters y.These include first-order methods based on gradient descent and second-order methods such as quasi-newton and truncated Newton methods [13].Evaluating gradients in the context of PDE constrained optimization often involves the (numerical) solution of adjoints (duals) of the underlying PDE [4].Other types of optimization methods such as particle swarm optimization [8] and genetic algorithms [44] are gradient free and are particularly suited for optimization, constrained by PDEs with rough solutions.
Despite the well documented success of some of the aforementioned algorithms in the context of PDE constrained optimization, the solution of realistic optimization problems involving a large number of design parameters (d 1 in (1.1)) and with the underlying PDEs set in complex geometries in several space dimensions or describing multiscale, multiphysics phenomena, remains a considerable challenge.This is mostly on account of the fact that applying the above methods entails evaluating the objective function and hence the underlying map L and its gradients a large number (in the range of 10 2 − 10 4 ) of times.As a single call to the underlying PDE solver can be very expensive, computing the underlying PDEs (and their adjoints) multiple times could be prohibitively expensive.
One possible framework for reducing the computational cost of PDE constrained optimization problems is to use surrogates [14].This approach consists of generating training data, i.e. computing L(y), ∀y ∈ S, with S ⊂ Y denoting a training set.Then, a surrogate model is constructed by designing a surrogate map, L : Y → Z such that L(y) ≈ L(y), for all y ∈ S. Finally, one runs a standard optimization algorithm such as gradient descent or its variants, while evaluating the functions and gradients in (1.1), with respect to the surrogate map L. This surrogate model will be effective as long as L ≈ L in a suitable sense, for all y ∈ Y and the cost of evaluating the surrogate map L is significantly lower than the cost of evaluating the underlying map L. Examples of such surrogate models include reduced order models [34] and Gaussian process regression [38].
A particularly attractive class of such surrogate models are deep neural networks (DNNs) [15], i.e. functions formed by multiple compositions of affine transformations and scalar non-linear activation functions.Deep neural networks have been extremely successful at diverse tasks in science and engineering [22] such as at image and text classification, computer vision, text and speech recognition, autonomous systems and robotics, game intelligence and even protein folding [12].
Deep neural networks are also being increasingly used in different contexts in scientific computing.A very incomplete list of the rapidly growing literature includes the use of deep neural networks to approximate solutions of PDEs by so-called physics informed neural networks (PINNs) [21, 35-37, 28, 29] and references therein, solutions of high-dimensional PDEs, particularly in finance [16,11,2] and references therein, improving the efficiency of existing numerical methods for PDEs, for instance in [45,39,27] and references therein.
A different approach is taken in recent papers [24,23,30], where the authors presented supervised deep learning algorithms to efficiently approximate observables (quantities of interest) for solutions of PDEs and applied them to speedup existing (Quasi)-Monte Carlo algorithms for forward Uncertainty quantification (UQ).These papers demonstrated that deep neural networks can result in effective surrogates for observables in the context of PDEs.Given the preceding discussion, it is natural to take a step further and ask if the resulting surrogates can be used for PDE constrained optimization.This indeed constitutes the central premise of the current paper.
Our first aim in this article is to present an algorithm for PDE constrained optimization that is based on combining standard (gradient based) optimization algorithms and deep neural network surrogates for the underlying maps in (1.1).The resulting algorithm, which we term DNNopt, is described and examined, both theoretically (with suitable hypotheses on the underlying problem) and in numerical experiments.We find that although DNNopt can be a viable algorithm for PDE constrained optimization and converges to a (local) minimum of the underlying optimization problem (1.1), it is not robust enough and can lead to a high variance or sensitivity of the resulting (approximate) minima, with respect to starting values for the underlying optimization algorithm.A careful analysis reveals that a key cause for this high variance (large sensitivity) is the fact that the training set for the deep neural network surrogate is fixed a priori, based on global approximation requirements.This training set may not necessarily represent the subset (in parameter space) of minimizers of the objective function in (1.1) well, if at all.The resulting high variance can impede the efficiency of the algorithm and needs to be remedied.
To this end, the second and main aim of this paper is to propose a novel active learning procedure to iteratively augment training sets for training a sequence of deep neural networks, each providing successively better approximation of the underlying minimizers of (1.1).The additions to the training sets in turn, are based on local minimizers of (1.1), identified by running standard optimization algorithms on the neural network surrogate at the previous step of iteration.This feedback between training neural networks for the observable L in (1.1) and adding local optimizers as training points leads to the proposed algorithm, that we term as iterative surrogate model optimization (ISMO).Thus, ISMO can be thought of as an example of an active learning algorithm [43], where the learner (deep neural network) queries the teacher (standard optimization algorithm) to iteratively identify training data that provides a better approximation of local optima.We analyze ISMO theoretically to find that the proposed algorithm is able to approximate minimizers of (1.1), with remarkably low variance or sensitivity with respect to starting values for the optimization algorithm.In particular, at least in restricted settings, ISMO converges exponentially fast to the underlying minima, with the variance also decaying exponentially fast.This behavior is observed in several numerical experiments.Thus, ISMO is shown to be a very effective framework for PDE constrained optimization.
The rest of this paper is organized as follows: in section 2, we put together preliminary material for this article.The DNNopt and ISMO algorithms for PDE constrained optimization are presented in sections 3 and 4, respectively and numerical experiments are reported in section 5.

The underlying constrained optimization problem
The basis of our constrained optimization problem is the following time-dependent parametric PDE, Here, Y ⊂ R d is the underlying parameter space that describes the design (control) variables y ∈ Y .For simplicity of notation and exposition, we set Y = [0, 1] d for the rest of this paper.The spatial domain is labeled as y → D(y) ⊂ R d s and U : [0, T] × D ×Y → R n is the vector of unknowns.The differential operator L is in a very generic form and can depend on the gradient and Hessian of U, and possibly higher-order spatial derivatives.For instance, the heat equation as well as the Euler or Navier-Stokes equations of fluid dynamics are specific examples.Moreover, L b is a generic operator for imposing boundary conditions.
For the parametrized PDE (2.1), we define a generic form of the parameters to observable map: Finally, we define the cost (objective or goal) function G as G ∈ C 2 (R m ; R).A very relevant example for such a cost function is given by, for some target L ∈ R m , with 1 p < ∞ and |.| denoting a vector norm.The resulting constrained optimization problem is given by Find ȳ = arg min with the observable defined by (2.2) and C 2 -objective function, for instance the one defined in (2.3).We also denote, for notational convenience.

Standard optimization algorithms
For definiteness, we restrict ourselves to the class of (quasi)-Newton algorithms as our standard optimization algorithm in this paper.We assume that the cost function G (2.5) is such that G ∈ C 2 (Y ) and we are interested in computing (approximate, local) minimizers of the optimization problem (2.4).A generic quasi-Newton algorithm has the following form, Algorithm 2.1.Quasi-Newton approximation of (2.4) Inputs: Underlying cost function G (2.5), starting value y 0 ∈ Y and starting Hessian B 0 ∈ R d×d , tolerance parameter ε Goal: Find (local) minimizer for the optimization problem (2.4).
Different formulas for the update of the Hessian (2.8) lead to different versions of the quasi-Newton algorithm (2.1).For instance, the widely used BFGS algorithm [13] uses an update formula (2.8) such that the inversion step (2.6) can be directly computed with the well-known Sherman-Morrison formula, from a recurrence relation.Other versions of quasi-Newton algorithms, in general truncated Newton algorithms, use iterative linear solvers for computing the solution to (2.6).Note that quasi-Newton methods do not explicitly require the Hessian of the objective function in (2.4), but compute (estimate) it from the gradients in (2.8).Thus, only information on the gradient of the cost G (2.5) is required.
We observe that implementing the algorithm 2.1 in the specific context of the optimization problem (2.4), constrained by the PDE (2.1), requires evaluating the map L and its gradients multiple times during each iteration.Thus, having a large number of iterations in algorithm 2.1 entails a very high computational cost on account of the large number of calls to the underlying PDE solver.

Neural network surrogates
As mentioned in the introduction, we will employ neural network based surrogates for the underlying observable L (2.2) in order to reduce the computational cost of employing standard optimization algorithm 2.1 for approximating the solutions of the optimization problem (2.4).To this end, we need the following ingredients.

Training set
As is customary in supervised learning ( [15] and references therein), we need to generate or obtain data to train the network.To this end, we fix N ∈ N and select a set of points S = {y i } 1 i N , with each y i ∈ Y .It is standard in machine learning that the points in the training set S are chosen randomly from the parameter space Y , independently and identically distributed with the Lebesgue measure.However, we can follow recent papers [24,30] to choose low discrepancy sequences [5,33] such as Sobol or Halton sequences as training points in order to obtain better rates of convergence of the resulting generalization error.For Y in low dimensions, one can even choose (composite) Gauss quadrature points as viable training sets [28].

Deep neural networks
Given an input vector y ∈ Y , a feedforward neural network (also termed as a multi-layer perceptron), shown in figure 1, transforms it to an output through layers of units (neurons) consisting of either affine-linear maps between units (in successive layers) or scalar non-linear activation functions within units [15], resulting in the representation, (2.9) Here, • refers to the composition of functions and σ is a scalar (non-linear) activation function.A large variety of activation functions have been considered in the machine learning literature [15].Popular choices for the activation function σ in (2.9) include the sigmoid function, the tanh function and the ReLU function defined by, σ(z) = max(z, 0). (2.10) When z ∈ R p for some p > 1, then the output of the ReLU function in (2.10) is evaluated componentwise.For any 1 k K, we define For consistency of notation, we set d 1 = d and d K+1 = 1.Thus in the terminology of machine learning (see also figure 1), our neural network (2.9) consists of an input layer, an output layer and (K − 1) hidden layers for some 1 < K ∈ N. The k-th hidden layer (with d k+1 neurons) is given an input vector z k ∈ R d k and transforms it first by an affine linear map C k (2.11) and then by a ReLU (or another) nonlinear (component wise) activation σ (2.10).A straightforward addition shows that our network contains d + 1 + K k=2 d k neurons.We also denote, to be the concatenated set of (tunable) weights for our network.It is straightforward to check that (2.13)

Loss functions and optimization
For any y ∈ S, one can readily compute the output of the neural network L θ (y) for any weight vector θ ∈ Θ.We define the so-called training loss function as for some 1 p < ∞.
The goal of the training process in machine learning is to find the weight vector θ ∈ Θ, for which the loss function (2.14) is minimized.
The above minimization problem amounts to finding a minimum of a possibly non-convex function over a subset of R M for possibly very large M. We follow standard practice in machine learning by either (approximately) solving (2.15) with a full-batch gradient descent algorithm or variants of mini-batch stochastic gradient descent (SGD) algorithms such as ADAM [19].
For notational simplicity, we denote the (approximate, local) minimum weight vector in (2.15) as θ * and the underlying deep neural network L * = L θ * will be our neural network surrogate for the underlying map L.
The proposed algorithm for computing this neural network is summarized below.

Algorithm 2.2. Deep learning of parameters to observable map
Inputs: Underlying map L (2.2).
Goal: Find neural network L θ * for approximating the underlying map L.
Step 1: Choose the training set S = {y n } for y n ∈ Y , for all 1 n N such that the sequence {y n } is either randomly chosen or a low-discrepancy sequence or any other set of quadrature points in Y .Evaluate L(y) for all y ∈ S by a suitable numerical method.
Step 2: For an initial value of the weight vector θ ∈ Θ, evaluate the neural network L θ (2.9), the loss function (2.15) and its gradients to initialize the (stochastic) gradient descent algorithm.
Step 3: Run a stochastic gradient descent algorithm till an approximate local minimum θ * of (2.15) is reached.The map L * = L θ * is the desired neural network approximating the map L. Goal Compute (approximate) minimizers for the PDE constrained optimization problem (2.4).
Step 1 Given training set S = {y i }, with y i ∈ Y for 1 i N, generate the deep neural network surrogate map L * by running the deep learning algorithm 2.2.Set G * (y) = G(L * (y)) for all y ∈ Y , with the cost function G defined in (2.5).
Step 2 Draw N random starting points ỹ1 , . . ., ỹN in Y .For each 1 i N, run the standard optimization algorithm 2.1 with cost function G * and starting values ỹi till tolerance to obtain a set ȳi of approximate minimizers to the optimization problem 2.4.
The following remarks about algorithm 3.1 are in order, Remark 3.2.The quasi-Newton optimization algorithm 2.1 requires derivatives of the cost function with respect to the input parameters y.This boils down to evaluation of ∇ y L * .As L * is a neural network of form (2.9), these derivatives can be readily and very cheaply computed with backpropagation.
Remark 3.3.The overall cost of DNNopt algorithm 3.1 is expected to be dominated by the cost of generating the training data L(y i ), for y i ∈ S, as this involves calls to the underlying and possibly expensive PDE solver.In practice, for observables L, the cost of training the network can be much smaller than the cost of generating the training data, whereas the cost of evaluating the map L * and its gradient ∇ y L * is negligible (see Table 5 of the recent paper [24] for a comparison of these costs for a realistic example in aerodynamics).Thus, the cost of running the optimization algorithm 2.1 for the neural network surrogate is negligible, even for a large number of iterations.

Analysis of the DNNopt algorithm 3.1
As discussed in remark 3.3, the cost of DNNopt algorithm 3.1 is dominated by the cost of generating the training data.How many training samples N are needed in order to achieve a desired accuracy for the optimization procedure ?This question is very hard to answer in full generality.However, we will investigate it in a special case, by imposing suitable hypotheses on the underlying map L, on the cost function G, on the training procedure for the neural network and on the trained neural network.
For the rest of this paper, we set |.| = |.|∞ which is the max vector norm.We start with suitable hypothesis on the underlying map L in (2.4).
Next, we have the following hypothesis on the function G that appears in the cost function (2.5), Recalling that the cost function in (2.5) is given by G = G(L), the hypotheses H1 and H2 imply that However, we need further hypothesis on this cost function given by, (H3) G has exactly one global minimum at ȳ and no other local minima, 3) Note that the assumptions H3 and H4 are automatically satisfied by strictly convex functions.Thus, these hypotheses can be considered as a slight generalization of strict convexity.
We are interested in analyzing the DNNopt algorithm 3.1, with the underlying map and cost function satisfying the above hypothesis.Our aim is to derive estimates on the distance between the global minimizer ȳ of the cost function G (see hypothesis H3) and the approximate minimizers ȳi , for 1 i N, generated by the DNNopt algorithm 3.1.However, the standard training procedure for neural networks (see algorithm 2.2) consists of minimizing loss functions in L p -norms and this may not suffice in controlling pointwise errors near extremas of the underlying function.Thus, we need to train the neural network to minimize stronger norms that can control the derivative.
To this end, we fix N ∈ N and choose the training set S = {y i } for 1 i N such that each y i ∈ Y and the training points satisfty, with B (y, r) denoting the |.| ∞ ball, centered at y, with diameter r.
The simplest way to enforce hypothesis H5 is to divide the domain Y into N cubes, centered at y i , with diameter N − 1 d .On this training set, we assume that for any parameter θ ∈ Θ, the neural network L θ ∈ W 2,∞ (Y ) by requiring that the activation function σ ∈ W 2,∞ (R) and consider the following (Lipschitz ) loss function: Then, a stochastic gradient descent algorithm is run in order to find Note that the gradients of the Lipschitz loss function (3.5) can be computed by backpropagation.We denote the trained neural network as L * = L θ * and set Finally, we need the following assumption on the standard optimization algorithm 2.1 i.e., for all starting values y i ∈ S, the optimization algorithm 2.1 converges (with a finite number of iterations) to a local minimum ȳi ∈ Y of the cost function Note that we are not assuming that G * is convex, hence, only convergence to a local minimum can be realized.
We have the following bound on the minimizers, generated by the DNNopt algorithm 3.1 in this framework, Lemma 3.4.Let the underlying map L in (2.4) satisfy H1, the function G satisfy H2, the cost function G satisfy H3 and H4.Let L * be a neural network surrogate for L, generated by algorithm 2.2, with training set S = {y i }, for 1 i N, such that the training points satisfy H5 and the neural network is obtained by minimizing the Lipschitz loss function (3.5).Furthermore, for starting values y i ∈ S as inputs to the standard optimization algorithm 2.1, let ȳi denote the local minimizers of G * = G(L * ), generated by the algorithm 2.1 and satisfy H6, then we have the following bound, Using the straightforward calculation ∇G(y) = G (L(y))∇L(y) and similarly for ∇G * , in the above inequality yields, We estimate the terms T 1,2 as follows, and As ȳj ∈ Y , by the hypothesis H5 on the training set, there exists an 1 i j N such that y i j ∈ S and . Hence, we can further estimate T 1,2 from the following inequalities, Here, we recall that In practice, we do not know the exact minimizer ȳ of the cost function G and hence the minimum cost G(ȳ).Instead, we can monitor the effectiveness of the DNNopt algorithm 3.1 by either computing the range of minimizers i.e., range(G) := max or an empirical approximation to the standard deviation, A straightforward application of the estimate (3.9), together with (3.1), (3.2) yields, In order to simplify notation while tracking constants, by rescaling the underlying functions, we can assume that C L , C G 1 3 , then the bound (3.9) simplifies to Next, we define a well-trained network as a deep neural network L * , generated by the deep learning algorithm 2.2, which further satisfies for any training set S = {y i }, with 1 i N, the assumptions, Hence, for a well-trained neural network, the bound (3.18) further simplifies to, In particular, the bound (3.20) ensures convergence of all the approximate minimizers ȳi , generated by the DNNopt algorithm 3.1, to the underlying minimizer, ȳ, of the optimization problem (2.4) as the number of training samples N → ∞.
3.1.1Discussion on the assumptions in Lemma 3.4 The estimates (3.9), (3.20) need several assumptions on the underlying map, cost function and trained neural network to hold.We have the following comments about these assumptions, • The assumptions H1 and H2 entail that the underlying map L and function G are sufficiently regular i.e., W 2,∞ .A large number of underlying maps for PDE constrained optimization, particularly for elliptic and parabolic PDEs satisfy these assumptions [4].
• The assumptions H3 and H4 essentially amount to requiring that the cost function G is strictly convex.It is standard in optimization theory to prove guarantees on the algorithms for convex cost functions, while expecting that similar estimates will hold for approximating local minima of non-convex functions, see for instance [19].
• The hypothesis H5 on the training set is an assumption on how the training set covers the underlying domain [9].It can certainly be relaxed from a deterministic to a random selection of points with some further notational complexity, see [6] for space filling arguments.It also stems from an equidistribution requirement on the training set and can be satisfied by low-discrepancy sequences such as Sobol points.
• The use of Lipschtiz loss functions (3.5) for training the neural network is necessitated by the need to control the neural network pointwise.We can also use a Sobolev loss function of the form, with s > d.By Morrey's inequality, minimizing the above loss function will automatically control the differences between the (derivatives of) the underlying map and the trained neural network, pointwise and an estimate, similar to (3.9), can be obtained.Using an L p -loss function such as (2.15) may be insufficient in providing control on pointwise differences between minima of the underlying cost function and approximate minima generated by the DNNopt algorithm 3.1.
• The hypothesis H6 on the standard optimization algorithm 2.1 is only for notational convenience and can readily be relaxed to an approximation i.e., |∇G * (ȳ i )| ε, for each i and for a small tolerance ε 1.This merely entails a minor correction to the estimate (3.9).Quasi-Newton algorithms are well known to converge to such approximate local minima [13].
• The point of assuming well-trained networks in the sense of (3.19) is to simplify the estimates (3.9), (3.17).The first inequality in (3.19) is an assumption on uniform bounds on the trained neural networks (and their gradients) with respect to the number of training samples.The constant 1/3 is only for notational convenience and can be replaced by any other constant.Note that we are not explicitly ensuring that the first inequality in (3.19) is always satisfied.This can be readily monitored in practice and can be ensured by adding suitable regularization terms in the loss function (2.15).On the other hand, the second inequality in (3.19) is an assumption on the training error which greatly serves to simplify the complexity estimates but is not necessary.The essence of this assumption is to require that training errors are smaller than the generalization gap.In general, there are no rigorous estimates on the training error with a stochastic gradient method as a very high-dimensional non-convex optimization problem is being solved.
• In Lemma 3.4, we have assumed that the neural network L * ∈ W2,∞ (Y ).This rules out the use of ReLU activation function but allows using other popular activation functions such as the hyperbolic tangent.

Iterative surrogate model optimization (ISMO)
Even with all the hypotheses on the underlying map, cost function and trained neural network, the bound (3.9) (and the simplified bound (3.20)) highlight a fundamental issue with the DNNopt algorithm 3.1.This can be seen from the right hand side of (3.9), (3.20),where the error with respect to approximating the minimum ȳ of the underlying cost function in (2.4), decays with a dimension dependent exponent significantly increasing the cost of the algorithm and reducing its competitiveness with respect to standard optimization algorithms.
As mentioned in the introduction, one possible cause of the slow rate of error decay in bound (3.9) lies in the fact that the training set S for the deep learning algorithm 2.2 is chosen a priori and does not reflect any information about the location of (local) minima of the underlying cost function in (2.4).It is very likely that incorporating such information will increase the accuracy of the deep learning algorithm around minima.We propose the following iterative algorithm to incorporate such information, Step k = 0 Given initial training set S 0 = {y 0 i }, with • If for some tolerance ε 1, then terminate the iteration.Else, continue to step k + 1. k .Thus, a sequence of neural network surrogates are generated, with each successive iteration likely producing a DNN surrogate with greater accuracy around local minima of the underlying cost function.Given this feedback between neural network training and optimization of the underlying cost function, we consider ISMO as an example of an active learning algorithm [43], with the deep learning algorithm 2.2, playing the role of the learner, querying the standard optimization algorithm 2.1, which serves as the teacher or oracle, for obtaining further training data.

Analysis of the ISMO algorithm 4.1
Does the ISMO algorithm 4.1 lead to more accuracy than the DNNopt algorithm 3.1?We will investigate this question within the framework used for the analysis of the DNNopt algorithm in section 3.1.We recall that a modified Lipschitz loss function (3.5) was used for training the underlying deep neural network in the supervised learning algorithm 2.2.We will continue to use this Lipschitz loss function.Hence, at any step k in algorithm 4.1, with training set S k , the following loss function is used, Then, a stochastic gradient descent algorithm is run in order to find We denote the trained neural network as We have the following bound on the minimizers, computed with the ISMO algorithm 4.1, ) and the number of initial training points, N 0 = #(S 0 ) be chosen such that with constant C = 4C −1 9 .Given σ 0 in (4.7), define σ k for k 1 by,

.8)
Under the assumption that the local minimizers ȳk j , for 1 j ∆N and at every k 0 also satisfy, then the approximate minimizers ȳk j , for all k 1, 1 j ∆N satisfy the bound, with σ k defined in (4.8).
In particular, if at every step k, the trained neural network L * k is well-trained i.e., it satisfies, then, the minimizers, generated by the ISMO algorithm 4.1 satisfy the bound, , ∀k 1. (4.12) Proof.We observe that step k = 0 of the ISMO algorithm 4.1 is identical to the DNNopt algorithm 3.
As in the proof of Lemma 3.4, we can estimate, By the hypothesis H7 on the approximate minimizers, there exists an 1 i j,k ∆N such that ȳk−1 . Hence, we can further estimate from the Lipschitz loss function (4.4), completely analogously to the estimates (3.13) and (3.14) to obtain that, , which with definition (4.8) is the desired inequality (4.10) for step k.
A straightforward application of the definitions (4.11) for well-trained networks L * k for k 1, together with a recursion on the inequality (4.10) yield the bound (4.12).Remark 4.4.In addition to the assumptions H1 − H6 in section 3.1, we also need another assumption H7 for obtaining bounds on the minimizers, generated by the ISMO algorithm.This assumption amounts to requiring that during the iterations, the optimizers of (4.4), at a given step k, remain close to optimizers of (4.4), at the previous step k − 1.This seems to be a continuity requirement on the minimizers.It is justified as long as the first step k = 0 is able to provide a good approximation to the underlying (global) minimum and successive iterations improve the approximation of this global minimum.On the other hand, for non-convex cost functions, it could happen that such an assumption is violated near local minima.However, given the fact that the starting values for the standard optimization algorithm, at any iteration step k 1 of the ISMO algorithm 4.1, are chosen randomly, we would expect that this provides a pathway for the algorithm to escape from unfavorable local minima.
Remark 4.5.The assumptions on well-trained networks (4.11), at each step of the ISMO algorithm 4.1 is for conceptual and notational simplicity.An estimate, analogous to (4.12), can be proved without resorting to assuming (4.11).

Comparison between DNNopt and ISMO
The estimates (3.20) and (4.12) provide a quantitative comparison of accuracy between the DNNopt algorithm 3.1 and the active learning ISMO algorithm 4.1.
To realize this, we further assume that the ISMO algorithm converges to the desired tolerance within K iterations and that the batch size ∆N = cN 0 , for some fraction c 1. Hence, the accuracy of the ISMO algorithm is quantified by (4.12) as,

Numerical experiments
The theory presented in sections 3 and 4 suggests that the approximate minimizers computed by the DNNopt algorithm 3.1 and the ISMO algorithm 4.1, converge to (local) minima of the underlying optimization problem 2.4, as the number of training samples is increased.Moreover, the theory also suggests that the ISMO algorithm converges significantly faster than the DNNopt algorithm and can have significantly lower variance (sensitivity) of the computed minima.However, these theoretical results were obtained under rather stringent regularity hypotheses on the underlying observable L, convexity hypothesis on the cost function G, use of Lipschitz loss functions (3.5), (4.4), and other hypotheses on the distribution of optimizers (4.9).Many of these hypotheses may not be valid for PDE constrained optimization problems of practical interest.Will the DNNopt and ISMO algorithms, work as predicted by the proposed theory, for these realistic problems ?In particular, will ISMO significantly outperform DNNopt in terms of accuracy and robustness ?We investigate these questions by a series of numerical experiments in this section.

Details of implementation
The implementation of both the DNNopt algorithm 3.1 and the ISMO algorithm 4.1, is performed in Python utilizing the Tensorflow framework [1] with Keras [7] for the machine learning specific parts, while SciPy [47] is used for handling optimization of the objective function.
The code is open-source, and can be downloaded from https://github.com/kjetil-lye/iterative_surrogate_optimization.New problem statements can be readily implemented without intimate knowledge of the algorithm.The architecture and hyperparameters for the underlying neural networks in both algorithms are specified in Table 1.To demonstrate the robustness of the proposed algorithms, we train an ensemble of neural networks with 10 different starting values and evaluate the resulting statistics.
By utilizing the job chaining capability of the IBM Spectrum LSF system [18], we are able to create a process-based system where each iteration of the ISMO algorithm is run as a separate job.The novel use of job chaining enables us to differentiate the job parameters (number of compute nodes, runtime, memory, etc.) for each step of the algorithm, without wasting computational resources.This problem was proposed in a recent paper [23] as a prototype for using deep neural networks to learn observables for differential equations and dynamical systems.The motion of a projectile, subjected to gravity and air drag, is described by the following system of ODEs, d dt x(t; y) = v(t; y), x(0; y) = x 0 (y), (5.1)

Width
where F D = 1 2m ρC d πr 2 v 2 denotes the drag force, the initial velocity is set to v(0; y) = [v 0 cos(α), v 0 sin(α)] and the various parameters are listed in Table 2.
The observable L of interest is the horizontal range x max of the simulation: As observed in this figure, we start with a randomly chosen set of points, which have no information on the minimizers of the underlying function (the parabola in figure 2).At the very first iteration, the standard optimization algorithm identifies optimizers that are clustered near the parabola and at each successive step, more such training points are identified, which increasingly improve the accuracy and robustness of the ISMO algorithm.We recall from the theory presented in Lemma 4.3 that the initial number of training points N 0 in the ISMO algorithm should be sufficiently high in order to enforce (4.7) and obtain the desired rate of convergence.To check whether a minimum number of initial training points are necessary for the ISMO algorithm to be robust and accurate, we set N 0 = 8 and ∆N = 4 in algorithm 4.1 and present the results in figure 4 (left).As seen from this figure and contrasting the results with those presented in figure 3, we observe that the ISMO algorithm is actually less robust than the DNNopt algorithm, at least for the first few iterations.Hence, we clearly need enough initial training samples for the ISMO algorithm to be robust and efficient.
Finally, the results presented so far used randomly chosen initial training points for the ISMO algorithm.The results with low-discrepancy Sobol points are presented in figure 4 (right) and show the same qualitative behaviour as those with randomly chosen points.Although the mean and standard deviation of the objective function are slightly smaller with the Sobol points, the differences are minor and we will only use randomly chosen initial training points for the rest of the paper.This numerical experiment is openly avaiable at https://github.com/kjetil-lye/ismo_validation.

Inverse problem for the heat equation
In this section, we consider our first PDE example, the well-known heat equation in one space dimension, given by, Here, u is the temperature and q ≥ 0 is the diffusion coefficient.We will follow [3] and identify parameters for the following data assimilation problem, i.e., for an initial data of the form, with unknown coefficients a and for an unknown diffusion coefficient q, the aim of this data assimilation or parameter identification problem is to compute approximations to a , q from measurements of the form u i = u(x i , T), i.e., pointwise measurements of the solution field at the final time.This problem can readily be cast in the framework of PDE constrained optimization, section 2, in the following manner.The control parameters are y = [q, a 1 , a 2 , . . ., a L ] ∈ Y = [0, 1] L+1 and the relevant observables are, Following [3] and references therein, we can set up the following cost function, with measurements Li .In this article, we assume that the measurements are noise free.
In order to perform this experiment, we set L = 9, I = 5 and T = 0.001 and generate the measurement data from the following ground truth parameters, [q, a 1 , . . .a 9 ] = [0.75,0.1, 0.4, 0.8, 0.25, 0.75, 0.45, 0.9, 0.25, 0.85], and measurement points, [x 1 , . . ., x 5 ] = [0.125,0.25, 0.5, 0.75, 0.825] We numerically approximate the heat equation (5.4) with a second-order Crank-Nicholson finite difference scheme with 2048 points in both space and time and generate the measurement data Li , for 1 i 5 from this simulation.The DNNopt algorithm 3.1 is run for this particular configuration with the hyperparameters in Table 1.We would like to point out that I deep neural networks, L * i , for 1 i I, are independently trained, corresponding to each of the observables L i .The algorithm 3.1 and its analysis in section 3.1, can be readily extended to this case.
Results of the DNNopt algorithm with randomly chosen N k training points, with N k = N 0 + k∆N, with N 0 = 64 and ∆N = 16, are presented in figure 5. We observe from this figure that the mean Ḡ of the cost function (3.16) barely decays with increasing number of training samples.Moreover, the standard deviation (3.16) is also unacceptably high, even for a very high number of training samples.This is consistent with the bounds (3.20), (4.15), even though the hypotheses of section 3.1 do not necessarily hold in this case.The bounds (3.20), (4.15) suggest a very slow decay of the errors for this 10-dimensional (d = 10) optimization problem.
Next, the ISMO algorithm 4.1 is run on this problem with the same setup as for the DNNopt algorithm, i.e., I = 5 independent neural networks L * k,i are trained, each corresponding to the observable L i (5.6), at every iteration k of the ISMO algorithm.The results, with N 0 = 64 and ∆N = 16 as hyperparameters of the ISMO algorithm are shown in figure 5. We observe from this figure that the mean Ḡ of the cost function (3.16) decays with increasing number of iterations, till a local minimum corresponding to a mean objective function value of approximately 0.05 is reached.The standard deviation also decays as the number of iterations increases.In particular, the ISMO algorithm provides a significantly lower mean value of the cost function compared to the DNNopt algorithm.Moreover, the standard deviation is several times lower for the ISMO algorithm than for the DNNopt algorithm.Thus, we expect a robust parameter identification with the ISMO algorithm.This is indeed verified from figure 6, where we plot the identified initial condition (5.5) with the ISMO algorithm with a starting size of 64 and a batch size of 16, and the corresponding solution field at time T (computed with the Crank-Nicholson scheme), superposed with the measurements.We see from this figure that the identified initial data leads to one of the multiple solutions of this ill-posed inverse problem that approximates the measured values quite accurately.
This numerical experiment is openly avaiable at https://github.com/kjetil-lye/ismo_heat. The parametrization used to deform the shape of the airfoil according to (5.9).Right: The initial C-grid mesh used around the airfoil.

Shape optimization of airfoils
For the final numerical example in this paper, we choose optimization of the shapes of airfoils (wing cross-sections) in order to improve certain aerodynamic properties.The flow past the airfoil is modeled by the two-dimensional Euler equations, given by, We will follow standard practice in aerodynamic shape optimization [31] and consider a reference airfoil with upper and lower surface of the airfoil are located at (x, y U,re f (x/c)) and (x, y L,re f (x/c)), with x ∈ [0, c], c being the airfoil chord, and y U,re f , y L,re f corresponding to the well-known RAE2822 airfoil [17], see figure 7 for a graphical representation of the reference airfoil.
The shape of this reference airfoil is perturbed in order to optimize aerodynamic properties.The upper and lower surface of the airfoil are perturbed as with c being the airfoil chord length i.e., c = | AB| for the airfoil depicted in figure 7.
Although several different parametrizations of shape or its perturbations [41,26] are proposed in the large body of aerodynamic shape optimization literature, we will focus on the simple and commonly used Hicks-Henne bump functions [26] to parametrize the perturbed airfoil shape B k (ξ) = sin 3 (πξ q k ), q k = ln 2 ln(d + 4) − ln k .
(5.10) Furthermore, the design parameters for each surface is chosen as for 1 k d/2.The transform functions in (5.11) to obtain the design parameters have been chosen to ensure i) the perturbation is smaller near narrower tip of the airfoil (point B), and ii) the design envelope is large enough (see Figure 8) while avoiding non-physical intersections of the top and bottom surfaces after deformation.Note that the airfoil shape is completely parametrized by the vector y ∈ [0, 1] d .For our experiments, we choose 10 parameters for each surface, i.e., we set d = 20.
The flow around the airfoil is characterized by the free-stream boundary conditions corresponding to a Mach number of M ∞ = 0.729 and angle of attack of α = 2.31 • .The observables of interest are the lift and drag coefficients given by,  As is standard in aerodynamic shape optimization [40,31,25], we will modulate airfoil shape in order to minimize drag while keeping lift constant.This leads to the following optimization problem, We will compute approximate minimizers of the cost function (5.14) with the DNNopt algorithm 3.1 and the ISMO algorithm 4.1.The training data is generated using the NUWTUN solver1 to approximate solutions of the Euler equations (5.8) around the deformed airfoil geometry (5.9), with afore-mentioned boundary conditions.The NUWTUN code is based on the ISAAC code2 [32] and employs a finite volume solver using the Roe numerical flux and second-order MUSCL reconstruction with the Hemker-Koren limiter.Once the airfoil shape is perturbed, the base mesh (see Figure 7) is deformed to fit the new geometry using a thin plate splines based radial basis function interpolation technique [20,10], which is known to generate high quality grids.The problem is solved to steady state using an implicit Euler time integration.9. From this figure, we observe that the mean Ḡ of the cost function (5.14) decays, but in an oscillatory manner, with increasing number of training samples.Moreover, the standard deviation (3.16) also decays but is still rather high, even for a large number of iterations.This is consistent with the findings in the other two numerical experiments reported here.These results suggest that the DNNopt algorithm may not yield robust or accurate optimal shapes.
Next, we run the ISMO algorithm 4.1 with the same setup as DNNopt, i.e., with two independent neural networks C * L,k , C * D,k , at every iteration k, for the lift and the drag, respectively.We set hyperparameters N 0 = 64 and batch size ∆N = 16 and present results with the ISMO algorithm in figure 9. We see from this figure that the mean and the standard deviation of the cost function steadily decay with increasing number of iterations for the ISMO algorithm.Moreover, the mean as well as the standard deviation of cost function, approximated by the ISMO algorithm, are significantly less than those by the DNNopt algorithm.Thus, one can expect that the ISMO algorithm provides parameters for robust shape optimization of the airfoil.This is further verified in figure 10, where we separately plot the mean and mean ± standard deviation for the drag and lift coefficients, corresponding to the optimizers at the each iteration, for the DNNopt and ISMO algorithms.From this figure, we observe that the mean and standard deviation for the computed drag, with the ISMO algorithm, decay with increasing number of iterations.Moreover, both the mean and standard deviation of the drag are significantly lower than those computed with the DNNopt algorithm.On the other hand, both algorithms result in Hicks-Henne parameters that keep lift nearly constant (both in mean as well as in mean ± standard deviation), around a value of C L = 0.90, which is slightly higher than the reference lift.This behavior is completely consistent with the structure of the cost function (5.14), which places the onus on minimizing the drag, while keeping lift nearly constant.Quantitatively, the ISMO algorithm converges to a set of parameters (see the resulting shape in figure 8) that correspond to the computed minimum drag coefficient of C D = 0.005202, when compared to the drag C r e f D = 0.01156, of the reference RAE2822 airfoil.The corresponding lift coefficient with this optimizer is C L = 0.8960, when compared to the reference lift C re f L = 0.8763.Similarly, the computed Hicks-Henne parameters, with the ISMO algorithm, that lead to a cost function (5.14) that is closest to the mean cost function Ḡ (see shape of the resulting airfoil in figure 8), resulted in a drag coefficient of C D = 0.005828 and lift coefficient of C L = 0.8879.Thus, while the optimized lift was kept very close to the reference lift, the drag was reduced by approximately 55% (for the absolute optimizer) and 50% (on an average) by the shape parameters, computed with the ISMO algorithm.
In order to investigate the reasons behind this significant drag reduction, we plot the Mach number around the airfoil to visualize the flow, for the reference airfoil and two of the optimized airfoils (one corresponding to the absolute smallest drag and the other to the mean optimized drag) in figure 10.From this figure we see that the key difference between the reference and optimized airfoil is significant reduction (even elimination) in the strength of the shock over the upper surface of the airfoil.This diminishing of shock strength reduces the shock drag and the overall cost function (5.14).
Finally, we compare the performance of the ISMO algorithm for this example with a standard optimization algorithm.To this end, we choose a black box truncated Newton (TNC) algorithm, with its default implementation in SciPy [47].In figure 12, we plot the mean and the mean ± standard deviation of the cost function with the TNC algorithm, with 1024 starting values chosen randomly and compare it with the ISMO algorithm 4.1.From figure 12 (left), we observe that the ISMO algorithm has a significantly lower mean of the objective function, than the black box TNC algorithm, even for a very large number (2 11 − 2 12 ) of evaluations (calls) to the underlying PDE solver.Clearly, the ISMO algorithm readily outperforms the TNC algorithm by more than an order of magnitude, with respect to the mean of the objective function (5.14).Moreover, we also observe from figure 12 (right) that the ISMO algorithm has significantly (several orders of magnitude) lower standard deviation than the TNC algorithm for the same number of calls to the PDE solver.Thus, this example clearly illustrates that ISMO algorithm is both more efficient as well as more robust than standard optimization algorithms such as the truncated Newton methods.

Discussion
Robust and accurate numerical approximation of the solutions of PDE constrained optimization problems represents a formidable computational challenge as existing methods might require a large number of evaluations of the solution (and its gradients) for the underlying PDE.Since each call to the PDE solver is computationally costly, multiple calls can be prohibitively expensive.Using surrogate models within the context of PDE constrained optimization is an attractive proposition.As long as the surrogate provides a robust and accurate approximation to the underlying PDE solution, while being much cheaper to evaluate computationally, surrogate models can be used inside standard optimization based algorithms, to significantly reduce the computational cost.However, finding surrogates with desirable properties can be challenging for high dimensional problems.
Deep neural networks (DNNs) have emerged as efficient surrogates for PDEs, particularly for approximating observables of PDEs, see [24,23,30] and references therein.Thus, it is natural to investigate whether DNNs can serve as efficient surrogates in the context of PDE constrained optimization.To this end, we propose the DNNopt algorithm 3.1 that combines standard optimization algorithms, such as the quasi-Newton algorithm 2.1, with deep neural network surrogates.
A careful theoretical analysis of the DNNopt algorithm, albeit in a restricted setting presented in section 3.1, reveals a fundamental issue with it, namely the accuracy and robustness (measured in terms of the range and standard deviation (3.16) with respect to starting values) suffers from a significant curse of dimensionality.In particular, estimates (3.9), (3.17), (4.17) show that the error (and its standard deviation) with the DNNopt algorithm only decays very slowly with increasing number of training samples.This slow decay and high variance are verified in several numerical examples presented here.
We find that fixing training sets a priori for the deep neural networks might lead to poor approximation of the underlying minima for the optimization problem, which correspond to a subset (manifold) of the parameter space.To alleviate this shortcoming of the DNNopt algorithm, we propose a novel algorithm termed as iterative surrogate model optimization (ISMO) algorithm 4.1.The key idea behind ISMO is to iteratively augment the training set for a sequence of deep neural networks such that the underlying minima of the optimization problem are better approximated.At any stage of the ISMO iteration, the current state of the DNN is used as an input to the standard optimization algorithm to find (approximate) local minima for the underlying cost function.These local minima are added to the training set and DNNs for the next step are trained on the augmented set.This feedback algorithm is iterated till convergence.Thus, one can view ISMO as an active learning algorithm [43], where the learner (the deep neural network) queries the teacher or oracle (standard optimization algorithm 2.1) to provide a better training set at each iteration.In section 4.1, we analyze the ISMO algorithm in a restricted setting and prove that the resulting (approximate) optimizers converge to the underlying minimum of the optimization problem, exponentially in terms of the number of iterations of the algorithm, see estimates (4.12).Moreover, the underlying variance is also reduced exponentially (4.10).This exponential convergence (4.16) should be contrasted with the algebraic convergence for the DNNopt algorithm (4.17) and can ameliorate the curse of dimensionality.
Although the theoretical results were proved in a restricted setting, we validated the ISMO algorithm and compared it with the DNNopt algorithm for three representative numerical examples, namely for an optimal control problem for a nonlinear ODE, a data assimilation (parameter identification) inverse problem for the heat equation and shape optimization of airfoils, subject to the Euler equations of compressible fluid dynamics.For all these examples, the ISMO algorithm was shown to significantly outperform the DNNopt algorithm, both in terms of the decay of the (mean) objective function and its greatly reduced sensitivity to starting values.Moreover, the ISMO algorithm outperformed (by more than an order of magnitude) a standard black-box optimization algorithm for aerodynamic shape optimization.
Thus, on the basis of the proposed theory and numerical examples, we can assert that the active learning ISMO algorithm provides a very promising framework for significantly reducing the computational cost of PDE constrained optimization problems, while being robust and accurate.Moreover, it is very flexible, with respect to the choice of the underlying optimization algorithm as well as architecture of neural networks, and is straightforward to implement in different settings.
This article is the first to present the ISMO algorithm and readily lends itself to the following extensions.
• We have presented the ISMO algorithm in a very general setting of an abstract PDE constrained optimization problem.Although we have selected three representative examples in this article, the algorithm itself can be applied in a straightforward manner to a variety of PDEs, both linear and non-linear.
• We have focused on a particular type of underlying optimization algorithm, i.e., of the quasi-Newton type.However, the ISMO algorithm 4.1 does not rely on any specific details of the optimization algorithm.Hence, any optimization algorithm can be used as the teacher or oracle within ISMO.In particular, one can also use gradient-free optimization algorithms such as particle swarm optimization and genetic algorithms within ISMO.Thus, even black-box optimization algorithms can employed within ISMO to significantly accelerate finding solutions of PDE constrained optimization problems.
• Similarly, the ISMO algorithm does not rely on a specific structure of the surrogate model.Although we concentrated on neural networks in this article, one can readily use other surrogates such as Gaussian process regression, within the ISMO algorithm.
• In this article, we have considered the PDE constrained optimization problem, where the objective function in (2.3), is defined in terms of an observable (2.2) of the PDE solution.In some problems, particularly for control problems, it might be necessary to consider the whole solution field and approximate it with a neural network.Such DNNs are also available but might be more expensive to train and evaluate.Thus, one needs to carefully investigate the resulting speedup with the ISMO algorithm over standard optimization algorithms in this context.
• A crucial issue in PDE constrained optimization is that of optimization under uncertainty, see [42] and references therein.In this context, the ISMO algorithm demonstrates considerable potential to significantly outperform state of the art algorithms and we consider this extension in a forthcoming paper.

Figure 1 :
Figure 1: An illustration of a (fully connected) deep neural network.The red neurons represent the inputs to the network and the blue neurons denote the output layer.They are connected by hidden layers with yellow neurons.Each hidden unit (neuron) is connected by affine linear maps between units in different layers and then with nonlinear (scalar) activation functions within units.

Remark 4 . 2 .
The key difference between the ISMO algorithm 4.1 and DNNopt algorithm 3.1 lies in the structure of the training set.In contrast to the DNNopt algorithm, where the training set was chosen a priori, the training set in the ISMO algorithm is augmented at every iteration, by adding points that are identified as approximate local minimizers (by the standard optimization algorithm 2.1) of the cost function G * k , corresponding to the surrogate neural network model L *

Lemma 4 . 3 . 1 3 1 3
Let the underlying map L in (2.4) satisfy H1 with C L , the function G satisfy H2 with constant C G , the cost function G satisfy H3 and H4.Let the neural networks at every step k in algorithm 4.1 be trained with the Lipschtiz loss functions (4.4) and local minimizers of the cost function G * k , computed with the standard optimization algorithm 2.1 satisfy H6.Let the trained neural network L * 0 be well-trained in the sense of (3.19

. 14 )
As we have assumed that the cost of both DNNopt and ISMO algorithms is dominated by the cost of generating the training samples, the cost of the ISMO algorithm with K iterations is determined by the total number of training samples i.e., (1 + cK)N 0 .For exactly the same number of training samples, the accuracy of the DNNopt algorithm 3.1 is given by the estimate (3.20) as,| ȳ − ȳj | C (1 + cK) ) N 0 1 and c ≈ 1,as long as C ∼ O(1), we see that the estimate (4.14) suggests an exponential decay of the approximation error of the underlying minimum wrt the size of the training set for the ISMO algorithm, in contrast to the algebraic decay for the DNNopt algorithm, at-least within the framework of Lemmas 3.4 and 4.3.This contrast can also be directly translated to the range (3.15) and standard deviation (3.16) of the approximate minimizers for both algorithms.Straightforward calculations with the definitions (3.15), (3.16) and estimates (4.14), (4.15) reveal an exponential decay for the ISMO algorithm of the form max{range(G), std(G)} ∼ O N an algebraic decay for the DNNopt algorithm of the form, max{range(G), std(G)} ∼ O (K N 0 ) − 1 d .(4.17)These estimates (4.16) and (4.17) suggest that the ISMO algorithm may have significantly lower variance (sensitivity) of approximate minimizers of the underlying optimization problem (2.4), when compared to the DNNopt algorithm.In particular, these estimates highlight the role of the iterated feedback between the learner (the deep learning algorithm 2.2) and the oracle (the standard optimization algorithm 2.1) in providing the training data, that exponentially improves the approximation of the underlying minima, in each successive step.

Figure 2 :
Figure 2: Evolution of training sets S k , for different iterations k, for the ISMO algorithm 4.1 for the optimal control of projectile motion example.The newly added training points Sk are the orange dots, while existing training points S k are blue dots.The exact minimizers are placed on the green parabola.Note how most of the newly added training points lie on this parabola.

Figure 4 :
Figure 4: The objective function vs. number of iterations of the DNNopt and ISMO algorithms for the optimal control of projectile motion.The mean of the objective function and mean ± standard deviation (3.16) are shown.Left: Random initial training points with N 0 = 8 and ∆N = 4 for the ISMO algorithm.Right: Sobol initial training points for the ISMO algorithm with N 0 = 32 and ∆N = 16.

Figure 5 :
Figure 5: The objective function (5.7) vs. number of iterations of the DNNopt and ISMO algorithms for the parameter identification (data assimilation) problem for the heat equation.The mean of the objective function and mean ± standard deviation (3.16) are shown.

Figure 6 :
Figure 6: Solution of the data assimilation problem for the heat equation with the ISMO algorithm.The identified initial data (5.5),resulting (exact) solution at final time and measurements (5.6) are shown.

Figure 7 :
Figure 7: The reference shape of the RAE2822 airfoil.Left: The parametrization used to deform the shape of the airfoil according to (5.9).Right: The initial C-grid mesh used around the airfoil.

( 5 . 8 )= 1 2 ρ|u| 2 + p γ − 1 ,Figure 8 :
Figure8: Shapes for the airfoil shape optimization problem.Left: The design envelope, where the reference, widest and narrowest airfoils are obtained by setting all design parameters y i as 0.5, 0 and 1, respectively.Right: The optimized shapes obtained using the ISMO algorithm.

Figure 9 :
Figure 9: The objective function (5.14) vs. number of iterations of the DNNopt and ISMO algorithms for the airfoil shape optimization problem.The mean of the objective function and mean ± standard deviation (3.16) are shown.
C L,D being the lift (5.12) and drag (5.13) coefficients, respectively, and C re f L is the lift corresponding to the reference RAE2822 airfoil and P = 10000 is a parameter to penalize any deviations from the lift of the reference airfoil.The reference lift is calculated to be C re f L = 0.8763, and the reference drag is C r e f D = 0.011562.

First, we
run the DNNopt algorithm 3.1, with two independent neural networks C * L and C * D for the lift coefficient C L and drag coefficient C D , respectively.The results of the DNNopt algorithm, with randomly chosen training points in Y = [0, 1] 20 , and with N 0 + k∆N training samples, for N 0 = 64 and ∆N = 16, are presented in figure

Figure 10 :
Figure 10: The Drag (5.13) (Left) and Lift (5.12) (Right) vs. number of iterations of the DNNopt and ISMO algorithms for the airfoil shape optimization problem.The mean of the objective function and mean ± standard deviation (3.16) are shown.

Figure 11 :
Figure 11: Mach number contours for the flow around airfoils.Left: Flow around reference RAE2822 airfoil.Center: Flow around optimized airfoil, corresponding to Hicks-Henne parameters with a cost function (5.14) that is closest of the (converged) mean cost function for the ISMO algorithm.Right: Flow around optimized airfoil, corresponding to Hicks-Henne parameters with the smallest (converged) cost function for the ISMO algorithm

Figure 12 :
Figure 12: Comparison of ISMO with the black box TNC algorithm of [47] for the airfoil shape optimization example.Left: Mean of the cost function (5.14) vs. number of calls to the PDE solver.Right: Mean ± Standard deviation (3.16) for the cost function (5.14) generate the deep neural network surrogate map L * 0 by running the deep learning algorithm 2.2.Set G * , with the cost function G defined in (2.5).batch size ∆N N 0 , draw ∆N random starting values ỹ0 1 , . . ., ỹ0 ∆N in Y .Run the standard optimization algorithm 2.1 with ỹ0 j as starting values and cost function G * 0 till tolerance, to obtain the set S0 = {ȳ 0 0 (y) = G(L * 0 (y)) for all y ∈ Y j }, for 1 j ∆N, of approximate minimizers to the optimization problem 2.4.SetS 1 = S 0 ∪ S0 (4.1)Step k 1 Given training set S k , generate the deep neural network surrogate map L * k by running the deep learning algorithm 2.2.Set G * k (y) = G(L * k (y)) for all y ∈ Y , with the cost function G defined in (2.5).Draw ∆N random starting values ỹk 1 , . . ., ỹk ∆N .Run the standard optimization algorithm 2.1 with ỹk j , 1 j ∆N, as starting values and with cost function G * k till tolerance, to obtain the set Sk = {ȳ k j }, for 1 j ∆N, of approximate minimizers to the optimization problem 2.4.Set S k+1 = S k ∪ Sk (4.2) 1, with training set S 0 and N 0 training samples.As we have required that all the hypothesis, H1 − H6 for Lemma 3.4 holds and the neural network L * 0 is well-trained in the sense of (3.19), we can directly apply estimate (3.20) to obtain that

Table 1 :
Training parameters for the experiments done in section 5.

Table 2 :
Parameters for the projectile motion experiment in section 5.