Numerical simulation of differential-algebraic equations with embedded global optimization criteria

We are considering differential-algebraic equations with embedded optimization criteria (DAEOs) in which the embedded optimization problem is solved by global optimization. This actually leads to differential inclusions for cases in which there are multiple global optimizer at the same time. Jump events from one global optimum to another result in nonsmooth DAEs and thus reduction of the convergence order of the numerical integrator to first-order. Implementation of event detection and location as introduced in this work preserves the higher-order convergence behavior of the integrator. This allows to compute discrete tangents and adjoint sensitivities for optimal control problems.


Introduction
Solving dynamic models that are described by underdetermined systems of differentialalgebraic equations (DAEs), i.e., where there are more algebraic states than algebraic equations, often requires to embed an optimization criterion to find a unique solution.These models are called DAEs with embedded optimization criteria (DAEOs).
Prime application for DAEOs can be found in process system engineering.Separation processes with thermodynamic equilibrium between phases are often modeled with DAEOs.The DAE describes the dynamic behavior and a nonlinear programs represents the phase equilibrium which is at the minimum of the Gibbs free energy [1,8,29].
For simulation [26,27] and optimization [4,16,28] of DAEOs the nonlinear programs are often substituted with first-order optimality conditions, i.e., Karush-Kuhn-Tucker (KKT) conditions.The local approach only guaranties exact solution for convex nonlinear programs.For nonconvex nonlinear programs, the solution point might only be locally optimal.Furthermore, substitution of the optimization problem by the KKT conditions yields a nonsmooth DAE due to switching events, i.e., changes of the active set.This kind of problem is solved either with a simultaneous or with a sequential approach [3].The simultaneous approach results in solving a single large-scale NLP.In contrast, the sequential approach requires interaction of the numerical integrator and the numerical optimizer.
CONTACT Jens Deussen.Email: deussen@stce.rwth-aachen.de In this paper, we propose the simulation of DAEOs with deterministic global optimization (DGO) methods instead of substituting the optimization problem with first-order optimality conditions.This has the advantage that the solution point of the optimization problem is guaranteed to be a global optimum.Embedding a global optimization problem requires to consider differential inclusions as a generalization of differential equations for the cases where several global optimizer exist.Among other things, this is also the case when the global optimizer jumps.A jump of the global optimizer is referred to as an event, which can occur due to the nature of dynamic systems.In the presence of an event the resulting DAE system becomes nonsmooth.Using the local optimum of the previous time step even if the time period covers an event amounts to a type of discontinuity locking [25].Unfortunately, the convergence order of the integrator is reduced to first-order without explicit treatment of event locations.To achieve second-order convergence across jumps of the global optimum it is necessary to implement an adaptive time stepping or event location procedure.We will describe how to detect and locate events.
An alternative approach for obtaining a higher convergence for the integrator is presented in [13].They utilize a generalization of algorithmic differentiation (AD) to treat nonsmooth right hand sides of ordinary differential equations (ODEs).Further information on the theory of AD can be found in [10,24] and for information on nonsmooth AD it is referred to [11,17].
The paper is organized as follows: In Section 2, we describe the mathematical formulation of the type of problem we are considering in this paper.This includes the description of the DAEO, and assumptions on the DAEO as well as on the events, i.e., jumps of the global minimizer.Section 3 gives an overview of interval computations and deterministic global optimization.Based on [5], we show how to obtain all convex subdomains of a nonconvex objective function that potentially contain a local minimum.The numerical simulation of the DAEO is described in Section 4. This includes a description on how to detect whether an event has happened on a time period (event detection) and how to find the time step at which the event takes place (event location).In Section 5 the presented methods are applied to two example functions.For the first example, we derive an analytical solution to examine the convergence behavior of the simulation with and without explicit treatment of the events.Section 6 summarizes the results and gives an outlook on future work.

Theoretical Background
We consider the initial value problem for a differential inclusion with an embedded global optimization problem where f : R nx × R ny is the differential part of the system and h : R nx × R ny is the objective function.We use the notation x ′ = ∂x/∂t for the derivative with respect to time, ∂ x = ∂/∂x for the (partial) derivative and d x = d/dx for the total derivative with respect to x where it is crucial to distinguish from the partial derivative in the context of implicit differentiation.Second partial derivatives are denoted by The problem in eq. ( 1) is a differential inclusion instead of an differential equation problem because the optimum of the embedded optimization problem is not necessarily unique.In this paper we consider the case where the solution set consists of a finite set of isolated strict local optima that are also potential global optima arg min with h(y i ) = h(y j ) for all i, j ∈ {1, . . ., S} and h(y i ) < h(z) for all i ∈ {1, . . ., S}, z ∈ R ny \ arg min y h(x, y).Notably, that also means that the implicit set-valued map y(t) = y(x(t)) is not convex.
Assumption 1.The functions f and h are twice Lipschitz continuously differentiable with respect to (x, y).
For each strict local optimum we have necessary and sufficient local optimality conditions 0 = ∂ y h(x, y) The gradient with regard to y is zero and the Hessian is positive definite.
Assumption 2. The initial value problem is posed in such a way that the necessary and sufficient local optimality conditions hold for (x(t), y i (t)) for all i ∈ {1, . . ., S}, t ∈ [t 0 , T ].
In order to turn the differential inclusion into a discontinuous differential equation with a unique solution we consider the case where the solution set has size larger than one only for a set of times that has measure zero.We make the even stronger assumptions that the set of times where the solution set is larger than one is finite in order to make handling these events numerically feasible.Assumption 3.There exists a finite set of events 0 < t e1  The second case is not relevant for the solution of the differential equation because it occurs on a set of measure zero.We make a transversality assumption that precludes the second case.First, we define the condition of the touching two local optima y 1 and y 2 as the root of the event function H(x(t)) Assumption 4. All events are transversal: Assuming that y(x(t ej )) = y 1 (x(t ej −)) we have With the above assumptions we can write the original differential inclusion as in eq. ( 1) as multiple initial value problems for differential equations for j = 0, . . ., K + 1: where t 0 = t 0 and t K+1 = T .The time periods (t ej , t ej+1 ), j = 0, . . ., K, will be referred to as phases.
Proposition 2.1.Equation ( 2) is a DAEO for each phase because y j (t) is a locally unique implicit function of x(t) that is also Lipschitz continuous in x(t).
Proof.The first-order optimality condition that we assumed to hold for all local optima is 0 = ∂ y h(x, y) .
By the implicit function theorem and using the positive definiteness of the Hessian we have Because y is only given implicitly, the above problems are really DAEs.We get the DAE formulation by linearizing the first-order optimality condition: (3)

Global search for local optima
Interval arithmetic (IA) [23] evaluations have the property that all values that can be evaluated on a given domain are reliably contained in the output of the corresponding interval evaluation.To obtain global information on the function value a single function evaluation in IA is required instead of multiple function evaluations at several points.
For (compact) interval variable [y] we use the notation with lower and upper bound y, y ∈ R ny .The united extension (UE) of a function h evaluated on [y] is defined as The UE for algebraic operators and elemental functions, i.e., general power, general root, exponential, logarithmic, trigonometric and hyperbolic functions, on compact domains are well known.However, this does not apply to composite functions.To enable IA of composite functions, the natural interval extension (NIE) replaces all algebraic operators and elemental functions in the evaluation procedure by their UE.
The evaluation of function h on [y] by the NIE yields The superset states that the resulting interval can be an overestimation of the UE.Overestimation can occur if the underlying data format (e.g., floating-point numbers) cannot represent the exact bounds of a computed interval.In this particular case the IA evaluation rounds towards negative or positive infinity for a lower or upper bound, respectively.Furthermore, overestimation can be caused by the dependency problem.If a function evaluation uses a variable multiple times, IA does not take into account that actual values taken from these intervals are equal.The larger the intervals are the more significant the overestimation is.Another challenge for IA are conditional branches that depend on interval arguments.Comparisons of intervals that intersect are ambiguous.Splitting or multi-section [15] of the original domain and evaluation in IA on subdomains might address this problem.While the NIE converges linearly to the UE [23], so called mean-value forms [22] converge quadratically.An alternative approach to obtain global information on the function value are McCormick relaxations [20,21].These relaxations converge quadratically to the convex hull, from which the UE can be determined.
To obtain guaranteed ranges for the derivatives on a given domain, the NIE can be applied to derivative computations, e.g., by AD.In [7] convergence of the interval methods applied to AD models is shown and cases are investigated for which the NIE of the AD models yield the UE.We will denote the interval gradient with and the interval Hessian with This information can be used to exclude that the necessary condition is violated and to verify if the sufficient condition is fulfilled.
In [6], it is shown how to sharpen the bounds of the enclosure of the gradient by using McCormick relaxations instead of the NIE.The nonsmooth McCormick relaxations are abs-factorable such that we can apply piecewise linearization as suggested in [11].Thus, for computing the optima of the McCormick relaxations we use successive piecewise linearization which is an optimization method that repeatedly utilizes piecewise linearization models in a proximal point type method [12].
A divide and conquer algorithm can be utilized to detect all convex subdomains B([y]) that potentially contain a local minimum of h(x, y) The existence of these boxes is shown by Proposition 3.2.
Lemma 3.1.The spectral norm of a symmetric positive definite matrix A is its maximum eigenvalue. Proof.
Proposition 3.2.For every local optimum y ⋄ ∈ arg min y h(x, y) there is a closed box for δ > 0 for all y ∈ B(a, b).
Proof.By Assumption 1 the Hessian is Lipschitz continuous on the box.Using Lemma 3.1 and the triangle inequality we get for some L > 0. By equivalence of norms in finite dimensional vector spaces we have for some C > 0. It follows that for any box B(a, b) we can find a bound Z > 0 such that for all y ∈ B(a, b).We choose b i > 0 and a i < 0 then y ⋄ ∈ int(B(a, b)) and we choose them such that The divide and conquer algorithm recursively refines the domain until a subdomain can either be eliminated due to violation of the optimality condition or be returned due to approval of these conditions.The union of the returned and the eliminated subdomains should cover the original domain.The processing of a subdomain B consists of the following main components: • Branching: Split the current domain B into subdomains B j with B = j B j .
Performing local searches on the convex subdomains stored in B(t) results in the set S(t) that contains all local minima y ⋄ .Some of these subdomains might not contain a local minimum due to the already mentioned overestimation of IA.In that case the local search would find the minimum on the bound of the subdomain and would not include this into the set S(t).
To reduce the computational effort of the complete search algorithm, it might be desirable to implement a bounding step into the algorithm as it is done by conventional branch-and-bound algorithms as implemented in DGO solver, e.g., MAiNGO [2] or BARON [18].Instead of finding all local optima y ⋄ , one would focus on those local optima that are close to the global optimum y * .These local optima would fulfill h(x, y ⋄ ) ≤ h(x, y * ) + α , for α > 0. That would lead to the additional steps in the processing of subdomain B: • Bounding: Compute a guaranteed lower bound h of a convex relaxation of the objective function on B.
• Pruning: evaluate the function at any feasible point y ∈ B, e.g., the midpoint y † of B, and update the current best solution h * , if h(x, y † ) < h * .

Numerical simulation
In this section, we present how to solve eq. ( 3) numerically with a temporal discretization for the numerical integrator and how to detect and locate events at which the global optimizer switches.

Time stepping
We select an implicit scheme for the time stepping of the simulation of the dynamic system in order to be stable and a higher-order scheme in order to be more efficient in terms of step size.The trapezoidal rule, a second-order convergent Runge-Kutta method, computes for discrete time steps t k and t k+1 , with ∆t k = t k+1 − t k such that we obtain the discretized system x j (t ej ) = x 0 if j = 0 x j−1 (t ej ) otherwise The system in eq. ( 4) can be solved by a linear solver, e.g., with LU decomposition.Since the events t ej , j ∈ {1, . . ., K}, are not known apriori, we need a need a mechanism for event detection.

Event detection
Let us assume we track the dynamic set of multiple strong local optima We are assuming that only finitely many isolated switches happen.So we can always choose ∆t k = t k+1 − t k small enough such that at most one switch happens per time step.The problem is not that a back and forth switch might happen between t k and t k+1 that we do not observe (because the optimum at t k+1 is again the same as t k ).The problem is to distinguish the elements of the sets S(t k ) and S(t k+1 ) in such a way that it is clear if one switch or no switch has happened.
The elements of the sets S(t k ) and S(t k+1 ) have no intrinsic order.So for example y 1 (t k ) could be the first element of a list representation of S(t k ) but y 1 (t k+1 ) the second element of a list representation of S(t k+1 ).The task of event detection comes down distinguishing elements in the list representation.Because then we can check whether h(y s (t k+1 )) < h(y 1 (t k+1 )) for some s = 1.
Let ŷ1 k = y 1 (t k ) be the first element of S(t k ).We want to determine what elements ŷs for s ∈ {1, . . ., S} could potentially be ŷs = y 1 (t k+1 ).In order to do this we have to consider that y 1 (t) = y 1 (x(t)) is a locally Lipschitz continuous implicit function of x.Proposition 4.1.Given y 1 (t k ) we can bound the values of y 1 (t k +∆t k ) by a Lipschitz constant Proof.We have x(t) is absolutely continuous due to Carathéodory's existence theorem [14].We are considering a compact time interval so absolute continuity implies Lipschitz continuity The implicit function y(x) is Lipschitz continuous due to strong local convexity and implicit function theorem by mean value theorem The same idea holds in the discrete time setting when allowing for some error.This is obvious for example when using an explicit Euler step where the change between x(t k+1 ) and x(t k ) is directly given by the dynamics.Instead of the real Lipschitz constant in t we just use the derivative of that time step (y 1 ) ′ (t k ).
The event detection can be implemented into an integrator by verifying in every time step.An event is detected if eq. ( 5) is violated.In that case, it remains to find the event location.

Event location
We mix the event location procedure into the time stepping procedure via the Mannshardt approach [19] that is compatible with our assumptions.Finding the event times t ej involves numerically solving for the event function roots.We use Newton's method to solve 0 = H(x(t ej )) for t ej where x(t ej ) is a local numerical approximation of the solution trajectory.Differentiation of x(t ej ) with respect to t ej is straight-forward so we need the gradient of the event function with respect to x The individual ∂ x y i (x) are computed by implicit function theorem as seen above.Note, that y 2 (t k+1 ) can be obtained directly from the DGO solver since it is the global optimum, while y 1 (t k+1 ) needs to be identified in the set of local optima S(t k+1 ).This can be achieved by local optimization, i.e., solving

Numerical experiments
We investigate two example DAEOs with global optima discontinuous in time.The first example is a very basic example with two local optima for which an analytical solution can be easily derived.are emerging and vanishing over time.We will compare numerical simulations with and without event detection.Furthermore, we set the time step for the numerical integration to ∆t = 0.02 and use the trapezoidal rule as explained above.For the DGO part of the simulation, we use the solver described in [5].
Example 5.1.As a first example, we consider Function h(x(t), y(t)) has two local optima at time-independent positions y 1 (t) = 1 and y 2 (t) = −1, i.e., ∂ t y 1 (t) = 0 and ∂ t y 2 (t) = 0.A surface plot of h(x(t), y(t)) is shown in Figure 2. Finding the root of the event function, i.e., yields event location at x(t) = 0.5.The solution of the DAEO is for the first phase Thus, the event location is at t e = − log(0.5)/3.The second phase is described by x(t e ) = 0.
x(t) + On the left side of Figure 3, the results of the differential variable x(t) are shown.The blue line represents the analytical solution, while the orange circles mark the numerical simulation without event detection and the black crosses mark the one with event detection (and location).It becomes visible that the differential variable is nonsmooth at the event.On the right side of Figure 3 the convergence behavior of the numerical simulations with (black, crosses) and without (orange, circles) explicit treatment of the event is plotted for decreasing time steps.It can be seen that the version with explicit treatment converges quadratically to the analytical solution, while the version without explicit treatment only converges linearly.Computing the event location and treating this explicitly in the simulation is only possible due to the tracking and detection of switches in the global optimum.
Example 5.2.The second DAEO considers a 1D Griewank inspired function [9] as embedded optimization problem.The DAEO is described by   (red line) is given on the left side of Figure 4.It can be seen that there are four events on the time interval t ∈ [0, 2].The differentiable variable x(t) (blue), the global optimizer y(t) (red) and the global optimum h(x(t), y(t)) (green) are represented as lines in the right graph of Figure 4.The numerical results of the version without (dashed) explicit treatment of the event differ slightly from the result with explicit treatment.This is a result of the error propagation due to the behavior of the dynamic system.

Conclusions and further work
We introduced event detection and event location for a jumping global optimizer.The explicit treatment of events and its implementation into the numerical integrator yield a second-order convergent method for DAEOs in the presence of a jumping global optimizer.The second-order integrator enables the computation of discrete tangent and adjoint sensitivities of the dynamic system with respect to some parameters, which is crucial for solving optimal control problems of DAEOs.These sensitivities can now be obtained by AD methods.Due to the high computational cost that comes along with the application of DGO in every time step, we aim for tracking relevant local optima in time instead.The global search is only required for getting a list of local optima and for recognizing that a new local optimum emerged during a time step.Vanishing optima should not be a problem with this approach.
for all y ∈ B(a, b).Let I be the n y × n y identity matrix.The shifted matrix ∂ 2 yy h(x, y) − ZI has the dominant eigenvalue λ min (∂ 2 yy h(x, y))−Z for all y ∈ B(a, b).Using Lemma 3.1, the triangle inequality, strong convexity of the Hessian and Lipschitz continuity (As-sumption 1) we get

Figure 3 .
Figure 3.: Analytical (line) and numerical (marks) results for differential variable x(t) of Example 5.1 on the left.The differential equation is solved by trapezoidal rule with ∆t = 0.02 without (superscript −) and with (superscript +) explicit treatment of events.The convergence behavior of the two methods is shown on the right.
x(0) = 1 x ′ (t) = y(t) {y(t)} = arg min y (y(t) − x(t)) 2 + sin(Cy(t)) , with C = 5 and T = 2.The objective function has multiple optima that are emerging and vanishing over time.A surface plot of the function with the global optimizer y(t)

Figure 4 .
Figure 4.: Surface plot of the 1D Griewank inspired function from example Example 5.2 with global optimizer (red line) is shown on the left.The right side shows differential variable x(t) (blue), global optimizer y(t) (red) and value of the global optimum (green) for the version with (solid) and without (dashed) explicit treatment of events.
{y 1 (t), ..., y S (t)} .No local optimum emerges or vanishes between t k and t k+1 , such that we have |S(t k )| = |S(t k+1 )| = S. Furthermore, we assume that there is an event at t e .The unique global optimum for t k ≤ t < t e is y 1 (t).We haveh(y 1 (t)) < h(y s (t))for all s ∈ {1, ..., S} \ {1}, t ∈ [t k , t e ).An event occurs at t e when without loss of generality h(y 2 (t e )) = h(y 1 (t e )).Due to transversality there exists a t k+1 > t e such that h(y 2 (t)) < h(y s (t)) for all s ∈ {1, ..., S} \ {2}, t ∈ (t e , t k+1 ].Note, that we keep the superscripts of the local optima fromt k for t ∈ [t k , t k+1 ].The question of event detection in the discrete time setting is the following: Looking only at randomly ordered list representations of S(t k ) and S(t k+1 ) that are obtained by a (global) optimization algorithm, has a switch of the global optimum happened in the meantime?
The second example has multiple local optima that Figure2.: The left image shows a surface plot of h(x(t), y(t)) for t ∈ [0, 1].The position y(t) of the global optimum which lies at y 1 = 1 for t < t e and at y 2 = −1 for t > t e is marked by the red line while the green lines indicate its value.