Integer optimal control problems with total variation regularization: Optimality conditions and fast solution of subproblems

We investigate local optimality conditions of first and second order for integer optimal control problems with total variation regularization via a finite-dimensional switching point problem. We show the equivalence of local optimality for both problems, which will be used to derive conditions concerning the switching points of the control function. A non-local optimality condition treating back-and-forth switches will be formulated. For the numerical solution, we propose a proximal-gradient method. The emerging discretized subproblems will be solved by employing Bellman's optimality principle, leading to an algorithm which is polynomial in the mesh size and in the admissible control levels. An adaption of this algorithm can be used to handle subproblems of the trust-region method proposed in Leyffer, Manns, 2021. Finally, we demonstrate computational results.


Introduction
We investigate the infinite-dimensional mixed-integer optimization problem Minimize F (u) + β TV (u) such that u(t) ∈ {ν 1 , . . ., ν d } for a.a.t ∈ (0, T ). (P) Here, the admissible control values satisfy {ν 1 , . . ., ν d } ⊂ Z with ν 1 < ν 2 < . . .< ν d , and TV(u) is the total variation of the function u, see Section 2. The first part of the objective is kept rather general and might contain, e.g., the solution operator of a differential equation.Therefore, (P) covers a large class of mixed-integer optimal control problems and these have an abundance of applications.We refer to Leyffer, Manns, 2021, Severitt, Manns, 2022 and the references therein.
In Leyffer, Manns, 2021, problems of the form (P) have been investigated and a trustregion algorithm has been proposed, with subproblems being modeled as linear integer problems.Here, we will extend some of the gained results.For further investigation on mixed integer optimal control problems, see e.g.Hante, Sager, 2013, Bestehorn et al., 2020, Kirches, Manns, Ulbrich, 2021and Sager, Zeile, 2021 using an approach based on the combinatorial integral approximation decomposition.
At this point, we would like to mention that the total variation term in (P) ensures the existence of minimizers under rather mild assumptions on F .To be precise, it suffices to assume that F : L 1 (0, T ) → R is lower semicontinuous and bounded from below, see Leyffer, Manns, 2021, Proposition 2.3 and the short argument after Theorem 2.2 below.Since the total variation term penalizes the number (and height) of the switches of the control function u, it is also desirable from an application point of view.
The aim of this paper is threefold.After recalling some properties of the total variation in Section 2, we address optimality conditions for (P) in Section 3. In particular, we verify local optimality condition of first and second order (Theorem 3.10) and we also formulate some non-local optimality conditions (Section 3.3) in the spirit of the classical mode-insertion as in Egerstedt, Wardi, Axelsson, 2006, Section IV.Second, we propose a proximal-gradient method for the solution of (P) in Section 4. Third, we give a fast algorithm for the solution of the proximal-gradient subproblem (Section 4.2) as well as for the subproblem arising in the trust-region method proposed in Leyffer, Manns, 2021 (Section 5).Finally, we illustrate our findings by some numerical experiments in Section 6.

The total variation functional
In this section, we recall the definition of the total variation functional TV : L 1 (0, T ) → [0, ∞] and give some basic properties.Furthermore, we write TV(u) := TV(u; (0, T )).
The space of functions with bounded variation BV(0, T ) is therefore defined as the set of all u ∈ L 1 (0, T ) with TV(u) < ∞, equipped with the norm u BV(0,T ) = u L 1 (0,T ) + TV(u).
Since both F and TV are defined on L 1 (0, T ), we will ignore null sets in the following.
For the next sections, some properties of BV(0, T ) are needed.
Theorem 2.2.The space BV(0, T ) and the functional TV have the following properties.
(i) The space BV(0, T ) is (isometric isomorphic to) the dual space of a separable Banach space.
(ii) For a sequence (u k ) k∈N ⊂ BV(0, T ), we have u k u in BV(0, T ) if and only if u k → u in L 1 (0, T ) and (u k ) k∈N is bounded in BV(0, T ).
(iv) When u k u in BV(0, T ), we have u k → u in L p (0, T ) for all p ∈ [1, ∞).
(v) If (u k ) k∈N is bounded in BV(0, T ), there exists a weak-accumulation point of (u k ).
In order to prove (vi), we take a subsequence with lim inf k→∞ TV(u k ) = lim l→∞ TV(u k l ).
For an arbitrary ϕ ∈ C 1 c (0, T ) with ϕ L ∞ (0,T ) ≤ 1, we have Taking the supremum over all these ϕ, we get the desired inequality.
The existence of a solution can be shown by standard arguments: A minimizing sequence (u k ) k∈N ⊂ U ad is bounded in L 1 (0, T ) by T max(|ν 1 |, |ν d |), while the boundedness of TV(u k ) follows from the existence of a lower bound for F .Using Theorem 2.2 (v), the existence of a weak-convergent subsequence (u k l ) l∈N with u k l ū ∈ BV(0, T ) can be derived.Considering Theorem 2.2 (ii), we see that u k l → ū in L 1 (0, T ).Thus, there is another subsequence (u m ) m∈N ⊂ (u k l ) l∈N with u m (t) → ū(t) for a.e.t ∈ (0, T ).It follows that ū(t) ∈ {ν 1 , . . ., ν d } a.e. in (0, T ), hence ū ∈ U ad .Finally, the lower semicontinuity of F and Theorem 2.2 (vi) yield the optimality of ū.

Then, we have
Note that we will not have equality in (2.1), even in case t 1 = 0, t n = T , since jumps at the points t 2 , . . ., t n−1 are ignored by the left-hand side of (2.1).

Optimality conditions
In this section, we are discussing optimality conditions for (P).First, we address a switching-point reformulation in Section 3.1.This can be used to derive local optimality conditions of first and second order in Section 3.2.Afterwards, we consider non-local optimality conditions in Section 3.3.

Switching point reformulation
Let where we again use t 0 = 0 and t n = T .
In Leyffer, Manns, 2021, Corollary 4.4 where n is chosen as small as possible.
We give a different representation.
Before giving the proof, we will explain the meaning of the conditions (i)-(iv).Using conditions (i) and (iii), we can identify u with a piecewise constant function with the switching points tj , j ∈ {1, . . ., n − 1}.In contrast to the representation in Leyffer, Manns, 2021, Proposition 4.4, we also allow equality of time steps.With (ii), the equality of two or more tj is needed when u is increasing or decreasing by more than one level.Finally, (iv) prevents unnecessary and repetitive switching between two levels at the same time instance.To illustrate the difference to Leyffer, Manns, 2021, Proposition 4.4, we consider the following example.
The uniqueness of n, t, â is easy to check.
In what follows, we associate with a given function the representations from Leyffer, Manns, 2021, Proposition 4.4 and from Lemma 3.1.
Here, the value n ∈ N is as small as possible, thus, we refer to u = v t,a as the minimal representation of u.
Finally, we define the index sets (associated with the minimal representation) The set J + (J − ) consists of exactly those indices j, for which there is an upwards (downwards) jump at t = t j which skips over the control levels between a j and a j+1 .
Using the full representation of a feasible function, the following can be proved.
Let w be a feasible point of (P) with TV(w) ≤ TV(u) and w − u L 1 (0,T ) ≤ ε.Thus, where λ is the Lebesgue measure.Since u and w are piecewise constant, there exists a nonempty interval (α j , βj ) ⊂ [ tj , tj+1 ] for every j ∈ {0, . . ., n − 1} with tj = tj+1 on which w = u = âj .The same is true when considering the minimal representation v t,a with t ∈ R n−1 , a ∈ R n of u, where we get the existence of such an interval in [t j , t j+1 ] for every j ∈ {0, . . ., n − 1} on which w = u = a j .
Let w = v s,ã be the minimal representation of w with s ∈ R m−1 , ã = (ã 1 , . . ., ãm ).Since there is an open subinterval (α j , β j ) of [t j , t j+1 ] with w = a j , we can define the midpoint tj of this interval for every j ∈ {0, . . ., n−1}.By defining ϕ ∈ C 1 c ( tj , tj+1 ) as a continuous Integer optimal control problems with total variation regularization Marko, Wachsmuth function with ϕ(t) = − sgn(a j+1 − a j ) for t ∈ (β j , α j+1 ), we can see that Then, using Lemma 2.3, it follows that that Thus, equality holds.In particular, we have implying that w can only ascend or descend from a j to a j+1 in ( tj , tj+1 ).Translating this behaviour in the full representation, we see that for every j ∈ {0, . . ., n − 1} with tj = tj+1 , w has to switch to every value between âj and âj+1 exactly once in (α j , βj+1 ).We conclude that the full representation of w is given by v ŝ,â for an ŝ ∈ R n−1 .Now, observe that with τ j = ŝj − tj , µ j = âj+1 − âj and Note that, at every t ∈ (0, T ), all non-vanishing addends on the right-hand side of (3.1) share the same sign.Thus, from which, using the equivalence of all norms in R n−1 , the statement follows.Now, we want to derive local optimality conditions for (P) via reformulation as a switching point optimization problem similar to Leyffer, Manns, 2021, Section 4.2.Given n ∈ N and a ∈ R n , we consider the problem Note that (ST(n, a)) depends on the chosen values of n ∈ N and a ∈ R n .We mention that we also utilize (ST(n, â)), where we use the data (n, â) from the full representation of u.The main advantage of using the full representation is the upcoming theorem showing that local optimality of u for (P) is equivalent to local optimality of t for (ST(n, â)).
Theorem 3.5.Let u ∈ BV(0, T ) be feasible for (P) and consider the data (n, â, t) of its full representation.Then, u is locally optimal for (P) in L 1 (0, T ) if and only if t is locally optimal for (ST(n, â)).Moreover, u satisfies a local quadratic growth condition for (P) in L 1 (0, T ) if and only if a local quadratic growth condition is valid for (ST(n, â)) at t.To be precise, the existence of constants ε, η > 0 with where is the feasible set of (ST(n, â)).
Note that equivalence of the local optimalities will not hold in general if we are using the minimal representation.

Local optimality conditions for (P)
In this section, we derive optimality conditions for (P) via the (equivalent) problem (ST(n, â)).To this end, we are going to discuss optimality conditions for the problem (ST(n, a)) and these findings will also be applied to (ST(n, â)).Since (ST(n, a)) is a standard finite-dimensional optimization problem, optimality conditions involving first and second order derivatives of the objective of (ST(n, a)) (w.r.t.t) can be formulated.Thus, we are going to investigate these derivatives.
In the upcoming theorem, we need some regularity of F .First, we assume that This yields the second-order Taylor expansion see Cartan, 1967, Theorem 5.6.3. Here, F (u) : are the Fréchet derivatives of first and second order at u, respectively, and We investigate the structure of the derivatives.The first order derivative F (u) belongs to the dual space of L 1 (0, T ), which will be identified with L ∞ (0, T ).Thus, F (u) is identified with a function ∇F (u) ∈ L ∞ (0, T ) and we will pose regularity assumptions on this function.Similarly, F (u) is a continuous bilinear form on L 1 (0, T ).It is well known that continuous bilinear forms on L 1 (0, T ) can be identified with functions from L ∞ ((0, T ) 2 ).In fact, this follows from the (isometric) identifications see Defant, Floret, 1992, Sections 3 and 7 for the results and for the notation.Thus, we will identify F (u) with a function ∇ 2 F (u) from L ∞ ((0, T ) 2 ) and the evaluation (given by the above identifications) is As for ∇F (u) : (0, T ) → R, we are going to postulate regularity assumptions on the function ∇ 2 F (u) : (0, T ) 2 → R. Finally, we mention that the symmetry of F (u), see Cartan, 1967, Theorem 5 Theorem 3.6.We consider fixed n ∈ N, a ∈ R n .Let the vector t ∈ R n−1 be feasible for (ST(n, a)) and let τ ∈ R n−1 be given such that τ k ≤ τ k+1 whenever t k = t k+1 for all k = 0, . . ., n with the convention 0 = t 0 = τ 0 = τ n and T = t n .Then, t + τ is feasible for (ST(n, a)) whenever τ is small enough.Under the regularity assumptions that F : 2 ), we have the expansion Here, µ j = a j+1 − a j is the jump height at t j .
Proof.The feasibility of t + τ for τ small enough is clear.For brevity, we write v t and v t+τ instead of v t,a and v t+τ,a , respectively.By definition of v t+τ and v t , we have Since F is assumed to be twice Fréchet differentiable on L 1 (0, T ), we get the expansion We study the terms on the right-hand side of the expansion by using the above representation of v t+τ − v t .First, we have Similarly, where we used continuity of the function ∇ 2 F (v t ).This shows the claim.
We note that the first order part of the expansion can be shown by assuming first order Fréchet-differentiability of F : L 1 (0, T ) → R at v t,a and continuity of ∇F (v t,a ) : [0, T ] → R.
Lemma 3.7.We consider fixed n ∈ N, a ∈ R n .Let the vector t ∈ R n−1 be feasible for (ST(n, a)) and t 0 = 0 < t 1 , t n−1 < t n = T .We again use the jump heights µ j := a j+1 − a j and define We assume µ j = 0 for all j = 1, . . ., n − 1 and we suppose that all jumps at t i go in the same direction, i.e., sgn(µ i ) = sgn(µ j ) for all i, j ∈ {1, . . ., n − 1} with t i = t j .Further, we assume that F satisfies the regularity assumptions of Theorem 3.6.If t is a local minimizer of (ST(n, a)), then ∀j = 1, . . ., n − 1 : On the other hand, if is satisfied, then t is a local minimizer of (ST(n, a)) and a quadratic growth condition is satisfied.
The assumption µ j = 0 means that there is actually a jump at t = t j and the second assumption on µ corresponds to Lemma 3.1(iv).
Proof.It is straightforward to verify that (ST(n, a)) satisfies the linear independence constraint qualification.This implies that T coincides with the tangent cone of the feasible set at the point t, see Nocedal, Wright, 2006, Lemma 12.2.Next, we are going to employ optimality conditions of first and second order.Note that there is a slight difficulty, since the objective of (ST(n, a)) is only defined on the feasible set, which is a closed set.However, we have proven a Taylor-like second order expansion in Theorem 3.6.By inspecting the proofs of Nocedal, Wright, 2006, Theorems 12.3, 12.5 and 12.6, we see that this is enough in order to get optimality conditions.
To prove the necessary conditions, we assume that t is locally optimal.The first-order optimality condition (Nocedal, Wright, 2006, Theorems 12.3) reads For any j ∈ {1, . . ., n}, there exist i, k ∈ {1, . . ., n} with i ≤ j ≤ k, Then, the unit vectors −e i and e k belong to T and this gives (3.4a) due to sgn(µ i ) = sgn(µ k ).Since the derivative of the objective is zero, the critical cone used for second order conditions coincides with the tangent cone T and the Lagrange multipliers are zero.The second-order necessary condition Nocedal, Wright, 2006, Theorem 12.5 delivers (3.4b).
Remark 3.8.If u ∈ BV(0, T ) is feasible for (P) and has a switch across more than one level, i.e., if it switches from ν i to ν j with |i−j| > 1, then the minimal representation (t, a) and the full representation ( t, â) deliver two different instances (ST(n, a)) and (ST(n, â)).
It is easy to check that the first order part of Lemma 3.7 gives the same conditions, namely ∇F (u)(t) = 0 for all switching times t ∈ (0, T ).By means of an example, we check that the second order conditions differ.
We consider the setting , which has a jump from ν 1 to ν 3 at t = 1.The minimal representation of u is given by Consequently, T = R and the condition (3.4b) reads On the other hand, the full representation of u is given by n = 3, â1 = 0, â2 = 1, â3 = 2, t1 = t2 = 1.
It can be checked that the second order conditions obtained via the full representation of Lemma 3.1 are always stronger (or equivalent) to the second order conditions via the minimal representation (Leyffer, Manns, 2021, Corollary 4.4).This is also expected if we compare Theorem 3.5 with the corresponding result Leyffer, Manns, 2021, Theorem 4.14 (3).
We generalize the findings of this example.
Lemma 3.9.Let u ∈ BV(0, T ) be feasible for (P) and we denote by (n, a, t) and (n, â, t) the minimal and the full representation of u, respectively.We assume that F satisfies the regularity assumptions from Theorem 3.6.Further, we define the symmetric matrices where µ j = a j+1 − a j and μj = âj+1 − âj .Further, we define the cone Proof.For j ∈ {1, . . ., n}, we set I j := {i ∈ {1, . . ., n} | ti = t j }.Note that these sets I j are a decomposition of {1, . . ., n − 1} and i∈I j μi = µ j .Further, I j is a singleton, if and only if j ∈ J + ∪ J − .Now, let τ ∈ R n−1 and τ ∈ R n−1 be given such that If I j is a singleton, the last parenthesis vanishes.Otherwise, where we used convexity of s → s 2 , µ j = i∈I j μi and that all μi , i ∈ I j possess the same sign as µ j .Thus, τ Fτ = where σ j = i∈I j μi τ 2 i − µ j τ 2 j .Note that ±σ j ≥ 0 for all j ∈ J ± .
In order to get the sign conditions of (∇F (u)) (t j ) for j ∈ J + ∪ J − , it is enough to realize that we can choose τ ∈ T such that the corresponding τ and σ satisfy τ = 0, σ j = ±1 and σ j = 0 for j ∈ (J + ∪ J − ) \ {j}.
Note that the conditions involving the cone T are difficult to verify since they involve positive (semi)-definiteness of a matrix over a cone and this is, in general, difficult to check.In contrast, the equivalent conditions appearing on the right-hand sides are straightforward to verify.
By combining the above results, we obtain the main result of this section.
Theorem 3.10.Let u ∈ BV(0, T ) be feasible for (P) and we denote by (n, a, t) the minimal representation for u.We assume that F : ).We define µ j := a j+1 − a j for j = 1, . . ., n − 1.If u is a local minimizer of (P) in L 1 (0, T ), then the system is satisfied.Moreover, u is a local minimizer of (P) satisfying a quadratic growth condition in L 1 (0, T ) if and only if Note that (3.10d), (3.11d) describe the positive (semi)-definiteness of the matrix Furthermore, we mention that (3.10) and (3.11) can be easily checked.Bear in mind that these conditions use the data from the minimal representation of u, but were derived using the full representation of u.Finally, we mention that the gap between the necessary and the sufficient conditions is as small as possible and, moreover, we are able to characterize local quadratic growth in L 1 (0, T ).
(i) A comparable second-order optimality condition (for bang-bang problems) in the multi-dimensional case was given in Christof, G. Wachsmuth, 2018, Theorem 6.12.Therein, the term |∇ϕ| corresponds to (∇F (v t,a )) above (since the adjoint state ϕ represents the derivative of the objective w.r.t. the control at the point of interest).
(ii) The results of Theorems 3.6 and 3.10 can be utilized to set up a Newton method for the solution of (ST(n, a)).
(iii) The second-order terms in the Theorems 3.6 and 3.10 give rise to the following observations: • The convexity of F is not enough to guarantee that first order stationary points are (locally) optimal.Indeed, the convexity of F has no influence on the signs of (∇F (u)) (t j ).
• Similarly, optimality of v t,a alone does not give a sign of (∇F (v t,a )) (t j ) for j ∈ J + ∪ J − , due to the coupling in (3.10d).

Non-local optimality conditions
In Theorem 3.10, we were able to give second-order optimality conditions with minimal gap.This delivers a good understanding of the local optimality for the problem (P).
In this section, we provide two examples of a non-local optimality condition.The first result shows that fast back-and-forth switches can be non-optimal in certain situations.
Note that Theorem 3.12 is concerned with the situation of u switching upwards on (t 2 , t 3 ).A similar argument can be used in case of a downward switch with u > ν j on (t 1 , t 2 ) ∪ (t 3 , t 4 ).
Finally, we comment that u can still be locally optimal in the situation of Theorem 3.12.
To see this, consider that u − v L 1 (0,T ) = (ν j − ν j−1 )(t 3 − t 2 ) and the radius of optimality of u could be smaller than this constant.
The next result is concerned with the introduction of an additional switch.
Proof.This follows from similar arguments as in the proof of Theorem 3.12, but now we have This result shows that it might be worthwhile to have jumps to bigger/smaller values when ∇F (u) is negative/positive on intervals where u is constant.In contrast to Theorem 3.13, the region (t 2 , t 3 ) on which u will be modified cannot be too small, otherwise the first term in (3.13) dominates.

Proximal-gradient method
In this section, we propose a proximal-gradient method to compute locally optimal points of (P).Originally, the method was proposed for non-differentiable convex optimization problems, but contributions like D. Wachsmuth, 2019 motivate the application to nonconvex problems, also in infinite dimensions.

Theoretical results
Since the proximal-gradient method applies to problems in Hilbert spaces, we will discuss (P) in the space L 2 (0, T ).Note that the admissible set U ad is already a subset of L 2 (0, T ).We start by reformulating (P) as min u∈L 2 (0,T ) where the indicator function δ U ad : L 1 (0, T ) → {0, ∞} is defined by satisfies (4.1). 2 Set k ← k + 1 and go to step 1.Now, the first addend in the objective F is smooth, whereas the second part G : is non-smooth and non-convex.As in D. Wachsmuth, 2019, Algorithm 3.21, we use the decrease condition with some parameter η > 0 in each step of the proximal-gradient method, see Algorithm 4.1.The existence of solutions u k+1 of problem (4.2) can be guaranteed similar to the discussion after Theorem 2.2.However, since G fails to be convex, there might be multiple solutions.The next result gives some basic properties of sequences generated by Algorithm 4.1.
Theorem 4.1.Let (u k ) k∈N be a sequence generated by Algorithm 4.1.Moreover, let ∇F be Lipschitz continuous from L 2 (0, T ) to L 2 (0, T ) with modulus L.Then, the following is true: (i) The sequences (u k ) k∈N and (∇F (u k )) k∈N are bounded in L 2 (0, T ).
(ii) The sequence (F (u k ) + G(u k )) k∈N is decreasing and converges.
Proof.We will adapt the proof of D. Wachsmuth, 2019, Theorem 3.22 for our situation.Since (4.1) can be written as and F , G are bounded from below, (ii) follows.This implies that (G(u k )) k∈N is also bounded.Furthermore, as we have . Moreover, using the Lipschitz continuity of ∇F , this implies the boundedness of (∇F (u k )), which completes the proof of (i).
By taking the sum of (4.1) over k = 1, . . ., n for n ∈ N leads to L 2 (0,T ) converges.Thus, (iii) follows.To show (iv), we note that |u k+1 − u k | does not take values in (0, 1) for all k ∈ N.This leads to the inequality Note that D. Wachsmuth, 2019, Theorem 3.13 states the validity of where u k+1 is the solution of (4.2).Hence the choice τ k ≥ 2η + L implies that the decrease condition (4.1) is satisfied.Nevertheless, for a fast convergence of the algorithm, it is desired to choose the inverse step length τ k as small as possible.This can be realized by testing the values τ 0 θ −i for i = 0, 1, 2, . . ., τ 0 > 0 and θ ∈ (0, 1) until the decrease condition is achieved.If τ 0 is already sufficient, it is reasonable to test smaller values τ 0 θ i for i = 1, 2, . . .until (4.1) is no longer valid.
Theorem 4.2.Let (u k ) k∈N be a sequence generated by Algorithm 4.1.Further, let ∇F be Lipschitz continuous from L 2 (0, T ) to L 2 (0, T ) with modulus L.Then, the weaklimit ū of the sequence (u k ) k∈N in BV(0, T ) solves min for every accumulation point τ of (τ k ).
Proof.Since u k+1 solves (4.2), we have for all v ∈ L 2 (0, T ) ∩ U ad = U ad .Suppose that the subsequence (τ k l ) converges towards τ .The above inequality yields Since v ∈ U ad was arbitrary, this shows the claim.
Next, we are going to investigate optimality conditions of (4.3).Note that it is not possible to utilize the theory of Section 3, since u → τ 2 u − ū 2 L 2 (0,T ) is not Fréchet differentiable in L 1 (0, T ).The following lemma shows that the optimality conditions of (4.3) are weaker than the first order conditions from Theorem 3.10.Lemma 4.3.Let ū ∈ BV(0, T ) ∩ U ad and τ ≥ 0 be given such that ū is a solution of (4.3).Further, suppose that ∇F (ū) ∈ C([0, T ]).Then, for each switching time t ∈ (0, T ), we have in which we use the data ( t, â) from the full representation, i is the smallest index with t = ti and j is the largest index with t = tj .
Note that the condition (4.4) is weaker than the first-order condition (3.10a) in case τ > 0. A similar observation has been made in D. Wachsmuth, 2019, Theorem 3.18.

Fast solution of discrete subproblems
The main work of Algorithm 4.1 consists in the solution of the subproblems (4.2), which can be equivalently written as min On a first glance, these subproblems seem to be very delicate, since we have the integer constraints, some nonlinearity and the coupling in time due to the TV-norm.However, we will see that it is possible to solve (the discretizations of) these problems very efficiently.
First, we want to restate (4.5).We define the gradient step We can use to rewrite the objective of (4.5).By further omitting the constant terms and by dropping the index k of v k and τ k , (4.5) can be rephrased as Note that the solution of (4.6) corresponds to the computation of the proximal point mapping of the non-convex functional G = β TV +δ U ad .
In order to discretize (4.6), we partition [0, T ] via the grid 0 For simplicity of the presentation, we assume that we have an equidistant mesh size ∆t := T n , but the following can be adapted easily to non-equidistant mesh sizes.In accordance with this mesh, we discretize the function u as a piecewise constant function, i.e., u = n j=1 u j χ (t j−1 ,t j ) , for u j ∈ {ν 1 , . . ., ν d }, j = 1, . . ., n.For the discretization of v, we choose the mean values v j = (∆t) −1 t j t j−1 v dt.Thus, a discretization of (4.6) is given by min or, equivalently, min Now, we want to employ the Bellman principle on problem (4.8), stating that independent from the initial decision, the remaining decisions of an optimal solution have to constitute an optimal policy with regard to the state resulting from the first decision.In this sense, we define a value function (represented by the matrix Φ ∈ R d×n ) giving the optimal value of (4.8) restricted to an interval (t ι−1 , T ) given the choice u ι = ν κ l at t ι−1 .That is, we define for all l = 1, . . ., d, ι = 1, . . ., n.It is easy to see that this gives which is a terminal value for the value function.In order to compute Φ l,ι for ι < n, we have to minimize w.r.t.κ ι+1 , . . ., κ n ∈ {1, . . ., d}.The first bracket is independent of κ ι+2 , . . ., κ n , hence, these values minimize the second bracket and the corresponding minimal value is Φ κ ι+1 ,ι+1 .Thus, for all 1 ≤ l ≤ d and 1 ≤ ι < n, (4.9) can be rephrased as Finally, the solution of (4.8) can be found by calculating Φ l,1 for every l ∈ {1, . . ., d} and comparing these values.As motivated before, this can be achieved by computing Φ l,ι for ι = n, . . ., 1 and every l ∈ {1, . . ., d} using (4.10) in the first step (which, in our case, is the last time step) and (4.11) for the following steps.The corresponding minimizer κ ι+1 has to be saved for every ι = n − 1, . . ., 1 in order to reconstruct the solution when the best initial choice l ∈ {1, . . ., d} minimizing Φ l,1 has been found.Therefore, we save these values in a matrix U ∈ R d×n−1 defined by

Trust-region algorithm and efficient computation of corresponding subproblems
Similar to Leyffer, Manns, 2021, Sect. 3.1, locally optimal points of (P) can be calculated using a trust-region algorithm where the objective is partially linearized around a given feasible point.When employing such an algorithm, one has to solve subproblems of the form Minimize (g, with a given function v ∈ U ad and g = ∇F (v).In Leyffer, Manns, 2021, this was done by constructing a mixed-integer linear program.For a fine discretization, such an approach may lead to long computing times, which is why we are interested in applying the Bellman principle in a similar manner as in Section 4 to efficiently compute discrete solutions of (TR).In contrast to the method for proximal-gradient subproblems, it is not possible to adapt the above procedure to general non-equidistant meshes since the definition of B depends on the uniform mesh size ∆t.However, in the important case that all occurring interval lengths t j − t j−1 are integer multiples of a minimal length, it is possible to transfer the ideas.

Numerical examples
To study the properties and quality of the proximal-gradient (PG) and trust-region (TR) algorithm using the Bellman principle, we consider a Lotka-Volterra fishing problem motivated by Sager, 2012, Chapter 4 aswell as a signal reconstruction problem involving a convolution investigated in Leyffer, Manns, 2021.
The problems will be discretized using a grid with n equidistant grid points, where we will test different values for n ranging from 256 to 4096.For (PG), we will choose the algorithmic parameters η = 10 −6 , θ = − 1 2 and τ 0 = 0.01, while (TR) will be initiated with an initial trust-region radius of ∆ 0 = 0.4 for the Lotka-Volterra problem and ∆ 0 = 0.125 for the signal reconstruction problem.The algorithms are implemented in Julia Version 1.6.3 and all results are calculated using an Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz on a Linux OS.
We can write (LV) in the form of (P) by defining an operator S : L 2 (0, T ) → W 1,1 (0, T, R 2 ) mapping a function u ∈ L 2 (0, T ) to the unique solution of the ordinary differential equation (ODE) in (LV).Thus, we have It can be verified that F is bounded from below by 0 and continuous on L 1 (0, T ) if S is continuous.The continuity of S together with its Fréchet differentiability can be shown by employing the implicit function theorem, see Appendix A. The derivative S (u) can be characterized with the adjoint equation corresponding to the ODE in (LV).In the implementation, we solved all occurring ODEs using the explicit Euler method.
In Table 6.1, we can see that (TR) generally produces far better results than (PG) with comparable computing times.This may be due to the fact that (PG) is not suited for non-convex optimization problems.Indeed, in more than 50% of all cases for every grid size, the solution generated by (PG) will be zero in every grid point after 2 iterations of the outer loop, which can be observed by interpreting the distributions of the objective values in Figure 6.1 and the last column in Table 6.1.
The best results can be achieved by starting (TR) with a randomly generated start function u 0 on a grid of size n = 256 and using the corresponding solution as a start Integer optimal control problems with total variation regularization Marko, Wachsmuth  function on a refined grid (with halved time step size), which will be repeated until arriving at n = 4096.Indeed, using this method testing again 1000 randomly generated start functions, we arrive at an objective range of [0.6749, 0.6789] with an average computing time of 0.128s.
Solutions as displayed in Figure 6.2 are competitive, since the optimal objective value for the relaxed problem (allowing u(t) ∈ [0, 1]) without the total variation term (i.e., β = 0) is given by 0.67204, cf.Sager, 2012, Chapter 4.1.Note that ∇F (u) is equal or close to zero whenever u switches.

Signal reconstruction problem
To compare our results with the SLIP-method derived in Leyffer, Manns, 2021, we consider the problem where Ku := k * u for the convolution kernel Furthermore, we use the data ω 0 = π, t 0 = −1, t f = 1 aswell as f (t) := 2 5 cos(2πt).In Leyffer, Manns, 2021, Proposition 5.1, it is shown that F , where K * denotes the adjoint operator of K. Since the objective is bounded from below by zero, the problem meets our assumptions.
As described before, the problem will be discretized using a grid {t 0 , . . ., t n } with the equidistant mesh size ∆t := t f −t 0 n and setting u(t) := n j=1 u j χ (t j−1 ,t j ) (t).We further introduce the vectors u = (u 1 , . . ., u n ) , f = (f (t 0 ), f (t 1 ), . . ., f (t n )) .In this scenario, the evaluation of the convolution Ku in a grid point t i , i ∈ {0, . . ., n} can be calculated as a simple matrix-vector product: Since we can write (Ku)(t i ) = (Ku) i+1 for i = 0, . . ., n with the matrix K = (k lj ) (l,j)∈I , I = {1, . . ., n + 1} × {1, . . ., n} given by   Since proximal-gradient subproblems can be solved faster than trust-region subproblems, we tried to develop a mixed algorithm, where a trust-region step instead of a proximalgradient step will be done whenever u k+1 = u k .However, this did not yield satisfactory results.

Conclusion and outlook
We investigated first and second order optimality conditions for integer control optimization problems using a switching point reformulation.The essential tool to show these conditions was the full representation of a piecewise constant function, allowing only switches between adjacent control levels.Non-local optimality conditions involving back-and-forth switches were also derived.
Next, we showed convergence results of a proximal-gradient algorithm and used the Bellman principle to efficiently solve the corresponding subproblems.This method was adapted for subproblems of a trust-region method suggested in Leyffer, Manns, 2021.
Testing the algorithms on two numerical examples showed that the proximal-gradient algorithm is not able to produce satisfactory results, while the trust-region method will give a good solution in most cases.Given that the best solutions found for our problems still do not meet the necessary optimality conditions derived in Section 3.2, it may be advantageous to optimize the location of the switching points of such a solution with second-order methods by using the derivatives of Theorem 3.6; combined with the insertion and removal of switches by utilizing Theorems 3.12 and 3.13.
Furthermore, the runtime of the subproblem solver could be improved by adapting the ideas from Severitt, Manns, 2022.To be more precise, when given a heuristic to estimate a lower bound for the cost of a path in U , it may be possible to reduce the number of calculations carried out.
In a lot of applications, multiple decisions interact with a system simultaneously, motivating a generalization of the ideas presented in this paper for multidimensional control functions.

Figure 6 . 1 :Figure 6 . 2 :
Figure 6.1: Distribution of 1000 objective values for (LV) calculated by (PG) (left) and (TR) (right) for different choices of n and random start functions u 0 ∈ U ad .

Figure 6 . 3 :Figure 6 . 4 :
Figure 6.3: Distribution of objective values for (SR) calculated by (PG) (left) and (TR) (right) for different choices of n and random start functions u 0 ∈ U ad .

Table 6 .
2: Results of applying the (PG) and (TR) 10 l times to (SR) with random start point u 0 for different grid sizes n, where l = 2 for every grid size when applying (PG), while l = 2 for n ∈ {256, 512, 1024}, l = 1 for n = 2048 and l = 0 for n = 4096 when applying (TR).