A projected primal-dual gradient optimal control method for deep reinforcement learning

In this contribution, we start with a policy-based Reinforcement Learning ansatz using neural networks. The underlying Markov Decision Process consists of a transition probability representing the dynamical system and a policy realized by a neural network mapping the current state to parameters of a distribution. Therefrom, the next control can be sampled. In this setting, the neural network is replaced by an ODE, which is based on a recently discussed interpretation of neural networks. The resulting infinite optimization problem is transformed into an optimization problem similar to the well-known optimal control problems. Afterwards, the necessary optimality conditions are established and from this a new numerical algorithm is derived. The operating principle is shown with two examples. It is applied to a simple example, where a moving point is steered through an obstacle course to a desired end position in a 2D plane. The second example shows the applicability to more complex problems. There, the aim is to control the finger tip of a human arm model with five degrees of freedom and 29 Hill’s muscle models to a desired end position.


Introduction
Techniques in order to solve optimal control problems are interesting for many operators in different research areas because of the prevalence of these problems. For instance, such problems naturally arise if one aims to steer a car to a desired end position or one searches the optimal activations of Hill's muscles of a digital human model. This explains the long standing history of contributions from mathematicians to this area. In particular, we emphasize on the development of the Pontryagin's maximum principle (cf. [1]) in 1956. For a detailed overview of the history of optimal control, we refer the reader to [2].
While in the last decade the importance of deep learning and its great potential became apparent, new approaches based on deep learning and tackling optimal control type problems came up as well. The interest in Reinforcement Learning (see e.g. [3]) did rapidly grow in the last years, because of its great potential to control many dynamical systems without knowing the exact equations of motion. This class of techniques, which focuses on closed loop control problems, is based on Markov Decision Processes and is able to handle systems implemented on a computer as well as real life system which can only be observed. Despite the slightly different starting point in well-known classical optimal control theory and Reinforcement Learning, strong connections between these topics are well-known (see e.g. [4]). We are convinced that the discussion in this contribution extends the number of connections.
Reinforcement Learning approaches, which are based on neural networks forming a policy, also called Deep Reinforcement Learning, are very common choices, because of their flexibility and their famous learning approaches. Successfully implemented examples are, for instance, shown in [5] and [6]. On the other hand, learning algorithms of neural networks are often criticized for being black box methods. Although neural networks are motivated by the neurons of the human brain, often, they do not seem to have an interpretable importance in the actual application. Thus, the interest in finding connections to well-established areas and new interpretations grows. For instance, the authors of [7] use an ODE-interpretation of neural networks in order to discuss the notion of stability for neural networks. In [8], a classification problem is interpreted as an optimal control problem by using the same ODE-interpretation.
The possibility to benefit from well-known research areas motivates us to combine the interpretation of the neural network as ODE and Reinforcement Learning. Then, in the Reinforcement Learning setting we derive an optimal control problem similar to the classical well-known one. Beside the ODE-interpretation, we focus on the transfer of necessary optimality conditions for optimal control problems into the Reinforcement Learning setting. As this is one of the main results in the area of optimal control, it might well open powerful new possibilities in the area of Reinforcement Learning.

Reinforcement learning 2.1 Markov decision process
We start our discussion focusing on the underlying structure of Reinforcement Learning (RL) approaches [3]. The whole theory is based on the assumption of having a Markov Decision Process (MDP). Such a MDP consists of a state space S and a control space A. Elements of the former one describe the state of an underlying system such as the gesture of a human arm or a position and velocity of a car. The control space contains admissible controls, which can influence these states like muscle actuations, acceleration of a car or a steering angle. Furthermore, we assume to have an environment and a policy. The environment takes the role of the system, which we aim to control. It gets the current control and generates a next state. This received state is plugged in into the policy, which provides the next control. In other words, one can think of a controller in a closed control loop. In the Reinforcement Learning community this controlling part is known as agent. However, the MDP differs in the realization of the environment and the policy from a usual closed control loop. The environment is assumed to be given by a transition distribution. Furthermore, the idea of the policy is, that it samples the next control from a distribution, which depends on the current state. A discussion about defining characteristics of robust and stochastic optimization can be found in [9]. Since we aim to find an optimal policy, we need to specify what optimal means. For this, every state-control pair of a trajectory is rated by a reward function and the expected sum over the rewards of one trajectory takes the role of an objective function. Having introduced the basic ideas, in a next step, we give a mathematical introduction, which is essentially based on the Handbook of Markov Decision Processes published by E.A. Feinberg and A. Shwartz [10]. We allow that the state and control spaces are continuous. This property is necessary since, most of the time, our applications are based on continuous spaces like a human arm model actuated by muscles, which get activations between zero and one. Overall, we assume that the state space S and control space A are measurable spaces endowed with corresponding σ -algebras S and A. The reward function r : S × A → R is assumed to be a measurable function on (S × A, S × A). In the following, we assume that the absolute value of the reward is bounded by a constant C reward . From our point of view, this is not a very restrictive assumption for most practical application cases, since very undesirable (respectively desirable) state-control pairs can still be rated with a huge negative (positive) reward. Solely unbounded rewards, which, most of the time, only appear for state-control pairs nonrepresentative and far from optimal state-control pairs, are not possible anymore. Furthermore, we require a transition probability from (S × A, S × A) to (S, S). This transition probability is represented by K(B, x, u) for (x, u) ∈ S × A and B ∈ S and is assumed to fulfill the following two properties. Firstly, K(·, x, u) is a probability measure on (S, S) for any (x, u) ∈ S × A. Secondly, K(B, ·, ·) is measurable on (S × A, S × A) for any B ∈ S. At this point, we stress out that, despite the fact that we introduced the MDP for general measurable spaces, we restrict ourselves to the state space R n s and control space R n a or at least subsets of these in this contribution, which are suitable for the most application cases. For these spaces, we can use the Borel σ -algebra. Additionally, we assume for further discussions, that the transition probability is given by a transition density p : Having this at hand, we can define histories, respectively trajectories, τ n = x 0 u 0 x 1 u 1 x 2 u 2 . . . x n , n ∈ N. The set H n = S × (A × S) n contains all possible trajectories of length n. H n = S × (A × S) n denotes its σ -algebra. For the sake of completeness, we introduce H ∞ = (S × A) ∞ as the set of all trajectories of infinite length τ = x 0 u 0 x 1 u 1 x 2 u 2 . . . . Then, the policy π is introduced as a transition density. From this density, the next control can be sampled. The policy is called Markov if this transition density does not depend on the whole history, but only on the current state. This means, that π is a transition density from (S, S) to (A, A). In the end, the Ionescu Tulcea theorem [11] ensures the existence of a unique strategic measure P π p 0 , which is a transition probability from (S, S) to (H ∞ , H ∞ ) (respectively P π ,n p 0 from (S, S) to (H n , H n )) for a given initial density p 0 regarding the first state and any given policy π . This means that, for a measurable rectangle , we have that: It follows that an expected value of a measurable function R : H n → R has the following form: Since this expression is not very handy and bulky, we introduce a new notation representing parts of this expression. Instead of having a lot of integral symbols, we shorten this by writing only one integral symbol with H n in the index representing the integral over the state and control spaces. Furthermore, we replace dx n . . . dx 2 du 1 dx 1 du 0 dx 0 by dτ n :

Goal of reinforcement learning
Having introduced the MDP in Sect. 2.1, we have everything at hand in order to formulate the optimization problem forming the starting point for the Reinforcement Learning [3].
As we have mentioned, the reward function is defined in order to rate the current situation. For example, if we are interested in controlling a car to a desired end-position as fast as possible, the reward function might be minus the distance between the current position of the car and the end-position. In the end, we aim to maximize the summed up reward of one trajectory. This, which is also called total reward, is computed as R(τ n ) := n k=0 r(x k , u k ), where τ n = x 0 u 0 x 1 u 1 x 2 u 2 . . . x n is a trajectory. The optimization problem then reads as: We formulated the optimization problem for trajectories with finite length n ∈ N and focus on application cases, which have a certain finishing time. This is a necessary assumption for further considerations, where we need to define a function mapping from the space of trajectories.
Up to now, it is not clear how the optimal policy looks like and this makes the optimization problem difficult. Thus, we focus on parameterized policies in this contribution.
To be more precise, we assume that the current state, which is plugged in into the policy, is inputted into a neural network. The outputs of the neural network are parameters of a Gaussian distribution, from which the next control can be sampled. This ansatz has already been considered and discussed, inter alia, in [3,5]. This way, we do not search directly for the optimal policy anymore. Instead, we optimize the weights and biases (parameters θ ) of the neural network in order to get an optimal policy. The policy defined by the parameters θ is denoted as π θ and the optimization problem can be written as: Techniques in order to solve such an optimization problem are essentially based on the idea of initializing a policy, using this policy in the system simulation in order to generate trajectories, and updating the policy based on this data afterwards. Here, one distinguishes between model-based and model-free techniques. The model-free methods make use of observations for improving the policy directly. In the model-based case, the observations of simulations are used for system identification, while the identified system generates the trajectories for the policy update. These techniques are said to be more sample efficient than model-free techniques (see e.g. [12]). Furthermore, one can classify Reinforcement Learning techniques as policy-based or value-function based techniques. The latter one makes use of a value-function, which is successively improved and defines the policy indirectly. This can be observed in the famous Q-learning technique [13]. In contrast to value-function based techniques, policy-based methods directly update the (parameterized) policy. The very famous REINFORCE [14], based on a simple gradient ascent idea, and the approach, which will be introduced in this contribution, belong to this class of techniques.

Parameterized policy
Here, we have a closer look at the parameterized policy π θ , which we mentioned in the previous subsection. It is very common to build a policy out of a neural network and a distribution, which gets the parameters from the neural network. For further discussions, we define the mapping NN θ : S → A representing the neural network for fixed weights and biases θ . It gets the current state x k as input. Then, the activation functions, which are in our case hyperbolic tangents, and the affine linear functions are applied successively as it will be further explained in Sect. 2.4. The outputs μ θ k := NN θ (x k ) are used as parameters of the distribution ρ(u k |μ θ k ) being the second component of the policy. Therefrom, the next control u k can be sampled. For a later discussion, we introduce the func- We visualize the new introduced notation in Fig. 1. We see the structure of the policy and the corresponding mathematical function in this picture. Before we go on and use these notations in order to rewrite the RL optimization problem into a new form, we introduce one additional definition: Obviously, only the policy depends on μ θ,n . This means that if the density ρ θ (u k |μ θ k ) is differentiable with respect to its parameters μ θ k , we know that the derivative of P(τ n |μ θ,n τ n ) with respect to μ θ,n τ n exists. Typically, ρ θ (·|μ θ k ) is a density function of a Gaussian distribution with mean value μ θ k and a suitable fixed variance σ k . In this case, the differentiability Figure 1 Schematic representation of the policy is ensured. Thus, from now on we assume to have this differentiability, which we will need in future discussions.

Neural network as an ODE
In this part, we have a closer look at continuous interpretations of neural networks, which we are going to use in order to motivate a new policy ansatz consisting of an ODE instead of a neural network. A classical neural network is defined by weight matrices W i , biases b i , where i describes the layer, and an activation function σ . From a mathematical point of view, a feedforward network works by the following computation rule: Another type of network is a residual neural network. The difference here is that the activation function together with the affine linear function σ (W i z i + b i ) does not describe z i+1 , but the residuum between z i and z i+1 . This means, that in this case the computation rule is as follows: The authors in [7] make use of the similarities between this computation rule and an explicit Euler sheme with step size one for the following ODE: Based on this connection between residual neural networks and ODEs, they discuss the notion of stability, which is defined for ODEs, in the neural network context. This motivates us to replace the neural network in the policy by this ODE. We will see in further sections, that this leads us to a continuous optimal control problem. Keep in mind, that the above mentioned connection can only be deduced, if the state and control dimension are the same. But we do not see this as a strong restriction, since we can use, for instance, an en-and decoder with neural networks [15] to circumvent this assumption. For a better readability, we derive the results in this contribution with residual neural networks. Note however that using the connection between ODEs and feedforward neural networks, which can be seen by replacing (10), or other types of neural networks leaves the following discussion unaffected. The replacement of the neural network by the corresponding ODE leads to a deviation in the generated actions. The size of this deviation depends on the solution strategy of the ODE. If the ODE is solved numerically by an explicit Euler with step size one, the ODEpolicy and the policy based on the neural network may not be distinguished from each other. If the ODE is solved analytically, classical error estimation results for the explicit Euler hold.
In order to be able to derive the continuous optimal control problem and to discuss the benefits of this connection in the Reinforcement Learning case, we introduce a solution operator. This solution operator F : H S n × L ∞ → W 1,∞ shall take the same role as F θ in the neural network case. This means, that the state-trajectory τ x n , whose components are the inputs for the neural network (see (6)), is now the initial value of the ODE we aim to solve with the solution operator. For the right-hand side we define: where θ (s) collects W (s) and b(s) and z k are the entries of z that correspond to x k of τ x n . Overall, the solution operator with z = F(z 0 , θ ) is defined such that the following ODEinitial-value problem is fulfilled: The first argument of the solution operator is the initial value. As we already mentioned, the initial value is the state-trajectory τ x n and we need this solution operator for many different trajectories. The second argument is the control θ ∈ L ∞ . The operator F is welldefined for a bounded θ since for a fixed initial value the theorem of Picard Lindelöf (see e.g. [16]) guarantees the existence and uniqueness of a solution of the initial value problem. The necessary assumption of this theorem that the function f satisfies the Lipschitz condition is in general fulfilled since the activation function we consider is the hyperbolic tangent. Even other usual activation functions like the sigmoid or the ReLu function would lead to a fulfilled Lipschitz condition.

Reinforcement learning and optimal control
In the previous subsections, we discussed the underlying optimization problem of RL, possible policies and solution strategies. At this point, we aim to establish a novel connection to optimal control problems with differential equations as constraints. We start by reformulating the optimization problem (5) based on the firstly introduced neural network policy. Afterwards, we replace the parts where the neural network appears in the policy and receive an infinite optimization problem. We expect that this connection enables great possibilities to learn from an old and well established field. We start with the optimization problem in (5) and plug in our definition of the expected value. We obtain: First, we can replace the maximization problem by a minimization problem. In order to do this, we spend a minus in the objective function. Now, we split the policy as discussed in Sect. 2.3. Thus, the objective function becomes a function depending on μ θ,n , for which additional constraints need to be added: Now, we use that the neural network can be interpreted as an ODE (see Sect. 2.4). This means that for each state-trajectory τ x n an ODE is solved in order to get the corresponding μ θ,n τ n . The right-hand side is given as in (12) and we get: We make use of the solution operator of this differential equation, which we introduced in Sect. 2.4. In our optimization problem, F will replace F θ . This induces the need for the pseudo time s. While we have finite parameters in the optimization problem (16)- (17), now, θ ∈ L ∞ depends on s. We do not have a finite optimization problem anymore. Overall, we get: s.t. z θ,n τ n (s) = F τ x n , θ (s), ∀τ x n ∈ H S n .
The constraint (23) represents our assumption that θ is bounded, which we need for the existence of the solution operator. Typically, this will be a box constraint. We conclude this section by summarizing the assumptions we needed so far and we assume to hold in the remaining of this paper: 1. The state and action space are (subsets of ) Euclidean spaces, 2. the considered time horizon is finite, 3. the reward is bounded by a constant C reward , 4. the control is bounded θ (s) ∈ U a.e. on [s 0 , s f ], 5. the activation functions are sufficiently smooth (e.g. tanh,sigmoid), and 6. the distribution embedded into the policy is differentiable w.r.t. its parameters (e.g. Gaussian distribution is differentiable w.r.t. its mean value).

Optimal control problems
At this point, we will discuss the connection to well-established optimal control theory.
In order to do this, we make a short excursus to the optimal control formulation and the form of the optimization problem, which is typically considered in this case. The information about the goal is mainly encoded in the objective function J : R n z → R. It is defined by a mapping Φ : R n z → R describing the costs in the end position and a mapping φ : R × R n z × R n θ → R representing the costs in each time step. The natural numbers n z and n θ describe the dimension of the state space and the control space. The constraint Ψ : R n z → R n Ψ for some n Ψ ∈ N can be used to define a desired end state of the underlying system. Overall, we end up with an optimization of the following form: Here, F * : L ∞ → W 1,∞ denotes the solution operator of an initial value problemż = f * (z, θ ), z(s 0 ) = z 0 , where W 1,∞ represents the set of absolutely continuously functions z : [s 0 , s f ] → R n z : The problem statement of this infinite optimization problem and discretization methods are discussed in [17]. Furthermore, one can find a derivation and discussion of a local minimum principle, which we state here, since the conditions resulting from this minimum principle can be used in order to get possible candidates for the optimal solution. Then, the quality of those candidates can be verified using the original optimization problem. Using the Hamiltonian function H : R × R n z × R n θ × R n x × R → R with H(s, z, θ , λ, l 0 ) := l 0 φ(s, z, θ ) + λ T f (z, θ ) these conditions take the following form. [17]) Let U ⊆ R n θ be closed and convex and letθ ∈ L ∞ be a local minimum of the problem (24)-(27). Then, there exist multipliers l 0 ∈ R, σ ∈ R m , λ ∈ W 1,∞ such that:

Theorem 1 (Necessary Optimality Conditions
The two mentioned optimization problems ((20)-(23) and (24)-(27)) share many common properties. First, we remark that the objective function of the former optimization problem is a special case of the classical objective function. In order to see this, one can define: Φ z(s f ) := - Furthermore, we remark, that the constraint in (26) does not appear in the Reinforcement Learning optimization problem. The biggest difference and the reason, why we cannot apply the results of the local minimum principle to this optimization problem directly, lies in the differential equations. In the classical case, we only have one differential equation with one fixed initial value. In the RL case, we have infinitely many differential equations, since one for each trajectory appears. Note that this depends on the assumption that the state and control space are continuous sets. If this was not the case and the state and control space were finite sets, the integrals would vanish and they would be replaced by sums. In the constraints, we would have one ODE for each trajectory. But, the number of trajectories would be finite and the ODEs could be collected in one ODE with a huge dimensionality. In this case, the classical necessary optimality conditions could be applied, since the RL optimization problem would be a special case of the classical one. Nevertheless, in our case, we need to spend effort to receive similar results.

Derivation of the necessary optimality conditions
This section is devoted to the derivation of a type of necessary optimality conditions for the RL optimization problem (20)-(23).

Basic results from functional analysis
The proof of the necessary optimality conditions from the previous section is essentially based on the Fritz-John conditions for optimization problems in Banach spaces. Since we are inspired by this proof for the necessary optimality conditions of the RL problem, we recall the meaning of the Fritz-John conditions here. The starting point is the following form of an optimization problem on Banach spaces: Here, X, Y and Z are Banach spaces and J : X → R, G : X → Y , H : X → Z. Furthermore, S ⊂ X is a closed and convex subset, K ⊂ Y a closed and convex cone with vertex 0 y . Both, S and K , are not allowed to have empty interiors. While the theorem of Weierstraß ( [17], p. 78) guarantees that a lower semi-continuous J achieves its minimum on a compact set {x ∈ X|x ∈ S, G(x) ∈ K, H(x) = 0 Z }, the Fritz-John theorem provides optimality conditions (cf. [18,19]).

Theorem 2 (Fritz-John Conditions) Let J and G Fréchet-differentiable and H be continuously Fréchet-differentiable. Letx be a local minimum of problem (35)-(38).
Assume that Im(H (x)) ⊂ Z is not a proper dense subset. Then, there exist non-trivial multipliers (0, 0, 0) = (l 0 , λ * , μ * ) ⊂ R × Y * × Z * with l ≥ 0, λ * (G(x)) = 0, λ * ∈ K + and: If a Banach space is equipped with a star, the dual space is meant. The dual cone K + := {x * ∈ X * |x * (x) ≥ 0, ∀x ∈ X} represents the set of functionals of the dual space, which map all elements of the cone K into the positive real numbers or zero. This result will be the starting point for deriving the necessary optimality conditions.

Necessary optimality conditions
Finally, in this subsection, we derive the necessary optimality conditions. The most challenging part is to show, that the assumptions of the Fritz-John conditions are fulfilled. Especially, the Fréchet-differentiability of our objective function needs to be shown. Because of the chain rule, we also need to know the Fréchet-derivative of the newly introduced solution operator F with respect to the optimization variables θ . This will be the next step. Fortunately, we have a special case of the solution operator discussed in [20]. There, the author shows the Fréchet-differentiability of a solution operator for an initial value problem and determines the derivative. Despite the fact that our solution operator additionally depends on the initial value, the proof can be applied here since we can consider the initial value as varying but fix.

Lemma 1 ([20], pp. 56-59) For a fixed z 0 ∈ H S n and a sufficiently smooth right-hand-side f the Fréchet-derivative of
θ is given as follows: where δz z 0 is given by the following ODE: The smoothness assumption in Lemma 1 allows, for instance, the activation functions hyperbolic tangent and sigmoid but prohibit using the ReLU function. Nevertheless, in case of a ReLU function, the following discussion can be made for a smooth approximation of the ReLU function.
We have seen in Theorem 1 that necessary optimality conditions are described with the help of the adjoint variable λ, which itself needs to fulfill an ODE. Also in our case, we make use of an auxiliary definition in order to express the necessary optimality conditions. But, like in the case of F , we define a solution operator for an ODE and need to allow varying initial values. We introduce the operator G : Again, the existence and uniqueness are guaranteed for bounded θ by the theorem of Picard Lindelöf (see e.g. [16]). To see this, one has to notice that -λ T f z (z, θ ) satisfies the Lipschitz-property w.r.t. λ, since it is linear in λ and f z is bounded.
At this point, we have everything at hand to derive the necessary optimality conditions. We stress out that parts and ideas of the proof of the classical necessary optimality conditions in Theorem 1 inspire the following proof. For example, in our proof, we will use the Fritz-John conditions in Theorem 2 as starting point of our proof and continue by rewriting the inequality condition afterwards. Parts of these steps are motivated by [20] and [17]. Nevertheless, crucial parts of the following proof are new. The most important innovation is the derivation of the Fréchet-derivative of the objective function, which is needed in order to satisfy the assumptions of the Fritz-John conditions. Theorem 3 Let U ⊆ R n θ for n θ ∈ N be closed and convex and letθ ∈ L ∞ be a local minimum of the optimal control problem: Furthermore, we assume that sup θ |∇ z P(τ n |z)| z=F(τ x n ,θ)(s f ) (δz τ x n (s f ))| is Lebesgue integrable. Here, δz τ x n denotes the Fréchet derivative of the solution operator from Lemma 1 with z 0 = τ x n . Then, the following local minimum principle holds: with z(τ n ,θ ) = F τ x n ,θ , λ(τ n ,θ ) = G -∇ z log P(τ n |z) T | z=F (τ x n ,θ)(s f ) , F τ x n ,θ ,θ .
Proof First, we have a closer look at the problem statement (35)-(38). We recognize that we have a special case of such a problem. We also have an objection function J : L ∞ → R, which we want to minimize. The function G and H in (37)-(38) do not appear in our optimization problem. Since we do not have the function in the constraints, we do not care about a convex cone K . Overall, we have a special case of this problem statement and thus, we can check the assumptions of Theorem 2. We notice, that if we can ensure that J is Fréchet-differentiable, the assumptions are satisfied. We claim that J (θ )(δθ) = -H n ∇ z P(τ n |z)| z=F (τ x n ,θ)(s f ) (δz τ x n (s f ))R(τ n ) dτ n is the Fréchet-derivative of the objective function. We remark, that it is linear and continuous with respect to δθ since only δz τ x n (s f ) depends on δθ and this is the Fréchet derivative from Lemma 1. Thus, it is linear and continuous with respect to δθ . It remains to show that the defining equation of a Fréchet derivative is fulfilled: fact that H n |P(τ n |F (τ x n , θ )(s f ))R(τ n )| dτ n ≤ (n + 1)C reward , we can make use of the dominated convergence theorem. This provides that the whole expression (43) is an element of o( δθ ). Overall, we showed the Fréchet-differentiability and determine the derivative of the objective function by showing: Now, all assumptions of the Fritz-John-Necessary-Optimality-Conditions are fulfilled and we can apply it to our optimization problem. From the left-hand-side of the inequality (39) only the objective function remains. In the end, we have, that the Fréchetderivative atθ applied to a small change δθ := θ -θ , θ ∈ L ∞ needs to be greater or equal to zero: We plugged in the derivative, which we determined in the beginning of this proof. Afterwards, we made use of the fact, that the derivative of the logarithm w.r.t. the argument is one over the argument (see (49)-(50)) and finally the definition of the end value of λ(τ n ,θ ) led us to equation (51). Before we continue, at this point, we make two remarks. First, we have a closer look at the partial integration: As a second remark, we recap that we know from Lemma 1: Using integration by parts and the property in (53) in order to rewrite (51) is motivated by the proof of the necessary conditions for an optimal control problem with a DDAE initial value problem in ( [20], pp. 88-94). Thus, we plug in (52) and (53) into (51). This, together with the ODE (40) defining λ, explains the following transforma-tions: Once more, we received an expected value in the end. At this point, we are almost done. It remains to show, that the claim of this theorem follows from the proved above.
In order to show this, inspired by ( [17], p. 117), we introduce a control θ θ ∈ L ∞ depending on ≥ 0, which obviously satisfies the inequality discussed so far in this proof. This control θ θ is defined for a fixed s ∈ [s 0 , s f ] and fixed, but arbitrary θ ∈ U: θ (s), else.
We will show that from plugging θ θ into the inequality from above, we can deduce that the inequality of the claim of this theorem is satisfied as well. This means that we will show: Before we can show this, we need to define two auxiliary functions: Note, that the function h 1 as well as h 2 are differentiable with respect to the pseudo time s and it holds that h 1 (s) = λ T τ n (s)f θ (z τ n (s), θ (s)), respectively h 2 (s) = λ T τ n (s)f θ (z τ n (s), θ (s))θ(s). Having all the preparations at hand, we can start with the inequality (58). Afterwards, the definition of θ θ can simplify this expression: Simple rearrangements leave us with an expression, where we can find the difference quotients of the auxiliary functions. Since these are differentiable, the limits for going to zero exist. At this point, we stress that the limit for going to zero can be brought into the expected value. The argument for this is again provided by the dominated convergence theorem and the assumption of the Lebesgue integrability expression. In the last step, we extract (θ -θ (s)) from the expected value. This is possible, since θ andθ (s) are independent of the trajectories.
In the case thatθ lies in the interior of U ⊂ R n θ , the inequality in (42) would become an equality. In the next step, we will use this new achieved result in order to derive a new algorithm.

Algorithm based on necessary optimality conditions and its challenges
The necessary optimality conditions can be used to solve the optimization problem. One can solve the optimality conditions and get potential candidates for an optimal solution. The conditions that need to be solved consist of three main parts. An initial value problem for an ordinary differential equation represented by the solution operator F , its adjoint end value problem represented by G, and the minimum principle. In order to solve such conditions, it turns out to be worth it, at least in the classical theory, to initialize all searched quantities and iterate trough these conditions. This means by starting with random initial parameters θ 0 we can solve the initial value problem first and the adjoint afterwards. In the end, these trajectories are used to determine the left-hand-side of the minimum principle, which turns out to be a good search direction, in which we update the parameters.
But we need to keep in mind that the expected value in our minimum principle cannot be computed exactly because it is an expected value over infinitely many trajectories, which is not practicable. This is because we would need to solve the initial value problem and end-value problem for infinitely many trajectories. Instead, we make use of the procedure many model-free Reinforcement Learning approaches are based on: Initialize

Application
We will show the applicability of the newly developed algorithm in two test cases. The first one is a simple model of controlling to move a point in the two dimensional plane with obstacles. This example helps to recall all important parts of the introduced approach and how it is actually applied. As a second test case, we consider a human arm model with 29 muscle models and five degrees of freedom. This shows us the applicability to complex models with many control inputs compared to the degrees of freedom.

Moving point in the 2D plane
We start with the simple dynamical model. The goal is to move a point to a desired end position. The position of the point is described by a two dimensional vector x k describing the position in two directions at time step k. The two dimensional vector u k represents the velocity and the direction of the velocity. This quantity forms our control such that we aim to find a controller, which gives us the velocity for the current position in order to achieve our goal. Overall, we have: x k+1 = x k + 0.05 · u k , for k = 0, . . . , n.
Since Reinforcement Learning approaches do not need the explicit equations of motion, we only need these equations in order to simulate forward in time to collect trajectories. We decided to use time steps of 0.05 min and define the final time to be 2.5 min (i.e. n = 50). The reward function gives back the negative distance between the current position and the target position [4.5, -2.5] T . Additionally, the final position of one trajectory is rewarded by minus the square of ten times the distance and achieving the desired position leads to a positive reward of 10. Furthermore, we define two forbidden areas visualized by red rectangles in Fig. 2. Entering these areas will be penalized by a negative reward of 50. In Fig. 2 and 3, the starting position is marked with a yellow triangle and the desired end position is located at the blue rectangle.    Fig. 1 and a Gaussian distribution with a mean value determined by the final state of the ODE. We consider the ODE on the interval [0, 10]. Overall, we initialize θ randomly and simulate trajectories with the introduced discrete motion equations and the policy based on the ODE depending on θ . We generate ten trajectories in each iteration. Afterwards for each forward ODE, we solve the corresponding adjoint ODE. In the end, we use the collected data in order to compute the update direction. Then, we update the parameters in this direction with step size 10 -2 . After doing this training, we have a controller, which is able to robustly control the point trough the obstacle course. We applied the algorithm with different methods to solve the ODE in the policy numerically. We decide to use an explicit Euler method with step size 1 (NOC_1000), 0.5 (NOC_0500) and 0.25 (NOC_0250). Trajectories, produced with the trained controller, can be observed in Fig. 3. Additionally, we run the same example with REINFORCE [14] for comparison. Most of the time, the trajectories lie on top of each other or at least close to each other. At this point, we stress that along these trajectories, the controls are randomly disturbed. This emphasizes the robustness of the controller. The reward during the training can be seen in Fig. 4.

Human arm model
At this point, we focus on a more complex model. The human arm model, which the authors of [21] have introduced and which is discussed in [22], consists of 29 Hill's muscle models and five degrees of freedom. Three of five degrees of freedom are located in the shoulder, where a spherical joint is modeled. The other two are in the elbow defined by two revolute joint. A visualization of the arm can be seen in Fig. 5(a). The gray cylinders represent the upper, forearm and the hand. The red, violet and blue lines portray the muscles. Here, the color stands for the activation of the muscle. If a muscle is not activated (activation = 0), the muscle is blue. If it is red, it is fully activated (activation = 1). On the finger tip, a small ball is shown. This is the marker, which is used to determine the distance from the finger tip to its desired end position. Keep in mind that the dimension of the control and state space are not the same anymore. This is solved by using an encoder implemented with neural networks. In [23], the authors discussed RL techniques as possibility to control this biomechanical model. In the present contribution, we apply the newly introduced technique instead of a trust region policy optimization (see [5]) as used in [23]. The goal of our optimization is to reach a desired final position with the finger tip. In Figs. 5(b)-5(e), one can see snapshots of the movement of the arm. Obviously, the finger tip touches the final end position in the end. This shows us that the task of reaching a certain point is fulfilled. Furthermore, the movement looks human-like and is plausible from this point of view. As in the previous example, we have a look at the summed up reward during the training in Fig. 6. Now, we run the training several times (black lines). In the end, we calculate the mean and plot it with a red line. We can see in each run, that the value of the objective function, which we maximize, increases iteratively up to a saturation limit.