Reinforcement Learning Control of Constrained Dynamic Systems with Uniformly Ultimate Boundedness Stability Guarantee

Reinforcement learning (RL) is promising for complicated stochastic nonlinear control problems. Without using a mathematical model, an optimal controller can be learned from data evaluated by certain performance criteria through trial-and-error. However, the data-based learning approach is notorious for not guaranteeing stability, which is the most fundamental property for any control system. In this paper, the classic Lyapunov's method is explored to analyze the uniformly ultimate boundedness stability (UUB) solely based on data without using a mathematical model. It is further shown how RL with UUB guarantee can be applied to control dynamic systems with safety constraints. Based on the theoretical results, both off-policy and on-policy learning algorithms are proposed respectively. As a result, optimal controllers can be learned to guarantee UUB of the closed-loop system both at convergence and during learning. The proposed algorithms are evaluated on a series of robotic continuous control tasks with safety constraints. In comparison with the existing RL algorithms, the proposed method can achieve superior performance in terms of maintaining safety. As a qualitative evaluation of stability, our method shows impressive resilience even in the presence of external disturbances.


Introduction
The recent progress in reinforcement learning (RL) [1] has produced many interesting and impressive results in control problems and proves to be effective in finding optimal controllers for nonlinear stochastic systems modeled by Markov decision process (MDP), for which the traditional control methods are hardly applicable. However, the learning methods are notorious for not guaranteeing stability. Given a control system, stability is one of the most important properties, because an unstable system is typically useless and potentially dangerous. This presents a major bottleneck for the broad control engineering applications. Stability analysis has a long history in control engineering, in which Lyapunov's method plays a central role [2][3][4]. However, the classical control methods rely on the full or partial knowledge of the system dynamics to design controllers and are largely limited to systems with simple dynamics. Thus, it is a natural move to combine RL with control theory to develop learning control methods with a stability guarantee [5,6]. This paper was not presented at any IFAC meeting. Corresponding author W. Pan. Email: wei.pan@tudelft.nl.  Among various definitions of stability, uniformly ultimate boundedness stability (UUB) has been extensively studied for dynamic systems [7,8]. UUB generally says that the trajectories will enter the neighborhood of the equilibrium within finite time and never escape from this set thereafter (see the trajectory in Fig. 1). Intuitively, this property is consistent with the requirement of many control tasks with constraints on the states, where the states are required to stay in a certain region which is safe. Thus in this paper, as an application scenario, the controller with UUB guarantee is learned to solve control tasks with safety constraints. In terms of learning a controller, UUB is well known in the context of adaptive dynamic programming (ADP) on (1) UUB of the states or tracking error in uncertain systems [9][10][11] and (2) UUB of the estimation error of critic function and controller [9,12,13]. However, in ADP, the model structure, not necessarily the model parameters, should be known a prior. And the input structure of the nonlinear system is often assumed to be affine. However, the UUB analysis for the general class of nonlinear stochastic systems by solely using data has not been addressed and remains as an open problem [5,14].
The control tasks with safety constraints on the state have been extensively studied in model predictive control (MPC) literature [15,16] and the results are applied in various industrial processes [17,18]. In the context of RL, control problems with safety constraints are also well studied. In [19], the authors proposed a safety constrained policy optimization (CPO) approach based on the trust region method, which guarantees the constraint satisfaction with a safe initial policy, but it is restricted to the on-policy algorithms and suffers from the low sample efficiency. In [20], a Lyapunov-based approach for solving constrained control tasks is proposed with a novel way of constructing the Lyapunov function through linear programming. In [21], the above result is further generalized to continuous control tasks. However, the above results can only guarantee that the cumulative sum of a designed constraint function being kept under a threshold. Moreover, none of these results provides stability guarantees of any kind. Since the property of attraction is missing, simply satisfying the safety constraints does not imply stability, and the agent may easily violate constraints in the presence of slight disturbances.
To apply machine learning algorithms to control constrained dynamic systems is advancing recently. In [22], a modelbased RL method is proposed to deal with Lipschitz continuous deterministic nonlinear systems. Nevertheless, safety is ensured by validating the stability condition on discretized points in the subset of state space with the help of a learned model, limiting its application to rather simple and lowdimensional systems. The combination between RL with control barrier functions (CBF) have raised many attentions in recent years [23,24]. The general idea is to incorporate the RL controller with a model-based controller using CBFs. [23] exploits the nominal model to design a CBFbased controller to ensure safety, then the unknown dynamic is learned using the Gaussian process and the RL controller further improves the return performance. In [24], based on the controller designed using the nominal model, RL is exploited to solve the safe control problems in control affine nonlinear systems under model uncertainty. An-other common way of solving constrained control tasks is to incorporate MPC with online model-learning techniques, such as [25][26][27]. [28] proposes an RL framework that can exploit expert knowledge to safely improve control performance without violating safety constraints. Different from these approaches, in this paper, neither the nominal model nor expert knowledge is needed to learn a safe controller.
In this paper, a novel data-based UUB theorem without using a mathematical model is proposed. Based on the theoretical result, an off-policy based on an actor-critic algorithm and a policy optimization algorithm are developed respectively to learn controllers with the UUB guarantee. The contributions of this paper can be summarized as follows: • A novel and principled method is proposed to construct Lyapunov functions based on data to analyze the closedloop stability of stochastic nonlinear systems characterized by MDP. • The classical definition of UUB is generalized to deal with control tasks with safety constraints on the states. • Practical algorithms are designed to search for the optimal safe policy with UUB guarantee while safety is guaranteed both during learning and exploitation.
In a series of high-dimensional continuous control tasks with safety constraints such as locomotion for legged robots and manipulators, as well as a quadrotor, the proposed algorithms outperform the existing (safe) RL algorithms [19,21] in terms of both performance and safety. Besides, it is empirically shown that the controller with the UUB guarantee is more capable of dealing with perturbations and disturbances in comparison with those without such guarantees.
The remainder of this paper is organized as follows: the preliminaries and problem formulation are introduced in Section 2; the main theoretical result is presented in Section 3; an off-policy algorithm and a policy optimization algorithm with the UUB guarantee are described in Section 4; the experiments that validate the proposed algorithms are presented in Section 5; Section 6 concludes this work.

Preliminaries
In RL, a dynamical system is often characterized by a Markov decision process (MDP) in which the next state only depends on the current state and action. In MDPs, s t ∈ S ⊆ R n is the state vector at time t, S denotes the state space. The agent then takes an action a t ∈ A ⊆ R m according to a stochastic policy/controller 1 π(a t |s t ), resulting in the next state s t+1 . The transition of the state is dominated by the transition probability density function p(s t+1 |s t , a t ), which denotes the probability density of the next state s t+1 . In MDP, a reward function r(s t , a t ) is used to measure the immediate performance of a state-action pair (s t , a t ). The goal is to find π which can maximize the objective function/return J(π) ∞ t=1 E st,at γ t r(s t , a t ). Additionally, for control problems with safety constraints, a continuous non-negative constraint function c(s t , a t ) is introduced to measure how safe a state-action pair is. The state-action pair can be viewed as safe if c(s t , a t ) is lower than a designed threshold.
Next, the notation of this paper will be introduced in two parts. First, the concept and definition of UUB studied in this paper are introduced. Then the constrained control problems or so-called constrained Markov decision process (CMDP) are described as a particular application scene of UUB.

Uniformly Ultimate Boundedness (UUB) Stability
First, the classical definition of UUB stability is given as follows.
Definition 1 [29] A system is said to be uniformly ultimately bounded with ultimate bound η, if there exist positive constants b, η, ∀ < b, there exists T ( , η), such that s t0 < =⇒ s t < η, ∀t > t 0 + T . If this holds for arbitrary large , then it is globally uniformly ultimately bounded.
The classical definition of UUB generally says that for trajectories starting from a point where the norm of the state less than b, the state will eventually enter and stay inside the set where s ≤ η after T time steps. However, the above definition is of limited use for the general class of control tasks with safety constraints, where the constraint functions c(s t , a t ) are not necessarily the norm of the state, i.e., s . For example, safety-critical applications like autonomous vehicles may use the distance from the working area or central lane as the safety constraint; altitude control of a drone may require that sinusoid of an angle to be less than a certain threshold. Therefore, the classical definition of UUB is extended to the more general case as follows.
Definition 2 A system is said to be uniformly ultimately bounded with respect to c π (·), if there exist positive constants where c π (s) E a∼π c(s, a) denotes the constraint function function under the controller π. In the rest of this paper, UUB refers to the property defined above.
The difference between Definition 1 and Definition 2 is merely on the substitution of the norm of states with a constraint function. The general idea of UUB in Definition 2 is demonstrated in Fig. 1 and 2 in the state space and time domain respectively. With the proper choice of c(s, a), UUB in terms of the constraint function implies recoverability from danger within finite time. For example, a vehicle will recover to the road within finite time if it is accidentally disturbed and run into a risky area; a motor can recover to normal status within finite time if the kinetic energy or torque accidentally exceeds a dangerous threshold.

Control of Constrained Dynamic System
In this section, the control problems with safety constraints will be formulated. It will be further shown how safety constraints can be ensured as a consequence of the UUB guarantee.
The safety constraints are measured by a continuous nonnegative constraint function c(s t , a t ). In the safety constrained control tasks, the objective is to find a controller π which not only maximizes J(π) but also satisfy the expectation of the safety constraint An MDP with such safety constraints is called constrained Markov decision process (CMDP) [30].
If the system is UUB in terms of the constraint function with an ultimate bound η < d, the value of the constraint function is guaranteed to converge and stay under η after T time steps. To further ensure safety during the T time steps, it is also needed to ensure that the expectation of the constraint function to be lower than d. Later it will be shown that this is an inherent property of the system being UUB. Thus solving the control problem (1) is equivalent to finding an optimal controller that ensures the closed-loop system is UUB.
Before proceeding, some notations are introduced. The action-value function Q π (s, a) ∞ t=1 E st,at [γ t r(s t , a t )|s 1 = s, a 1 = a] denotes the subsequent return under the controller π after taking action a at the state s. The value function with respect to constraint function V c π (s) ∞ t=1 E st,at [γ t c(s t , a t )|s 1 = s] denotes the discounted sum of constraint function starting from the state s under the controller π. ρ(s 1 ) denotes the probability density function of starting states, which is a continuous function and takes positive values on the starting set Γ {s|c π (s) ≤ b}. The closed-loop transition probability is denoted as p π (s |s) A π(a|s)p(s |s, a)da. Also note that the closed-loop state distribution at a certain instant t as p(s|ρ, π, t), which can be defined iteratively: p(s |ρ, π, t + 1) = S p π (s |s)p(s|ρ, π, t)ds, ∀t ∈ Z [1,∞) and p(s|ρ, π, 1) = ρ(s).
Two important sets are exploited in this paper, as shown in Fig. 1. First, the edge set ∆ {s|c π (s) ≥ η} is composed of states that are unsafe. Conversely, the inner set B {s|c π (s) < η} is the set of absolutely safe states and B ∪ ∆ = S.

Theoretical Results
In this section, the main assumptions and a novel data-based UUB theorem for stochastic systems are proposed. The UUB theorem enables analyzing the UUB of the system through a data-based manner, which further enables policy learning in the next section.
First, we made the following assumption: This assumption easily holds with the typical choices of constraint function in practice, such as kinetic energy, altitude, etc. It is also assumed that the Markov chain induced by policy π is ergodic with a unique stationary distribution q π , q π (s) = lim t→∞ p(s|ρ, π, t), which is a common assumption for the optimal control of a Markov decision process [31][32][33][34].
Our approach utilizes the Lyapunov function to prove the stability condition. Lyapunov's method has long been used in control theory for stability analysis and controller design [35], but mostly exploited along with a known model of a dynamic system, whether deterministic or probabilistic [7,29,36].
In this paper, instead of using a model, we will present the following theorem as a sufficient condition for UUB solely based on samples.
Theorem 1 If there exists a function L(s) : S → R + and positive constants α 1 , α 2 , α 3 , η, such that and where µ N (s) denotes the average distribution of s over the finite N time steps, N is the maximum instant that the probability of being in the edge set is greater than zero, N = max{t : P(s ∈ ∆|ρ, π, t) > 0}; N = ∞ if for any δ there exists an instant t > δ such that P(s ∈ ∆|ρ, π, t) > 0. 1 ∆ (s) denotes the function Then one has the following hold: i) the system is uniformly ultimately bounded with ultimate bound η; ii) the expectation E p(s|ρ,π,t) c π (s) is bounded during the N time steps, Proof. The proof can be divided into two steps. First, we will prove that N is finite based on the conditions and assumptions, then prove the boundedness on the expectation of c π during the N steps. To show this, we will assume that N is infinity and prove by contradiction.
In that case, the finite-horizon sampling distribution µ N (s) turns into the infinite-horizon sampling distribution The existence of µ(s) is guaranteed by the existence of q π (s). Since the sequence {p(s|ρ, π, t), t ∈ Z + } converges to q π (s) as t approaches ∞, then by the Abelian theorem, the sequence { 1 T T t=1 p(s|ρ, π, t), T ∈ Z + } also converges and µ(s) = q π (s). Then one naturally has that the sequence {µ N (s)L(s)1 ∆ (s), T ∈ Z + } converges pointwise to q π (s)L(s)1 ∆ (s).
Let g π (s) denote E a∼π g(s, a). According to Assumption 1 and (2), L(s) ≤ α 2 c π (s) ≤ α 2 g π (s), which follows that According to the Lebesgue's Dominated convergence theorem [37], if a sequence f n (s) converges point-wise to a function f and is dominated by some integrable function g in the sense that, Applying this theorem to the first term in the left-hand-side of (3) It follows that on the left-hand-side of (3), Note that c π (s)1 ∆ (s) is greater than η if s ∈ ∆ and equals to zero if s / ∈ ∆, which can be summarized as c π (s)1 ∆ (s) ≥ η1 ∆ (s). Thus Now let's look into the right-hand-side of (3). Since µ(s) = q π (s), the right-hand-side of (3) equals to Combining the above inequality with (3) and (4), one has that lim t→∞ P(s ∈ ∆|ρ, π, t) < 0, which is contradictory with the fact that P(s ∈ ∆|ρ, π, t) is non-negative. Thus there exist a finite N such that P(s ∈ ∆|ρ, π, t) = 0 for all t > N , which concludes the proof of UUB.
Additionally, it will be shown that the expectation of c π (s) is bounded by a finite value during the N time steps. As N is a finite value, (3) implies that Then for any instant n ∈ [1, N ], one has the following hold Note that the expectation of c π (s) at instant n equals to S p(s|ρ, π, n)c π (s)ds = ∆ p(s|ρ, π, n)c π (s)ds + B p(s|ρ, π, n)c π (s)ds Then the bound of the expectation of c π (s) at instant n is derived as follows, which concludes the proof.
Some discussion and explanations are needed for the above theorem. First, (2) confines the property that the Lyapunov function needs to satisfy. (3) is the data-based energy decreasing condition, of which the evaluation requires sampling data according to sampling distribution µ N (s). Although µ N (s) is defined on the state space S, the indication function 1 ∆ (s) only takes the non-zero value on the edge set ∆. Thus (3) requires the Lyapunov value to be decreasing on large in the edge set ∆ and eventually entering the inner set B.
Remark 1 While results on UUB of various systems are well-known [7,29,36], however, both of these results require the full knowledge of the dynamic model of the system. On the contrary, the proposed UUB theorem enables a databased approach to analyze the stability of the system, i.e. collecting lots of state transition pairs and evaluate the value of (3) through the Monte-Carlo method. In the data-based stability analysis, the system can be a complete black-box, as long as its dynamic satisfies the Markov property.
Some connections are to be drawn between safety constrained control problems and the proposed UUB theorem.
If the system is UUB with ultimate bond α2b α3 + η < d, then it is guaranteed that the system satisfies the safety constraint in (1). These conditions can be satisfied by choosing the hyperparameters α 2 , α 3 , and η. (3) is the condition that requires training of the control policy, which will be discussed in detail in the following section.

Reinforcement Learning Algorithms with UUB Stability Guarantee
In this section, combined with the theoretical result in Theorem 1, both an off-policy and an on-policy RL algorithm are proposed respectively. First, based on the result in Theorem 1, an actor-critic RL algorithm called Lyapunov-based soft actor-critic (LSAC) is given, where two critic functions are used. The first is the standard critic function Q(s, a) in the actor-critic RL algorithm, which is used to evaluate the performance in terms of the cumulative return. The other critic function is introduced to evaluate the UUB condition (3). We call the second critic function as Lyapunov critic function L c (s, a). Then a trust-region policy optimization algorithm, Lyapunov-based constrained policy optimization (LCPO) is developed. LCPO ensures that at each update step the UUB condition is satisfied and increases the cumulative return monotonically so that the safety during training can also be guaranteed. In both LSAC and LCPO, a Lyapunov function is firstly specified to direct learn the controller and Lyapunov function. The controller is then updated to ensure that the energy decreasing UUB condition (3) holds for the learned Lyapunov function.

Learning a Lyapunov Critic Function
In Theorem 1, the Lyapunov function L(s) is essential in the stability analysis. However, it is not directly applicable in an existing actor-critic learning framework since the gradient of L with respect to the controller π is unavailable. To enable the actor-critic learning, the Lyapunov critic function L c (s, a) is introduced to prove the stability theorem therefore make sure the learned controller π can guarantee the stability of the closed-loop system. L c depends on both the state s and the action a 2 . L c satisfies L(s) = E a∼π L c (s, a), such that it can be exploited by judging the value of (3).
In this paper, L c is constructed by using a fully connected deep neural network (DNN) parameterized by φ. A ReLU activation function is used in the output layer of the DNN to ensure positive output.
From a theoretical point of view, some functions, such as the norm of state and value function, naturally satisfy the basic requirement of Lyapunov function (2). These functions are referred to as Lyapunov candidate functions. However, Lyapunov candidate functions are conceptual functions without any parameterization. Since their gradient with respect to the controller is not tractable, they are not directly applicable in an actor-critic learning process. In the proposed framework, the Lyapunov candidate acts as a supervision signal during the training of L c . L c is updated to approximate the target function L target related to the chosen Lyapunov candidate, minimizing the following objective function simply using least square algorithm, where D = {(s, a, s , r, c)} is the set of observed transition tuple under the controller π.
The choice of the Lyapunov candidate plays an important role in learning a controller. In control theory, the sum of quadratic polynomials, e.g., L(s) = s P s where P is a positive definite matrix, is often used. Such Lyapunov functions can be efficiently discovered by using semi-definite programming (SDP) solvers with certain limited conservatism for control tasks. In the context of RL [20,22], the value function V c π is proved to be a valid Lyapunov candidate. In the meantime, the constraint function c π is also a valid Lyapunov candidate due to its nonnegativity. The value function and constraint function are chosen to be the Lyapunov candidate in this paper while other potential choices are left for future study.
With the value function chosen to be the Lyapunov candidate, the target function L target in (5) is where L c is the network that has the same structure as L c , but parameterized by a different set φ . These parameters of the neural network is updated through exponentially moving average controlled by a hyperparameter τ ∈ R (0,1) , φ k+1 ← τ φ k + (1 − τ )φ k , as typically used in actor-critic algorithms [38]. It should be noted that when the constraint function is chosen to be the Lyapunov candidate, the target function L target is much simpler by having L target (s, a) = c(s, a), and the L c network and moving average update are not needed.
Remark 2 This remark will collectively summarize the Lyapunov terms used in the section above for clarity. i) L refers to the Lyapunov function. ii) Lyapunov candidates are functions that potentially can be used as a Lyapunov function, such as value function and s . iii) The Lyapunov critic function L c is a function dependent on the state and action, and satisfies L = E a∼π L c (s, a). iv) The target function L target refers to the supervision signal used in the training of L c , which takes different forms when different Lyapunov candidates are chosen. v) L c is a network that shares the same structure with L c , but only the parameters are updated by applying moving average to the parameter of L c . This is a typical trick used in the RL literature, designed to improve the stability of the learning process.
Remark 3 In this paper, the Lyapunov functions are parameterized using neural networks. It is also possible to choose a parameterization form that can be updated using SDP, such as a quadratic of s and a. However, such a parameterization may result in a large approximation error when the constraint function involves multiple types of nonlinearities that can't be approximated using a single polynomial, e.g. the constraint functions with non-differentiable nonlinearities that will be used Section 5. Alternatively, neural networks are powerful function approximators that can theoretically approximate any nonlinear functions in desired precision, thus we exploit neural networks to show the general applicability of the proposed method, and leave other parameterizations for future studies.

Lyapunov-based Safe Off-Policy RL Algorithm
A novel off-policy RL algorithm based on the actor-critic algorithm, i.e., Lyapunov Safe Actor-Critic (LSAC), is proposed to learn the controller π that can maximize the return while guaranteeing the UUB for the closed-loop system.
The controller π θ is parameterized by a DNN f θ (s, ) depending on s and a Gaussian noise . The goal is to learn θ that can maximize J(π) in (1), while satisfying the UUB condition (3) simultaneously. In this paper, we build the actor-critic algorithm based on the maximum entropy framework [39], which can enhance the exploration of the controller during learning and has been proven to substantially improve the robustness of the learned controller [40,41].
The learning problem is summarized as follows, where (9) sets the entropy of the controller to be larger than a designed threshold H t . By exploiting the Lagrange method, solving the above constrained optimization problem is equivalent to minimizing the following objective function, where β and λ are positive Lagrangian multipliers. Both the values of β and λ are adjusted through the gradient descent/ascent method. D ∆ denotes the transition pairs collected from the sampling distribution µ N (s).
In our implementation, the double Q-learning technique [42] is exploited, where two Q-functions {Q 1 , Q 2 } are parameterized by neural networks with parameters ν 1 ν 2 . The Qfunction with the lower value is exploited in the learning process [43], which is useful in mitigating performance degradation caused by the bias in the value estimation. Taking these techniques into consideration, the gradient concerning θ is obtained as + β∇ θ log(π θ (a|s)) + β∇ a log π θ (a|s)] The gradient is composed of two parts: (1) the gradient estimated by the Q-function and the entropy of the controller based on the samples from replay buffer D, and (2) the gradient estimated by the Lyapunov critic based on the samples from the edge buffer D ∆ . (11) is the basis of the actor-critic algorithm and enables the update of controller with observed transition pairs.
In the actor-critic algorithm, the Q-function is updated by using gradient descent to minimize the following objective function where Q i is the target network that has the same structure with Q i and parameterized by ν i but updated through moving average.
Finally, the value of Lagrange multipliers β and λ are adjusted by gradient ascent to maximize the following objectives, respectively, The pseudocode of LSAC is summarized in Algorithm 1.

Algorithm 1 Lyapunov-based Safe Actor-Critic Algorithm (LSAC)
Set iteration index i ← 0 and learning rate δ repeat Sample s 1 according to ρ for each time step do Sample a t from π(a t |s t ) and step forward Observe and store (s t , a t , r t , c t , s t+1 ) in D end for Record the largest instant N at which s N ∈ ∆ Store all tuples (s t , a t , r t , c t , s t+1 ), t ≤ N in D ∆ for each update step do Sample mini-batches of transitions from D and D ∆ and update parameters with gradients, (3) is satisfied and i exceeds a designed threshold.

Lyapunov-based On-Policy RL Algorithm
The off-policy algorithms can exploit the data collected under a different controller and update the controller much more frequently than the on-policy algorithms, and thus is more favorable in terms of data efficiency and convergence speed. However, safety can be hardly guaranteed during the training process of off-policy algorithms [21]. When the controller is trained online in the real-world and data are collected directly from the physical systems, safety needs to be guaranteed even during training. To this end, an on-policy algorithm called Lyapunov Constrained Policy Optimization (LCPO) is proposed for these safety-critical scenarios.
In comparison with LSAC (11), instead of approximating the policy gradient using the approximated critics, LCPO is a trust-region style method built upon the linearization of constraint and objective functions locally around the current parameter θ. In trust-region methods, a local constrained optimization problem is solved at each update step and the parameter updates a small step towards improving the objective while satisfying constraints. This determines that the optimization of policy needs more samples than the actorcritic methods to correctly approximate the constrained optimization problem, as well as more computation time in the line-search at each update [19]. However, on the other hand, this also guarantees the approximate monotonic improvement of the performance and safety during training with an initial safe policy [44,45]. Following such a procedure, it is possible to directly train a controller on the real system with safety being assured. The update of the policy parameter θ at the k th iteration can be formulated as follows Here, π k denotes the policy parameterized by θ k at k th update. In trust-region method, there is a local policy search constraint (14) to prevent the policy from taking unreasonably large update steps, and ensure that post-update policy stays in the neighborhood of the previous policy specified by δ. Here, D KL (p|q) denotes the KL-divergence between two distributions p and q, D KL (p|q) . = E p log(p/q). The KL-divergence is a measure of the difference between two distributions and is commonly used in trust-region methods [19,46]. At each update step, the above constrained optimization problem is solved analytically. Since the search of policy is constrained around the previous policy π k by (14), it is possible to linearize the objective function (12) and the safety constraint (13) around π k and approximate the local policy search constraint (14) using second-order expansion [19]. The approximated optimization problem is as follows, where g Q and g L are gradients of the objective function and the safety constraint function with respect to θ at θ k , h is the value of the safety constraint function at θ k and H is the Hessian of the KL-divergence. Note that the Fisher information matrix H is guaranteed to be positive semidefinite, thus the above optimization problem is convex and its dual is as follows, where Z = g T Q H −1 g L and N = g T L H −1 g L . Suppose that the original problem is feasible and λ * and β * are the solutions to (16), then the optimal solution to the primal problem (15) is given by If the optimization problem (15) is not feasible, then a recovery update step is needed. For safety constrained tasks, it is important that the policy π recovers to a set of safe policies as soon as possible. In the meantime, (14) needs to be satisfied as this is the basis of local approximate optimization. The recovery step is equivalent to solving the following optimization problem, The optimal solution to the above recovery optimization problem is In LCPO, the Lyapunov function L(s) is also a DNN parameterized by φ. The Lyapunov critic function L c is not needed since LCPO does not involve critic-actor updates. Meanwhile, the Lyapunov function candidates are still valid following a similar approximation procedure as LSAC. The pseudo-code of LCPO can be found in Algorithm 2.

Algorithm 2 Lyapunov-based Constrained Policy Optimization (LCPO)
for i = 1, 2, . . . do Sample s 1 according to ρ for each time step do Sample a t from π(a t |s t ) and step forward Observe and store (s t , a t , r t , c t , s t+1 ) in D end for Record the largest instant N at which s N ∈ ∆ Store all tuples (s t , a t , r t , c t , s t+1 ), t ≤ N in D ∆ Estimate g Q , g L , h, H with D and D ∆ if (15) is feasible then Calculate the optimal λ * and β * Calculate the proposal θ * with (17) else Calculate θ * with (19) end if Update θ k+1 by backtracking linesearch to satisfy the sample estimate (13) Clear D and D ∆ end for Remark 4 The comparison between LCPO and LSAC in terms of data efficiency is to be made. As shown in the pseudo-code Algorithm 1, LSAC updates multiple steps after a trajectory has been sampled. It is even possible to update at every step after observing a new state-action pair, though this is not adopted in LSAC. In comparison, LCPO only proceeds one step after each iteration of observing the trajectory. This is due to the nature of on-policy algorithms: after one update of θ, the collected data become off-policy data, i.e. the data generated by a different controller, and can not be used by the on-policy algorithms anymore. Thus, at the end of each iteration, LCPO needs to empty the set of transition pairs D and D ∆ . On the contrary, LSAC repeatedly makes use of the data collected by different controllers. As a result, LSAC possesses better data efficiency than LCPO.
Remark 5 Another distinguishing difference between LCPO and LSAC is the safety guarantee during training, which is a key consequence between off-and on-policy algorithms. LCPO is implemented based on the trust-region optimization, solving an approximated optimization problem locally and assuring that every post-update policy is safe. If the initial policy is safe, the safety will be ensured during learning. Otherwise, LCPO will try to find a safe policy first using the recovery update in Eq. (19). In comparison, LSAC does not hold this property but only can assure safety at the end of the training. From a practical point of view, LSAC is suitable for training a safe policy in a virtual environment and then deployed to a real system; LCPO is directly applicable for online training. Furthermore, a potential choice is to combine the strengths of LSAC and LCPO: use LSAC to learn an initial policy in the virtual environment then transfer to LCPO for further online training.

Further Discussion
Before proceeding to show the effectiveness of LSAC and LCPO for various environments, we would like to further discuss the possible adverse effects of sample-based approximation and UUB analysis.
First, in both LSAC and LCPO, the safety constraint is evaluated using a sample estimate of the inequality (3), which unavoidably introduces estimation error unless the number of samples is infinite. Therefore, a possible research direction is to establish a quantitative relation between the reliability of the safety guarantee and the number of samples. Furthermore, stronger safety constraints can be introduced by considering the estimation error.
Second, as both LSAC and LCPO rely on the sample-based gradient approximation in controller training, the learning algorithms may take update steps in undesirable directions temporarily due to some approximation errors (such as reducing the return or violating the safety constraints). Nevertheless, this effect can be modified by choosing reasonable hyperparameters such as learning rate and batch size such that the undesirable update does not affect the convergence of the learning process.
Finally, different from the classical model-based controller design methods, the proposed method does not need a dynamic model to design the controller. Instead, it is modelfree which means only the data from the trial and error will be used to learn the controller until a satisfactory one can be found. This process is undoubtedly time-consuming and hardly applicable to a system in the real world. Thus in practice, it is favorable to train the controller virtually first, then transfer and fine-tune the controller in the real world [47][48][49].

Experiments
In our experiments, we would like to address the following questions: • How does our approach perform compared with other existing safe RL algorithms for CMDP tasks? • Can the algorithms converge with different parameter initializations and learn safe control policies in the presence of function approximation error? • Does the policy with UUB guarantee ensure the system to recover to the inner set under perturbation and disturbance? • Under what circumstances the Lyapunov-based safe RL algorithms may fail?

Experiment Setups
In the following, five CMDP tasks are set up ranging from simulated robot locomotion in the MuJoCo simulator [50] to motion planning of a simulated quadrotor: (i) Cartpole-Safe: The agent is rewarded for sustaining the pole vertically at a target position, while limited in a safe region; (ii & iii) HalfCheetah-Safe [21] and Ant-safe: The agent is rewarded for running while the speed is limited for safety; (iv) Point-Circle [19]: The agent is rewarded for following a circular trajectory while limited to stay in a safe region |x| < x; (v) Quadrotor-Safe: The agent is rewarded for tracking a spiral trajectory while constrained to stay under a certain altitude. The details of the control tasks are described in the following.

CartPole-Safe
In this experiment, the agent is expected to sustain the pole vertically at a target position x = 6. This is a modified version of CartPole with continuous action space [51]. The action is the horizontal force F on the cart, F 2 [ 20,20]. It should be noted that the target position is outside the safe region. As a result, the optimal safe behavior is to sustain the pole vertically at the edge of the safe region. This setup is deliberate in order to test whether safety can outweigh the return under the control of the trained agents.

Point-Circle
This task is borrowed from [19], where the agent controls a mass point to run in a wide circle but limited to stay in a safe region. The agent is initialized at (0, 0). The agent is rewarded for moving the mass point counter-clockwise along a circle of radius 15m, r = y⇥v x +x⇥v y 1+| p (x 2 +y 2 ) 15| . We constrain the x-axis position to be less than 2.4, and the constraint function is: c = max(|x| 2.4, 0). The episodes are set at a length of 65.

HalfCheetah-Safe
HalfCheetah-Safe is taken from [21]. The task is to control a HalfCheetah (a 2-legged simulated robot with 17 degrees of freedom) to run as fast as possible with maximum velocity limited for safety. The reward function is r = v 0.1⇥kak 2 where v is the forward speed of the HalfCheetah and a is the control input. The constraint function is c = max(|v| 2.7, 0) 2 . The episodes are set at a length of 200.

Ant-Safe
In Ant-Safe, the agent controls an Ant (a quadrupedal simulated robot) to run as fast as possible while satisfying the safety constraint on forwarding speed, v < 2.7. The reward function is r = v 0.5 ⇤ P a 2 cost contact + 1 where cost contact equals to 1 if the robot touches the ground and 0 otherwise. The constraint function c = max(|v| 2.7, 0) 2 . The episodes are set at a length of 200. Reward function and other settings are the same with the default setting in the OpenAI Gym. The episodes are set at a length of 200.

Quadrotor-Safe
This task is to control a drone to track a spiral trajectory in 3D space. In the meantime, the altitude is limited to be under a threshold to avoid hitting the ceiling. The simulation environment is modified from an open-source Crazyflie

Experiments
In our experiments, we would like to address the following questions: • How does our approach perform compared with other existing safe RL algorithms for CMDP tasks? • Can the algorithms converge with different parameter initializations and learn safe control policies in the presence of function approximation error? • Does the policy with UUB guarantee ensure the system to recover to the inner set under perturbation and disturbance? • Under what circumstances the Lyapunov-based safe RL algorithms may fail?

Experiment Setups
In the following, five CMDP tasks are set up ranging from simulated robot locomotion in the MuJoCo simulator [50] to motion planning of a simulated quadrotor: (i) Cartpole-Safe: The agent is rewarded for sustaining the pole vertically at a target position, while limited in a safe region; (ii & iii) HalfCheetah-Safe [21] and Ant-safe: The agent is rewarded for running while the speed is limited for safety; (iv) Point-Circle [19]: The agent is rewarded for following a circular trajectory while limited to stay in a safe region |x| < x; (v) Quadrotor-Safe: The agent is rewarded for tracking a spiral trajectory while constrained to stay under a certain altitude. The details of the control tasks are described in the following.

CartPole-Safe
In this experiment, the agent is expected to sustain the pole vertically at a target position x = 6. This is a modified version of CartPole with continuous action space [51]. The action is the horizontal force F on the cart, F ∈ [−20, 20]. While tracking the target position, the safety constraint x < 4 needs to be ensured. The state space of x is [0, 10] and the episodes end if the cart moves outside this region. The agent is initialized randomly as x ∈ [0, 4]. The reward function is given by r = 20 × sign(1 − |x − 6|) × (1 − |x − 6|) 2 + sign( π/2−|θ| π/2 ) × ( π/2−|θ| π/2 ) 2 . The constraint function is given by c = max(|x|−3.2, 0) 2 /5. The episodes are set at a length of 250.
It should be noted that the target position is outside the safe region. As a result, the optimal safe behavior is to sustain the pole vertically at the edge of the safe region. This setup is deliberate in order to test whether safety can outweigh the return under the control of the trained agents.

Point-Circle
This task is borrowed from [19], where the agent controls a mass point to run in a wide circle but limited to stay in a safe region. The agent is initialized at (0, 0). The agent is rewarded for moving the mass point counter-clockwise along a circle of radius 15m, r = . We constrain the x-axis position to be less than 2.4, and the constraint function is: c = max(|x| − 2.4, 0). The episodes are set at a length of 65.

HalfCheetah-Safe
HalfCheetah-Safe is taken from [21]. The task is to control a HalfCheetah (a 2-legged simulated robot with 17 degrees of freedom) to run as fast as possible with maximum velocity limited for safety. The reward function is r = v −0.1× a 2 where v is the forward speed of the HalfCheetah and a is the control input. The constraint function is c = max(|v| − 2.7, 0) 2 . The episodes are set at a length of 200.

Ant-Safe
In Ant-Safe, the agent controls an Ant (a quadrupedal simulated robot) to run as fast as possible while satisfying the safety constraint on forwarding speed, v < 2.7. The reward function is r = v − 0.5 * a 2 − cost contact + 1 where cost contact equals to −1 if the robot touches the ground and 0 otherwise. The constraint function c = max(|v| − 2.7, 0) 2 . The episodes are set at a length of 200. Reward function and other settings are the same with the default setting in the OpenAI Gym. The episodes are set at a length of 200.

Quadrotor-Safe
This task is to control a drone to track a spiral trajectory in 3D space. In the meantime, the altitude is limited to be under a threshold to avoid hitting the ceiling. The simulation environment is modified from an open-source Crazyflie

Ant-Safe HalfCheetah-Safe
Point-Circle Quadrotor-Safe Cartpole-Safe simulator Matlab code 3 into OpenAI Gym's structure. The simulator has three parts, the CrazyFlies model parameters, the PD controller, and the dynamic equations. The control proceeds as follows: The quadrotor simulator outputs the next state of the quadrotor given the force, torques, and the current state. The controller maps the observation of the current state to the desired next step and the desired state equals the current state adds the desired step. Last, the PD controller converts the current state and desired state to force and torques, and the loop continues. The state includes the quadrotor's position, velocities, attitude ,angular velocities and reference trajectory, s t = [x, y, z,ẋ,ẏ,ż, p, q, r,ṗ,q,ṙ, x target , y target , z target ]. The policy outputs desired relative changes in position and velocity from current state, a t = [ x, y, z, ẋ, ẏ, ż]. And an episode ends when the current position is too far from the reference trajectory. For this task, the reward is r = kdk+1 where d is the distance to the reference trajectory. The constraint function is c = 100 ⇤ max(|z| 0.4, 0). For this experiment, the episodes are set at a length of 2000.

Evaluation and Comparison Analysis
In this part, the performance of LSAC on the CMDP tasks is evaluated and compared with the existing safe RL algorithm, safe soft actor-critic (SSAC) [21] with optimized hyperparameters. SSAC is a safety constrained variant of the original algorithms through the Lagrange relaxation procedure [52]. The soft actor-critic (SAC) [40] is also included to show that the optimal behaviors are generally unsafe in the CMDP 3 https://github.com/yrlu/quadrotor setting. In the meantime, the performance of LCPO is compared with constrained policy optimization (LCPO) [19], a state-of-the-art trust-region method for CMDP tasks. Since the trust-region methods (LCPO, CPO) require large batch sizes and need more samples than the gradient-based methods to reach convergence, thus, these two classes of methods are compared separately. The sum of the constraint function of episodes (safety cost) and the counts of constraint violations is used as the measure of safety and the return is used to evaluate performance. The goal is to suppress these measures to zero while maximizing the return, i.e., the expectation of constraint function at each time step and the expectation of constraint violations are required to be zero.

Algorithm hyperparameters
The hyperparameters for the LSAC and LCPO can be found in the Appendix.
For LSAC, there are three networks: the policy network, the Q network, and the Lyapunov network. For the policy network, we use a fully-connected MLP with two hidden layers of 256 units, outputting the mean and standard deviations of a Gaussian distribution. For the Q network and the Lyapunov network, we use a fully-connected MLP with two hidden layers of 256 units, outputting the Q value and the Lyapunov value. All the hidden layers use Relu activation function and we adopt the same invertible squashing function technique as [40] to the output layer of the policy network. The hyperparameters can found in Fig. 8. simulator Matlab code 3 into OpenAI Gym's structure. The simulator has three parts, the CrazyFlies model parameters, the PD controller, and the dynamic equations. The control proceeds as follows: The quadrotor simulator outputs the next state of the quadrotor given the force, torques, and the current state. The controller maps the observation of the current state to the desired next step and the desired state equals the current state adds the desired step. Last, the PD controller converts the current state and desired state to force and torques, and the loop continues. The state includes the quadrotor's position, velocities, attitude ,angular velocities and reference trajectory, s t = [x, y, z,ẋ,ẏ,ż, p, q, r,ṗ,q,ṙ, x target , y target , z target ]. The policy outputs desired relative changes in position and velocity from current state, a t = [∆x, ∆y, ∆z, ∆ẋ, ∆ẏ, ∆ż]. And an episode ends when the current position is too far from the reference trajectory. For this task, the reward is r = − d +1 where d is the distance to the reference trajectory. The constraint function is c = 100 * max(|z| − 0.4, 0). For this experiment, the episodes are set at a length of 2000.

Evaluation and Comparison Analysis
In this part, the performance of LSAC on the CMDP tasks is evaluated and compared with the existing safe RL algorithm, safe soft actor-critic (SSAC) [21] with optimized hyperparameters. SSAC is a safety constrained variant of the original algorithms through the Lagrange relaxation procedure [52]. The soft actor-critic (SAC) [40] is also included to show that the optimal behaviors are generally unsafe in the CMDP setting. In the meantime, the performance of LCPO is compared with constrained policy optimization (LCPO) [19], a state-of-the-art trust-region method for CMDP tasks. Since the trust-region methods (LCPO, CPO) require large batch sizes and need more samples than the gradient-based methods to reach convergence, thus, these two classes of methods are compared separately. The sum of the constraint function of episodes (safety cost) and the counts of constraint violations is used as the measure of safety and the return is used to evaluate performance. The goal is to suppress these measures to zero while maximizing the return, i.e., the expectation of constraint function at each time step and the expectation of constraint violations are required to be zero.

Algorithm hyperparameters
The hyperparameters for the LSAC and LCPO can be found in the Appendix.

Ant-Safe HalfCheetah-Safe
Point-Circle Quadrotor-Safe Cartpole-Safe network, it has two hidden layers of sizes (64, 32) with tanh activation functions, outputting the mean and standard deviations of a Gaussian distribution. For the value network, it has two hidden layers of sizes (256, 128) with Relu activation functions, outputting the value. And for the Lyapunov network, it has two hidden layers of sizes (256, 256) with Relu activation functions, outputting the Lyapunov value. The hyperparameters can found in Fig. 9.
The implementation of all the algorithms is based on Ten-sorFlow [53].

Comparison with SSAC
As demonstrated in Fig. 4, though SSAC can maintain state constraint satisfaction on some of the tasks despite some major violations during training (see HalfCheetah-Safe and Ant-Safe), on other tasks it fails to find safe policy both at convergence or during training. On the other hand, our method (LSAC) quickly converges to safe policies across all the tasks (Fig. 4 (a)-(e)) while maintaining reasonable return. Besides, LSAC maintains low safety costs (almost zero) and constraint violation times throughout the training with low variance in all the tasks, even though all the policies are randomly initialized.

Comparison with CPO
CPO and LCPO are compared on the safe environments as used for the off-policy methods. As shown in Fig. 5, LCPO performs stably in terms of both safety cost and constraint violation counts in all of the environments, approximately achieving zero constraint violation during training and at convergence. In terms of return, LCPO performs comparably with CPO on Point-Circle and Quadrotor-Safe, while in other tasks CPO performs slightly better than LCPO, however, at a cost that CPO violates the safety constraints a lot, as shown in Fig. 5 (c). In CartPole and Point-Circle, CPO is not safe both during training and at convergence, while in other experiments it may violate constraints during training. As discussed in [54], the discounted-sum constrained methods suffer from the tricky tuning of the safety threshold to handle the state constrained problems. To show this, we tested different threshold settings for CPO on Halfcheetah and Ant. In the HalfCheetah-Safe task, CPO swings due to the different settings of the safety threshold. CPO either fails to find the feasible policy or reaches convergence after being unsafe for a long period (more than 500,000 steps). One more thing to note is that CPO requires additional cost shaping, by having a network evaluating the chance of constraint violation to achieve the best performance, while our approach doesn't need such techniques.

Recoverability to Inner Set
As shown in Fig. 1 and Fig. 2, UUB assures that the system can recover to the inner set when it is wrongly initialized or accidentally disturbed and appears in the edge set. Now we evaluate the recoverability of the agents trained by LSAC and baselines when interfered by an unseen exogenous disturbance in CartPole-Safe and Point-Circle. Impulsive forces are implemented on the robot to push it outside the inner set and see whether it can recover to the inner set. Their performance is measured by the recovery rate, i.e., the probability network. The hyperparameters can found in Fig. 8.
For LCPO, there are three networks: the policy network, the value network, and the Lyapunov network. For the policy network, it has two hidden layers of sizes (64, 32) with tanh activation functions, outputting the mean and standard deviations of a Gaussian distribution. For the value network, it has two hidden layers of sizes (256, 128) with Relu activation functions, outputting the value. And for the Lyapunov network, it has two hidden layers of sizes (256, 256) with Relu activation functions, outputting the Lyapunov value. The hyperparameters can found in Fig. 9.
The implementation of all the algorithms is based on Ten-sorFlow [53].

Comparison with SSAC
As demonstrated in Fig. 4, though SSAC can maintain state constraint satisfaction on some of the tasks despite some major violations during training (see HalfCheetah-Safe and Ant-Safe), on other tasks it fails to find safe policy both at convergence or during training. On the other hand, our method (LSAC) quickly converges to safe policies across all the tasks (Fig. 4 (a)-(e)) while maintaining reasonable return. Besides, LSAC maintains low safety costs (almost zero) and constraint violation times throughout the training with low variance in all the tasks, even though all the policies are randomly initialized.

Comparison with CPO
CPO and LCPO are compared on the safe environments as used for the off-policy methods. As shown in Fig. 5, LCPO performs stably in terms of both safety cost and constraint violation counts in all of the environments, approximately achieving zero constraint violation during training and at convergence. In terms of return, LCPO performs comparably with CPO on Point-Circle and Quadrotor-Safe, while in other tasks CPO performs slightly better than LCPO, however, at a cost that CPO violates the safety constraints a lot, as shown in Fig. 5 (c). In CartPole and Point-Circle, CPO is not safe both during training and at convergence, while in other experiments it may violate constraints during training. As discussed in [54], the discounted-sum constrained methods suffer from the tricky tuning of the safety threshold to handle the state constrained problems. To show this, we tested different threshold settings for CPO on Halfcheetah and Ant. In the HalfCheetah-Safe task, CPO swings due to the different settings of the safety threshold. CPO either fails to find the feasible policy or reaches convergence after being unsafe for a long period (more than 500,000 steps). One more thing to note is that CPO requires additional cost shaping, by having a network evaluating the chance of constraint violation to achieve the best performance, while our approach doesn't need such techniques.

Recoverability to Inner Set
As shown in Fig. 1 and Fig. 2, UUB assures that the system can recover to the inner set when it is wrongly initialized or accidentally disturbed and appears in the edge set. Now we evaluate the recoverability of the agents trained by LSAC and baselines when interfered by an unseen exogenous disturbance in CartPole-Safe and Point-Circle. Impulsive forces are implemented on the robot to push it outside the inner set and see whether it can recover to the inner set. Their performance is measured by the recovery rate, i.e., the probability of successful recovery after impulsive disturbances. Under different impulse magnitudes, the policies trained by LSAC and baselines are evaluated 500 times, and the results are shown in Fig. 6.
As observed in the figure, LSAC achieves the best performance in terms of recoverability when interfered by forces with different magnitudes, while SSAC is less possible to recover under the same circumstances. Note that the agent can not recover from arbitrarily large disturbances since only local UUB is assured.

Ablation on Constraints
We want to test how does the proposed algorithm trade-off between performance and safety. In CartPole, the safety constraint is contradictory to the performance, i.e. being safer will decrease the return. Thus, we gradually strengthen the constraint and see how does LSAC reacts and when does it fail to find a feasible policy. Specifically, the size of inner set {x|x ∈ [0, x]} is reduced by assigning x with {0, 1, 2, 3, 4}.
The performance of LSAC in CartPole-Safe with different sizes of the inner set is compared, see Fig. 7. As x approaches zero, the average return of LSAC also decreases while safety is maintained. However, when x = 0 and only the origin is safe, the agent fails to sustain the pole and dies almost immediately. This implies that LSAC may fail in the case that safety constraints are too strong.

Conclusion
In this paper, a novel data-based approach for analyzing the uniformly ultimate bounded stability of a learning control system is proposed. Based on the theoretical results, two model-free reinforcement learning algorithms are developed, i.e., Lyapunov safe actor-critic and Lyapunov constrained policy optimization. The proposed algorithms are evaluated on a series of robotic continuous control tasks with safety constraints. In comparison with the existing RL algorithms, the proposed method can reliably assure safety in various challenging continuous control tasks. As a qualitative evaluation of stability, our method shows impressive resilience even in the presence of external disturbances.
For future work, we would like to explore the following directions: i) automating the design of constraint function; ii) extending the Lyapunov-based approach to model-based setting; iii) optimizing the performance upper bound while assuring the stability guarantee.  Stepsize 0.01 0.01 Safety discount 0.5 0.5