Adaptive neural network optimized control using reinforcement learning of critic-actor architecture for a class of non-affine nonlinear systems

In this article, an optimized tracking control using critic-actor reinforcement learning (RL) strategy is investigated for a class of non-affine nonlinear continuous-time systems. Since the non-affine system is with the implicit control input in dynamic equation, it is a more general modeling form than the affine case, hence this also makes the optimized control more challenging and rewarding. However, most existing RL-based optimal controllers are very complex in algorithm because their actor and critic training laws obtained by implementing gradient descent on the square of Bellman residual error, which equals to the approximation of Hamilton-Jacobi-Bellman (HJB) equation, hence these methods are difficult to be extended to non-affine systems. In this optimized control, the RL algorithm is produced from implementing gradient descent to a simple positive-definite function, which is derived from HJB equation’s partial derivative. As a result, the proposed control algorithm can be significantly simple so as to alleviate the computational burden. Finally, a typical numerical simulation is carried out, and the results also further confirm effectiveness of the proposed control scheme.


I. INTRODUCTION
I N control community, non-affine system control has always played an important and key role because most practical engineering must be modeled in non-affine dynamic form. Unlike affine system that has the explicit control input, non-affine system's control is implicit, so that it has not the concept of control gain and control direction. As a result, studying non-affine control system becomes very challenging and rewarding [1]. Several mathematical tools are available to help find an equivalent affine system, such as mean value theorem, implicit function theorem, and Taylor serious expansion, they can be used to convert the system into affine form [2]- [4]. Especially, in [4], the adaptive NN controller is derived by first transforming the non-affine single-input single-output (SISO) nonlinear system to affine form via Taylor series expansion.
In the recent decades, optimal control has always been a hot and attractive academical topic in control community, especially to optimal control of nonlinear systems [5]- [8]. In general, nonlinear optimal control problem refers to solve a nonlinear partial differential equations, regarded as Hamilton-Jacobi-Bellman (HJB) equation [9]. Due to inherent non-linearity of the equation, its analytical solution is obtained difficultly or even impossibly. To address this challenge, Bellman proposed a technique that is the famous dynamic programming (DP) method [10]. However, the technique has an inevitable disadvantage, of which the calculation amount will increase exponentially as the increasing of system dimension, and thus it will result difficult application in practice. In order to address the difficulty of DP method so that nonlinear optimal control can be effec-tively achieved, Werbos developed an adaptive algorithm by taking the advantage of NN approximators, which was called approximate/adaptive dynamic programming (ADP) [11]. Up to now, ADP method has received the increasing attention and spawned many other schemes, such as adaptive critic design [12], [13], neural dynamic programming [14] and the like. Additionally, in the reference [15], Liu et al. proposed a strategic of iterative ADP to address the infinite level optimal control problem of nonlinear systems. In the reference [16], a complex ADP approach was developed to solve the infinite-horizon optimal problem of complexvalued nonlinear systems.
In fact, ADP can be regarded as a class of reinforcement learning (RL) [17]. RL is a machine learning strategy that modifies agent behavior based on the response from environment [18]. In general, a common structure of RL is the criticactor architecture, in which the critic is to evaluate the control performance according to the interaction with their environment and return the feedback for the actor, and the actor is to execute these continuous improving control operations. Since RL enables a agent to learn autonomously according to their own experience [19]- [21], it is an universal strategy in the nonlinear optimal control [22]- [24]. In [22], to solve the infinite-horizon optimal control problem, Yang et al. developed an adaptive optimal control strategy by using the RL of identifier-critic architecture. In [23], by using the NNapproximator-based RL, Wen et al. proposed a decentralized optimized formation control for a class of nonlinear multiagent systems, and a significant breakthrough in the work is that two common requirements of known dynamic and persistence excitation are removed. In [24], the RL optimized control was extended to stochastic nonlinear system.
Because neural network (NN) and fuzzy logic system (FLS) are the effective approximators [25]- [27], some adaptive nonlinear approaches based on FLS or NN were proposed in recent years [28]- [32]. By using NN to estimate the solution of HJB equation, the optimal control based on RL of nonlinear systems was further developed, and many outstanding achievements have been made recently [33]- [35]. In [33], [34], to optimal control of nonlinear strict feedback system, the new technique, optimized backstepping (OB), was proposed first time. Its basic thought is to design the actual control and all virtual controls as the optimal solution of the corresponding backstepping step, so that the overall system control can be optimized. In [35], OB technique was applied for surface vessel control. But the above optimization control methods requires complete system knowledge in the RL training. In fact, some systems are often with unknown dynamics. To solve this problem, many highlighted approaches have been presented, such as [23], [36], [37]. In [36], an observer-based optimal control scheme was developed, and thus unknown dynamics can be compensated by the adaptive observer. In the references [23], [37], an optimal formation control of nonlinear multi-agent system was addressed, the identifier technique was employed to overcome the difficulty of unknown dynamic. Inspired by the above discussions, an optimal control using RL strategy is presented for a class of non-affine continuoustime nonlinear systems in this article. The primary contributions in the work can be summarized as follows.
1) The optimized control approach is developed for a class of non-affine nonlinear systems, and it is a significant extension in optimal control area.
2) The optimized control is significantly simple compared with the existing methods, so that it can be well performed for engineering.
3) The optimized control is easy to be implemented and applied, because it can release the condition of persistence excitation required for most existing optimal control.

II. PROBLEM FORMULATION
Consider the following non-affine nonlinear continuous-time system, which is a stabilizable system [33]: where x(t) ∈ R m and u ∈ R m are, respectively, the system state and control input, F (u, x) ∈ R m with F (0, 0) = 0 m is the unknown nonlinear vector-value function. The term F (u, x) is assumed to be Lipschitz continuous on the set Ω containing origin so that the solution of system (1) is unique for any control u and bounded initial value x(0). Since the control u is implicitly contained in the dynamic function F (u, x), the control cannot be constructed via direct seeking for help from the system (1). In order to overcome the difficulty, Taylor series expansion is implemented so that the relation between control and dynamic can become explicit [38]: where ∆ ∈ R m denotes the infinitesimal term, which can be limited by a constant µ as 0 ≤ ∥∆∥ ≤ µ, and u 0 (x) ∈ R m is an unknown smooth function. Furthermore, by choosing u 0 = 0, equation (2) is expressed as Insert (3) into system dynamics (1), it results iṅ Assumption 1: [38], [39] The matrix g(x) in system (4) is non-singular and bounded, i.e., it is an invertible matrix and there exist two constantsξ > ξ > 0 such thatξ > ∥g(x)∥ > ξ. As a result, it implies that the matrix g(x) is either strictly positive or strictly negative. Without losing of generality, we assumeξ > g(x) > ξ.
The desired tracking trajectory is denoted by y(t) ∈ R m , then define the tracking error as z(t) = x(t) − y(t). From (4), we obtain the following equation: Assumption 2: [3], [40] The reference tracking trajectory y(t) and its derivativeẏ(t) are assumed to be bounded. Let us introduce the performance index as follows where r(z, u) = z T (t)z(t) + u T u is the local cost function. Definition 1: [9] A control policy u associated with (1) is admissible on Ω, that is denoted by u ∈ Φ(Ω), if u is continuous, and u(0) = 0, and stabilizes (1), and makes the performance index (6) finite on Ω.
Optimal Control: An admissible control u ∈ Φ(Ω) for the system (1) is said to be optimal one if it can minimize the performance index (6). According to (6), define the performance index function as Represent the optimal control via u * , the optimal value function is generated as Taking the time derivation on both sides of the optimal value function (8), the HJB equation is got as follows: where J * z (z) = dJ * (z) dz ∈ R m . Assuming the solution of (9) to be existing and unique, then the optimal control u * can be got by solving the equation Define a function K * (z, x) as then the optimal control described in (10) can be rewritten as Substituting (12) into (9), we get Since the optimal control (10) contains the uncertain term J * z (z), it is unavailable for the non-affine system (1). For the sake of deriving available optimal control, the gradient term J * z (z) is expected to obtain by solving the HJB equation (13). But solving the equation is difficult or even impossible because the equation has strong nonlinearity. In order to solve this difficulty, the critic-actor RL algorithm based on NN approximation be usually considered.

A. REINFORCEMENT LEARNING DESIGN
To construct the critic-actor architecture RL, rewrite the term where k > 0 is a design parameter, Substituting (14) into (10), the optimal control becomes In adaptive control field, NN has become the popular tool for solving the unknown dynamic problem because of its universal function approximation ability, they can approximate a continuous function to desired accuracy over a compact set (the detailed introduction refers to [26], [31]). Since the term K 0 (z, x) is unknown but continuous, NN can approximate it over a compact set Ω K in the following form where ω * K ∈ R n×m is the ideal NN weight, Π K (z, x) ∈ R n is the basis function vector, and ε K ∈ R m is the NN approximation errors to satisfy ∥ε K ∥ ≤ τ , where τ is a constant.
Inserting (16) into (14) and (15), we get It should be noted that the NN weight ω * K is an unknown constant weight just for analytical purpose, therefore the optimal control (18) cannot be directly adopted for system (1). For obtaining the valid control, the critic and actor NNs for implementing RL are constructed in accordance with (17) and (18).
Critic NN is designed to evaluate the control performance asK whereK z (z) is the estimation of K * z (z),ω c (t) ∈ R n×m is the critic NN weight matrix.
The tuning law for critic NN weight iṡ where γ c > 0 is the critic designed parameter.

VOLUME 4, 2016
The actor NN is designed to perform the control behavior as where u is the estimation of u * ,ω a (t) ∈ R n×m is the actor NN weight matrix. The tuning law for actor NN weight iṡ where γ a > 0 is actor designed parameter. Remark 1: The critic and actor updating laws (20) and (22) are analyzed below. Substituting the (20) and (22) into (9), the approximated HJB equation is generated in the following Use the HJB equation (13) and its approximation (23) to define the Bellman residual error e(t) as e(t) = H(z, u,Ĵ z ) − H z, u * , J * z = H(z, u,Ĵ z ). (24) Based on the previous analysis, the optimized solution u(z) will be required to satisfy e(t) = H(z, u,Ĵ z ) → 0. If H(z, u,Ĵ z ) = 0 is true and has a unique solution, then the following is true, Defined a positive function as follows: Obviously, P (t) = 0 is equivalent to the equation (25). Then the updating laws (20) and (22) are designed from the following fact.
On the basis of the time derivative of P (t) is computed along (20) and (22) as The inequality (27) indicates that using the RL update laws (20) and (22) can achieve P (t) = 0 finally, therefore the (25) can be established.
The main advantages are that: 1) in contrast to the existing methods, this optimized control algorithm is greatly simple; 2) it can remove the persistent excitation condition. Remark 2: In this paper, RL for optimal control is adopted (as is shown in Fig.1), which is an iterative process that synchronously trains both critic and actor. Therefore, the challenge of control design is mainly focused on the derivation of the critic and actor updating laws. In the existing optimal methods, critic and actor updating laws are designed based on the square of Bellman residual. Because the equation is a complex nonlinear equation, the complexity of control design is inevitably increased. In this paper, RL algorithm is designed based on a simple positive function, which is equivalent to the HJB equation, as a result, it is of great significance to reduce the complexity of control design.

B. STABILITY ANALYSIS
Lemma 1: [41] For a positive definite continuous function G(t) ∈ R meetsĠ(t) ≤ −pG(t) + q, where p > 0 and q > 0 are two constants, then a following inequality is true: Theorem 1: Consider the non-affine nonlinear system (1) under bounded initial condition. If the proposed RL optimized tracking control is performed by the critic and actor NNs (19) and (21) with the training laws (20) and (22), and these designed constants, k, γ a and γ c , are chosen to satisfy Then the proposed optimized approach can guarantee the following objectives: 1) all the errors can be guaranteed to be semi-globally uniformly ultimately bounded(SGUUB); 2) the system state x(t) can track the trajectory y(t) in desired accuracy. Proof. Choose a Lyapunov function candidate as whereω c (t) =ω c (t) − ω * K andω a (t) =ω a (t) − ω * K are the critic and actor NN weight errors, respectively.
Taking the time derivative of L(t) along (5), (20) and (22) Adding (21) into (31) getṡ On the basis of Cauchy-Schwartz and Young's inequalities, it follows Substituting the inequalities (33) into (32) becomeṡ By using the factsω c (t) =ω c (t) − ω * K andω a (t) = ω a (t) − ω * K , the following equations can be gained: Applying the above results to the inequality (34) haṡ Applying Lemma 1 into (40) becomes According to the above inequality, all error signals are SGUUB, and when the design parameter k is large enough, the tracking error can convergent to the desired accuracy.

IV. SIMULATION EXAMPLE
Consider the following numerical example of non-affine nonlinear systemṡ  Corresponding to the control protocol (21), the parameter is chosen as β = 18. The NN with 12 nodes are employed for the NN approximation (16). The basis function vector is designed as And the NN center η i ∈ R 2 , i = 1, 2, ..., 12, equally spaced in an interval of −6 to 6.
Corresponding to the two updating laws (20) and (22), the parameter γ c = 16 represents the critic updating and the parameter γ a = 14 for actor updating. And the initial weights are written asω c = [0.2, ..., The simulation results are shown in Figs. 2-5. Fig.2 describes the tracking performance. Fig.3 presents the tracking error to be convergent. From Figs.2-3, the system states can follow to the desired reference trajectory. In the Fig.4, the boundedness of critic and actor NN weights are shown, respectively. The cost function is displayed in the Fig.5. The tracking capability of controller is demonstrated through the simulation results.
We make a comparison with the method of reference [9] in the computation time by using the "tic" function of MAT-LAB. Their averaged times are 0.1132s for the proposed method and 0.3779s for the method of [9] respectively. It is obvious that the proposed method is with the less computational time.

V. CONCLUSION
In this article, an optimized control method is developed for a class of continuous-time non-affine nonlinear systems.
Since the system is with the implicit control, it needs to transform the system to the affine-like form for revealing the control. According to the transformed system, the optimal control is derived by employing the NN-based RL. Since the RL updating laws is derived from negative gradient of a simple positive function, which is designed based on the partial derivatives of the HJB equation, the proposed optimized control can be significantly simple to compare with the existing RL optimal methods. Moreover, it can remove the condition of persistence excitation. Finally, it is proven that the control targets with the desired control performance are achieved. The effectiveness of the proposed optimizing method is demonstrated by the theory proof and simulation.
The disadvantages of the methods are mainly involved two aspects: 1) the control scheme is designed for an abstract mathematics model, it is not for a specific practical dynamic system. Hence we will extend this method to the practical engineering systems; 2) the optimal control of non-affine nonlinear system is focused on the first-order case, we will consider to develop the method to the second-order nonaffine nonlinear systems. XUE YANG received the B.S. degree from Qilu University of Technology (Shandong Academy of Sciences) in 2015. At present, she is studying for a Master's degree in Qilu University of Technology (Shandong Academy of Sciences).Her main research interests are optimal control and adaptive control. He is currently an associate professor with the College of Science, Binzhou University, Shandong Province, China. His research interests include adaptive control, optimal control, multi-agent control, nonlinear systems, reinforcement learning, neural networks and fuzzy logic systems. VOLUME 4, 2016