Online adaptive optimal tracking control for model-free nonlinear systems via a dynamic neural network

This paper presents an online adaptive approximate solution for the optimal tracking control problem of model-free nonlinear systems. Firstly, a dynamic neural network identifier with properly designed weights updating laws is developed to identify the unknown dynamics. Then an adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics optimally. The composite Lyapunov method is used to analyse the stability of the closed-loop system. Two simulation examples are presented to demonstrate the effectiveness of the proposed method.


Introduction
The basic idea of the classical adaptive control is to update the model parameter and control law directly or indirectly, such that the control error can be minimized. However, it is generally not optimal. On the other side, the main drawback of the classical optimal control approach lies in that the system dynamics must be precisely known for solving the Hamilton-Jacobi-Bellman (HJB) equation in an off-line manner [1]. Hence, by merging the knowledge from adaptive control and optimal control, the adaptive optimal control approach has been developed during the past decade and a survey of this research can be found in [2][3][4].
To develop an online adaptive optimal control, Werbos [5] introduced the general actor-critic (AC) framework for adaptive optimal control. The critic neural network (NN) approximates the evaluation function, mapping states to an estimated measure of the value function, whereas the NN approximates an optimal control law and generates the actions or control signals. Since then, various modifications to adaptive optimal control algorithms have been proposed as model-based methods (heuristic dynamic programming -HDP [6] and dual heuristic programming-DHP [7]) and model-free methods (action-dependent heuristic dynamic programming -ADHDP [8] and Q learning [9]). However, most of the previous works on adaptive optimal control have focused on discrete-time systems. The extensions of these adaptive optimal control research to continuous-time systems pose challenges in proving stability, convergence and ensuring the online updating law with model free [10].
Discretizinging the continuous time system is generally not accurate, especially for the high-dimensional systems that prohibit the learning process. Hence, the online policy iteration-based algorithms are proposed to solve the linear [11] and nonlinear [12] continuoustime infinite horizon optimal control problems, which involve synchronous adaptive of both actor and critic NN. Furthermore, ref. [10] extended the idea in refs. [11,12] by designing a novel AC-identifier architecture to approximate the HJB equation without the knowledge of system drift dynamics, but the knowledge of the input dynamics is required. The recent research in [13] cancels this requirement by using the experience iteration technique. Based on ref. [10], a simply identifiercritic structure-based optimal control method is proposed in [14,15], where just a critic NN is used to approximate the solution of the HJB equation and to calculate the optimal control action. In [16], an optimal control method for nonzero-sum differential games of continuous-time nonlinear systems is designed directly from the critic NN instead of the action-critic dual network, which greatly simplifies the algorithm architecture.
Most of the existing adaptive optimal research studies mainly focus on dealing with regulation problems rather than trajectory tracking problems. The combined consideration of two aspects can ensure not only the realization of trajectory tracking and stabilization but also satisfying the prescribed performance index (such as minimization of the trajectory error, fuel consumption, etc.). In [17] a new databased iterative optimal learning control scheme is developed to solve a coal gasification optimal tracking control problem in the discrete-time domain. For continuous-time systems, linear quadratic tracking control of partially-unknown systems using reinforcement learning is present in [18] and a nonlinear approximately optimal trajectory tracking method with exact model information is developed in [19]. To relax the requirement of an explicit model, a steady-state control conjunction with an optimal control for nonlinear continuous-time systems is developed in [20], which stabilizes the error dynamics in an optimal way.
Most of the above-mentioned adaptive optimal control method is based on the affine nonlinear system, to the best of our knowledge, only [21] addressed the adaptive optimal control of unknown non-affine nonlinear systems in the discrete-time domain and [22] introduces an adaptive recursive control for the model-based non-affine nonlinear continuous system. The optimal control of an unknown non-affine nonlinear continuous-time system is still a challenging task, which is the motivation of this paper.
The main contributions of this paper are listed as follows.
(1) The optimal tracking control of unknown nonaffine nonlinear systems based on the critic identifier architecture is first proposed in this paper. Model-free property is achieved by a neuro identifier in conjunction with the novel updating laws for both the weights and the linear part matrix which is usually assumed to be a known Hurwitz matrix for the conventional black-box nonlinear system identification. (2) Adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics. Online solution of the optimal control term is obtained directly by a single critic NN to approximate the optimal cost function of the HJB equation instead of the conventional action-critic dual network, which greatly reduces complexity and saves calculation time. A novel learning law driven by filtered parameter error is proposed for critic NN. The stability of the entire closed-loop system is proved by the properly designed composite Lyapunov method.
The main organization of the paper is as follows. The problem formulation is given in Section 2. The DNN identifier is designed in Section 3. Then, the optimal control strategy, based on the critic-identifier architecture, is present in Section 4. Two simulation examples are presented to verify the proposed scheme in Section 5 and the conclusion is drawn in Section 6.

Problem formulation
Consider the following non-affine nonlinear continuous-time systemṡ where . . , u m (t)) T ∈ R m is the control input vector and f (·) is an unknown continuous nonlinear smooth function for x(t) and u(t).
The objective of the optimal tracking control problem is to design an optimal controller (1) to ensure that the state vector x(t) tracks the specified trajectory x r (t) and minimize the infinite horizon performance cost function as follows: V(e(t)) = ∞ t r(e(τ ), u e (e(τ ))dτ (2) where the tracking error is defined as e(t) = x(t) − x r (t), the utility function with symmetric positive definite matrices Q and R is defined as r(e(t), From the basic optimal control theory, we define the Hamiltonian of (1) as where V e = ∂V ∂x denotes the partial derivative of the cost function V(e(t)) with respect to e(t).
The optimal cost function V * (e(t)) is given as , u e (e(τ ))dτ (4) and it satisfies the HJB equation where the control u is defined to be admissible for (2) on a compact set ∈ R n , denoted by u ∈ ψ( ). Theoretically, the optimal control for nonlinear system (1) can be obtained from Equations (4) and (5). However, optimal control cannot be obtained in practical systems due to two reasons: 1). The optimal cost function V * (e(t)) should be obtained by solving the HJB equation (5). However, it is usually difficult to solve the high-order nonlinear partial differential equation (PDE) for general nonlinear systems via analytical methods. Moreover, the unknown nonlinear dynamic f (·) makes the solution unavailable for HJB Equation (2). The idea of optimal control u * (t) cannot be derived by solving ∂H(e,u * ,V * ) ∂u * = 0 due to the unavailability of V * (e(t)).
In this paper, we develop a critic-identifier to solve the optimal control of an unknown non-affine nonlinear continuous-time system, all the learning processes can be updated online.

Adaptive model-free identifier
We employ the following dynamic neural network (DNN) model to approximate the nonlinear dynamic system (1) are the weights in the output layers, W 1 ∈ R n×m , W 2 ∈ R m×n are the weights in the hidden layer, A ∈ R n×n is the matrix for the linear part of NNs, (6) can be written aṡ It has been proved in [23] that DNN with the formẋ(t) = Ax(t) + W can approximate the nonlinear system (1) to any degree of accuracy if the hidden layer V is large enough. Here, to simplify the analysis process, we consider the simplest structure (i.e. m = n, V = I, φ(·) = I).
Then the nonlinear system (1) can be modelled by the DNN as follows: where A * , W 1 * , W 2 * are the nominal unknown matrices and W 1 * , W 2 * are bounded as W 1 * 1 are any positive definite symmetric matrices), and ξ 1 is regarded as the modelling error or disturbance and is assumed to be bounded.
x T D x, and D = D T > 0 is the known normalizing matrices.
Then from (6) and (7), we can obtain the error dynamic equation Theorem 3.1: Consider the identification scheme (6) for (1), the following updating laẇ where k 1, k 2 and k 3 are positive constants, can guarantee the following stability properties: (1) For a precise identifier case i.e. ξ 1 = 0 (2) For bounded modelling error and disturbances i.e.
Proof: Consider the Lyapunov function candidate Hence, differentiating (11) and using (8) yielḋ By using the updating laws (9) and taking the factṡ A = −Ȧ,Ẇ 1,2 = −Ẇ 1,2 , into consideration, then (11) becomesL Using the following matrix inequality where X, Y ∈ R j×k are any matrices and ∈ R j×k is any positive definite matrix. From Assumption 3.1, one obtains Then substituting (14) into (12) obtainṡ By defining R = W 1 , Q = D + Q o , then if we can select proper Q o so that Q satisfies the conditions in Lemma 3.1, there exists matrix P satisfying the equation Hence (15) becomeṡ (17) From (17) we get x,Ŵ 1,2 ,Â ∈ L ∞ . Furthermore, from the error dynamics (8) we have˙ x ∈ L ∞ . By integrating (17) on both sides from 0 to ∞, we have Case 2: For bounded modelling error and disturbances i.e. ξ 1 ≤ξ 1 . Equation (16) can be represented aṡ where [24], the dynamics of the identification error (8) is input to state stable, which implies x, W 1,2 , A ∈ L ∞ . This completes the proof of Theorem 3.1.

Optimal control design
In this section, adaptive optimal control is designed based on the DNN identifier. From Section 3, we know that a nonlinear system (1) can be represented by DNN with the updating law (9) as follows: where the model error ξ 1 is still assumed to be bounded ξ 1 ≤ξ 1 . x and W 1,2 are bound as Theorem 3.1.
Then (19) can be further rewritten aṡ where For bounded ξ 1 and x, ξ 2 is bounded as well i.e. ξ 2 ≤ξ 2 . To achieve optimal tracking control, the control action u is designed as u = u r + u e where u r is the steady-state control which ensures that the tracking error is at the steady state, and u e is the adaptive optimal control which is used to minimize the infinite horizon performance index function optimally. u r should be designed to compensate for the nonlinear dynamic in (20). Hence, let u r be where e = x − x r denotes the state tracking error, K is the feedback gain and W + 2 denotes the generalized inverse of W 2 .
From (20) and (21), the error dynamic equation becomesė In this case, the tracking problem with (20) is transferred to the regulator problem of (22). The adaptive optimal control u e is designed to stabilize (22) optimally. Hence rewrite the infinite horizon performance cost function (2) as where r(e, u e ) = e T Qe + u T e Ru e is the utility function with the optimal control u e .
According to the optimal regulator problem design in [25], an admissible control policy u e should be designed to ensure that the infinite horizon cost function (23) related to (22) is minimized. So, design the Hamiltonian of (22) as where V e = ∂V(e) ∂e is the partial derivative of the value function with respect to e.
Then we define the optimal cost function as , u e (e(τ ))dτ ) (25) and it satisfies the following HJB equation The last optimal control value u * e for (22) can be obtained by solving where V * (e) is the solution of the HJB equation (26).
From (27), we can learn that the optimal control value u * e is based on the optimal value function V * (e). However, it is difficult to solve the nonlinear partial differential HJB equation (26) to obtain V * (e). The usual method is to get the approximate solution via a critic NN as [4,5,25]. A single-layer NN will be used to approximate the optimal value function V * (e) = W * T 3 φψ(e) + ξ 3 and its derivative is where W * 3 ∈ R I is the nominal weight vector, ψ(e) ∈ R I is the active function and ξ 3 is the approximation error, I represents the number of neurons. ∇ψ(e) = ∂ψ(e) ∂e and ∇ξ 3 = ∂ξ 3 ∂e are the partial derivatives of ψ(e) and ξ 3 with respect to e, respectively.
Then substituting (28) with (27), one obtains The critic NN is approximated as where W 3 is the estimation of the nominal W * 3 . Then the approximate optimal control can be obtained from (30) and (31)

Remark 4.2:
The available adaptive optimal control method is usually based on the dual NN architecture, where the critic NN and action NN are employed to approximate the optimal cost function and optimal control policy, respectively. The complicated structure and computational burden make it difficult for practical implantation. In the following, we will calculate the optimal control action directly from the critic NN instead of the action-critic dual network.
Then (33) can be written as the general identification form as where X = ∇φ(e)[−Ke + W + 2 u e ], Y = e T Qe + u T e Ru e . According to the least square method learning rules, one can get the estimation of nominal W * 3 as W 3 = −(XX T ) −1 XY T in the case of residual HJB equation error equals zero. However, ξ HJB is not always zero and it is also difficult to finish the subsequent closedloop stability analysis based on the least square method. Inspired by [14,26], we develop a novel robust estimation method of W * 3 . The following equation is used to identify (34) where ξ HJB can be assumed to be the model error and unknown disturbance. For (35), the filtered version of Y is defined aṡ is a positive constant,and z is an auxiliary variable.
We further define the auxiliary variables z f , Y f , X f and ξ HJB1f as where η is a filter parameter. It should be noted that the fictitious filtered variable ξ HJB1f is just used for analysis.
Then we get From the first equation in (36), one obtainṡ According to (38), (39) and (40), we have Furthermore, we define the auxiliary regression matrix E ∈ R l×l and vector F ∈ R l as The solution of (42) is derived as Finally, we denote a vector M as The adaptive law for updating W 3 is provided bẏ where μ is the learning gain.

Proof: The Lyapunov function is selected as
Then, by substituting (42) with (44), one obtains It can be seen from [26] that the persistently excited (PE) for X can make the matrix defined in (43) is positive define, i.e. λ min (E) > σ > 0 Then according . W 3 = −Ẇ 3 , the derivative of (46) is calculated aṡ ThenW 3 converges into the compact set : (1) with an adaptive optimal control u signal (21) and (32) and adaptive laws (9) and (45), the tracking error e is uniformly ultimately bound, and the optimal control u e in (32) converges to a small bound around its ideal optimal solution u * e in (30).

Proof: Design the Lyapunov function as
where L I can be expressed as (10) and the time derivative of (18) satisfies the following inequalitẏ L o is defined as (46) and its derivation is obtained from (48) such thaṫ From the basic inequality ab ≤ a 2 δ/2 + b 2 /2δ with δ > 0, we can rewrite (50) aṡ L C is defined as where V * (e) is the optimal cost function defined in (25) and , κ > 0 are positive constants. Substituting (32) with (22), one obtainṡ Then time derivation of (52) can be deduced from (28) and (53) aṡ Then from (49), (50) and (54), the time derivative of L iṡ L =L I +L o +L c and satisfied the following inequalitẏ If we can choose the appropriate parameters to satisfy the following condition Then (55) can be further represented aṡ where are all positive constants from condition (56).
Moreover, we havê When t → ∞, the upper bound of (59) is where ζ depends on the DNN identification approximation error and the critic NN weight errorW 3 . The structure diagram of the control scheme is illustrated in Figure 1.
A summary of the ADP-based optimal tracking control algorithm is as follows (1) Select the proper initial values of active functions σ (·) and φ(·) in Equation (6) and updating gains k 1 , k 2 , k 3 in Equation (9) for the identifier. σ (·) is usually selected as the sigmoidal function σ (·) = a/(1 + e −bx ) − c where a, b and c are the designed constants. ψ (·) is selected as ψ (·) = I. α,β and γ are tuned online according to equations (9). Hence, there is no need to select the initial weight values of α,β and γ . Meanwhile, select the proper function φ(·) in Equation (31) and the updating gain μ in Equation (45) for the critic NN φ (·) is usually selected as a smooth function consisting of a different combination between state tracking errors. (2) The inputs/outputs data of an unknown non-affine nonlinear system (1) is used to train the identifier. (3) Adaptive optimal tracking control law consisting of the steady-state control law in an equation and the optimal control law in Equation (32) is obtained based on the first two steps.

Simulations
We consider the following two examples to illustrate the theoretical results in this section.
Example 5.1: Considering the following non-affine nonlinear system The matrices Q and R of the performance index function are chosen as identify matrices. The control objective is to make the state x 1 and x 2 follow the desired trajectory x 1r = sin t and x 2r = cos(t) − sin(t). First, a DNN identifier (6) with the updating law (9) is used to identify the non-affine nonlinear system. Parameters are selected as k 1 = k 2 = k 3 = 1, active function is selected as σ (·) = 2/(1 + e −2x ) − 0.5. The identification error is shown in Figure 2. We can see that the proposed identifier can model the nonaffine nonlinear system accurately. Then, with the identified model, the adaptive optimal tracking controller is implemented for the unknown non-affine nonlinear continuous system (61). Define the trajectory error as e 1 = x 1 − x r1 , e 2 = x 2 − x r2 . The activation function of critic NN is selected as φ = [e 2 1 , e 1 e 2 , e 2 2 ]. The adaptive gain of the critic NN is selected as μ = 100, and the steady control gain is selected as K = 1200. Figures 3  and 4 represent the trajectory tracking, and the convergence property for the weight of the critic NN is shown in Figure 5, which demonstrates that the proposed adaptive optimal tracking controller can ensure satisfactory tracking performance for an unknown nonaffine nonlinear continue system.

Example 5.2:
The classical 2-DOF single-track vehicle model, as shown in Figure 6, is commonly used in      The friction coefficient between tire and road γ Yaw rate about the z-axis The ith wheel slip angle, slip ratio δ f Steer angle of the front wheel AFS/DYC control design [27]. The parameter notations are shown in Table 1.
The mathematical model of Figure 6 considering the uncertainty parameters is expressed as follows: where x = [ β γ ] β is the side slip angel, γ is the yaw rate; u = δ c M c , δ c is the active steer angle, M c is the corrective yaw moment and δ f is the driver steer input The main object of vehicle stability control is to design the proper controller to make the actual vehicle yaw rate and sideslip to follow the desired responses. The reference model is usually selected aṡ τ r , τ β are the designed time constants of raw rate and sideslip angle, respectively. With the assumption that the variation and uncertainty of tire cornering stiffness can be described as where C f 0 , C r0 and C f , C r are the nominal and actual cornering stiffness of the front and rear tires respectively, C f 0 , C r0 are the deviation magnitude, ρ f , ρ r are perturbations. Simulation parameters of the vehicle system are selected as m = 1704kg, C f = 63224N/rad, C r = 84680 N/rad, I z = 3048 kg m 2 , l f = 1.135 m and l r = 1.555 m. A 28-degree step steer manoeuvre with an initial speed (of 80 km/h) is simulated to verify the proposed method. The time-varying parameters of C f and C r are obtained from (64) by selecting f , r as constant 0.5 and ρ f , ρ r as band-limited white noise with the amplitude ±0.01. As shown in Figures 7 and 8, the proposed method still demonstrates strong robustness and self-adaptive performance, i.e. less tracking error for yaw rate and sideslip angle, when encountering timevarying cornering stiffness in step steer manoeuvre.
To show the identification performance of the proposed algorithm, the performance index-Root Mean Square (RMS) for the state's error has been adopted for comparison.

Conclusions
In this paper, we develop an adaptive optimal controller with a critic-identifier structure to solve the trajectory tracking problem for model uncertain non-affine nonlinear continuous-time system. First, a model-free DNN identifier is designed to reconstruct the unknown dynamic. Then, based on the identification model, an adaptive optimal controller is presented, which can realize the trajectory tracking and stabilize the error dynamic optimally. In addition, a critic NN is introduced to approximate the optimal value function, and a novel robust tuning law is established to update the critic NN weight. The stability of the closed-loop system is proved by the Lyapunov approach. Simulation results of two examples are presented to verify the validity of the proposed approach.

Disclosure statement
No potential conflict of interest was reported by the author(s).