Disturbance Compensation-Based Optimal Tracking Control for Perturbed Nonlinear Systems

This paper investigates the disturbance compensation-based optimal tracking control for nonlinear systems in the presence of uncertain dynamic drift and extraneous disturbance by using the adaptive dynamic programming (ADP). First, an extended state observer (ESO)-based disturbance rejection controller is designed to estimate the comprehensive disturbances of system. Then, a novel composite controller capable of online learning is developed based on disturbance rejection controller and optimal regulation law, where the optimal regulation law is conducted by ADP framework to stabilize the dynamics of tracking error and minimize predefined value function. Particularly, an improved critic-only weight updating algorithm is inserted in ADP for ensuring the finite time convergence of critic weight without resorting to traditional actor-critic structures enduring remarkable computational burden. Based on Lyapunov analysis, it is proved that the tracking errors and weight estimation errors of critic network are uniformly ultimately bounded and the pursued controller approximates to the optimal policy. Finally, simulation results are shown to check the superiority of involved strategy, and the value function can be decreased by 25% with consistent tracking performance.


I. INTRODUCTION
In recent years, tracking control for nonlinear systems has always been investigated for many real-world plants in practices, such as unmanned aerial vehicles [1], [2], robots [3], and quadrotors [4], [5]. To realize the performance of tracking, massive various control algorithms are employed for tracking control problems, including backstepping control [6], sliding model control [7], [8] and model predictive control [9]. In fact, it is inevitable that the modeling error and external disturbance exist in practical applications. Since classical control is sensitive to the model variation, the tracking performance will be weakened and the system stability cannot be guaranteed. Thus, achieving disturbance accommodation and nominal tracking performance are the chief concerns in current controller designs.
To ensure robust tracking and performance recovery, a large number of control methods or tools have emerged to The associate editor coordinating the review of this manuscript and approving it for publication was Emre Koyuncu . deal with uncertainties, such as sliding mode control [10], [11], H ∞ controller [12], [13], neural network (NN) [14], [15], fuzzy logic system (FLS) [16], [17], and so on. In [11], a finite time sliding mode controller is proposed for robotic manipulators, where a better performance and robust tracking control can be derived and the chattering phenomenon is eliminated, but it requires the specific bound of uncertainties. In [12], a robust H ∞ tracking controller is considered for nonlinear multi-UAV networked system to fulfill the stability via transforming network systems into leader-follower structure. The controller typically consumes unnecessary control actions due to the worst-case consideration. In [15], the NN-based backstepping controller is established based on distance for nonlinear multiagent systems to ensure formation and tracking control, where the unknown nonlinearity of the system dynamics is recovered by NN. In addition, FLS is employed in the backsteppingbased control for stochastic nonlinear systems to estimate the unknown dynamic drift while reduced-order state observer is applied for reconstructing immeasurable states, such that the tracking errors belong to a compact set in the mean square sense [17]. However, a heavy computation burden is usually accompanied with FLS due to the fact that heuristic parameter adjustments are closely depending on trial and error, which is hard to apply in applications. To achieve efficient closed-loop tracking, active disturbance rejection control (ADRC) proposed by Han [18] is a widespread and easy implementation solution to handle the uncertainties and disturbances. The key module of ADRC is the so-called extended state observer (ESO), which could approximate and compensate for the negative influences of total uncertainty in a timely fashion, and has been successfully validated both from theoretical and industrial advancements. In general, the compensated system is then converted into a chain of integrators with nominal controller and it is a common selection to devise a proportional derivative controller to achieve reference tracking in the existing paradigms [19], [20], [21], [22], [23], [24], [25]. Although the reported ADRC-based outcomes can achieve nominal tracking versus a wide class of uncertainties, it lacks the online optimization ability in terms of tracking performance and control consumption due to the time-invariant control structure and few works explore optimizing the overall control performance to retain the optimality, which results in investigating the controllers with improved performance and low control consumption.
Optimal tracking control can be obtained via the dynamic programming by minimizing a predefined quadratic function consisting of control input and tracking errors subject to nonlinear systems [26]. Noting that corresponding Hamilton-Jacobi-Bellman (HJB) equation for nonlinear systems is in form of nonlinear partial derivative equations, thus it is not easy to directly obtain analytical solutions and may cause curse of dimensionality compared to linear systems. Recently, reinforcement learning (RL) provides an efficient way for approximation of optimal value function and optimal control by interacting with the environment via pursuing the maximum reward function. Adaptive dynamic programing (ADP), which is a representative branch of RL in control community, is demonstrated that the optimal control can be derived by replacing the analytic solution of HJB equation with NN approximators. In particular, actor-critic networks are prevailing structures utilized in ADP, wherein the critic NN is constructed for approximation of value function, while the actor NN is used for approximating the optimal policy. For instance, under the condition that system model is known, the policy iteration in [27] updates critic and actor NN synchronously, and the system state and NN errors are proved to be uniformly ultimately bounded (UUB). However, the iteration algorithms are only effective under the assumption that model is precisely known. Thus, many scholars have shed lights on ADP methods for nonlinear uncertain systems. In [28], the identifier-critic-actor NN is proposed using identifier NN to reconstruct the uncertainty in model. To avoid overhigh learning time for suppressing modeling errors, the identifier in [29] is designed by functional-link NN for permanent-magnet synchronous motor. However, the use of triple neural structures may lead to heavy computation and slow convergence rate. To solve this difficulty, a critic NN is used to approximate the optimal value function and optimal control input whilst the unknown dynamics are reconstructed by identifier, avoiding complexity of introducing the actor NN [30]. Besides, the sliding mode control is employed in the ADP framework for uncertain nonlinear systems, which is insensitive to external disturbance. In [31], the ADP with actor-critic structure is employed to online learning the optimal control of the sliding-mode dynamics, where lumped disturbances consisting of unknown term and input disturbance are estimated by NN and disturbance observer. In [32], a modified value function with control inputs and disturbance from neighbor nodes is considered for multiagent systems to derive optimal control in critic-actor framework such that the consensus errors are asymptotically stable, where a distributed event-triggered integral sliding mode controller is constructed to eliminate the matched disturbance. Furtherly, actor-critic ADP is conducted by incorporating a novel fast nonsingular sliding mode into controller design and input constraints are handled via a nonlinear anti-windup compensator in [33], leading to convergence of tracking error in fixed-time. Besides, a novel ADP in [34] is capable of deriving optimal strategy under the framework of zero-sum game. However, the upper bound is required to be known before the controller design, which is hard to guarantee in real scenarios. Nevertheless, the above mentioned controllers need to introduce actor NN for weight convergence to induce optimal control. Note that ESO structure does not depend on the upper bound of lumped disturbances. As far as we know, there is no work considering ESO-based optimal tracking control within critic-only ADP scheme and it is also challenging originating from single NN to address the high coupling of disturbance observation errors by ESO, NN weight updating deviation as well as reconstruction errors.
Inspired by foregoing observations, there exist following shortcomings in above developed controller schemes: The traditional control design is carried without consideration of control consumption, while the optimal controllers are devised based on actor-critic framework for weight convergence. This article focuses on disturbance compensationbased optimal tracking control for nonlinear systems in the presence of uncertain dynamic drift and extraneous disturbance with critic-only ADP. The main contributions of this work are summarized as follows: 1) Unlike [10], [11], [12], [13], [14], [15], [16], [17] that only focus on disturbance rejection without caring about control consumptions, herein an efficient compromise between tracking performance and control costs is handled under the devised ADP framework, achieving performance optimality and robustness simultaneously. In addition, the disturbance rejection controller is imported to deal with nonzero reference trajectory such that optimal tracking problem is converted to optimal stabilization problem. 50620 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
2) Contrasting to [28], [29], and [31] using multiple NNs to design optimal policy for perturbed nonlinear systems, the critic-only ADP with a lower computational burden is established. Additionally, unlike nonlinear disturbance observer [31] and sliding mode controller [32] with entailed disturbance upper bound information, the proposed optimal anti-disturbance control based on uncertainty estimates offered by ESO can be derived without resorting to the prior information, which is not conservative and computationally convenient compared with game scheme in [34].
3) Different from [28], [29], [31], [32], [33], [35], and [36], where Bellman error is minimized via prevailing gradient descent or least squares methods, by extracting weight errors from designated intermittent arguments, an improved neural weight updating rule by inclusion of weight errors is designed ingeniously to achieve finite time convergence. On account of weight rule, the tracking errors can converge to the neighborhood of origin under critic-only ADP framework with synthesized consideration of disturbance observation error by ESO, NN evaluated error and tracking error. In addition, the persistent excitation (PE) condition can be easily checked by judging the minimum eigenvalue of constructed matrix, which provides an online feasible examining avenue compared to [28] and [29].
The rest of paper is organized as follows. The problem formulation is shown in Sect. II. The disturbance rejection controller is described in Sect. III, while composite controller is designed based on estimation of comprehensive disturbances and optimal regulation law. The simulations are carried out for quadrotor tracking to verify the effective of investigated method in Sect. IV. Finally, we give the conclusion in Sect. V.
Notation: The bold characters are used to express vectors or matrices. 0 and I denote zero and the identity matrices with dimension determined by the context. The diag(·) indicates a diagonal matrix. λ max (A) and λ min (A) represent the maximum and minimize eigenvalues of matrix A, respectively.

II. PROBLEM FORMULATION
Consider the nonlinear affine multi-input multi-output (MIMO) system,ẋ where x ∈ R n and u ∈ R m are the system state and control input, d(t) ∈ R n denotes unknown time-varying disturbance. f (x) ∈ R n×n is unknown continuously differentiable function and bounded on a compact set , and control gain matrix g(x) ∈ R n×m is known, bounded and reversible. The tracking error e can be expressed as where x d is bounded and differential trajectory to be specified. From [27], the system can be stabilizable by designing a continuous control input u to track desired trajectory x d . Following (1), the error dynamics can be computed aṡ The control objective is to design the controller u such that x can track the desired trajectory x d in an optimal way by minimizing following function consisting of tracking errors and control inputs: where Q ∈ R n×n and R ∈ R m×m are positive definite matrices. Noting that the unknown term and disturbance exist in system (3), it can be regarded as comprehensive disturbances in disturbance rejection controller. Assumption 1: The comprehensive disturbances δ = f (x) + d(t) and its time derivativeδ can be bounded with an unknown positive constant.
Remark 1: The model (1) considered in this paper can cover a wide range of plants in applications, such as unmanned aerial vehicles [2], robotic manipulators [11], MEMS gyroscopes [22] and quadrotors [37]. For high-order systems, one can use iterative backstepping design to convert high-order dynamics into first-order systems in form of (1). Thus, the proposed controller can be feasible to stabilize highorder tracking error system. In addition, similar to numerous preceding ESO works, the assumption is necessary in the disturbance compensation framework, which can be found in [22], [24], and [37]. And the bound ofδ is only utilized for proving the convergence of disturbance approximation rather than control design in [11] and [31].
Furtherly, following lemma and definition are prepared for weight convergence.
Lemma 1 [38]: Assume that there exists a positive function V satisfyingV with ℓ > 0, 0 < β < 1 and positive bounded η, then V (t) can converge to neighborhood of origin before a settling time Definition 1 [31]: If there exist positive ε and τ such that t+τ t T dτ > εI holds, variable satisfies PE condition.

III. MAIN RESULTS
In this section, a novel composite controller capable of online learning is developed based on disturbance rejection controller and optimal regulation law. First, in light of the uncertainties estimation result of ESO, a disturbance rejection controller is constructed. Then, to simultaneously accomplish a minimum tracking error and a lowest control cost, an optimal regulation law is established by devising a critic-only ADP technique with novel adaptive law such that VOLUME 11, 2023 the preset quadratic value function can be minimized. Note that the whole control architecture of the proposed control is summarized in Figure 1.
From (1) and (7), the dynamics of estimation errors can be expressed as it can be rewritten into a compact form: where Observing that H is Hurtiwz, there exists a positive definite matrix P x ∈ R 2n×2 n satisfying H T P x + P x H = −I. According to Assumption 1, the convergence ofδ can be easily inferred from [22] and [24]. Thus, the dynamics of tracking errors (3) can be rewritten intoė In order to derive e → 0, the composite control is divided into two parts, where disturbance rejection controller u d is designed to compensate for the effect of disturbance: where (g T g) −1 g T represents the generalized inverse of matrix g and positive diagonal gain matrix K ∈ R n×n is selected to maintain the tracking error close to zero at the steadystate stage. Furthermore, by substitute (13) into (11), the reconstituted error dynamics is converted tȯ Thus, the optimal regulation law u e will be considered within ADP framework for balancing tracking performance and control expenses. Remark 2: By employing the disturbance rejection controller u d to compensate the disturbance and nonlinear timevarying desired trajectory, the optimal control considered in subsection B is designed for stabilization system, avoiding conducting augmented systems for bounded value functions.

B. OPTIMAL REGULATION LAW DESIGN 1) OPTIMAL CONTROL
To stabilize system (14), the value function is predefined for balancing control cost and tracking performance. Thus, the value function (4) can be reconducted related to u e as For deriving the optimal control policy, the Hamiltonian is denoted as where V e = ∂V (e) ∂e. By taking derivative of (15) along (14) and following the optimality principle [39], the HJB equation can be derived as with V * e = ∂V * (e) ∂e and the optimal value function V * (e) is given as The optimal control u * e can be obtained by utilizing the stationarity condition ∂H (e, u * e , V * ) ∂u * e = 0 as 50622 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. By virtue of (19), HJB equation (17) can be represented as Note that (20) is a nonlinear partial differential equation relevant to optimal value function, it is difficult to obtain its analytical solution due to its complex dynamics and nonlinearity. Here, the RL strategy is considered for approximating V * (e) by introducing NN. With the aid of critic NN, the optimal value function V * (e) and its gradient can be obtained as And where l represents the number of neurons in hidden layer, W ∈ R l is the ideal weight of critic NN, φ ∈ R l is the regressor vector, and ε is the error caused by critic NN, which can envolve to zero with sufficiently large l [27]. And ∇φ = ∂φ ∂e ∈ R l×n and ∇ε = ∂ε ∂e ∈ R n denote the partial derivatives of φ and ε, satisfying ∥∇φ∥ ⩽φ and ∥∇ε∥ ⩽ε with positive constantsφ andε. Thus, the optimal control u * e can be given as Taking the unavailability of W , the optimal value V * can be estimated by following form whereŴ ∈ R l is estimation of W and its online update will be given later to realize optimal control and optimal value function simultaneously. According to (23) and (24), the related optimal control can be approximated by Remark 3: In the most ADP schemes, the Hamiltonian function is considered for the nominal system in [27], [28] and [29]. In particular, the HJB equation is structured related to nominal system to solve optimal control. In contrary to these methods, the HJB equation (17) is conducted related to (11) consisting of the estimation errorδ, which leads to challenges for adjusting weight and deducing optimal regulation law. Differing from the existing ADP control design for nominal system with actor-critic NN framework, where utilizes critic NN and actor NN for approximation of optimal value function and optimal policy [28], [29], optimal value function and optimal control policy can be updated simultaneously by the critic-only NN with adaptive weight update, reducing computational consumption in actor NN significantly.

2) IMPROVED UPDATING RULE FOR CRITIC NN
In this subsection, the update of critic NN weightŴ is considered to make sure that optimal regulation law u e not only can stabilize the system (14) but also enable an optimal property. Based on NN tools, the HJB equation (17)  where ε HJB = ∇ε T (−Ke + gu e +δ) + (W T ∇φ)δ is residual error of HJB equation.
For the sake of description, by denoting ψ = e T Qe + u T e Ru e and = ∇φ(−Ke + gu e ), HJB equation (26) can be expressed as Instead of minimizing quadratic function defined by the Bellman error, we design the improved method to update the weightŴ with finite time stability. To describe the update rule, we multiply both sides of (27) by , resulting in Thus, for a constant c > 0, the auxiliary matrix N ∈ R l×l and vector S ∈ R l are introduced to perform filter operation, which are defined as where positive constant c is chosen to ensure the boundedness of S and N. From (27), there exists the auxiliary variable P ∈ R l , which satisfies Thus, the improved update rule for weight iṡ where ∈ R l×l is a positive definite diagonal matrix. There may emerge singularity of norm-normalized term, which has been well investigated by a saturation or sign function among sliding mode controllers. Thus, one can employ the similar technique to make sure weight boundedness, where discussion about sign function and norm-normalized term can be found in [40]. With the aid of improved update rule (31), the finite time weight convergence of critic NN can be derived in following theorem. Theorem 1: Consider the value function (15) for nonliner system (14), if the estimation of value function is updated with approximate weightŴ given by (31). Then, the approximation error of weight, i.e.,W = W −Ŵ , can converge to the neighborhood of origin in a finite time.
The auxiliary variable ς related to HJB error is introduced as ς = − t 0 e −c(t−r) ε HJB dr, resulting in S = −NW + ς from (28). Further, due toW = W −Ŵ , we can infer that estimation error of W is implicitly contained in matrix P in following form P = NŴ + S = −NW + ς which is crucial to guarantee weight convergence. Suppose that the PE condition of holds, one can infer from [30] that σ = λ min (N) > 0 with λ min (N) representing the minimize eigenvalue of matrix N. And the candinate Lyapunov function is selected as By substituing (31) and (33) into the time derivative of V W , one can yield thaṫ with a postive constant ε ς being assured to satisfy ∥ς ∥ ⩽ ε ς and η = σ 2 λ max −1 with λ max ( −1 ) representing the maximum eigenvalue of matrix −1 . Thus, one can conclude that the weight estimation errorsW converge to the neighborhood of origin after the finite time T ⩽ 2V 1 2 (0) η by reacalling Lemma 1.
Remark 4: In the most ADP framework, the Bellman error is minimized by gradient descent or least squares methods [28], [29]. Unlike these methods, weight errors can be extracted from designated intermittent arguments via filter operation such that proposed adaptive rule (31), which is driven by weight errors, can achieve finite-time weight convergence, avoiding the hysteresis phenomenon of weight convergence by minimizing Bellman error. Furthermore, approximate optimal regulation law can be derived with estimation of weight, resulting in zero coupling errors with disturbance observation error NN evaluated error and tracking error. In addition, the PE condition is the basic assumption to guarantee the convergence of the NN weight among ADP frameworks [28], [29], which is not easy to verify in the simulation. By introducing the auxiliary matrix N, PE condition can be judged by computing whether σ > 0.
The proposed controller with critic-only ADP can be operated in Algorithm 1 with specific steps, where consists of the key equations.

3) STABILITY ANALYSIS
Theorem 2: For the nonlinear system (1), consider composite control (12) consisting of disturbance rejection controller (13) and optimal regulation law (25), where ESO (7) and improved adaptive rule (31) are provided for updatingδ andŴ . If satisfies PE condition, then the tracking errors can be proved within the neighborhood of origin. And u e will converge to a neighborhood of optimal regulation law u * e . Proof: It follows from (14) and (25) Considering the following Lyapunov function: where V * is optimal value function given by (18), constants K ′ > 0 and ′ > 0. According to (18) and (36), it follows from Young's inequality ab ⩽ a 2 2 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Under the condition that K ′ λ min (R) − ′ ∥g∥ 2 ⩾ 0, (38) can be rewritten asV Noting that γ ′ contains weight estimation errorsW , ESO estimation errorsδ and NN approximation errors ∇ε from critic NN, which implies that constant γ ′ is bounded from Theorem 1 and the boundedness of ∇ε,δ. In addition, the parameters ′ and K ′ are chosen as follows, shown in the equation at the bottom of the page, which guarantees the positiveness of v to induce the stability of (38).
Due to the fact that the disturbance estimation errors and critic NN approximation errors are upper limited by finite values, i.e.,γ ′ > 0. For any We can deriveV < 0 from (39). From Lyapunov Theorem, tracking errors are UUB. Thus, one can conclude that the tracking errors converge to region ∥e∥ ⩽ γ ′ v. In addition, according to the definition of u * e in (23) and u e in (25), the boundness of u e − u * e can be conducted by The proof is completed.

IV. SIMULATION RESULTS
To verify the effectiveness and advantages of proposed disturbance compensation-based optimal tracking control, simulations are conducted on a class of typical perturbed nonlinear system. The quadrotors, which is one of indispensable aircrafts widely applied in many fields [24], [37], has been investigated by many scholars, due to its nonlinearity, unknown dynamic model, and wind disturbance. As modeled in [37], the position loop of quadrotors is described as where p = [p x , p y , p z ] T denotes the position of quadrotor, v =ṗ = [v x , v y , v z ] T represents the velocity, m is mass of quadrotor and G = [0, 0, mg] T with g standing for the gravity acceleration. The coupling matrix H = [cos ϕ cos ψ sin θ + sin ϕ sin ψ, cos ϕ sin ψ sin θ − sin ϕ cos ψ, cos ϕ cos θ] T reveals the relationship of translation and rotational motion with ϕ, θ, ψ being roll, pitch and yaw angles. and d stand for unknown positive definite aerodynamic matrix and bounded disturbance. With an effort to boost controller design, we reformulate (42) into a compact form to adopt proposed composited controlleṙ where with u x = [u p (cos ϕ cos ψ sin θ + sin ϕ sin ψ)] m, u y = [u p (cos ϕ, sin ψ sin θ − sin ϕ cos ψ)] and u z = [u p (cos ϕ cos θ − mg)] m. Since the f (X) and D(t) are unknown, we consider them as comprehensive disturbances δ = f (X) + D(t). The model parameter of the quadrotor and disturbance are shown in Table 1.
The trajectory reference command is p d = [10(1 − cos(πt/10)), 5 sin(πt/5), 9] T K ′ > max gR −1 ,  Based on the aforementioned control design, proposed control is exploited based on disturbance rejection controller and optimal regulation law, where a disturbance rejection controller is designed based on ESO to compensate the impact of comprehensive disturbances. The optimal regulation law is derived by minimizing a value function in ADP framework equipped with adaptive law (31) for weight update. To approximate the optimal value function by critic NN, the 6-15-1 structure is established as following: six input neurons, fifteen hidden neurons, and one output neuron, as shown in Figure 2.
In specific, we select Gaussian basis functions φ = [φ 1 , φ 2 , . . . , φ 15 ] T as regressors in following form For fair comparison, we choose control parameters via a trial and error manner such that the disturbance is estimated precisely and steadystate accuracy is consistent with following contrastive controller schemes.
3) Critic-actor ADP [29] (abbreviated as ADP2): The controller in [29] is achieved by identifier NN and criticactor NN within ADP scheme, which is the state-of-art ADP frameworks and can be found in [28]. For fair comparison, we augment critic-actor ADP with disturbance rejection controller. The learning gains in critic NN and actor NN are chosen as c = I and a = I.

A. COMPARISON WITH DIFFERENT CONTROLLERS
The simulation results are listed in Figures 3-8. The evolution of position tracking errors among above-mentioned controllers are depicted in the Figure 3, implying that all controllers can make sure that trajectory tracking errors of position can convergence to the neighborhood of the origin in a short time despite of comprehensive disturbance consisting of unknown dynamic drift and extraneous disturbance, where comprehensive disturbances can be well estimated by ESO, as shown in Figure 4. The estimation of comprehensive disturbances can track the real lumped disturbances in a short time. The control inputs are shown in Figure 5.
In virtue of ADP framework, the optimal controllers of ADP1 and ADP2 are derived by different NN structure with update rules. The transient tracking performance by proposed ADP1 controller is faster than that of ADP2, which can be originated from the implicit representation and employment of weight errors, such that the proposed adaptive law can drive the weight errors to converge to zero in a prescribed time. Related weight convergence profiles for both methods are revealed in Figures 6-8. The proposed update rule drives the weight approximation to convergence real weight in 2s, furtherly inducing the convergence of optimal value function and optimal policy with weight estimation, which also indicate that the involved ADP1 is capable of learning the optimal control policy as weights tend to constant values. Oppositely, the traditional weight update is designed by minimizing Bellman error via gradient descent or least 50626 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    squares method in ADP2, resulting in the weight convergence with obvious lag phenomenon in the presence of uncertainties after 2s.  Further, taking tracking errors and control consumptions into consideration, the value function is given to evaluate the performance of different controllers: Herein, the value function is denoted with positive matrices Q = I and R = I, where I represents the identity matrix. The evolution of value function under different controllers are shown in Figure 9. It interprets intuitively that proposed controller is equipped with online optimization ability in terms of tracking performance and control consumptions compared with DR controller, wherein the comparable optimization capability with critic-only ADP make the value function decrease by 25% and its performance is slightly better than that of critic-actor architecture.

B. EFFECT OF DIFFERENT OBSERVER BANDWIDTH ON PERFORMANCE AND CONSUMPTION
To explore the optimality ability of proposed optimal tracking controller, we make simulation comparisons with DR controller, where the different observer bandwidths are VOLUME 11, 2023 50627 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   considered. To highlight the immunity ability of proposed controller, we impose following disturbance on systems: d(t) = [sin(4t)−cos(t), 2 (cos(4t)+sin(2t)−cos(t)) , sin(3t) Although the disturbance estimation errors can be decreased with increasing bandwidth, the tracking errors can be minimized with high control consumptions among DR controllers with different observer bandwidths, as revealed in Figures 10-11. Thanks to the fact that the disturbance compensation of ESO and online learning ability for updating the weight to reduce coupling errors, tracking errors can be minimized with the whole effective integration. To prove the necessity of the optimal regulation law u e to improve tracking performance and minimize cost consumption, we employ following indices to measure steady state tracking performance and control costs: with T i being the time of entering steady state and e 0 representing the mean error. Herein, we choose T i = 4s. As is exhibited in Table 2, the ADP1 controller can achieve the better tracking errors, due to the fact that optimal regulation law acts a part of controller for minimizing tracking errors, whilst the control consumptions are not significant increasing, which further demonstrates the superiority of investigated ADP framework. With help of optimal regulation law u e , proposed controller can achieve comparative tracking errors, even there exists disturbance estimation errors. To further compare the tracking performance and control costs, the overall value functions are depicted during the whole optimization procedure in Figure 12. The optimizing ability can be observed significantly that inserted ADP1 scheme has less value functions than DR controllers in spite of ESO designed with different bandwidths.

V. CONCLUSION
In this paper, the disturbance rejection-based optimal tracking control including a disturbance rejection controller and an optimal regulation law is proposed for perturbed nonlinear systems by utilizing the constructed ESO and proposed ADP techniques. Thereinto, the constructed ESO can well estimate the extraneous disturbance in system, such that the established disturbance rejection controller can ensure satisfactory disturbance compensation. Then, an optimal regulation law is designed to tradeoff the control input and tracking performance by minimizing the preassigned efficiency function. It should be pointed out that critic-only NN is used to approximate the value function in suggested framework with novel adaptive rule for updating weight, which has less computation and better control performance than existing ADP schemes with critic-actor NN. Additionally, convergence of weights can be assured in a finite time with proposed improved update rule, and tracking errors can be guaranteed to be UUB. The investigated optimal regulation law goes close to optimal controller. The simulations show the effectiveness and advantages on a practical example. Noting that proposed scheme is designed based on timetrigger, we will employ event-triggered mechanism of [42] and [43] to further taking event-triggered optimal tracking controller into account.