Intelligent adaptive optimal control using incremental model-based global dual heuristic programming subject to partial observability

The scarcity of information regarding dynamics and full-state feedback increases the demand for a model-free control technique that can cope with partial observability. To deal with the absence of prior knowledge of system dynamics and perfect measurements, this paper develops a novel intelligent control scheme by combining global dual heuristic programming with an incremental model-based identifier. An augmented system consisting of the unknown nonlinear plant and unknown varying references is identified online using a locally linear regression technique. The actor–critic is implemented using artificial neural networks, and the actuator saturation constraint is addressed by exploiting a symmetrical sigmoid activation function in the output layer of the actor network. Numerical experiments are conducted by applying the proposed method to online adaptive optimal control tasks of an aerospace system. The results reveal that the developed method can deal with partial observability with performance comparable to the full-state feedback control, while outperforming the global model-based method in stability and adaptability.


Introduction
Conventional controller design for aerospace systems is commonly based on the known linearized models at different equilibrium or trimmed conditions and on PID controllers with humanscheduled gains to cover the complete operating envelope [1]. However, more complex demands such as optimization and adaptation have emerged recently, which cannot be tackled within this traditional scheme. The demand for optimization involves optimal control, in which the dynamic programming (DP) principle plays a fundamental role [2], and the Hamilton-Jacobi-Bellman (HJB) equation is often involved [3]. However, as a partial differential equation, the HJB equation is arduous to be solved analytically due to its nonlinear nature. Besides, DP-based approaches are by nature offline planning approaches in a backwards-in-time way and generally require the full knowledge of the system dynamics [4]. However, for complex systems, sometimes not only the internal dynamics, but also the information to infer its internal states can be unaccessible, i.e., full-state feedback (FSF) is no longer available [5][6][7]. These factors prevent traditional optimal control methods from further applications. On the other hand, adaptive control is another focal point of aerospace systems control [1,8], which is generally considered a separate paradigm from optimal control [9]. Adaptive control concentrates on how the controller can adapt to uncertain system dynamics, and changing environments and tasks, and does not feature optimality as its paramount target. Both optimal control and adaptive control can be significant for aerospace systems. Therefore, the purpose of this paper is to develop an adaptive optimal control approach so as to improve the optimal tracking performance without known system dynamics and perfect measurements.
Reinforcement learning (RL) is a class of bio-inspired artificial intelligence techniques, by which the agent improves its policy to maximize the received reward (or minimize the penalty) during interaction with the environment [10]. From a theoretical point of view, RL is closely linked with adaptive optimal control methods [11,12]. A fruitful cross fertilization of RL and control theory produces adaptive/approximate dynamic programming (ADP), whose essential goal lies in approximating the solutions of DP [13,14]. With two essential ingredients of temporal difference (TD) error and value function approximation (VFA) [12], ADP is a class of effective approaches to deal with adaptive optimal control problems. ADP divides the learning process into two parts, namely policy evaluation and policy improvement, which, in comparison to conventional DP, enables the controller to be computed forward in time and makes online computation feasible. Linear ADP (LADP) is a widely used technique to deal with linear optimal control problems with a quadratic performance index function [5,6,15,16], and an explicit solution can be constructed [17]. Nevertheless, relying on the assumption that the dynamic system is linear time-invariant (LTI), LADP is not suitable for dealing with nonlinear or timevarying systems [6]. What is more, LADP is constricted to only employ a linear quadratic form cost, i.e., x T Qx + u T Ru, where Q is a positive semi-definite weight matrix and R is a positive definite weight matrix. This prevents LADP from further applications with different demands, such as addressing input saturation constraints [18][19][20][21][22], or releasing the input constraints [7,[23][24][25], i.e., R is positive semi-definite.
As an expansion of LADP, adaptive critic designs (ACDs) break the linear quadratic constraints that exists in LADP, and have demonstrated impressive success in adaptive optimal control problems [26,27]. ACDs normally exploit artificial neural networks (ANNs) to approximate evaluation (critic) and improvement (actor) of the control policy, and consequently they can be applied to nonlinear system control problems with complicated rewards. Based on the information utilized by the critic network, ACDs are generally categorized as heuristic dynamic programming (HDP), dual heuristic programming (DHP), and global dual heuristic programming (GDHP) [28]. Among them, GDHP combines the information used by HDP and DHP and thus takes advantage of the two methods [24,25,[29][30][31]. There are several structures of the critic network for GDHP [28] and the most widely used structure is the straightforward form that approximates the performance index function and its derivatives simultaneously [24,30,31]. However, this structure can introduce undesired inconsistent errors, so this paper employs explicit analytical calculations derived in [25] to eliminate these inconsistent errors.
Directly learning from unknown real systems usually requires a lot of trials or episodes [27] and may even cause disasters such as misconvergence or even divergence in some extreme cases [32]. Therefore, for complex and delicate systems, such as aerospace systems, information about state transition is required. For instance, for a time-invariant affine system, given the explicit information of control effectiveness, a convergent control policy can be generated based some assumptions [21,33]. However, sometimes system dynamics are completely unknown. Consequently, an extra structure, such as an ANN, is introduced to approximate the system model in some literature [29,31,[34][35][36]. Because training ANNs requires some efforts before the parameters converge, the model network is trained offline and kept unchanged for online application in [29,31,35], which lacks capability to adapt if the system is changed, whereas in [34,36], the information of partial system dynamics is still required for online identification.
To tackle the limits of learning global system models and to achieve online fast adaptation, an incremental model is introduced in this paper. According to a local linear regression (LLR) technique [37], the incremental model only approximates the local dynamics of the original nonlinear system instead of the global model, on the assumption of sufficiently high sampling frequency [25]. The incremental technique has been successfully combined with various classic control methods to obtain adaptive nonlinear control approaches, such as incremental nonlinear dynamic inversion (INDI) [38] and incremental sliding mode control (ISMC) [39,40]. These approaches have shown success in reduction of the model dependency and fault tolerant, but have still not addressed the optimality. On the other hand, the synthesis of incremental techniques and ACDs leads to the incremental model-based adaptive critic designs (IACDs) [23][24][25]. These approaches have been applied to several flight control problems and performed well to generate adaptive optimal controllers with FSF. Nevertheless, real applications are often more complex and FSF can be unrealistic, which results in a partial observability (PO) problem. According to [6,41], the methods coping with deterministic systems and measurements are often regarded as output feedback methods [5,7,9,16,36,42,43], whereas if stochastic timevarying dynamics are involved, the control problems are linked with partially observable Markov decision processes (POMDPs) [6,41,44,45]. PO often occurs in aerospace systems, whose internal dynamics can be difficult to obtain and may be time-varying or stochastic, such as liquid sloshing in spacecraft with fuel tanks, infrared camera tracking with unpredictable target maneuvering, and unforeseen damages to aircraft structures changing system dynamics suddenly [6,41]. In [6], the incremental model is for the first time applied to a flight control problem with only tracking errors directly measurable by improving LADP, and extends the approach by combining the HDP approach in [41]. However, HDP has shown inferiority in convergence speed and control precision compared to DHP and GDHP [23,25]. In addition, the convergence of the identification technique is not analyzed in all existing literature adopting the incremental model.
The main contribution of this paper is dealing with the PO condition in adaptive optimal tracking control of unknown nonlinear systems by introducing an augmented incremental model into the GDHP algorithm, such that an incremental model-based GDHP (IGDHP) approach is developed. The principal advantage of the IGDHP approach lies in that the incremental model accelerates the online policy learning without knowing global system dynamics or offline training a model network, which allows for quick adaptation to system changes. Although some previous works are based on the incremental model [5][6][7][23][24][25]29,41], this paper discusses the convergence of the identification technique and achieves the highest 100% success ratio for the first time. The output layer of the actor network exploits a symmetrical sigmoid activation function, to satisfy the demands for tackling input saturation constraints, by multiplying an additive determined weight vector. Different from [25], this paper focuses on the PO situation, and improves the previous IGDHP approach by an augmented incremental model so as to deal with the unavailability of the information referring to inner system states and the unknown time-varying reference. The present research aims at bridging the gap between the discussed algorithms and real world systems, by taking more realistic application scenarios into consideration for verification, including sensor noises, fault-tolerant tasks, parameter variations, load disturbances, and combination with other controllers in higher level control.
The remainder of this paper is structured as follows. Section 2 presents the basic formulation of the continuous optimal tracking control problem subject to input constraints. In Section 3, the incremental technique is introduced for online identification in both FSF and PO conditions. Section 4 presents the IGDHP algorithm with explicit analytical calculations and addresses the input constraints via the actor network. Then Section 5 verifies the developed IGDHP method by applying it to various control problems of an aerospace system. Finally, the conclusion and future research are presented in Section 6.

Problem statement
Consider a nonlinear continuous system described by: where x(t) ∈ R n , and u(t) ∈ R m are the state vector and control vector, respectively, and f [x(t), u(t)] ∈ R n provides the physical evaluation of the state vector over time. Assume that f is Lipschitz continuous on a set Ω s ⊂ R n and that the system (1) is controllable on Ω s .
The output of the nonlinear system is represented as: where y(t) ∈ R p , and h[x(t)] ∈ R p is the Lipschitz continuous output function. The system is also assumed to be observable.
The problem investigated in this study is in the framework of optimal tracking control problem, so the objective of the controller is to minimize the tracking error between system output y(t) and reference trajectory y ref (t), which is defined as: where e(t) ∈ R p and y ref (t) ∈ R p .
In the ADP scheme, the performance index function, also called cost-to-go, of optimal tracking control problem is usually presented as the cumulative sum of future costs from any initial time t: where u(t : ∞) = {u(τ ) : t ≤ τ < ∞} denotes the system control produced by control law µ(e(τ )) ∈ R m from time instant t to ∞, r(e(t), u(t)) ∈ R denotes the cost at the time instant t, and γ ∈ (0, 1] is the discount factor that indicates the extent to which the short-term cost or long-term cost is concerned. For simplicity, J(e(t), u(t : ∞)) is denoted by J(t) and r(e(t), u(t)) is denoted by r(t) in the following part.
Input constraints are taken into account in this paper, which cannot be tackled merely by the linear quadratic cost. A nonquadratic functional is employed in [18,20,22] for regulation optimal control problems with input constraints. Nevertheless, this non-quadratic functional can relatively improve the complexity of the GDHP technique, in that the backpropagation processes need to compute partial derivatives. Moreover, in the existing standard solution to the optimal tracking control problems, a transformation is conducted with the aid of a desired control input u d (t) to build a regulation optimal control formation concerning the tracking error e(t) and the feedback input u e (t) = u(t) − u d (t). However, as claimed in [19], it is impossible to encode the input constraints into this new control problem simply by a non-quadratic functional, since only feedback part of the control input u e (t) can be directly obtained by minimizing the performance function. Therefore, a saturation function is directly imposed upon the control commands to satisfy the input constraints, which will be addressed by modifying the structure of the actor network in Section 4. In this way, the tracking problem is transformed into a regular optimal control problem subject to input constraints.
Based on TD technique [12], the cost-to-go can also be represented as: where T > 0 is a time horizon. According to Bellman's optimality principle, the optimal cost-to-go is given as: where • * stands for the optimal value of • . Therefore, the optimal control law can be expressed as: µ * (e(t)) = arg min For nonlinear systems, the solution of Eq. (6) is usually intractable to be obtained analytically. Therefore, an IGDHP algorithm is introduced to iteratively solve this optimal control problem.

Incremental model implementation
The IGDHP algorithm requires the information of the cost at next time instant, so the predictability of the system states is significant. This paper considers a PO situation, where although the system is observable, the only measurement is the tracking error and even the reference can be unknown and changing. This scenario can happen in real applications in the aerospace systems control problems. For instance, the docking sensors for automated transfer vehicle and International Space Station are infrared cameras, which only measure the relative distance and angles between them as the navigation information [6]. Therefore, it is desired to build a new module to provide a mapping from the system input to the observation, which will be dealt with using the incremental model in this section.

Incremental model with FSF
The derivation of the incremental model starts from the FSF condition while for all incremental techniques, the following assumption is a prerequisite: The sampling frequency is sufficiently high, i.e., the sampling time ∆t is sufficiently small, and the system dynamics are relatively slow time-varying.

Remark 1.
There are two important parts referring to Assumption 1. Firstly, a discrete incremental model can be introduced to represent a continuous nonlinear plant and retain high enough precision. Secondly, the discrete model does not change the properties of the original system, including controllability and observability.
It is assumed that the system is first-order continuous with respect to time at around time instant t − ∆t (denoted by t 0 ).
Then, taking the first order Taylor series expansion and omitting higher-order terms, the system dynamics of Eq. (1) at around time instant t 0 can approximately be linearized as follows: For ADP-based methods, the control policy cannot be determined in advance, and therefore the higher-order term can behave as a perturbation term to affect the closed-loop performance. Nevertheless, as claimed in [39,46], the higher-order term, which means the norm value of the higher-order term is negligible given sufficiently high sampling frequency. Eq. (9) also indicates that ∀Ō > 0, ∃∆t > 0, satisfies that for all 0 < ∆t ≤ ∆t, ∀x ∈ R n , ∀u ∈ R m , ∀t ≥ t 0 , ∥O((∆x(t)) 2 , (∆u(t)) 2 )∥ 2 ≤Ō, i.e., there exists a ∆t that guarantees the boundedness of the higher-order term and the bound can be further diminished with the increase of the sampling frequency. Besides, the LLR technique is adopted and the linearization errors will not accumulate but only affect the local system identification. Furthermore, the real-world experiments, including the ground robot [38] and aerospace systems [40,47], have been successfully carried out based on this linearization process. Therefore, in the following part, the higher-order term is omitted for the convenience of controller design.
Assuming the states and state derivatives of the system are measurable, i.e., ∆ẋ(t), ∆x(t), ∆u(t) are measurable, an incremental model can be utilized to describe the system (8): Despite the fact that physical systems are usually continuous, modern processors work in a discrete way, leading to discrete measurements and computations [24,25]. Consequently, given a sufficiently small sampling time ∆t, based on Assumption 1, the plant model (10) can be represented approximately in a discrete form: (11) in which the subscript t stands for the current sampling time in- R n×m denote the system transition matrix and the input distribution matrix at time instant t − 1 for the discretized systems, respectively. From Eq. (11), the following incremental form of the new discrete system can be obtained: where I n denotes an identity matrix and subscript n shows its dimension.
In the FSF situation, matrices F t−1 and G t−1 can be identified online with a recursive least square (RLS) algorithm [25] and each update only requires the latest data.

Augmented incremental model
This subsection will focus on the construction of the locally incremental model using tracking error and input measurements based on the augmented state.
Considering Eq. (2), the output of the system around time instant t 0 can be linearized with Taylor expansion: where ∂x(t) | x(t 0 ) ∈ R p×n denotes the observation matrix. Given a sampling time ∆t, the incremental dynamics of the system output can be written as: in which H t = ∂h T (x) ∂x | xt ∈ R p×n denotes the discrete observation matrix. It has been examined that, if a nonlinear system is completely observable with its output, then the system can still be regarded as deterministic [5,6], suggesting that the unmeasurable internal states can be reconstructed with the adequate observations to provide transition information [7].

Lemma 1. Given the measured input/output data over a long-
enough time horizon, [t − N + 1, t], N ≥ n/p, the output increment ∆y t+1 can uniquely be determined as follows: where F t ∈ R p×Np denotes the extended discrete system transition matrix, G t ∈ R p×Nm denotes the extended discrete input distribution matrix, and ∆u t, input/output data of N previous steps, respectively.
Proof. Based on Assumption 1, the new discrete system described by Eqs. (12) and (14) is observable. Then the detailed proof can be found in [5,15] and is omitted here.
If the reference signal is slow-varying in comparison to the system dynamics, then in the time horizon [t − N + 1, t], the increment of the reference signal can be ignored. Accordingly, considering Eqs. (3) and (15), the output tracking error at the next time instant can be written as: where e t+1 ∈ R p and y ref t+1 ∈ R p . However, it is impossible to directly identify Matrices F t and G t since the reference is unknown, and put another way, the system output cannot be measured separately. Furthermore, the last approximation in Eq. (16) relies on the assumption that the reference remains constant within the time horizon [t − N + 1, t], which can be invalid in numerous scenarios.
Consequently, a more general situation corresponding to POMDP is taken into account, and the following assumption is given: The bandwidth of the reference signal is comparable with that of the system dynamics, and the dynamics of the reference signal can be represented as: where f ref is Lipschitz continuous on a set Ω r ⊂ R p , and differentiable almost everywhere except for finite isolated points.
The reference signal is often independent of the system output, while in some other cases the reference can partially be determined by the system output, such as moving targets equipped with anti-tracking systems [6]. Eq. (17) provides a general reference description that can also be expressed by the time-based function, as long as the reference signal is continuous and piecewise differentiable. Similar to Eq. (12), the reference signal can be represented as a discrete incremental form by Taylor expansion and discretization: where F ref Accordingly, an augmented system that consists of the system state and reference dynamics can be constructed by combining system representation Eqs. (12) and (14) and reference representation Eq. (18) [16].
] T , and then the following augmented system can be obtained: and Hence, given the current time instant t, the increment of system state and output reference can uniquely be represented by the historical data as an augmented state equation: Similarly, the tracking error can be represented by previous data: Vectors ∆e t,M and ∆u t,M are the increments of observation and input sequences over the time interval [t − M + 1, t], which represent the available measured data. Then the following lemma is given: Proof. Since the augmented system is observable, there exists a K , the observability index, such that rank(V M ) < n + p for M < K , and that rank( Therefore, let M ≥ K , and there exists a matrixM ∈ R (n+p)×pM such that: SinceV M has a full column rank, and its left inverseV left M is given by: then holds for any matrix Z, withM 0 denoting the minimum norm being the projection onto a range perpendicular toV M [15].
Note, however, thatM 1VM = 0 so thatMV M ∆z t−M+1 = M 0VM ∆z t−M+1 , and applyM 1 to Eq. (22), then Therefore, independently ofM 1 . Then from Eq. (21), it can be obtained that: This result expresses the increment of the system state and reference ∆z t+1 in terms of the inputs and observations from time instant t − N + 1 to time instant t, which ends the proof.
Lemma 2 provides a deterministic relationship between the historical data and future states. To build a direct mapping from the historical observations and inputs to the future observations regardless of the inner states, the following theorem is presented based on Lemma 2: Let the augmented system described by Eqs. (19) and (20) Proof. Substitute Eq. (30) into Eq. (20) and the dynamics of the measurement can directly be obtained: This completes the proof.
Theorem 1 has a similar form to Lemma 1 but includes the reference signal in its representation, which enables the incremental model predict tracking error without knowing the reference function. Matrices F t and G t in Eq. (31) can be identified using the RLS algorithm and then the one-step prediction of the tracking error can be made as: where• stands for the estimated or approximated value,F 11,t ∈ R p×p andF 12,t ∈ R p×(M−1)p are partitioned matrices fromF t , and G 11,t ∈ R p×m andĜ 12,t ∈ R p×(M−1)m are partitioned matrices from In this way, the original continuous non-affine system is transformed approximately into a new discrete affine system, based on which, the IGHDP algorithm can design the control increment ∆u t .

Online identification with RLS algorithm
A RLS algorithm is applied to the pending matrices F t and G t online [6,25]. For convenience, Eq. (31) is represented a row-byrow form as follows: as the input information of the augmented incremental model identification, and p+m)×p as the pending augmented matrix to be determined using the RLS algorithm. A sliding window technique is employed to store sufficient historical data for online identification [7,37]. Considering the demands for fast computation, identification and adaptation, the width of data window should be as small as possible with guaranteed accuracy. Consequently, according to Lemma 2 and Theorem 1, let M = (n + p)/p.
The main procedure of the RLS algorithm is presented as follows [24,48]: where ϵ t ∈ R p denotes the prediction error, Cov t ∈ R (p+m)M×(p+m)M stands for the estimation covariance matrix, which is symmetric and positive definite, and γ RLS ∈ (0, 1] is the forgetting factor in the RLS algorithm. It is noted that ∆ê t+1 is approximated byΘ t−1 during the implementation becauseΘ t is obtained by the RLS algorithm after ∆e t+1 is observed. Assumption 1 implies that in a certain horizon A = [1, t], M ≤ t ≤ P, M ≪ P < ∞, the slowly varying augmented system dynamics can be regarded as a linear plant with constant pending parameters. Hence, based on the following assumption [48], the locally approximate convergence of the RLS algorithm is analyzed.
where e o,t is the equivalent plant noise independent of the samples x t .

Theorem 2. If Assumptions 1-3 hold, and the RLS algorithm is implemented using Eqs. (35)-(38), the approximate augmented matrix
Θ t has the trend of converging to the locally optimal matrix Θ.
Proof. Because the optimal augmented matrix Θ is valid over A , the previous observations can uniformly be written as: convergence analysis, which guarantees X t X T t is positive definite. According to [48], it can be obtained that Cov −1 , . . . , 1]), and diag( • ) reshapes the vector to a diagonal matrix. Therefore, the approximate augmented matrixΘ t can be represented as: whereΘ t is the approximate error vector. Define the approximate error correlation matrix as: where E( • ) is the expectation operation. Substituting Eq. (41) into Eq. (42), and noticing both Cov t and Γ t are symmetrical matrices, we can obtain that: Recalling the independence of e o,t and x t , and the weight noise property of e o,t yields: where σ 2 o is the variance of e o,t , and Cov −1 t . Rigorous evaluation of Eq. (44) is intractable. Hence, Assumption 3 is utilized to facilitate an approximate evaluation ofL t [48].
It can be found that Cov −1 t is a weighted sum of the outer products x t x T t , · · ·, x M x T M . Therefore, based on Assumption 3, the following approximation holds: where is the correlation matrix of observations. If the PE condition is satisfied, x t x T t is positive definite and E −1 o can be expected.
Substituting Eq. (45) into Eq. (44) yields: In the steady state, i.e., t → P → ∞, the following equation holds: It can be found that if γ RLS is very close to 1, thenL P → 0, which means the approximate augmented matrix converges to the optimal matrix. This completes the proof.

The IGDHP algorithm
Since the incremental model discretely identifies the system dynamics, it is also necessary to design the controller in a discrete manner. It can be found that the optimal cost-to-go (6) and optimal control law (7), which are presented in the continuous domain, have similar forms of discrete representation [12]. Letting the time horizon in Eqs. (6) and (7) be equivalent to the sampling time, i.e., T = ∆t, we can discretize Eqs. (6) and (7) as: and µ * (e(t)) = arg min where r t is the one-step cost function of the discrete-time design and can still be formulated as a linear quadratic form: where both Q ∈ R n×n and R ∈ R m×m are positive semi-definite.
The weight matrix R is used to control the energy cost and note that it does not have to be positive definite. With the incremental model technique, the IGDHP algorithm can iteratively solve this discrete-time optimal control problem with an actorcritic scheme. Based on current information, the actor network generates control inputs for both real system and plant model. The incremental model predicts the tracking errors at the next time instant, which are utilized by the critic network to approximate cost-to-go, whose derivatives are computed analytically. The structure of the IGDHP algorithm is illustrated in Fig. 1. For simplicity, the introduced ANNs in both the critic and actor networks are fully connected and feedforward, and consist of only three layers of nodes: an input layer, a hidden layer and an output layer. The activation function employed in the input layer is a unit-proportion linear function and in the hidden layer is a symmetrical sigmoid function, which is denoted by σ . In the following detailed implementations, the variables or pathways corresponding to the critic and actor networks and the incremental model are denoted by the subscripts c, a, and m, respectively.

The critic network
Although the one-step cost function consists of e t and u t , only e t is set to be the input of the critic network. The main reason is that introducing u t as an input will significantly improve the complexity of the backpropagation. Besides, u t is also derived from e t by the actor network, and can still affect the approximation in an indirect way through the cost function. There is no bias term in the critic input, because the critic output is desired to be 0 when e t is a zero vector in this case.
In the structure with explicit analytical calculations, the critic network only directly outputs the approximated cost-to-goĴ t , and its output layer utilizes a unit-proportion linear activation function: where w c1,t and w c2,t denote the critic weights between input layer and hidden layer, and the weights between hidden layer and output layer, respectively, and σ ( • ) denotes the activation function of the hidden layer. Then explicit analytical calculations [25] are carried out to computeλ t directly usingĴ t : where ⊙ is the Hadamard product, and σ ′ ( • ) denotes the first order derivative of function σ ( • ).
A TD method is introduced to iteratively update the critic network [10], whose principle is trying to minimize the TD error, the error between the current and successive estimates of the state value. Similar to the transformation of the optimal cost-to-go, Eq. (5) can also be transformed into a discrete form: Therefore the TD errors produced by the critic network are given as follows: and where e c1,t denotes the TD error of the estimated cost-to-goĴ t with current network weights, e c2,t denotes the TD error of the computed derivativesλ t with current network weights. It is noted thatĴ t+1 andλ t+1 are predicted using the weights at time instant t.
The computation of items in Eq. (55) is worth analyzing. Considering Eq. (50), the second item on the right hand side of Eq. (55) can be computed by: where ∂u t /∂e t goes through the actor network, which will be introduced in the next subsection. Then ∂ê t+1 /∂e t requires to be handled with care since there exist two pathways for e t to influenceê t+1 . As shown in Fig. 1, one pathway goes through the incremental model directly (pathway 3.a), while another pathway firstly passes through the actor network and then goes through the incremental model (pathway 3.b): Then, IGDHP combines both kinds of TD errors in an overall error function E c,t : where β is a scalar within a range of [0, 1]. If β = 1, then it becomes pure IHDP. If β = 0, then the tuning of weights merely depends on the TD error of computed derivativesλ t , and consequently it is equivalent to IDHP.
Given a learning rate η c , the critic weights w ci , where i = 1, 2, are updated with a gradient-descent algorithm to minimize the overall error E c,t : where in which, ∂λ t /∂w ci,t represents the second-order mixed gradient of the estimated cost-to-goĴ t , and how to compute it is given without derivation [25] as follows: where ⊗ is the Kronecker product; and if i = 1, denote n c as the number of neurons in the hidden layer, and then where K is a commutation matrix of w c1,t , and vec( • ) is a vector reshaping function. Note that the tensor operation can reduce the dimensionality of a matrix, and therefore dimensionality analysis is involved to reshape the results after computing ∂λ t /∂w ci,t .

The actor network
Although in [35] a single critic network structure is utilized, since the IGDHP algorithm is actually implemented in a discrete way, an actor network is required for approximating the optimal control. Safety is vital for real physical systems and thus some restrictions are usually added to system control. In this paper, the output layer of the actor network employs a symmetrical sigmoid activation function, and is multiplied by an additive unchanged weight vector u b = [u b1 , . . . , u bm ] T , where u bi ≥ 0, for i = 1, . . . , m, so that the system control u t = [u 1,t , . . . , u m,t ] T outputted by the actor network is bounded by u b , i.e., |u i,t | < u bi for i = 1, . . . , m, as shown in Fig. 2. Consequently, the actor network is presented as: where b a is a constant bias term, which is introduced because the system control may not be a zero vector given zero tracking error, w a1,t and w a2,t are the weights of the actor network, and the way to define them is similar to that of w c1,t and w c2,t . The purpose of the actor network is to generate a near optimal control policy to minimize the future approximated cost-to-gô J t+1 : where E a,t denotes the overall error and e a,t stands for the error between the approximated future cost-to-goĴ t+1 and the target 0 cost-to-go, i.e., e a,t =Ĵ t+1 . The system control u t is also an input of the incremental model, so even though u t does not appear in the reward function, i.e., R is zero, it has an influence on the critic output at the next time instant. Therefore, a gradient-descent algorithm can be implemented to iteratively solve Eq. (65) by the 4th pathway starting fromĴ t+1 throughê t+1 to u t . Different from the straightforward form, whose back-propagation pathways of the actor network can start from eitherĴ t+1 orλ t+1 , there is only one back-propagation pathway for IGDHP with explicit analytical calculations to update the actor weights. Nevertheless, this explains exactly why the structure with explicit analytical calculations surpasses the straightforward structure, in that it releases the restriction of scalar β of being 0 or 1. Specifically, for the straightforward form, if β = 0, then the elements in w c2 linked toĴ t+1 will never be updated, and if the actor network is trained through the pathway leading fromĴ t+1 , the back-propagation cannot be carried out. Similarly, if the back-propagation channel of the actor network starts fromλ t+1 , then β ̸ = 1 is mandatory for the straightforward form. On the contrary, the structure with explicit analytical calculations has no such limitations on β, because even though β = 0, the critic network can still be trained.
As presented in Fig. 1, the actor weights are updated along the 4th back-propagation pathway with a learning rate η a : where i = 1, 2, and So far the implementation of the proposed IGDHP with PO control scheme has been introduced. The procedure is briefly summarized in the following Algorithm 1, where line 6 is the dividing line between the current time instant and the next time step.

Remark 2.
The convergence analysis of the online identification has been presented in Section 3.3, and the convergence analysis of the ADP scheme has been investigated in [29,30]. However, as stated in [25], it is currently unfeasible to theoretically prove the closed-loop stability of the IGDHP algorithm due to its completely online implementation. Accordingly, repeating numerical experiments are carried out in the next section for verification. Algorithm 1: Design procedure of the IGDHP with PO control scheme.
1 Initialization: initialize system states, and parameters of the identifier, the critic network and the actor network; 2 while terminal condition is not triggered do 3 compute the control input u t using the actor network (Eq. (64)) with the current observation; 4 predict the one-step observationê t+1 using the identifier (Eq. (32)) with the stored data; 5 evaluate the current control policy using the critic network (Eqs. (51) and (52)

Numerical experiments
This section assesses the developed IGDHP algorithm on a practical aerospace application. Firstly, traditional GDHP with PO (GDHP-PO), IGDHP with FSF (IGDHP-FSF) and IGDHP with PO (IGDHP-PO) are compared by applying them on an attitude tracking control task. Then the IGDHP-PO algorithm is adopted for an altitude control problem in combination with a hierarchy technique and PID controller.

Aerospace system model
A nonlinear model of aircraft is set up utilizing the public data [49] and only the longitudinal dynamics are taken into consideration. All parameters without special explanation in this paper are determined by trimming the aerodynamic model in a steady wings-level flight condition at 15000 ft and 600 ft/s, which will be referred to as the benchmark condition.
In this longitudinal control problem, there are 6 states, namely altitude h, airspeed v, pitch angle θ, angle of attack (AOA) α, flight path angle γ F and pitch rate q. Nevertheless, for a specific task, it is feasible to select only some main states to reduce computation. For instance, for the AOA tracking task, which is an attitude control problem, only pitch rate q, the basic state in the rate loop, is chosen as the additional feature state in [25], while other states are considered as parts of the black box to be identified.
For the longitudinal aerodynamic model, there are 3 control inputs, namely leading edge flap deflection δ lef , engine thrust T and elevator deflection δ e . The control surface of leading edge flap cannot be directly changed by the pilot [49], and therefore is regarded as inner unknown dynamics. Out of simplicity for implementation and convenience for analysis, a simple PID thrust control to maintain the airspeed is designed in a separate control loop [39], such that only one control input, elevator deflection δ e , is taken into concern and a mapping between one control input and one final output can be constructed. Before the elevator deflection is practically adjusted, the system control u t generated by the actor, or called elevator deflection command δ c e in this aerospace application, has to go through the actuator, Fig. 3. The dynamics of actuator, a first-order system with rate and position limits [25].

Table 1
Magnitude of the noises having impacts on controller design. Source: Adapted from [50].
which is modeled as a first-order system with rate and position constraints, as illustrated in Fig. 3 [39,49].
In real-word applications, noise is unavoidable, and unforeseen changes of system dynamics or even sudden failures might be encountered. Consequently, there is need to approximately model these uncertainties for verification of the developed method. Non-zero mean white measurement noise acting on the feedback signals and actuator is taken into account, and the magnitude of real-world phenomena utilized for simulation is given in Table 1 [50]. This noise can have impacts on the performance of the controller, but can also act as exploration noise to better satisfy the persistent excitation (PE) condition [25]. It is noted that only PO methods requires to consider measurement noise, while in the FSF condition, the acquired data is assumed noise-free.
Sudden structural damages may be encountered during flight, which require fault-tolerant control (FTC) [14]. Fault diagnosis is a significant part of FTC, but will not be discussed in this paper. Several kinds of sudden faults and their aftermath have been scrutinized in [39]. It has been investigated that moving mechanisms, such as actuators, are more likely to be affected by unexpected faults. There are four faults taken into consideration, namely reduction of bandwidth, strengthening of rate limitation, output deviation and reduction of control effectiveness, while the last situation can also be triggered by the structural damages changing aerodynamics [25]. As to the faults of static structure, damage of the horizontal stabilizer is selected to be investigated. This damage has significant impact on both static and dynamic stability for longitudinal dynamics, and mass loss is often encountered simultaneously, which will instantaneously change the center of gravity.

Simulation results
An AOA tracking problem is firstly taken into consideration. How to implement the GDHP method can be found in [25] and is omitted here. For better comparison, the settings in IGDHP-FSF, IGDHP-PO and GDHP-PO are as consistent as possible and the best possible parameters are ascertained by repeating the experiment. The actor, critic and model networks are all fully connected and consist of three layers of nodes. For balancing the accuracy with the computational efficiency, we experimentally set the number of hidden layer neurons in all three networks as 25 and a sigmoid function is adopted as the activation function. Both actor network and model network introduce a bias term as an input, and b a = b m = 0.01. Large random initial weights can significantly affect learning performance, whereas small ones will decrease initial exploration efficiency. Therefore, as a trade off, All weights are randomly initialized within a small range of [−0.1, 0.1], and bounded within the range of [−20, 20]. With the information of derivatives, the DHP technique generally surpasses HDP in tracking precision, convergence speed and success rate [6] and therefore β is chosen to be 0.1 to take advantage of the derivative information. We experimentally choose Q = 8, R = 1 and γ = 0.99995 to formulate the critic network. It should be noted that R can also be 0, but for the purpose of decreasing energy cost, it is set positive definite. As mentioned above, the system state only includes AOA and pitch angle, and therefore the minimum width of the sliding window is 3. To enhance the stability, the window width is set to 4 for IGDHP-PO and IGDHP-FSF, and 5 sets of data are utilized by GDHP-PO to ensure the same level of data use, in that both IGDHP-PO and IGDHP-FSF employ the increment information, which is acquired from 5 time instants.F t is initialized to be composed by 4 identity matrices andĜ t starts from a zero matrix. γ RLS is chosen to be 0.99995. Cov t originally is a diagonal matrix with all main diagonal elements chosen to be 10 7 [48], since the trace of Cov t indicates the magnitude of the estimated errors, which can initially be infinite due to insufficient or incorrect knowledge of the system [51]. Besides, a descending method is applied to guarantee effective learning, which suggests that the learning rates are initially set to be large numbers and gradually decrease at each time step by being multiplied by a descending coefficient until bound is reached, and the corresponding parameters are given in Table 2.
All experiments are carried out with a sampling frequency of 1kHz and computed using Euler's method. In order to achieve the PE condition, a probing signal is introduced at the initial exploration stage. How to design a good probing signal is still an open problem, but for this paper a 3211 disturbance signal that changes its sign over time as a particular proportion [5,25] is introduced to excite the system modes. Firstly, the system is supposed to track a human-designed AOA reference signal online using IGDHP-FSF, IGDHP-PO and GDHP-PO, respectively. The AOA reference signal varies around the trimmed condition, namely 2.6638 deg. The comparison of the performance is given in Fig. 4, where the system is initialized to be at the benchmark condition. If successfully implemented, all three methods can complete the tracking task. Among them, IGDHP-FSF shows the best performance, with a control policy that converges fast. In the PO condition, both IGDHP-PO and GDHP-PO methods can track the reference after a small period of vibration and get similar performances. The settling time of IGDHP-PO is slightly smaller than that of GDHP-PO method but it can be ignored in the practical applications. Due to the influence of measurement noises, the tracking precision of these PO-based methods is lower than IGDHP-FSF. As presented in subfigure (b), the tracking errors (denoted by ∆α) of IGDHP-PO and GDHP-PO keep oscillating around the mean value of the AOA sensor noise, while the tracking error of IGDHP-FSF is approximately 0 at the stable stage. Fig. 5 illustrates the elevator deflection command produced by the IGDHP-PO method starting from the benchmark condition. From the inset, it can be seen that a 3211 disturbance signal,   is applied to kick off the learning process. The control input command turns to be nearly 0 after the probing signal vanishes since the initial weights of the actor network are tiny random values, as presented in Fig. 6. Then both critic and actor networks are excited and the actor-critic scheme behaves as a high-gain controller during the exploration stage until the weights converge. The learning process is performed totally online and will continue even after the policy has converged.
In addition, the initial condition can have an impact on the controller while the proposed IGDHP-PO approach can deal with a wide range of initial states within [−10 deg, 15 deg] without loss of precision. As presented in Fig. 7, the AOA can track the given reference signal α ref in less than 2 s for all initial conditions using the IGDHP-PO approach, which is indicative of its competent adaptability and robustness.
Nevertheless, only when the task is successfully carried out, can the results presented above make sense. Random factors, such as initial weights of ANNs and measurement noises, can have impact on the performance and occasionally even trigger divergence and failure. To compare the robustness of these algorithms to various aspects, a concept of success ratio is introduced as a performance index. For a successful implementation, the amplitude of the AOA trajectory cannot transcend a range of [−15 deg, 20 deg] over the whole control period, the system is supposed to be able to track the reference signal after 2π seconds, and the tracking errors will not exceed 1 deg hereafter. Remaining all parameters unchanged except for probing signal, 1000 Monte Carlo simulations are implemented to assess the performance of these approaches.
The experiment are executed with 7 different initial AOAs and equal reference to evaluate the robustness of these approaches towards initial tracking errors and the results regarding success ratio are illustrated in Table 3. Both IGDHP-FSF and IGDHP-PO have a success ratio of 100% in the benchmark condition, which implies these approaches are stable for this tracking control problem. On the other hand, the success ratio of GDHP-PO in the benchmark condition is merely 51.2% and for all initial states, the success ratios of IGDHP-FSF and IGDHP-PO outclass those of GDHP-PO, demonstrating the incremental technique can significantly improve system identification compared to the traditional global ANN-based model in this online application. The success ratios of IGDHP-PO are generally smaller than those of IGDHP-FSF, which is intuitively predictable, while the differences are less than 1% for more than half initial states and this shows that the IGDHP-PO can cope with PO situations. It is shown that the success ratios with α 0 = −5 deg and α 0 = 0 deg decrease by 35.5% and 21.5% for IGDHP-FSF and by 53.7% and 26.5% for IGDHP-PO compared to the benchmark condition, respectively. Besides the algorithms themselves, this reduction of success ratio can also be caused by the collective effect of the nominal reference signal and inherent dynamics of the system. Besides, it should also be noted that in multiple cases the success ratio is not 100%, which is owing to the fact that it is arduous to accomplish optimal PE condition due to the circular argument between PE condition, accurate system information and stable control policy [25]. Although some cases can achieve a success ratio of 100%, random factors prevent the controller from consistent perfect tracking. Nevertheless, there is still prospect of full success and the results presented in Table 3 are obtained based on current settings. The development of various aspects can benefit the stability and improve success ratio, such as sensor precision, probing signal, parameter initialization and learning rates. It is still an open problem to improve these factors and therefore this paper concentrates on the assessment of robustness between IGHDP-FSF, IGDHP-PO and GDHP-PO. The adaptability is one of the most significant aspects through which ACD approaches show their superiority over other conventional optimal control methods. Sound polices can be learned automatically by updating the weights of the networks, which enables ACDs be applied to FTC problems. Therefore, the following part investigates and compares the capability to adapt of IGDHP-FSF, IGDHP-PO and GDHP-PO by applying them to two fault scenarios, where the system is supposed to track a sinusoidal reference signal. Considering the moment when the faults happen can have an influence to the results, the faults are designed to take place at the instants of 4π seconds and 5π seconds, respectively.
The first fault scenario examined is that the elevator actuator is partially damaged. Specifically, the elevator suddenly loses 30% of its effectiveness and 50% of its bandwidth, and the bound of deflection rate degrades by 1/3 to [−60 deg/s, 60 deg/s]. Besides, there is a constant disturbance acting on the actuator, making the real deflection 2 deg larger than the produced commands. The results are illustrated in Figs. 8 and 9, where the malfunction of the actuator does not exceedingly affect the performance of  these adaptive controllers, especially for IGDHP-FSF, which acts completely to normal after a small oscillation, without significant increase of tracking errors. The impacts of the sudden damage on IGDHP-PO and GDHP-PO are more obvious, which escalate their tracking errors in varying degrees. It can be seen that the tracking errors of both approaches increase after encountering instantaneous damage, while IGDHP-PO relatively surpasses GDHP-PO since the latter has larger errors. Nevertheless, all approaches are capable of handling this fault scenario.
The second fault scenario investigated is the damage of the left horizontal stabilizer. With an intact right horizontal stabilizer, the center of gravity inevitably shifts from the normal position, which will lead to an increment of pitching moment. This structural damage can also affect the closed-loop performance by degrading longitudinal damping and stability margin. As presented in Figs. 10 and 11, the static structural fault caused by damage of the horizontal stabilizer demonstrates a bigger impact compared to the considered actuator fault. At large positive values of AOA, the tracking performance of all three approaches significantly degrades to different extents. On the whole, all approaches can adapt to this sudden fault in that their performance is improving with the time. IGDHP-PO still outperforms GDHP-PO in adaptation since after one period of reference signal, the tracking errors of IGDHP-PO has declined to less than 1.5 deg while tracking errors of GDHP are beyond 2 deg. Synthesizing two fault scenarios,   it can be concluded that incremental technique has an advantage of quick adaptation over conventional ANN-based global model.
Then, more practical application scenarios are investigated, where the altitude control loop assisted with PID controller is introduced to generate a more realistic reference signal. The pitch angle of the system is selected as the controlled variable of the intermediate loop, i.e., from outer loop to inner loop, the controlled variables are altitude, pitch angle and AOA, respectively, as illustrated in Fig. 12. Other system states are regarded as the unmeasured inner states. Although flight path angle is more widely used as the intermediate variable, in this experiment, choosing pitch angle show a better performance, which is also feasible in the real world. In practice, proportional control is often applied alone, and in this paper, k h = 3, k θ = 5, while all other parameters are kept unchanged.  Parameter variations can affect the dynamic response of control system. Hence, different discounting factors are examined to further verify the robustness of the developed IGDHP-PO algorithm. Figs. 13 and 14 demonstrate the tracking performance with different discounting factors, where the subscript 1, 2, and 3 respectively stands for the case of γ = 0.99995, γ = 0.9, and γ = 0.85. As can be seen, if successfully implemented, comparable performance can be obtained with different discounting factors. Because of initially random policy and totally online learning, the tracking performance at the beginning is also imperfect. Due to the measurement uncertainties and proportional controllers, the generated AOA reference oscillates over the altitude control task. Despite this, the developed approach manages to keep the tracking errors mostly within 1 deg and controls the system to track the designed altitude reference. Nevertheless, it is also observed that with other parameters unchanged, different discounting factors can lead to different success ratios, specifically 99.1% for γ = 0.99995, 97.3% for γ = 0.9, and 70.7% for γ = 0.85. This shows that the developed IGDHP-PO approach is robust to the forgetting factor to a certain extent.
In addition, load disturbance is often encountered in flight control due to turbulence. Therefore, the tracking performance with the presence of sudden load disturbance is examined. Although the turbulence can act on the whole aircraft, out of the  purpose of simplicity and reproducibility, the considered load disturbance is set as an equivalent square wave disturbance acting on the real elevator deflection, i.e. δ e . As presented in Figs. 15 and 16, although sudden disturbance has an impact on tracking performance, the proposed IGHDP-PO approach can adapt with a fast changing deflection command. The largest impact happens at the instant when the disturbance load appears for the first time. When an equivalent 5 deg deflection disturbance is encountered, the controller tries to generate an opposite action to stabilize it. Due to the overshoot, oscillations are initiated, but the AOA signal manages to track the reference. After the onset, the subsequent changing disturbance becomes less influential, which demonstrates the robustness and the online learning property of the proposed method.

Conclusion
This paper develops an incremental model-based global dual heuristic programming (IGDHP) approach by combining global dual heuristic programming (GDHP) and augmented incremental techniques, which solves the partial observability problem. Moreover, the input saturation constraint is overcome by utilizing a symmetrical sigmoid function as the output layer activation function of the actor network, which frees the matrix R in the reward from having to be positive definite.
Various numerical simulation studies are conducted with an aerospace system, including attitude control problems with different initial states and sudden structural changes, and an altitude control problem combined with PID controller and hierarchy technique. The results uniformly demonstrate that the developed IGDHP algorithm can effectively deal with partial observability and surpass conventional GDHP in online stability, robustness to different initial states, and adaptability when encountering unforeseen faults. The applications to altitude control demonstrate that the developed IGDHP algorithm is robust to parameter variations and load disturbance, and has the potential to be applied to realistic complex scenarios combined with other techniques.
Future research should continue working on bridging the gap between the algorithms and real world systems, and approaches and skills to better satisfy the persistence excitation (PE) condition are specially recommended.