Online Multi-Objective Model-Independent Adaptive Tracking Mechanism for Dynamical Systems

The optimal tracking problem is addressed in the robotics literature by using a variety of robust and adaptive control approaches. However, these schemes are associated with implementation limitations such as applicability in uncertain dynamical environments with complete or partial model-based control structures, complexity and integrity in discrete-time environments, and scalability in complex coupled dynamical systems. An online adaptive learning mechanism is developed to tackle the above limitations and provide a generalized solution platform for a class of tracking control problems. This scheme minimizes the tracking errors and optimizes the overall dynamical behavior using simultaneous linear feedback control strategies. Reinforcement learning approaches based on value iteration processes are adopted to solve the underlying Bellman optimality equations. The resulting control strategies are updated in real time in an interactive manner without requiring any information about the dynamics of the underlying systems. Means of adaptive critics are employed to approximate the optimal solving value functions and the associated control strategies in real time. The proposed adaptive tracking mechanism is illustrated in simulation to control a flexible wing aircraft under uncertain aerodynamic learning environment.


Introduction
Adaptive tracking control algorithms employ challenging and complex control architectures under prescribed constraints about the dynamical system parameters, initial tracking errors, and stability conditions [1,2]. These schemes may include cascade linear stages or over-parameterize the state feedback control laws to solve the tracking problems [3,4]. Among the challenges associated with this class of control algorithms, is the need to have full or partial knowledge of the dynamics of the underlying systems, which can degrade their operation in the presence of uncertainties [5,6]. Some approaches employ tracking error-based control laws and cannot guarantee overall optimized dynamical performance. This motivated the introduction of flexible innovative machine learning tools to tackle some of the above limitations. In this work, online value iteration processes are employed to solve optimal tracking control problems. The associated temporal difference equations are arranged to optimize the tracking efforts as well as the overall dynamical performance. Linear quadratic utility functions, which are used to evaluate the above optimization objectives, result in two model-free linear feedback control laws which are adapted simultaneously in real time. The first feedback control law is flexible to the tracking error combinations (i.e., possible higher-order tracking error control structures compared to the traditional continuous-time Proportional-Derivative (PD) or Proportional-Integral-Derivative (PID) control mechanisms), while the second is a state feedback control law that is designed to obtain an optimized overall dynamical performance, while affecting the closed-loop characteristics of the system under consideration. This learning approach does not over-parameterize the state feedback control law and it is applicable to uncertain dynamical learning environments. The resulting state feedback control laws are flexible and adaptable to observe a subset of the dynamical variables or states, which is really convenient in cases where it is either hard or expensive to have all dynamical variables measured. Due to the straightforward adaptation laws, the tracking scheme can be employed in systems with coupled dynamical structures. Finally, the proposed method can be applied to nonlinear systems, with no requirement of output feedback linearization.
To showcase the concept in hand and to highlight its effectiveness under different modes of operation, a trajectory-tracking system is simulated using the proposed machine learning mechanism for a flexible wing aircraft. Flexible wing systems are described as two-mass systems interacting through kinematic constraints at the connection point between the wing system and the pilot/fuselage system (i.e., the hang-strap point) [7][8][9][10]. The modeling approaches of the flexible wing aircraft typically rely on finding the equations of motion using perturbation techniques [11]. The resulting model decouples the aerodynamics according to the directions of motion into the longitudinal and lateral frames [12]. Modeling this type of aircraft is particularly challenging due to the time-dependent deformations of the wing structure even in steady flight conditions [13][14][15][16]. Consequently, model-based control schemes typically degrade the operation under uncertain dynamical environments. The flexible wing aircraft employs weight shift mechanism to control the orientations of the wing with respect to the pilot/fuselage system. Thus, the aircraft pitch/roll orientations are controlled by adjusting the relative centers of gravities of these highly coupled and interacting systems [7,8].
Optimal control problems are formulated and solved using optimization theories and machine learning platforms. Optimization theories provide rigorous frameworks to solve control problems by finding the optimal control strategies and the underlying Bellman optimality equations or the Hamilton-Jacobi-Bellman (HJB) equations [17][18][19][20][21]. These solution processes guarantee optimal cost-to-go evaluations. Tracking control mechanism that uses time-varying sliding surfaces is adopted for a two-link manipulator with variable payloads in [22]. It is shown that a reasonable tracking precision can be obtained using approximate continuous control laws, without experiencing undesired high frequency signals. An output tracking mechanism for nonminimum phase flat systems is developed to control the vertical takeoff and landing of an aircraft [23]. The underlying state-tracker works well for slightly as well as strongly nonminimum phase systems, unlike the traditional state-based approximate-linearized control schemes. A state feedback mechanism based on a backstepping control approach is developed for a two-degrees-of-freedom mobile robot. This technique introduced restrictions about the initial tracking errors and the desired velocity of the robot [1]. Observer-based fuzzy controller is employed to solve the tracking control problem of a two-link robotic system [2]. This controller used a convex optimization approach to solve the underlying linear matrix inequality problem to obtain bounded tracking errors [2]. A state feedback tracking mechanism for underactuated ships is developed in [3]. The nonlinear stabilization problem is transformed into equivalent cascaded linear control systems. The tracking error dynamics are shown to be globally Kexponentially stable provided that the reference velocity does not decay to zero. An adaptive neural network scheme is employed to design a cooperative tracking control mechanism where the agents are interacting via a directed communication graph, and they are tracking the dynamics of a high-order non-autonomous nonlinear system [24]. The graph is assumed to be strongly connected and the cooperative control solution is implemented in a distributed fashion. Adaptive backstepping tracking control technique is adopted to control a class of nonlinear systems with arbitrary switching forms in [4]. It includes an adaptive mechanism to overcome the over-parameterization of the underlying state feedback control laws. A tracking control strategy is developed for a class of Multi-Input-Multi-Output (MIMO) high-order systems to compensate for the unstructured dynamics in [25]. Lyapunov proof with weak assumptions emphasized semi-global asymptotic tracking characteristics of the controller. Fuzzy adaptive state feedback and observer-based output feedback tracking control architecture is developed for Single-Input-Single-Output (SISO) nonlinear systems in [26]. This structure employed backstepping approach to design the tracking control law for uncertain non-strict feedback systems.
Machine learning platforms present implementation kits of the derived optimal control mathematical solution frameworks. These use artificial intelligence tools such as Reinforcement Learning (RL) and Neural Networks to solve the Approximate Dynamic Programming problems (ADP) [27][28][29][30][31][32][33]. The optimization frameworks provide various optimal solution structures which enable solutions of different categories of the approximate dynamic programming problems such as Heuristic Dynamic Programming (HDP), Dual Heuristic Dynamic Programming (DHP), Action Dependent Heuristic Dynamic Programming (ADHDP), and Action-Dependent Dual Heuristic Dynamic Programming (ADDHP) [34,35]. These forms in turn are solved using different two-step temporal difference solution structures. ADP approaches provide means to solve the curse of dimensionality in the states and action spaces of the dynamic programming problems. Reinforcement learning frameworks suggest processes that can implement solutions for the different approximate dynamic programming structures. These are concerned with solving the Hamilton-Jacobi-Bellman equations or Bellman optimality equations of the underlying dynamical structures [36][37][38]. Reinforcement learning approaches employ dynamic learning environment to decide the best actions associated with the state-combinations to minimize the overall cumulative cost. The designs of the cost or reward functions reflect the optimization objectives of the problem and play crucial role to find suitable temporal difference solutions [39][40][41]. This is done using two-step processes, where one solves the temporal difference equation and the other solves for the associated optimal control strategies. Value and policy iteration methods are among the various approaches that are used to implement these steps. The main differences between the two approaches are related to the sequence of how the solving value functions are evaluated, and the associated control strategies are updated.
Recently, innovative robust policy and value iteration techniques have been developed for single and multi-agent systems, where the associated computational complexities are alleviated by the adoption of model-free features [42]. A completely distributed model-free policy iteration approach is proposed to solve the graphical games in [21]. Online policy iteration control solutions are developed for flexible wing aircraft, where approximate dynamic programming forms with gradient structures are used [43,44]. Deep reinforcement learning approaches enable agents to drive optimal policies for high-dimensional environments [45]. Furthermore, they promote multi-agent collaboration to achieve structured and complex tasks. The augmented Algebraic Riccati Equation (ARE) for the linear quadratic tracking problem is solved using Q-learning approach in [46]. The reference trajectory is generated using a linear generator command system. A neural network scheme based on a reinforcement learning approach is developed for a class of affine (MIMO) nonlinear systems in [47]. This approach customized the number of updated parameters irrespective of the complexity of the underlying systems. Integral reinforcement learning scheme is employed to solve the Linear-Quadratic-Regulator (LQR) problem for optimized assistive Human Robot Interaction (HRI) applications in [48]. The LQR scheme optimizes the closed-loop features for a given task to minimize the human efforts without acquiring information about their dynamical models. A solution framework based on a combined model predictive control and reinforcement learning scheme is developed for robotic applications in [6]. This mechanism uses a guided policy search technique and the model predictive controller generates the training data using the underlying dynamical environment with full state observations. Adaptive control approach based on a model-based structure is adopted to solve the optimal tracking infinite horizon problem for affine systems in [5]. In order to effectively explore the dynamical environment, a concurrent system identification learning scheme is adopted to approximate the underlying Bellman approximation errors.
A reinforcement learning approach based on deep neural networks is used to develop a time-varying control scheme for a formation of unmanned aerial vehicles in [49]. The complexity of the multi-agent structure is tackled by training an individual vehicle and then generalizing the learning outcome of that agent to the formation scheme. Deep Q-Networks are used to develop generic multi-objective reinforcement learning scheme in [50]. This approach employed single-policy as well as multi-policy structures and it is shown to converge effectively to optimal Pareto solutions. Reinforcement Learning approaches based on deterministic policy gradient, proximal policy optimization, and trust region policy optimization approaches are proposed to overcome the PID control limitations of the inner attitude control loop of the unmanned aerial vehicles in [51]. The cooperative multi-agent learning systems use the interactions among the agents to accomplish joint tasks in [52]. The complexity of these problems depends on the scalability of the underlying system of agents along with their behavioral objectives. Action coordination mechanism based on a distributed constraint optimization approach is developed for multi-agent systems in [53]. It uses an interaction index to trade-off between the beneficial coordination among the agents and the communication cost. This approach enables non-sequenced coupled adaptations of the coordination set and the policy learning processes for the agents. The mapping of single-agent deep reinforcement learning to multi-agent schemes is complicated due to the underlying scaling dilemma [54]. The experience replay memory associated with deep Q-learning problems is tackled using a multi-agent sampling mechanism which is based on a variant of importance mechanism in [54].
The adaptive critics approaches are employed to advise various neural network solutions for optimal control problems. They implement two-step reinforcement learning processes using separate neural network approximation schemes. The solution for Bellman optimality equation or the Hamilton-Jacobi-Bellman equation is implemented using a feedforward neural structure described by the critic structure. On the other hand, the optimal control strategy is approximated using an additional feedforward neural network structure called the actor structure. The update processes of the actor and critic weights are interactive and coupled in the sense that the actor weights are tuned when the critic weights are updated following reward/punish assessments of the dynamic learning environment [28,30,33,37,40]. The sequences of the actor and critic weights-updates follow those advised by the respective value or policy iteration algorithms [28,37]. Reinforcement learning solutions are implemented in continuous-time platforms as well as discrete-time platforms, where integral forms of Bellman equations are used [55,61]. These structures are applied to multi-agent systems as well as single-agent systems, where each agent has its own actor-critic structure [34,35]. The adaptive critics are employed to provide neural network solutions for the dual heuristic dynamic programming problems for multi-agent systems [19,20]. These structures solve the underlying graphical games in a distributed fashion where the neighbor information is used. Actor-critic solution implementation for an optimal control problem with nonlinear cost function is introduced in [55]. The adaptive critics implementations for feedback control systems are highlighted in [56]. A PD scheme is combined with a reinforcement learning mechanism to control the tip-deflection and trajectory-tracking operation of a two-link flexible manipulator in [57]. The adopted actor-critic learning structure compensates for the variations in the payload. An adaptive trajectory-tracking control approach based on actor-critic neural networks is developed for a fully autonomous underwater vehicle in [58]. The nonlinearities in the control input signals are compensated for during the adaptive control process.
This work contributions are four-fold: 1. An online control mechanism is developed to solve the tracking problem in uncertain dynamical environment without acquiring any knowledge about the dynamical models of the underlying systems. 2. An innovative temporal difference solution is developed using a reformulation of Bellman optimality equation. This form does not require existence of admissible initial policies and it is computationally simple and easy to apply.
3. The developed learning approach solves the tracking problem for each dynamical process using separate interactive linear feedback control laws. These optimize the tracking as well as the overall dynamical behavior. 4. The outcomes of the proposed architecture can be generalized smoothly for structured dynamical problems. Since, the learning approach is suitable for discrete-time control environments and it is applicable for complex coupled dynamical problems.
The paper is structured as follows: Section 2 is dedicated to the formulation of the optimal tracking control problem along with the model-free temporal difference solution forms. Model-free adaptive learning processes are developed in Section 3, and their real-time adaptive critics or neural network implementations are presented in Section 4. Digital simulation outcomes for an autonomous controller of a flexible wing aircraft are analyzed in Section 5. The implications of the developed machine learning processes in practical applications and some future research directions are highlighted in Section 6. Finally, concluding remarks about the adaptive learning mechanisms are presented in Section 7.

Formulation of the Optimal Tracking Control Problem
Optimal tracking control theory is used to lay out the mathematical foundation of various adaptive learning solution frameworks. Thus, many adaptive mechanisms employ complicated control strategies which are difficult to implement in discrete-time solution environments. In addition, many tracking control schemes are model-dependent, which raises concerns about their performances in unstructured dynamical environments [17]. This section tackles these challenges by mapping the optimization objectives of underlying tracking problem using machine learning solution tools.

Combined Optimization Mechanism
The optimal tracking control problem, in terms of the operation, can be divided broadly into two main objectives [17]. The first is concerned with asymptotically stabilizing the tracking error dynamics of the system, and the second optimizes the overall energy during the tracking process. Herein, the outcomes of the online adaptive learning processes are two linear feedback control laws. The adaptive approach uses simple linear quadratic utility or cost functions to evaluate the real-time optimal control strategies. The proposed approach tackles many challenges associated with the traditional tracking problems [17]. First, it allows an online model-free mechanism to solve the tracking control problem. Second, it allows several flexible tracking control configurations which are adaptable with the complexity of the dynamical systems. Finally, it allows interactive adaptations for both the tracker and optimizer feedback control laws.
The learning approach does not employ any information about the dynamics of the underlying system. The selected online measurements can be represented symbolically using the following form where X ∈ R n×1 is a vector of selected measurements (i.e., the sufficient or observable online measurements), U ∈ R m×1 is a vector of control signals, k is a discrete-time index, and F represents the model that generates the online measurements of the dynamical system which could retain linear or nonlinear representations. The tracking segment of the overall tracking control scheme generates the optimal tracking control signal C k{i} ∈ R ∀k using a linear feedback control law that depends on the sequence of tracking errors e k{i} , e k−1{i} , e k−2{i} , where each error signal e k{i} is associated with the i th state or measured variable of vector X k (i.e., X k{i} ). The error e k{i} is defined by e k{i} = T k{i} − X k{i} , where T k{i} is the reference signal of the state or measured variable X k{i} . On one side, the number of online tracking control loops is determined by the number of reference variables or states. Each reference signal T k{i} has a tracking evaluation loop. In this development, a feedback control law that uses combination of three errors (i.e., e k{i} , e k−1{i} , e k−2{i} ) is considered in order to mimic the mechanism of a Proportional-Integral-Derivative (PID) controller in discrete-time where the tracking gains are adapted in real time in an online fashion. On the other side, the form of each scalar tracking control law C k{i} can be formulated for any combinations of error samples (i.e., e k{i} , e k−1{i} , e k−2{i} , e k−3{i} , . . . , e k−N{i} ). Thus, the proposed tracking structure enables higher-order difference schemes which can be realized smoothly in discrete-time environments. In order to simplify the tracking notations, e k and C k are used to refer to the tracking error signal e k{i} and tracking control signal C k{i} for each individual tracking loop respectively. Herein, each scalar actuating tracking control signal C k{i} simultaneously adjusts all relevant or applicable actuation control signals U k{j} , j ∈ m.
The overall layout of the control mechanism (i.e., considering the optimizing and tracking features) is sketched in Figure 1, where φ desired denotes a desired reference signal (i.e., each T k{i} ) and φ actual refers to the actual measured signal (i.e., each X k{i} ) for each individual tracking loop. The goals of the optimization problem are to find the optimal linear feedback control laws or the optimal control signals U * k and C * k ∀k, using model-free machine learning schemes. The underlying objective utility functions are mapped into different temporal difference solution forms. As indicated above, since linear feedback control laws are used, then linear quadratic utility functions are employed to evaluate the optimality conditions in real time. The objectives of the optimization problem are detailed as follows: (1) A measure index of the overall dynamical performance is minimized to calculate the optimal control signal U * k such that Therefore, the underlying performance index J is given by (2) A tracking error index is optimized to evaluate the optimal tracking control signal C * k such that with an objective cost function D(E k , S ∈ R 3×3 > 0 is a symmetric positive definite matrix, and M ∈ R > 0. The choice of the tracking error vector E is flexible to the number of the memorized tracking error signals N e such that e k− , = 0, 1, . . . , N e . Therefore, the underlying performance index P is given by Herein, the choice of the optimized policy structure U * k to be a function of the states X k is not meant to achieve asymptotic stability in a standalone operation (i.e., all the states X k , ∀k go to zero). Instead, it is incorporated into the overall control architecture where it can select the minimum energy path during the tracking process. Hence, it creates an asymptotically stable performance around the desired reference trajectory. Later, the performance of the standalone tracker is contrasted against that of the combined tracking control scheme to highlight this energy exchange minimization outcome.

Optimal Control Formulation
Various optimal control formulations of the tracking problem promote multiple temporal difference solution frameworks [17,18]. These use Bellman equations or Hamilton-Jacobi-Bellman structures or even gradient forms of Bellman optimality equations [19,20,35]. The manner at which the cost or objective function is selected plays a crucial role in forming the underlying temporal difference solution and hence its associated optimal control strategy form. This work provides a generalizable machine learning solution framework, where the optimal control solutions are found by solving the underlying Bellman optimality equations of the dynamical systems. These can be implemented using policy iteration approaches with model-based schemes. However, these processes necessitate having initial admissible policies, which is essential to ensure admissibility of the future policies. This is further faced by computational limitations, for example, the reliance of the solutions on least square approaches with possible singularities-related calculation risks. This urged for flexible developments such as online value iteration processes where they do not encounter these problems.
Value iteration processes based on two temporal difference solution forms are developed to solve the tracking control problem. These are equivalent to solving the underlying Hamilton-Jacobi-Bellman equation of the optimal tracking control problem [17,46]. Regarding the problem under consideration, it is required to have two temporal difference equations: One solves for the optimal control strategies to minimize the tracking efforts, and the other selects the supporting control signals to minimize the energy exchanges during the tracking process. In order to do that, two solving value functions related to the main objectives, are proposed such that where Γ(. . . ) is a solving value function that approximates the overall minimized dynamical performance and it is defined by Similarly, the solving value function that approximates the optimal tracking performance is given by These performance indices yield the following Bellman or temporal difference equations and where the optimal control strategies associated with both Bellman equations are calculated as follows Therefore, the optimal policy for the overall optimized performance is given by In a similar fashion, the optimal tracking control strategy is calculated using Therefore, the optimal policy for the optimized tracking performance is given by Using the optimal policies (4) and (5) into Bellman Equations (2) and (3) respectively yields the following Bellman optimality equations or temporal difference equations and where Γ * (. . . ) and Ξ * (. . . ) are the optimal solutions for the above Bellman optimality equations Solving Bellman optimality Equations (6) or (7) is equivalent to solving the underlying Hamilton-Jacobi-Bellman equations of the optimal tracking control problem.

Remark 1.
Model-free value iteration processes employ temporal difference solution forms that arise directly from Bellman optimality Equations (6) or (7), in order to solve the proposed optimal tracking control problem. This learning platform shows how to enable Action-Dependent Heuristic Dynamic Programming (ADHDP) solution, a class of approximate dynamic programming that employs a solving value function that is dependent on a state-action structure, in order to solve the optimal tracking problem in an online fashion [37,59].

Online Model-Free Adaptive Learning Processes
Bellman optimality Equations (6) and (7) are used to develop online value iteration processes. Herein, two adaptive learning algorithms are developed using these optimality equations. They share the ability to produce control strategies while they learn the dynamic environment in real time and the strategies do not depend on the dynamical model of the system under consideration.

Direct Value Iteration Process
The first model-free value iteration algorithm uses direct forms of (6) and (7) as follows: Algorithm 1 Model-free direct value iteration process.

Modified Value Iteration Process
Another adaptive learning algorithm based on an indirect value iteration process is proposed. This algorithm reformulates or modifies the way Bellman optimality equations are solved as follows; and Therefore, a modified value iteration process based on these reformulations is structured as follows Algorithm 2 Modified model-free value iteration process.
1. Initialize Γ 0 (X 0 , U 0 ), Ξ 0 (E 0 , C 0 ), U 0 0 and C 0 0 . 2. Update the solving value functions Γ(. . . ) and Ξ(. . . ) using 3. Extract the optimal strategies 4. Terminate the updates of the solving value functions when H r+1 (. This value iteration process solves Bellman optimality equation in a way that does not require initial stabilizing policies and, unlike the policy iteration mechanisms, this solution framework does not imply any computational difficulties related to the evaluations of Γ(. . . ) and Ξ(. . . ) at the different evaluation steps.
The proposed value iteration processes optimize the overall dynamical performance towards the tracking objectives. This means that the two optimization objectives are interacting and coupled along the variables of interest. This is done in real time without acquiring any information about the dynamics of the underlying system.

Comparison to a Standard Policy Iteration Process
The value iteration process, as explained earlier, employs two steps, one is concerned with evaluating the optimal value function (i.e., solving Bellman optimality Equations (6) or (7)) and the second extracts the optimal policy given this value function (i.e., (4) or (5)). On the other hand, the policy iteration mechanism starts with a policy evaluation step that solves for a value function that is relevant to an attempted policy using Bellman equation (i.e., (2) or (3)) and this is followed by a policy improvement step that results in a strictly better policy compared to the preceding policy unless it is optimal [37,60,61].
To formulate a policy iteration process for the optimization problem in hand (i.e., the overall energy and tracking error minimization), the control signals U H and C Π are evaluated using the linear policies − H −1 UU H UX X and − Π −1 CC Π CE E, respectively, where the policy iteration process uses (2) and (3) repeatedly in order to perform a single-policy evaluation step, such that where the symbols j and h refer to the calculation-instances leading to a policy evaluation step for each dynamical operation.
In other words, the solving value function Γ(. . . ) is updated after collecting several necessary samples ν i.e.,Z j=1 X (X k,k+1 , U H k,k+1 ),Z j=2 X (X k+1,k+2 , U H k+1,k+2 ), . . . ,Z j=ν X (X k+ν−1,k+ν , U H k+ν−1,k+ν ) , where ν = (n + m) × (n + m + 1)/2 designates the number of entries of the upper/lower triangle block of matrix H ∈ R (n+m)×(n+m) andZ X is a vector associated with the vector transformation of the upper/lower triangle block of the symmetric matrix H [60,61]. This act lasts for at least a real-time interval of k to k + ν to collect sufficient information to fulfill the policy evaluation step [60,61]. Similarly, the solving value function Ξ (. . . ) is updated at the end of each online interval k to k + 10, where 10 samples (10 refers to the number of entries of the upper/lower triangle block of matrix Π ∈ R 4×4 ) are repeatedly collected in order to evaluate the taken tracking policy i.e.,Z h=1 where the vectorZ h E is structured in a similar manner asZ X . The approach taken to construct vectorZ X orZ E is detailed in [60,61]. The policy iteration solution results in a decreasing sequence of the solving value functions which is lower-bounded by zero.
The policy iteration process requires the existence of an initial admissible policy and could encounter mathematical risks when evaluating the underlying policies [60,61]. On the other hand, Algorithms 1 and 2 do not impose initial admissible policies and the optimal value functions Γ(. . . ) and Ξ(. . . ) are updated simultaneously at each real-time instance r = k, as explained by (8) and (12). The value iteration process retains simpler and flexible adaptation mechanism compared with the above policy iteration formulation, where the policy evaluation steps could exist at uncorrelated time-instances.

Convergence and Stability Results of the Adaptive Learning Mechanism
The convergence analysis and stability characteristics of the value iteration processes, based on action-dependent heuristic dynamic programming solution, are introduced for single and multi-agent systems and for continuous as well as discrete-time environments [20,35,59,62,63]. The adaptive learning value iteration processes result in non-decreasing sequences such that where Γ * (. . . ) and Ξ * (. . . ) are the upper bounded optimal solutions for Bellman optimality equations.
The sequences of the resultant control strategies U r k , ∀k, r and C r k , ∀k, r are stabilizing and hence admissible sequences. In a similar fashion, the following inequalities hold The above inequalities are also bounded above using the same concepts adopted in [20,35,59,62,63]. The simulation results highlight the evolution of the solving value functions using Algorithms 1 and 2 in real time. Furthermore, they will judge the importance of Algorithm 2 in terms of the convergence speed and optimality of the solving value functions.

Neural Network Implementations
Adaptive critics are employed to implement the proposed adaptive learning solutions in real time. Each algorithm involves two steps. The first is concerned with solving a Bellman optimality equation, and the other approximates the optimal control strategy. Each step is implemented using a neural network approximation structure. The solving value function Γ(. . . ) or Ξ(. . . ) is approximated using a critic structure, while the associated optimal control policy is approximated using an actor structure. These represent coupled tuning processes with different objectives. The solving algorithms employ update processes to tune the critic weights, where they have different forms of the temporal difference equations. However, the way the actor is approximated for both adaptive algorithms is achieved in the same fashion. A full adaptive critics solution structure for the tracking control problem is shown in Figure 2.

Neural Network Implementation of Algorithm 1
The actor-critic adaptations for Algorithm 1 are done in real time using separate neural network structures as follows.
The solving value functions Γ(. . . ) and Ξ(. . . ) are approximated using the neural network structuresΓ are the critic approximation weights matrices. The optimal strategies U * and C * are approximated aŝ where Υ T a ∈ R m×1 and Ω T a ∈ R 3×1 are the approximation weights of the actors.

Dynamic Optimizer Critic (Solving Value Function Approximation)
Tracker System Actor (Optimal Control Strategy Approximation) The tuning processes are interactive, and the weights of each structure are updated using a gradient descent approach. Therefore, the update laws for the critic weights for this algorithm are calculated as where 0 < α c < 1 is a critic learning rate, Z X = X k U r k , Z E = E k C r k , and the target values of the approximations Γ target (. . . ) and Ξ target (. . . ) are given by In a similar fashion, the approximation weights of the optimal control strategies are updated using the rules where 0 < α a < 1 defines the actor learning rate and the target values of the optimal policy approximationsÛ k andĈ k are given bŷ Consequently, the critic and actor update laws are given by (14) and (15) respectively, where they form the implementation platforms of the solution steps (8) and (9) in Algorithm 1.

Remark 2.
The gradient descent approach employs actor-critic learning rates which take positive values less than 1. In the proposed development the actor-critic learning rates are tied to the sampling time used to generate the online measurements in the discrete-time environment. This is done to achieve smooth tuning for the actor-critic weights relative to the changes in the dynamics of the system. The gradient decent approaches do not have affirmative convergence criteria. However, as will be shown below, the simulation cases emphasize the usefulness of this approach even when a challenging dynamical environment is considered, where one of the challenging scenarios considers random actor-critic learning rates at each evaluation step in the real time processes.

Neural Network Implementation of Algorithm 2
The following development introduces the neural network implementations of the solution given by the modified value iteration solution presented by Algorithm 2.
The solving value function approximationsΓ(.|∆ c ) andΞ(.|Λ c ) are given bỹ are the critic approximation weights matrices. The approximations of the optimal control strategies U * and C * follow where ∆ T a ∈ R m×1 and Λ T a ∈ R 3×1 are the approximation weights of the actor neural network. The tuning of the critic weights for both optimization loops follows where 0 < η c < 1 is a critic learning rate,∆ c andΛ c are vector transformations of the upper triangle section of the symmetric solution matrices ∆ c and Λ c respectively,Z X andZ E are the respective vector-to-vector transformations of τ r X and τ r E with τ r X = The target valuesΓ target (. . . ) andΞ target (. . . ) are calculated bỹ The update of the actor weights for this solution algorithm follows a similar structure as of Algorithm 1 such that where 0 < η a < 1 is an actor learning rate, and the target valuesŨ target (. . . ) andC target k (. . . ) are given byŨ

Autonomous Flexible Wing Aircraft Controller
The proposed online adaptive learning approaches are employed to design an autonomous trajectory-tracking controller for a flexible wing aircraft. The flexible wing aircraft functions as a two-body system (i.e., the pilot/fuselage and wing systems) [10,[13][14][15][16]. Unlike fixed wing systems, the flexible wing aircraft do not have exact aerodynamic models, due to the deformations in the wings which are continuously occurring [13,64,65]. Aerodynamic modeling attempts rely on semi-experimental results with no exact models, which complicated the autonomous control task and made it very challenging [13]. Recently, these aircraft have captured increasing attention to join the unmanned aerial vehicles family due to their low-cost operation features, uncomplicated design, and simple fabrication process [44]. The maneuvers are achieved by changing the relative centers of gravity between the pilot and wing systems. In order to change the orientation of the wing with respect to the pilot/fuselage system, the control bar of the aircraft takes different pitch-roll commands to achieve the desired trajectory. The pitch/roll maneuvers are achieved by applying directional forces on the control bar of the flexible wing system in order to create or alter the desired orientation of the wing with respect to the pilot/fuselage system [65,66].
The objective of the autonomous aircraft controller design is to use the proposed online adaptive learning structures in order to achieve the roll-trajectory-tracking objectives, and to minimize energy paths (the dynamics of the aircraft) during the tracking process. The energy minimization is crucial for the economics of flying systems that share the same optimization objectives. The motions of the flexible wing aircraft are decoupled into longitudinal and lateral frames [13,64]. The lateral motion frame is hard to control compared to the inherited stability in the pitch motion frame. A lateral motion frame of a flexible wing aircraft is shown in Figure 3.

Assessment Criteria for the Adaptive Learning Algorithms
The effectiveness of the proposed online model-free adaptive learning mechanisms is assessed based on the following criteria: • The convergence of the online adaptation processes (i.e., tuning of the actor and critic weights achieved using Algorithms 1 and 2). Consequently, the resulting trajectory-tracking error characteristics. • The performance of the standalone tracking system versus the overall or combined tracking control scheme. • The stability results of the online combined tracking control scheme (i.e., the aircraft is required to achieve the trajectory-tracking objective in addition to minimizing the energy exchanges during the tracking process). • The benefits of the attempted adaptive learning approaches on improving the closed-loop time-characteristics of the aircraft during the navigation process.
Additionally, the simulation cases are designed to show how broadly Algorithm 2 (i.e., the newly modified Bellman temporal difference framework) will perform against Algorithm 1.

Generation of the Online Measurements
To apply the proposed adaptive approaches on the lateral motion frame, a simulation environment is needed to generate the online measurements. The different control methodologies do not use all the available measurements to control the aircraft [13,65]. Thus, the proposed approach is flexible to the selection of the key measurements. Hence, a lateral aerodynamic model at a trim speed, based on a semi-experimental study, is employed to generate the measurements as follows [13] where the lateral state vector of the wing system is given by X = v lφψ φ ψ T and U T is the lateral control signal applied to the control bar.
The control signal U T is the overall combined control strategy decided by the tracker system and the optimizer system (i.e., U Tk = U k + C k ). In this example, the banking control signal aggregates dynamically the scalar signals U k ∈ R and C k ∈ R in real time in order to get an equivalent control signal U Tk that is applied to the control bar in order to optimize the motion following a trajectory-tracking command. The optimizer will decide the state feedback control policy U k = f (X k ) using the measurements X k , where the linear state feedback optimizer control gains Ω a , Λ a ∈ R 1×5 are decided by the proposed adaptive learning algorithms. Similarly, the tracking system will decide the linear tracking feedback control policy C k based on the error signals (e k , e k−1 , e k−2 ), where e k = φ desired k − φ actual k , ∀k. The linear feedback tracking control gains Υ a , ∆ a ∈ R 1×3 are adapted in real time using the online reinforcement learning algorithms.
Noticeably, the proposed online learning solutions do not employ any information about the dynamics (i.e., drift dynamics A and control input matrix B), where they function like black-box mechanisms. Moreover, the control objectives are implemented in an online fashion, where only real-time measurements are considered. In other words, the control mechanism for the roll maneuver generates the real-time control strategy for the roll motion frame regardless what is occurring in the pitch direction and vice versa.

Simulation Environment
As described earlier, a state space model captured at a trim flight condition is used to generate online measurements [13].
The learning parameters for the adaptive learning algorithms are given by η a = η c = α a = α c = 0.0001. The learning parameters are selected to be comparable to the sampling time to have smooth adjustments for the adapted weights. Later, random learning rates are superimposed at each evaluation step.
The initial conditions are set to X 0 = 40 1. The desired roll-tracking trajectory consists of two smooth opposite turns represented by a sinusoidal reference signal such that φ desired (t) = 25 sin(2 π t /10) deg (i.e., right and left turns with max amplitudes of 25 deg).

Simulation Outcomes
The simulation scenarios tackle the performance of the standalone tracker, then the characteristics of the overall or combined adaptive control approach. Finally, a third scenario is considered to discuss the performance of the adaptive learning algorithms under unstructured dynamical environment and uncertain learning parameters. These simulation cases can be detailed out as follows 1. Standalone tracker: The adaptive learning algorithms are tested to achieve only the trajectory-tracking objective (i.e., no overall dynamical optimization is included, and they are denoted by STA1 and STA2 for Algorithms 1 and 2 respectively). In the standalone tracking operation mode, Bellman equations concerning the optimized overall performance and hence the associated optimal control strategies are omitted form the overall adaptive learning structure. 2. Combined control scheme: This case combines the adaptive tracking control and optimizer schemes (i.e., the tracking control objective is considered along with the overall dynamical optimization using Algorithms 1 and 2 which are referred to as OTA1 and OTA2 respectively).

Operation under uncertain dynamical and learning environments:
The proposed online reinforcement learning approaches are validated using challenging dynamical environment, where the dynamics of the aircraft (i.e., matrices A and B) are allowed to variate at each evaluation step by ±50% around their nominal values at a normal trim condition. The aircraft is allowed to follow a complicated trajectory to highlight the capabilities of the adaptive learning processes using this maneuver. Additionally, the actor-critic learning rates are allowed to variate at each iteration index or solution step.

Adaptation of the Actor-Critic Weights
The tuning processes of the actor and critic weights are shown to converge when they follow solution Algorithms 1 and 2 as shown in Figures 4-6. This is noticed when the tracker is used in a standalone situation or when it is operated within the combined or overall dynamical optimizer. It is shown that the actor and critic weights for the tracking component of the optimization process converge in less than 0.1 s as shown in Figures 4 and 5. The tuning of the critic weights in the case of optimized tracker took longer time due to the number of involved states and the objective of the overall dynamical optimization problem as shown in Figure 6. It is worth noting that the tracker part of the controller uses the tracking error signals as inputs which facilitates the tracking optimization process. These results highlight the capability of the adaptive learning algorithms to converge in real time.

Stability and Tracking Error Measures
The adaptive learning algorithms under different scenarios or modes of operation, stabilize the flexible wing system along the desired trajectory as shown in Figures 7 and 8. The lateral motion dynamics eventually follow the desired trajectory. In this case, the lateral variables are not supposed to decay to zero, since the aircraft is following a desired trajectory. The tracking scheme leads this process side by side with the overall energy optimization process, which actually improves the closed-loop characteristics of the aircraft towards minimal energy behavior. It is noticed that Algorithm 2 outperforms Algorithm 1 under standalone tracking mode or the overall optimized tracking mode. In order to quantify these effects numerically and graphically, the average accumulated tracking errors obtained using the proposed adaptive learning algorithms are shown in Figure 9a,b respectively. These indicate that the optimized tracker modes of operation (i.e., OTA1 and OTA2) give lower errors compared to those achieved during the standalone modes of operations (i.e., STA1 and STA2), emphasizing the importance of adding the overall optimization scheme to the tracking system. Adaptive learning Algorithm 2, using the optimized tracking mode, achieves the lowest average of accumulated errors as shown in Figure 9b. An additional measure index is used, where the overall normalized dynamical effects are evaluated using the following Normalized Accumulated Cost Index (NACI)  The normalization values are the square of the maximum measured values of X k and U Tk . The adaptive algorithm (OTA2) achieves the lowest overall dynamical cost or effort as shown by Figure 10. The final control laws achieved by using the different algorithms under the above modes of operation (i.e., STA1, STA2, OTA1, and OTA2) are listed in Table 1.
The online value iteration processes result in increasing bounded sequences of the solving value functions Γ r (. . . ) and Ξ r (. . . ), ∀r, which is aligned with the convergence properties of typical value iteration mechanisms. The online learning outcomes of the value iteration processes Γ r (. . . ) (i.e., using Algorithms 1 and 2) are applied and used for five random initial conditions as shown by Figure 11. The initial solving value functions evaluated by Algorithms 1 and 2 start from the same positions using the same vector of initial conditions. It is observed that Algorithm 2 (solid lines) outperforms Algorithm 1 (dashed lines) in terms of the updated solving value function obtained using the attempted random initial conditions. Despite both algorithms show general increasing and converging evolution pattern of the solving value functions, value iteration Algorithm 2 exhibits rapid increment and quicker settlement to lower values compared to Algorithm 1.

Closed-Loop Characteristics
To examine the time-characteristics of the adaptive learning algorithms, the closed-loop performances of the adaptive learning algorithms under the optimized tracking operation mode (i.e., OTA1 and OTA2) are plotted in Figure 12. Apart from the tracking feedback control laws, the optimizer state feedback control laws directly affect the closed-loop system. The forthcoming analysis is to show how (1) the aircraft system initially starts (i.e., open-loop system); (2) the evolution of the closed-loop poles during the learning process; and (3) the final closed-loop characteristics when the actor weights finally converge. The trace of the closed-loop poles achieved using OTA2 (i.e., the • marks) shows concise and faster stable behavior than that obtained using OTA1 (i.e., the • indicators), and definitely faster than the open-loop characteristics. The dominant open-loop pole is moved further into the stability region, when the overall dynamical optimizer is included, as listed in Table 2. These results emphasize the stability and superior time-response characteristics achieved using the adaptive learning approaches, especially Algorithm 2.

Performance in Uncertain Dynamical Environment
This simulation scenario challenges the performance of the online adaptive controller in uncertain dynamical environment. The continuous-time aircraft aerodynamic model (i.e., the aircraft state space model with the drift dynamics matrix A and control input matrix B) is forced to involve unstructured dynamics [13]. These disturbances are of amplitudes ±50% around the nominal values at the trim condition and they are generated from a normal Gaussian distribution as shown in Figure 13c,d. Additionally, the sampling time is set to T s = 0.005 s and the actor-critic learning rates are allowed to vary at each evaluation step as shown by Figure 13a,b to test a band of learning parameters. Finally, a challenging desired trajectory is proposed such that φ desired (t) = (25 sin(6 π t /10) + 15 cos(16 π t /10)) e −3 t/10 deg. These coexisting factors challenge the effectiveness of the controller. The randomness which appears in the proposed coexisting dynamical learning situations provides rich exploration environment for the adaptive learning processes. These dynamic variations occur at each evaluation step which guarantees some sort of generalization for the dynamical processes under consideration. Figure 14a-d emphasize that the adaptive learning Algorithms 1 and 2 (i.e., OTA1 and OTA2) are able to achieve the trajectory-tracking objectives. The actor weights are shown to successfully converge despite the co-occurring uncertainties. The adaptation processes are effectively responding to the acting disturbances, where relatively longer time is needed to converge to the proper control gains. The tracking feedback control gains took shorter time to converge as shown by Figure 14c,d, where the tracking feedback control law depends only on the state φ k , and implicitly its derivative. Algorithm 2 exhibited better trajectory-tracking features compared to those obtained using Algorithm 1 as shown by Figure 15a. Figure 15b, when compared to Figure 12, shows how the open-loop poles, represented by F marks (recorded disturbances at each iteration k), spread all over the S-plane. The adaptive learning Algorithms 1 and 2, exhibited similar stable behavior as observed in the earlier scenarios. However, longer time was needed to reach asymptotic stability around the desired reference trajectory. This can be observed by examining the spread of the closed-loop poles obtained using OTA1 (• notations) and OTA2 (• symbols). These results highlight the insensitivity of the proposed adaptive learning approaches against different uncertainties in the dynamic learning environments.

Implications in Practical Applications and Future Research Developments
The proposed combined adaptive learning approach can be integrated into various complex robotic or nonlinear system applications using extremely flexible adaptive learning black-box mechanisms. These are keen to optimize the performance of the actuation devices while maintaining the tracking control mission in an online fashion. At least, it will enable complicated distributed tracking solutions for structured robotic systems using simple adaptation laws with affordable computational costs compared to existing adaptive approaches. It can work in unstructured dynamical enthronements where it is really difficult to have full dynamical models for the underlying systems. The proposed adaptive learning algorithms can be deployed directly into the control units, where the only precautions are concerned with; (1) matching the sampling frequency (imposed by the sensory devices) to the learning parameters; (2) conditioning the weighting matrices in the utility or cost functions according to the actuation signals and the measured variables. The proposed learning approach is adaptive to the selection of the measured states which makes it convenient to use in many real-world applications, since it does not rely on complicated adaptive learning constraints.
Future research directions may extend other reinforcement learning tools, such as policy iteration schemes, in order to develop combined adaptive tracking processes. This direction should find means to tackle the admissibility requirements of the initial policies along with relaxing the computational efforts required to accomplish these processes. The proposed adaptive learning approaches can be adopted for multi-agent applications. Taking into consideration the complexity of the multi-agent structures, this would involve further research investigations which tackle connectivity, communication costs, and stabilizability of the coupled control schemes as well as the convergence conditions for the adaptive learning solutions. These ideas may consider structures based on Bellman equations as well as the Hamilton-Jacobi-Bellman equations. Additional directions may investigate the use of other approximate dynamic programming classes which employ gradient-based solving forms to solve the optimal tracking control problem [17,37]. These involve solutions for the Dual Heuristic Dynamic Programming and Action-Dependent Dual Heuristic Dynamic Programming problems. These developments should handle the dependence of the temporal difference solutions on the complete dynamical model information.

Conclusions
A class of tracking control problems is solved using online model-free reinforcement learning processes. The formulation of the optimal control problem tackled the tracking as well the overall dynamical processes by formulating the respective Bellman optimality or temporal difference equations. Two separate linear feedback control laws are adapted simultaneously in real time, where the first linear feedback law decides the optimal control gains associated with a flexible tracking error structure and the second law optimizes the overall dynamical performance during the tracking process. The proposed approach is employed to solve the challenging trajectory-tracking control problem of a flexible wing aircraft, were the aerodynamics of the wing are unknown and difficult to capture in a dynamical model. An aggressive learning environment that involves complicated reference trajectory, uncertain dynamical system, and flexible learning rates is adopted to show the usefulness of the developed learning approach. The complete optimized tracker revealed better closed-loop characteristics than those obtained using the standalone tracker.

Conflicts of Interest:
The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: Variables v l lateral velocity in the wing's frame of motion.
φ, ψ Roll and yaw angles in the wing's frame of motion.