Elsevier

Information Sciences

Volume 328, 20 January 2016, Pages 435-454
Information Sciences

Online approximate solution of HJI equation for unknown constrained-input nonlinear continuous-time systems

https://doi.org/10.1016/j.ins.2015.09.001Get rights and content

Abstract

This paper is concerned with the approximate solution of Hamilton–Jacobi–Isaacs (HJI) equation for constrained-input nonlinear continuous-time systems with unknown dynamics. We develop a novel online adaptive dynamic programming-based algorithm to learn the solution of the HJI equation. The present algorithm is implemented via an identifier-critic architecture, which consists of two neural networks (NNs): an identifier NN is applied to estimate the unknown system dynamics and a critic NN is constructed to obtain the approximate solution of the HJI equation. An advantage of the proposed architecture is that the identifier NN and the critic NN are tuned simultaneously. With introducing two additional terms, namely, the stabilizing term and the robustifying term to update the critic NN, the initial stabilizing control is no longer required. Meanwhile, the developed critic tuning rule not only ensures convergence of the critic to the optimal saddle point but also guarantees stability of the closed-loop system. Moreover, the uniform ultimate boundedness of the weights of the identifier NN and the critic NN are proved by using Lyapunov’s direct method. Finally, to illustrate the effectiveness and applicability of the developed approach, two simulation examples are provided.

Introduction

Over the past several decades, H optimal control problems for nonlinear systems have attracted intensive attention. Many remarkable results have been obtained in this filed [3], [4], [5], [10], [30], [31], especially the results reported in [4], [31]. In [4], Basar and Bernhard showed that the H optimal control problem is equivalent to the minimax optimization problem, which is termed as two-player zero-sum games where the controller is a minimizing player and the exogenous disturbance is a maximizing one. In [31], by using the theory of dissipative systems, van der Schaft transformed the H optimal control problem to the L2-gain optimal control problem. Nevertheless, the bottleneck for applying theories of the H optimal control in practice still exists. This is mainly because the solutions of two-player zero-sum games and L2-gain optimal control problems are often required to solve the Hamilton–Jacobi–Isaacs (HJI) equations. It is well-known that Hamilton–Jacobi–Isaacs (HJI) equations for nonlinear systems are actually nonlinear first-order partial differential equations (PDEs), which are difficult or impossible to solve by analytical approaches.

Since accurate solutions of HJI equations are intractable to obtain, an increasing number of researchers pay their attentions to deriving approximate solutions of this kind of equations. In the past few years, adaptive dynamic programming (ADP) methods have been successfully used to solve HJI equations. The ADP approach was first introduced by Werbos [36]. After that, various ADP methods were proposed (see surveys [15], [35]). A distinct feature of the ADP approach is that it employs neural networks (NNs) to derive the approximate optimal control forward in time. Due to this feature, the curse of dimensionality can be avoided while applying ADP approaches to solve the Hamilton–Jacobi–Bellman/HJI equations [27]. In light of this advantage, ADP methods have been extensively utilized to solve HJI equations.

For discrete-time nonlinear systems, Mehraeen et al. [20] presented an offline ADP-based iterative approach to solve the HJI equation for zero-sum two-player games. By using the proposed method and Taylor series expansion, a sufficient condition for the convergence to the saddle point is obtained. After that, Liu et al. [16] developed a greedy iterative ADP algorithm to solve the HJI equations associated with zero-sum two-player games. Based on the algorithm, three NNs referred to as action NN, critic NN and disturbance NN can approximate the optimal control, the optimal value and the worst disturbance, respectively. Later, Zhang et al. [45] proposed an online ADP-based algorithm to learn the solution of the HJI equation for a class of H control problems. By the algorithm given in [45], the prior knowledge of the nonlinear system is not required.

For continuous-time (CT) nonlinear systems, the HJI equations are often approximately solved by using reinforcement learning (RL), which is considered as a special case of ADP approaches by Werbos [37]. Abu-Khalaf et al. [2] introduced an offline RL-based algorithm to give the approximate solution of the HJI equation for constrained-input nonlinear systems. After that, Luo et al. [19] proposed an off-policy RL method to solve the HJI equation of H control problems. Differing from [2], the algorithm given in [19] generated the system data by arbitrary policies rather than evaluating policies. Recently, Vamvoudakis and Lewis [33] introduced an online RL-based algorithm to solve the HJI equation for two-player zero-sum games. By using the algorithm, the actor, critic and disturbance NNs were tuned simultaneously. Distinct from the above online RL-based algorithm, Dierks and Jagannathan [7] developed a single online approximator-based scheme to solve the HJI equation. Based on the algorithm, only a single critic NN is employed to learn the solution of the HJI equation and the initial stabilizing control is not required. It should be mentioned that, prior knowledge of system dynamics is required to be available in [7], [33]. After that, Luo and Wu [38] presented a simultaneous policy update algorithm to solve the HJI equation arising in nonlinear H control problems. By the proposed algorithm, the internal dynamics of nonlinear system is not required. Later, Liu et al. [17] employed the simultaneous policy update algorithm to obtain the approximate solution of the HJI equation for multi-player nonzero-sums with completely unknown dynamics. More recently, Johnson et al.[12] developed a projection algorithm to give the approximate solution of the coupled HJI equations for uncertain nonlinear CT systems.

To the best of authors’ knowledge, there are still no ADP-based algorithms proposed to solve the HJI equation for constrained-input nonlinear CT systems with unknown dynamics. In this paper, we develop a novel online ADP-based algorithm to learn the solution of the HJI equation for unknown constrained-input nonlinear CT systems. The present algorithm is implemented via an identifier-critic architecture, which consists of two neural networks (NNs): an identifier NN is utilized to estimate the unknown system dynamics and a critic NN is constructed to obtain the approximate solution of the HJI equation. An advantage of the present architecture is that the identifier NN and the critic NN are tuned simultaneously. With introducing two additional terms, namely, the stabilizing term and the robustifying term to update the critic NN, no initial stabilizing control is required. Meanwhile, the developed critic tuning rule not only ensures convergence of the critic to the optimal saddle point but also guarantees stability of the closed-loop system. In addition, Lyapunov’s direct method is utilized to demonstrate the uniform ultimate boundedness of the weights of the identifier NN and the critic NN.

It is significant to point out that, though our methodology in this work is in a similar spirit as [7], this paper extends the work of [7] to give an online approximate solution of the HJI equation for constrained-input nonlinear CT systems with unknown dynamics. Solving the HJI equation of unknown constrained-input nonlinear CT systems is more intractable than those with the knowledge of system dynamics regardless of control constraints.

The rest of the paper is organized as follows. Section 2 provides preliminaries of H optimal control problems for constrained-input nonlinear CT systems. Section 3 presents the design of identifier NNs for unknown controlled systems with stability proof. Section 4 develops a single critic NN to approximate the solution of the HJI equation. Section 5 shows the stability analysis. Section 6 presents two numerical examples to verify the effectiveness of the developed method. Finally, Section 7 gives several concluding remarks and potential future extensions.

Notations:R represents the set of all real numbers. Rm denotes the Euclidean space of all real m-vectors. Rn×m denotes the space of all n × m real matrices. In represents the n × n identity matrix. T is the transposition symbol. Cm represents the class of functions having continuous mth derivative. When ξ¯=[ξ¯1,,ξ¯m]TRm,ξ¯=(i=1m|ξ¯i|2)1/2 denotes the Euclidean norm of ξ¯. When ARm×m,A=(λmax(ATA))1/2 denotes the 2-norm of A, where λmax(ATA) represents the maximum eigenvalue of ATA.

Section snippets

Preliminaries and problem statement

Consider the nonlinear CT system described by x˙=f(x)+g(x)u+k(x)ω,z=h(x)+p(x)u,where x(t)Rn is the state, u(t)URm is the control input, U={uRm:|ui|κ,i=1,,m}, and κ > 0 is the saturating bound. ω(t)Rq1 is the exogenous disturbance, and z(t)Rq2 is the fictitious output, f(x)Rn,g(x)Rn×m,k(x)Rn×q1,h(x)Rq2,p(x)Rq2×m with f(0)=0, and x=0 is the equilibrium point of the system.

Assumption 1

f(x), g(x), and k(x) are unknown smooth functions defined on Rn. ω(t) ∈ L2[0, ∞), and it implies that there

Identifier design via dynamic NNs

According to [42], the first equation of system (1) can be represented by a dynamic NN as x˙=Ax+WfTϕ(x)+WgTρ(x)u+WkT(x)ω+ɛ(x),where ARn×n is a Hurwitz matrix, WfRn×n,WgRn×n, and WkRn×n are ideal NN weight matrices, and ɛ(x)Rn is the NN function reconstruction error. The vector function ϕ(x)Rn is assumed to be n-dimensional with the elements increasing monotonically. The matrix function ρ(x)Rn×m is assumed to be ρ(x)=[ρ1(ζ1Tx),,ρn(ζnTx)]T, where ζiRn×m is a constant matrix and ρi( · )

Approximate solution of the HJI equation via a single critic NN

According to the universal approximation property of NNs, the value function V*(x^) given in (30) can be represented by a single-layer NN on a compact set Ω as V*(x^)=WcTσ(x^)+ɛc(x^),where WcRN0 is the ideal NN weight vector, σ(x^)=[σ1(x^),σ2(x^),,σN0(x^)]TRN0 is the activation function with σj(x^)C1 and σj(0)=0, the set {σj(x^)}1N0 is often selected to be linearly independent, N0 is the number of neurons, and ɛc(x^) is the NN function reconstruction error. Meanwhile, the derivative of V*(x^

Stability analysis

Before demonstrating the main theorems, we present several required assumptions. These assumptions have been used in [18], [21], [22], [24], [32], [34].

Assumption 5

The ideal NN weight Wc is bounded by a known positive constant WcM, i.e., ‖Wc‖ ≤ WcM. There exist known constants bɛc>0 and bɛcx^>0 such that ɛc(x^)<bɛc,ɛc(x^)<bɛcx^ for every x^Ω. In addition, there exist known constants bɛu*>0 and bɛω*>0 such that ɛu*bɛu*,ɛω*bɛω* for every x^Ω.

Assumption 6

There exist known constants bσ > 0 and bσx^>0 such

Simulation results

In this section, two examples are provided to illustrate the effectiveness of the developed theoretical results.

Conclusions

In this paper, we have presented a new ADP-based algorithm which solves the HJI equation for constrained-input affine nonlinear CT systems in the presence of unknown dynamics. The algorithm employs an identifier-critic architecture. Based on the present algorithm, the identifer NN and the critic NN are tuned simultaneously. Meanwhile, no initial stabilizing control is required. A limitation of the present algorithm is that the system state is required to be available. In our future work, we

References (46)

  • M. Abu-Khalaf et al.

    Policy iterations on the Hamilton–Jacobi–Isaacs equation for state feedback control with input saturation

    IEEE Trans. Autom. Control

    (2006)
  • M. Abu-Khalaf et al.

    Neurodynamic programming and zero-sum games for constrained control systems

    IEEE Trans. Neural Netw.

    (2008)
  • M. Aliyu

    Nonlinear H Control, Hamiltonian Systems and Hamilton–Jacobi Equations

    (2011)
  • T. Basar et al.

    H Optimal Control and Related Minimax Design Problems

    (1995)
  • R.W. Beard et al.

    Successive Galerkin approximation algorithms for nonlinear optimal and robust control

    Int. J. Control

    (1998)
  • T. Dierks et al.

    Optimal control of affine nonlinear continuous-time systems using an online Hamilton–Jacobi–Isaacs formulation

    Proceedings of the 49th IEEE Conference on Decision and Control, Atlanta, GA, USA

    (2010)
  • B.A. Finlayson

    The Method of Weighted Residuals and Variational Principles

    (1972)
  • J. Huang

    An algorithm to solve the discrete HJI equation arising in the L2 gain optimization problem

    Int. J. Control

    (1999)
  • M. Johnson et al.

    Nonlinear two-player zero-sum game approximate solution using a policy iteration algorithm

    Proceedings of IEEE Conference on Decision and Control and European Control Conference, Orlando, FL, USA

    (2011)
  • M. Johnson et al.

    Approximate N-player nonzero-sum game solution for an uncertain continuous nonlinear system

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • F.L. Lewis et al.

    Neural Network Control of Robot Manipulators and Nonlinear Systems

    (1999)
  • F.L. Lewis et al.

    Reinforcement Learning and Approximate Dynamic Programming for Feedback Control

    (2013)
  • F.L. Lewis et al.

    Reinforcement learning and adaptive dynamic programming for feedback control

    IEEE Circuits Syst. Mag.

    (2009)
  • Cited by (68)

    • Reinforcement learning for robust stabilization of nonlinear systems with asymmetric saturating actuators

      2023, Neural Networks
      Citation Excerpt :

      Thus, the ADP methods (or rather, the PI algorithms) proposed in Abu-Khalaf and Lewis (2005), Vamvoudakis et al. (2016), and Wang et al. (2017) might not be suitable for unknown (or partially unknown) nonlinear systems with symmetric input constraints. To tackle the issue, the identifier NNs were employed to rebuild the controlled systems’ models (Yang et al., 2016; Zhang et al., 2020). Owing to the identification errors, the control performance was often degraded.

    View all citing articles on Scopus

    This work was supported in part by the National Natural Science Foundation of China under Grants 61034002, 61233001, 61273140, 61304086, and 61374105, in part by Beijing Natural Science Foundation under Grant 4132078, and in part by the Early Career Development Award of the State Key Laboratory of Management and Control for Complex Systems (SKLMCCS).

    View full text