Elsevier

Automatica

Volume 138, April 2022, 110153
Automatica

Brief paper
Homotopic policy iteration-based learning design for unknown linear continuous-time systems

https://doi.org/10.1016/j.automatica.2021.110153Get rights and content

Abstract

Recent results have emerged that policy iteration is a powerful reinforcement learning tool in designing a stabilizing control policy for continuous-time systems with unknown system dynamics. Policy iteration involves a model-based initialization stage, i.e., seeking an initial stabilizing control policy, which is, however, dependent on the full system dynamics including the drift dynamics and system input matrix. To remove such model requirements, this paper utilizes a homotopy-based initialization strategy for policy iteration, wherein a stabilizing control policy for continuous-time systems is obtained by gradually moving a stable system to the original system. We propose two homotopic policy iteration-based stabilizing control schemes, namely, a model-based design and a model-free design using system data, which are proved to place unstable poles into a stable region. The effectiveness of the proposed designs is validated through an illustrative example.

Introduction

Finding a stabilizing control without knowing the accurate system dynamics has been a long-standing goal in the field of system and control, which generates fruitful results on adaptive control (see e.g., Åström and Wittenmark, 2013, Goodwin and Sin, 2014, Ioannou and Sun, 2012, Krstić et al., 1995, Sastry and Bodson, 2011 and the references therein). It is often important for practical scenarios to design controllers with user-defined performance. It thus requires that adaptive controllers should guarantee system stability and performance. To this end, indirect adaptive controllers identify the system parameters, based on which an optimal control is derived within a model-based framework (Ioannou & Fidan, 2006). Extremum-seeking schemes are documented in Ariyur and Krstić (2003) and Scheinker and Krstić (2017) for system performance improvement with an analytical footing. Inverse optimality conditions are considered for adaptive controllers such as Freeman and Kokotović (2008) and Krstić and Tsiotras (1999).

As a method leveraging interactions between the environment and an agent, reinforcement learning (RL) e.g., (Abbasi-Yadkori and Szepesvári, 2011, Bradtke et al., 1994, Chen et al., 2021, Ibrahimi et al., 2012, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998, Vrabie et al., 2013), or approximate/adaptive dynamic programming (ADP) e.g., (Bertsekas, 1995, Jiang, Fan, et al., 2020, Jiang, Kiumarsi, et al., 2020, Lewis, Vrabie, and Vamvoudakis, 2012, Liu et al., 2017, Powell, 2007, Zhang et al., 2012), has recently been utilized to design control policies for both continuous-time (CT) and discrete-time (DT) systems. One structure for implementing RL algorithms is policy iteration, wherein the performance of a current control policy is evaluated and based on which an improved control policy is resulted (Kleinman, 1968, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998). RL algorithms via policy iteration can adaptively learn a stabilizing controller that minimizes certain performance indexes and guarantees the system stability.

This paper focuses on the stabilizing control design for CT systems without prior knowledge of the system dynamics. To obviate the system dynamics requirement in RL-based results (Doya, 2000, Murray et al., 2001), the Integral Reinforcement Learning (IRL) algorithm is proposed for CT systems without knowing prior knowledge of the drift dynamics (Lewis and Vrabie, 2009, Vrabie and Lewis, 2009, Vrabie et al., 2008, Vrabie et al., 2009). In Lewis and Vrabie (2009) and Vrabie et al. (2009), IRL algorithms are implemented via policy iteration for solving Linear Quadratic Regulator (LQR) problems and also for nonlinear systems. The IRL-based method directly learns the optimal control solution from real-time data along with the system trajectories (Lewis, Vrabie, & Syrmos, 2012). Further details of the IRL algorithms and their implementation for CT systems can be found in Chen et al., 2020, Chen, Modares, et al., 2019, Chen, Xie, et al., 2019, Jiang and Jiang, 2012, Lewis, Vrabie, and Syrmos, 2012 and Lewis, Vrabie, and Vamvoudakis (2012) and the references therein. Of note, value iteration-based results for the LQR design of CT systems are reported in Bian and Jiang, 2016, Bian and Jiang, 2021 and Vrabie et al. (2013).

Note that the policy iteration-based designs for CT systems require an initial stabilizing control policy, which, however, is largely dependent on prior knowledge of the full system dynamics including the drift dynamics and input matrix (see Kleinman, 1968, Kleinman, 1970, Vrabie et al., 2009 and the references therein). This results in a model-based initialization stage when implementing policy iteration. One way to approach a stabilizing control is a homotopy-based method (Feng and Lavaei, 2020a, Lamperski, 2020, Mobahi and Fisher, 2015). Homotopy can be used as an initialization strategy, the idea of which was studied in Broussard and Halyo (1983). Recently, Feng and Lavaei (2020b) considered improving the locally optimal solutions for CT systems via optimal decentralized control. Based on (Feng & Lavaei, 2020b), (Lamperski, 2020) studied the control policy design for DT systems. However, it is not clear how to solve the optimal CT solution without knowing the system dynamics. In contrast, a model-based projected gradient descent method was used in Feng and Lavaei (2020b) as the local search algorithm.

This paper aims to design a stabilizing control for linear CT systems without knowing the accurate system dynamics. We propose a homotopy-based policy iteration for linear CT systems, wherein a stabilizing control policy is now obtained by gradually moving an artificial stable system to the original control system. Note that the existing policy iteration-based results for CT systems usually required a model-based initialization stage, in which the full system dynamics including the drift dynamics and system input matrix may be used for design. In this paper, we utilize a homotopy-based strategy to remove such model requirements and propose homotopic policy iteration by integrating the homotopy with the policy iteration technique of CT systems. Using the homotopic policy iteration, we establish model-based and model-free designs for CT systems that place unstable closed-loop poles into a stable region. It is thus seen that we provide a method to reinforce the convergence rate for unknown linear CT systems.

The remainder of the paper is organized as follows. In Section 2, we formulate a stable closed-loop poles-oriented learning design problem for CT systems and review standard policy iteration-based techniques for LQR design. In Section 3, we give a homotopy-based policy iteration design for CT systems to obtain stable closed-loop poles within a model-based framework. In Section 4, we extend the model-based result of Section 3 to propose a model-free version using system data collected along with the trajectories of the system states. In Section 5, we present an illustrative example. In Section 6, the main results of this paper are summarized.

Notations: The notation denotes the open left-half complex plane. Given a square matrix X, λ(X) denotes its spectrum, and σmin(X) and σmax(X) are, respectively, the minimum and maximum singular values. The notation indicates the Kronecker product. Given a matrix XRm×n, vec(X)=[x1T,x2T,,xiT,,xn1T,xnT]TRmn denotes a vector with xiRm representing the ith column of the matrix X. Given a symmetric matrix XRm×m, vecs(X)=[x1,1,2x1,2,,2x1,m,x2,2,2x2,3,,2xm1,m,xm,m]TR12m(m+1) with xi,j being an entry. Given a vector xiRm, vecv(xi)=[xi,12,xi,1xi,2,,xi,1xi,m,xi,22,xi,2xi,3,,xi,m1xi,m,xi,m2]TR12m(m+1). Let Im be an identity matrix of size m×m.

Section snippets

Problem formulation

It is often desirable to find a control law u that stabilizes a linear continuous-time system given by ẋ=Ax+Bu,where xRn is the system state; uRm is the control input; and ARn×n, BRn×m are constant but unknown matrices. The matrix A is not necessarily Hurwitz stable, i.e., λ(A).

The system setting in (1) can be regarded as an adaptive control problem, which has attracted much attention in the field of system and control. The need for finding a stabilizing control also exists in RL-based

Model-based homotopic policy iteration design for stable closed-loop poles

In this section, we detail a model-based design for seeking a desired stabilizing control policy. The results presented in this section will pave for the model-free design in the next section.

The model-based design in this section implies that prior knowledge of the system dynamics, A and B, is known, which allows making the following assumption.

Assumption 1

There exists a known constant λmax such that λmax=max(maxi(Re(λi(A))),0).

Note that Assumption 1 is used in this section only, and will be removed in

Model-free homotopic policy iteration design for stable closed-loop poles

The previous section gives a model-based homotopy-based policy iteration, wherein both the control gain matrix Kk+1 in (14) and the iteration length αk+1 in (15) require prior knowledge of the system dynamics, i.e., λmax in Assumption 1, A, and B. In this section, we aim to remove the model requirements in solving (14), (15) and to present a completely model-free solution.

For the learning design, we add and subtract (β̄In+BKkj=0kαjIn)x to the system dynamics equation ẋ=(β̄In+BKkj=0kαjIn)x+A

Illustrative example

In this section, we use a fourth-order model as an example to illustrate the effectiveness of the proposed model-based result from Lemma 2 and the model-free result from Algorithm 1.

Conclusion

In this paper, we have studied homotopic policy iteration for the stabilizing control design of CT systems with unknown system dynamics. Compared to the existing policy iteration-based works, we have used the homotopy-based strategy to remove the model requirements in seeking an initial stabilizing controller design for CT systems. We have proposed two homotopic policy iteration-based schemes, model-based and model-free, the latter of which has presented a data-driven design for completely

Acknowledgments

The authors thank the Associate Editor and anonymous reviewers for their feedback that has improved the quality of this work.

Ci Chen received the B.E. and Ph.D. degrees from School of Automation, Guangdong University of Technology, Guangzhou, China, in 2011 and 2016, respectively. From 2016 to 2018, he has been with The University of Texas at Arlington and The University of Tennessee at Knoxville as a Research Associate. He was awarded the Wallenberg-NTU Presidential Postdoctoral Fellowship. From 2018 to 2021, he was with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore and

References (44)

  • BianT. et al.

    Reinforcement learning and adaptive optimal control for continuous-time nonlinear systems: A value iteration approach

    IEEE Transactions on Neural Networks and Learning Systems

    (2021)
  • BradtkeS.J. et al.

    Adaptive linear quadratic control using policy iteration

  • BroussardJ. et al.

    Active flutter control using discrete optimal constrained dynamic compensators

  • ChenC. et al.

    Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics

    IEEE Transactions on Automatic Control

    (2019)
  • ChenC. et al.

    Off-policy reinforcement learning for adaptive optimal output tracking of unknown linear discrete-time systems

    (2021)
  • ChenC. et al.

    Adaptive optimal output tracking of continuous-time systems via output-feedback-based reinforcement learning

    (2019)
  • DoyaK.

    Reinforcement learning in continuous time and space

    Neural Computation

    (2000)
  • FengH. et al.

    Connectivity properties of the set of stabilizing static decentralized controllers

    SIAM Journal on Control and Optimization

    (2020)
  • FengH. et al.

    Escaping locally optimal decentralized control polices via damping

  • FreemanR. et al.

    Robust nonlinear control design: state-space and lyapunov techniques

    (2008)
  • GoodwinG.C. et al.

    Adaptive filtering prediction and control

    (2014)
  • IbrahimiM. et al.

    Efficient reinforcement learning for high dimensional linear quadratic systems

    Advances in Neural Information Processing Systems

    (2012)
  • Cited by (0)

    Ci Chen received the B.E. and Ph.D. degrees from School of Automation, Guangdong University of Technology, Guangzhou, China, in 2011 and 2016, respectively. From 2016 to 2018, he has been with The University of Texas at Arlington and The University of Tennessee at Knoxville as a Research Associate. He was awarded the Wallenberg-NTU Presidential Postdoctoral Fellowship. From 2018 to 2021, he was with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore and with Department of Automatic Control, Lund University, Sweden as a researcher. He is now a professor with School of Automation, Guangdong University of Technology, Guangzhou, China. His research interests include reinforcement learning, resilient control, and computational intelligence. He is an Editor for International Journal of Robust and Nonlinear Control and an Associate Editor for Advanced Control for Applications: Engineering and Industrial Systems.

    Frank L. Lewis is Member, National Academy of Inventors. Fellow IEEE, Fellow IFAC, Fellow AAAS, Fellow U.K. Institute of Measurement & Control, PE Texas, U.K. Chartered Engineer. UTA Distinguished Scholar Professor, UTA Distinguished Teaching Professor, and Moncrief-O’Donnell Chair at the University of Texas at Arlington Research Institute.

    He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, cooperative control systems, and nonlinear systems. He is author of 7 U.S. patents, numerous journal special issues, journal papers, and 20 books, including Optimal Control, Aircraft Control, Optimal Estimation, and Robot Manipulator Control which are used as university textbooks world-wide. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award, U.K. Inst Measurement & Control Honeywell Field Engineering Medal, IEEE Computational Intelligence Society Neural Networks Pioneer Award, AIAA Intelligent Systems Award. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Was listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Texas Regents Outstanding Teaching Award 2013.

    Bo Li received the Ph.D. degree in computer science and technology from the School of Intelligent Systems Engineering in Sun Yat-sen University, Guangzhou, China, in 2021. From September 2014 to June 2016, he was a master student in the school of engineering, Sun Yat-sen University. He is currently a Post-Doctoral Research Fellow with the School of Automation, Guangdong University of Technology, Guangzhou, China. He has published over ten papers in refereed international journals. His current research interests include intelligent transportation system, traffic information processing as well as big data technology. Dr. Li is currently an active reviewer for some international journals and a member of program committee for many international conferences.

    This work was supported in part by the National Natural Science Foundation of China under Grants 61703112, 61973087, and U1911401, in part by State Key Laboratory of Synthetical Automation for Process Industries, China (2020-KF-21-02), and in part by the Wallenberg-NTU Presidential Postdoctoral Fellowship, China . The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Raul Ordonez under the direction of Editor Miroslav Krstic.

    View full text