Brief paperHomotopic policy iteration-based learning design for unknown linear continuous-time systems☆
Introduction
Finding a stabilizing control without knowing the accurate system dynamics has been a long-standing goal in the field of system and control, which generates fruitful results on adaptive control (see e.g., Åström and Wittenmark, 2013, Goodwin and Sin, 2014, Ioannou and Sun, 2012, Krstić et al., 1995, Sastry and Bodson, 2011 and the references therein). It is often important for practical scenarios to design controllers with user-defined performance. It thus requires that adaptive controllers should guarantee system stability and performance. To this end, indirect adaptive controllers identify the system parameters, based on which an optimal control is derived within a model-based framework (Ioannou & Fidan, 2006). Extremum-seeking schemes are documented in Ariyur and Krstić (2003) and Scheinker and Krstić (2017) for system performance improvement with an analytical footing. Inverse optimality conditions are considered for adaptive controllers such as Freeman and Kokotović (2008) and Krstić and Tsiotras (1999).
As a method leveraging interactions between the environment and an agent, reinforcement learning (RL) e.g., (Abbasi-Yadkori and Szepesvári, 2011, Bradtke et al., 1994, Chen et al., 2021, Ibrahimi et al., 2012, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998, Vrabie et al., 2013), or approximate/adaptive dynamic programming (ADP) e.g., (Bertsekas, 1995, Jiang, Fan, et al., 2020, Jiang, Kiumarsi, et al., 2020, Lewis, Vrabie, and Vamvoudakis, 2012, Liu et al., 2017, Powell, 2007, Zhang et al., 2012), has recently been utilized to design control policies for both continuous-time (CT) and discrete-time (DT) systems. One structure for implementing RL algorithms is policy iteration, wherein the performance of a current control policy is evaluated and based on which an improved control policy is resulted (Kleinman, 1968, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998). RL algorithms via policy iteration can adaptively learn a stabilizing controller that minimizes certain performance indexes and guarantees the system stability.
This paper focuses on the stabilizing control design for CT systems without prior knowledge of the system dynamics. To obviate the system dynamics requirement in RL-based results (Doya, 2000, Murray et al., 2001), the Integral Reinforcement Learning (IRL) algorithm is proposed for CT systems without knowing prior knowledge of the drift dynamics (Lewis and Vrabie, 2009, Vrabie and Lewis, 2009, Vrabie et al., 2008, Vrabie et al., 2009). In Lewis and Vrabie (2009) and Vrabie et al. (2009), IRL algorithms are implemented via policy iteration for solving Linear Quadratic Regulator (LQR) problems and also for nonlinear systems. The IRL-based method directly learns the optimal control solution from real-time data along with the system trajectories (Lewis, Vrabie, & Syrmos, 2012). Further details of the IRL algorithms and their implementation for CT systems can be found in Chen et al., 2020, Chen, Modares, et al., 2019, Chen, Xie, et al., 2019, Jiang and Jiang, 2012, Lewis, Vrabie, and Syrmos, 2012 and Lewis, Vrabie, and Vamvoudakis (2012) and the references therein. Of note, value iteration-based results for the LQR design of CT systems are reported in Bian and Jiang, 2016, Bian and Jiang, 2021 and Vrabie et al. (2013).
Note that the policy iteration-based designs for CT systems require an initial stabilizing control policy, which, however, is largely dependent on prior knowledge of the full system dynamics including the drift dynamics and input matrix (see Kleinman, 1968, Kleinman, 1970, Vrabie et al., 2009 and the references therein). This results in a model-based initialization stage when implementing policy iteration. One way to approach a stabilizing control is a homotopy-based method (Feng and Lavaei, 2020a, Lamperski, 2020, Mobahi and Fisher, 2015). Homotopy can be used as an initialization strategy, the idea of which was studied in Broussard and Halyo (1983). Recently, Feng and Lavaei (2020b) considered improving the locally optimal solutions for CT systems via optimal decentralized control. Based on (Feng & Lavaei, 2020b), (Lamperski, 2020) studied the control policy design for DT systems. However, it is not clear how to solve the optimal CT solution without knowing the system dynamics. In contrast, a model-based projected gradient descent method was used in Feng and Lavaei (2020b) as the local search algorithm.
This paper aims to design a stabilizing control for linear CT systems without knowing the accurate system dynamics. We propose a homotopy-based policy iteration for linear CT systems, wherein a stabilizing control policy is now obtained by gradually moving an artificial stable system to the original control system. Note that the existing policy iteration-based results for CT systems usually required a model-based initialization stage, in which the full system dynamics including the drift dynamics and system input matrix may be used for design. In this paper, we utilize a homotopy-based strategy to remove such model requirements and propose homotopic policy iteration by integrating the homotopy with the policy iteration technique of CT systems. Using the homotopic policy iteration, we establish model-based and model-free designs for CT systems that place unstable closed-loop poles into a stable region. It is thus seen that we provide a method to reinforce the convergence rate for unknown linear CT systems.
The remainder of the paper is organized as follows. In Section 2, we formulate a stable closed-loop poles-oriented learning design problem for CT systems and review standard policy iteration-based techniques for LQR design. In Section 3, we give a homotopy-based policy iteration design for CT systems to obtain stable closed-loop poles within a model-based framework. In Section 4, we extend the model-based result of Section 3 to propose a model-free version using system data collected along with the trajectories of the system states. In Section 5, we present an illustrative example. In Section 6, the main results of this paper are summarized.
Notations: The notation denotes the open left-half complex plane. Given a square matrix , denotes its spectrum, and and are, respectively, the minimum and maximum singular values. The notation indicates the Kronecker product. Given a matrix , denotes a vector with representing the th column of the matrix . Given a symmetric matrix , with being an entry. Given a vector , . Let be an identity matrix of size .
Section snippets
Problem formulation
It is often desirable to find a control law that stabilizes a linear continuous-time system given by where is the system state; is the control input; and , are constant but unknown matrices. The matrix is not necessarily Hurwitz stable, i.e., .
The system setting in (1) can be regarded as an adaptive control problem, which has attracted much attention in the field of system and control. The need for finding a stabilizing control also exists in RL-based
Model-based homotopic policy iteration design for stable closed-loop poles
In this section, we detail a model-based design for seeking a desired stabilizing control policy. The results presented in this section will pave for the model-free design in the next section.
The model-based design in this section implies that prior knowledge of the system dynamics, and , is known, which allows making the following assumption.
Assumption 1 There exists a known constant such that
Note that Assumption 1 is used in this section only, and will be removed in
Model-free homotopic policy iteration design for stable closed-loop poles
The previous section gives a model-based homotopy-based policy iteration, wherein both the control gain matrix in (14) and the iteration length in (15) require prior knowledge of the system dynamics, i.e., in Assumption 1, , and . In this section, we aim to remove the model requirements in solving (14), (15) and to present a completely model-free solution.
For the learning design, we add and subtract to the system dynamics equation
Illustrative example
In this section, we use a fourth-order model as an example to illustrate the effectiveness of the proposed model-based result from Lemma 2 and the model-free result from Algorithm 1.
Conclusion
In this paper, we have studied homotopic policy iteration for the stabilizing control design of CT systems with unknown system dynamics. Compared to the existing policy iteration-based works, we have used the homotopy-based strategy to remove the model requirements in seeking an initial stabilizing controller design for CT systems. We have proposed two homotopic policy iteration-based schemes, model-based and model-free, the latter of which has presented a data-driven design for completely
Acknowledgments
The authors thank the Associate Editor and anonymous reviewers for their feedback that has improved the quality of this work.
Ci Chen received the B.E. and Ph.D. degrees from School of Automation, Guangdong University of Technology, Guangzhou, China, in 2011 and 2016, respectively. From 2016 to 2018, he has been with The University of Texas at Arlington and The University of Tennessee at Knoxville as a Research Associate. He was awarded the Wallenberg-NTU Presidential Postdoctoral Fellowship. From 2018 to 2021, he was with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore and
References (44)
- et al.
Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design
Automatica
(2016) - et al.
Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems
Automatica
(2020) - et al.
Cooperative adaptive optimal output regulation of discrete-time nonlinear multi-agent systems
Automatica
(2020) - et al.
Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics
Automatica
(2012) - et al.
Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems
Neural Networks
(2009) - et al.
Adaptive optimal control for continuous-time linear systems based on policy iteration
Automatica
(2009) - Abbasi-Yadkori, Y., & Szepesvári, C. (2011). Regret bounds for the adaptive control of linear quadratic systems. In...
- et al.
Real-time optimization by extremum-seeking control
(2003) - et al.
Adaptive control
(2013) Dynamic programming and optimal control, vol. 1
(1995)
Reinforcement learning and adaptive optimal control for continuous-time nonlinear systems: A value iteration approach
IEEE Transactions on Neural Networks and Learning Systems
Adaptive linear quadratic control using policy iteration
Active flutter control using discrete optimal constrained dynamic compensators
Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics
IEEE Transactions on Automatic Control
Off-policy reinforcement learning for adaptive optimal output tracking of unknown linear discrete-time systems
Adaptive optimal output tracking of continuous-time systems via output-feedback-based reinforcement learning
Reinforcement learning in continuous time and space
Neural Computation
Connectivity properties of the set of stabilizing static decentralized controllers
SIAM Journal on Control and Optimization
Escaping locally optimal decentralized control polices via damping
Robust nonlinear control design: state-space and lyapunov techniques
Adaptive filtering prediction and control
Efficient reinforcement learning for high dimensional linear quadratic systems
Advances in Neural Information Processing Systems
Cited by (0)
Ci Chen received the B.E. and Ph.D. degrees from School of Automation, Guangdong University of Technology, Guangzhou, China, in 2011 and 2016, respectively. From 2016 to 2018, he has been with The University of Texas at Arlington and The University of Tennessee at Knoxville as a Research Associate. He was awarded the Wallenberg-NTU Presidential Postdoctoral Fellowship. From 2018 to 2021, he was with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore and with Department of Automatic Control, Lund University, Sweden as a researcher. He is now a professor with School of Automation, Guangdong University of Technology, Guangzhou, China. His research interests include reinforcement learning, resilient control, and computational intelligence. He is an Editor for International Journal of Robust and Nonlinear Control and an Associate Editor for Advanced Control for Applications: Engineering and Industrial Systems.
Frank L. Lewis is Member, National Academy of Inventors. Fellow IEEE, Fellow IFAC, Fellow AAAS, Fellow U.K. Institute of Measurement & Control, PE Texas, U.K. Chartered Engineer. UTA Distinguished Scholar Professor, UTA Distinguished Teaching Professor, and Moncrief-O’Donnell Chair at the University of Texas at Arlington Research Institute.
He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, cooperative control systems, and nonlinear systems. He is author of 7 U.S. patents, numerous journal special issues, journal papers, and 20 books, including Optimal Control, Aircraft Control, Optimal Estimation, and Robot Manipulator Control which are used as university textbooks world-wide. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award, U.K. Inst Measurement & Control Honeywell Field Engineering Medal, IEEE Computational Intelligence Society Neural Networks Pioneer Award, AIAA Intelligent Systems Award. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Was listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Texas Regents Outstanding Teaching Award 2013.
Bo Li received the Ph.D. degree in computer science and technology from the School of Intelligent Systems Engineering in Sun Yat-sen University, Guangzhou, China, in 2021. From September 2014 to June 2016, he was a master student in the school of engineering, Sun Yat-sen University. He is currently a Post-Doctoral Research Fellow with the School of Automation, Guangdong University of Technology, Guangzhou, China. He has published over ten papers in refereed international journals. His current research interests include intelligent transportation system, traffic information processing as well as big data technology. Dr. Li is currently an active reviewer for some international journals and a member of program committee for many international conferences.
- ☆
This work was supported in part by the National Natural Science Foundation of China under Grants 61703112, 61973087, and U1911401, in part by State Key Laboratory of Synthetical Automation for Process Industries, China (2020-KF-21-02), and in part by the Wallenberg-NTU Presidential Postdoctoral Fellowship, China . The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Raul Ordonez under the direction of Editor Miroslav Krstic.