Homotopic policy iteration-based learning design for unknown linear continuous-time systems

doi:10.1016/j.automatica.2021.110153

Automatica

Volume 138, April 2022, 110153

https://doi.org/10.1016/j.automatica.2021.110153 Get rights and content

Abstract

Recent results have emerged that policy iteration is a powerful reinforcement learning tool in designing a stabilizing control policy for continuous-time systems with unknown system dynamics. Policy iteration involves a model-based initialization stage, i.e., seeking an initial stabilizing control policy, which is, however, dependent on the full system dynamics including the drift dynamics and system input matrix. To remove such model requirements, this paper utilizes a homotopy-based initialization strategy for policy iteration, wherein a stabilizing control policy for continuous-time systems is obtained by gradually moving a stable system to the original system. We propose two homotopic policy iteration-based stabilizing control schemes, namely, a model-based design and a model-free design using system data, which are proved to place unstable poles into a stable region. The effectiveness of the proposed designs is validated through an illustrative example.

Introduction

Finding a stabilizing control without knowing the accurate system dynamics has been a long-standing goal in the field of system and control, which generates fruitful results on adaptive control (see e.g., Åström and Wittenmark, 2013, Goodwin and Sin, 2014, Ioannou and Sun, 2012, Krstić et al., 1995, Sastry and Bodson, 2011 and the references therein). It is often important for practical scenarios to design controllers with user-defined performance. It thus requires that adaptive controllers should guarantee system stability and performance. To this end, indirect adaptive controllers identify the system parameters, based on which an optimal control is derived within a model-based framework (Ioannou & Fidan, 2006). Extremum-seeking schemes are documented in Ariyur and Krstić (2003) and Scheinker and Krstić (2017) for system performance improvement with an analytical footing. Inverse optimality conditions are considered for adaptive controllers such as Freeman and Kokotović (2008) and Krstić and Tsiotras (1999).

As a method leveraging interactions between the environment and an agent, reinforcement learning (RL) e.g., (Abbasi-Yadkori and Szepesvári, 2011, Bradtke et al., 1994, Chen et al., 2021, Ibrahimi et al., 2012, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998, Vrabie et al., 2013), or approximate/adaptive dynamic programming (ADP) e.g., (Bertsekas, 1995, Jiang, Fan, et al., 2020, Jiang, Kiumarsi, et al., 2020, Lewis, Vrabie, and Vamvoudakis, 2012, Liu et al., 2017, Powell, 2007, Zhang et al., 2012), has recently been utilized to design control policies for both continuous-time (CT) and discrete-time (DT) systems. One structure for implementing RL algorithms is policy iteration, wherein the performance of a current control policy is evaluated and based on which an improved control policy is resulted (Kleinman, 1968, Lewis, Vrabie, and Syrmos, 2012, Sutton and Barto, 1998). RL algorithms via policy iteration can adaptively learn a stabilizing controller that minimizes certain performance indexes and guarantees the system stability.

This paper focuses on the stabilizing control design for CT systems without prior knowledge of the system dynamics. To obviate the system dynamics requirement in RL-based results (Doya, 2000, Murray et al., 2001), the Integral Reinforcement Learning (IRL) algorithm is proposed for CT systems without knowing prior knowledge of the drift dynamics (Lewis and Vrabie, 2009, Vrabie and Lewis, 2009, Vrabie et al., 2008, Vrabie et al., 2009). In Lewis and Vrabie (2009) and Vrabie et al. (2009), IRL algorithms are implemented via policy iteration for solving Linear Quadratic Regulator (LQR) problems and also for nonlinear systems. The IRL-based method directly learns the optimal control solution from real-time data along with the system trajectories (Lewis, Vrabie, & Syrmos, 2012). Further details of the IRL algorithms and their implementation for CT systems can be found in Chen et al., 2020, Chen, Modares, et al., 2019, Chen, Xie, et al., 2019, Jiang and Jiang, 2012, Lewis, Vrabie, and Syrmos, 2012 and Lewis, Vrabie, and Vamvoudakis (2012) and the references therein. Of note, value iteration-based results for the LQR design of CT systems are reported in Bian and Jiang, 2016, Bian and Jiang, 2021 and Vrabie et al. (2013).

Note that the policy iteration-based designs for CT systems require an initial stabilizing control policy, which, however, is largely dependent on prior knowledge of the full system dynamics including the drift dynamics and input matrix (see Kleinman, 1968, Kleinman, 1970, Vrabie et al., 2009 and the references therein). This results in a model-based initialization stage when implementing policy iteration. One way to approach a stabilizing control is a homotopy-based method (Feng and Lavaei, 2020a, Lamperski, 2020, Mobahi and Fisher, 2015). Homotopy can be used as an initialization strategy, the idea of which was studied in Broussard and Halyo (1983). Recently, Feng and Lavaei (2020b) considered improving the locally optimal solutions for CT systems via optimal decentralized control. Based on (Feng & Lavaei, 2020b), (Lamperski, 2020) studied the control policy design for DT systems. However, it is not clear how to solve the optimal CT solution without knowing the system dynamics. In contrast, a model-based projected gradient descent method was used in Feng and Lavaei (2020b) as the local search algorithm.

This paper aims to design a stabilizing control for linear CT systems without knowing the accurate system dynamics. We propose a homotopy-based policy iteration for linear CT systems, wherein a stabilizing control policy is now obtained by gradually moving an artificial stable system to the original control system. Note that the existing policy iteration-based results for CT systems usually required a model-based initialization stage, in which the full system dynamics including the drift dynamics and system input matrix may be used for design. In this paper, we utilize a homotopy-based strategy to remove such model requirements and propose homotopic policy iteration by integrating the homotopy with the policy iteration technique of CT systems. Using the homotopic policy iteration, we establish model-based and model-free designs for CT systems that place unstable closed-loop poles into a stable region. It is thus seen that we provide a method to reinforce the convergence rate for unknown linear CT systems.

The remainder of the paper is organized as follows. In Section 2, we formulate a stable closed-loop poles-oriented learning design problem for CT systems and review standard policy iteration-based techniques for LQR design. In Section 3, we give a homotopy-based policy iteration design for CT systems to obtain stable closed-loop poles within a model-based framework. In Section 4, we extend the model-based result of Section 3 to propose a model-free version using system data collected along with the trajectories of the system states. In Section 5, we present an illustrative example. In Section 6, the main results of this paper are summarized.

Notations: The notation $ℂ^{-}$ denotes the open left-half complex plane. Given a square matrix $X$ , $λ (X)$ denotes its spectrum, and $σ_{min} (X)$ and $σ_{max} (X)$ are, respectively, the minimum and maximum singular values. The notation $\otimes$ indicates the Kronecker product. Given a matrix $X \in R^{m \times n}$ , $v e c (X) = {[x_{1}^{T}, x_{2}^{T}, \dots, x_{i}^{T}, \dots, x_{n - 1}^{T}, x_{n}^{T}]}^{T} \in R^{m n}$ denotes a vector with $x_{i} \in R^{m}$ representing the $i$ th column of the matrix $X$ . Given a symmetric matrix $X \in R^{m \times m}$ , $v e c s (X) = {[x_{1, 1}, 2 x_{1, 2}, \dots, 2 x_{1, m}, x_{2, 2}, 2 x_{2, 3}, \dots, 2 x_{m - 1, m}, x_{m, m}]}^{T} \in R^{\frac{1}{2} m (m + 1)}$ with $x_{i, j}$ being an entry. Given a vector $x_{i} \in R^{m}$ , $v e c v (x_{i}) = {[x_{i, 1}^{2}, x_{i, 1} x_{i, 2}, \dots, x_{i, 1} x_{i, m}, x_{i, 2}^{2}, x_{i, 2} x_{i, 3}, \dots, x_{i, m - 1} x_{i, m}, x_{i, m}^{2}]}^{T} \in R^{\frac{1}{2} m (m + 1)}$ . Let $I_{m}$ be an identity matrix of size $m \times m$ .

Section snippets

Problem formulation

It is often desirable to find a control law $u$ that stabilizes a linear continuous-time system given by $\dot{x} = A x + B u,$ where $x \in R^{n}$ is the system state; $u \in R^{m}$ is the control input; and $A \in R^{n \times n}$ , $B \in R^{n \times m}$ are constant but unknown matrices. The matrix $A$ is not necessarily Hurwitz stable, i.e., $λ (A) ⁄ \subset ℂ^{-}$ .

The system setting in (1) can be regarded as an adaptive control problem, which has attracted much attention in the field of system and control. The need for finding a stabilizing control also exists in RL-based

Model-based homotopic policy iteration design for stable closed-loop poles

In this section, we detail a model-based design for seeking a desired stabilizing control policy. The results presented in this section will pave for the model-free design in the next section.

The model-based design in this section implies that prior knowledge of the system dynamics, $A$ and $B$ , is known, which allows making the following assumption.

Assumption 1

There exists a known constant $λ_{max}$ such that $λ_{max} = max (max_{i} (R e (λ_{i} (A))), 0) .$

Note that Assumption 1 is used in this section only, and will be removed in

Model-free homotopic policy iteration design for stable closed-loop poles

The previous section gives a model-based homotopy-based policy iteration, wherein both the control gain matrix $K^{k + 1}$ in (14) and the iteration length $α^{k + 1}$ in (15) require prior knowledge of the system dynamics, i.e., $λ_{max}$ in Assumption 1, $A$ , and $B$ . In this section, we aim to remove the model requirements in solving (14), (15) and to present a completely model-free solution.

For the learning design, we add and subtract $(\bar{β} I_{n} + B K^{k} - \sum_{j = 0}^{k} α^{j} I_{n}) x$ to the system dynamics equation $\dot{x} = (\bar{β} I_{n} + B K^{k} - \sum_{j = 0}^{k} α^{j} I_{n}) x + A$

Illustrative example

In this section, we use a fourth-order model as an example to illustrate the effectiveness of the proposed model-based result from Lemma 2 and the model-free result from Algorithm 1.

Conclusion

In this paper, we have studied homotopic policy iteration for the stabilizing control design of CT systems with unknown system dynamics. Compared to the existing policy iteration-based works, we have used the homotopy-based strategy to remove the model requirements in seeking an initial stabilizing controller design for CT systems. We have proposed two homotopic policy iteration-based schemes, model-based and model-free, the latter of which has presented a data-driven design for completely

Acknowledgments

The authors thank the Associate Editor and anonymous reviewers for their feedback that has improved the quality of this work.

References (44)

BianT. et al.
Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design
Automatica
(2016)
ChenC. et al.
Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems
Automatica
(2020)
JiangY. et al.
Cooperative adaptive optimal output regulation of discrete-time nonlinear multi-agent systems
Automatica
(2020)
JiangY. et al.
Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics
Automatica
(2012)
VrabieD. et al.
Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems
Neural Networks
(2009)
VrabieD. et al.
Adaptive optimal control for continuous-time linear systems based on policy iteration
Automatica
(2009)
Abbasi-Yadkori, Y., & Szepesvári, C. (2011). Regret bounds for the adaptive control of linear quadratic systems. In...
AriyurK.B. et al.
Real-time optimization by extremum-seeking control
(2003)
ÅströmK.J. et al.
Adaptive control
(2013)
BertsekasD.P.
Dynamic programming and optimal control, vol. 1
(1995)

BianT. et al.

Reinforcement learning and adaptive optimal control for continuous-time nonlinear systems: A value iteration approach

IEEE Transactions on Neural Networks and Learning Systems

(2021)

BradtkeS.J. et al.

Adaptive linear quadratic control using policy iteration

BroussardJ. et al.

Active flutter control using discrete optimal constrained dynamic compensators

ChenC. et al.

Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics

IEEE Transactions on Automatic Control

(2019)

ChenC. et al.

Off-policy reinforcement learning for adaptive optimal output tracking of unknown linear discrete-time systems

(2021)

ChenC. et al.

Adaptive optimal output tracking of continuous-time systems via output-feedback-based reinforcement learning

(2019)

DoyaK.

Reinforcement learning in continuous time and space

Neural Computation

(2000)

FengH. et al.

Connectivity properties of the set of stabilizing static decentralized controllers

SIAM Journal on Control and Optimization

(2020)

FengH. et al.

Escaping locally optimal decentralized control polices via damping

FreemanR. et al.

Robust nonlinear control design: state-space and lyapunov techniques

(2008)

GoodwinG.C. et al.

Adaptive filtering prediction and control

(2014)

IbrahimiM. et al.

Efficient reinforcement learning for high dimensional linear quadratic systems

Advances in Neural Information Processing Systems

(2012)

Cited by (0)

Ci Chen received the B.E. and Ph.D. degrees from School of Automation, Guangdong University of Technology, Guangzhou, China, in 2011 and 2016, respectively. From 2016 to 2018, he has been with The University of Texas at Arlington and The University of Tennessee at Knoxville as a Research Associate. He was awarded the Wallenberg-NTU Presidential Postdoctoral Fellowship. From 2018 to 2021, he was with School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore and with Department of Automatic Control, Lund University, Sweden as a researcher. He is now a professor with School of Automation, Guangdong University of Technology, Guangzhou, China. His research interests include reinforcement learning, resilient control, and computational intelligence. He is an Editor for International Journal of Robust and Nonlinear Control and an Associate Editor for Advanced Control for Applications: Engineering and Industrial Systems.

Frank L. Lewis is Member, National Academy of Inventors. Fellow IEEE, Fellow IFAC, Fellow AAAS, Fellow U.K. Institute of Measurement & Control, PE Texas, U.K. Chartered Engineer. UTA Distinguished Scholar Professor, UTA Distinguished Teaching Professor, and Moncrief-O’Donnell Chair at the University of Texas at Arlington Research Institute.

He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, cooperative control systems, and nonlinear systems. He is author of 7 U.S. patents, numerous journal special issues, journal papers, and 20 books, including Optimal Control, Aircraft Control, Optimal Estimation, and Robot Manipulator Control which are used as university textbooks world-wide. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award, U.K. Inst Measurement & Control Honeywell Field Engineering Medal, IEEE Computational Intelligence Society Neural Networks Pioneer Award, AIAA Intelligent Systems Award. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Was listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Texas Regents Outstanding Teaching Award 2013.

Bo Li received the Ph.D. degree in computer science and technology from the School of Intelligent Systems Engineering in Sun Yat-sen University, Guangzhou, China, in 2021. From September 2014 to June 2016, he was a master student in the school of engineering, Sun Yat-sen University. He is currently a Post-Doctoral Research Fellow with the School of Automation, Guangdong University of Technology, Guangzhou, China. He has published over ten papers in refereed international journals. His current research interests include intelligent transportation system, traffic information processing as well as big data technology. Dr. Li is currently an active reviewer for some international journals and a member of program committee for many international conferences.

^☆: This work was supported in part by the National Natural Science Foundation of China under Grants 61703112, 61973087, and U1911401, in part by State Key Laboratory of Synthetical Automation for Process Industries, China (2020-KF-21-02), and in part by the Wallenberg-NTU Presidential Postdoctoral Fellowship, China . The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Raul Ordonez under the direction of Editor Miroslav Krstic.

View full text

Brief paperHomotopic policy iteration-based learning design for unknown linear continuous-time systems☆

Abstract

Introduction

Section snippets

Problem formulation

Model-based homotopic policy iteration design for stable closed-loop poles

Model-free homotopic policy iteration design for stable closed-loop poles

Illustrative example

Conclusion

Acknowledgments

Automatica

Automatica

Automatica

Automatica

Neural Networks

Automatica

Real-time optimization by extremum-seeking control

Adaptive control

Dynamic programming and optimal control, vol. 1

Reinforcement learning and adaptive optimal control for continuous-time nonlinear systems: A value iteration approach

IEEE Transactions on Neural Networks and Learning Systems

Adaptive linear quadratic control using policy iteration

Active flutter control using discrete optimal constrained dynamic compensators

Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics

IEEE Transactions on Automatic Control

Off-policy reinforcement learning for adaptive optimal output tracking of unknown linear discrete-time systems

Adaptive optimal output tracking of continuous-time systems via output-feedback-based reinforcement learning

Reinforcement learning in continuous time and space

Neural Computation

Connectivity properties of the set of stabilizing static decentralized controllers

SIAM Journal on Control and Optimization

Escaping locally optimal decentralized control polices via damping

Robust nonlinear control design: state-space and lyapunov techniques

Adaptive filtering prediction and control

Efficient reinforcement learning for high dimensional linear quadratic systems

Advances in Neural Information Processing Systems

Brief paper
Homotopic policy iteration-based learning design for unknown linear continuous-time systems☆