Abstract
In the paper, a potential-based policy iteration method is proposed for optimal control of a stochastic dynamic system with an average cost criterion and a parameterized control law. In this method, the potential function and the optimal control parameters are obtained via a least-squares-based approach. The potential estimation algorithm is derived from a temporal difference learning method, which can be viewed as a continuous version of the least-squares policy evaluation algorithm. The policy iteration algorithm is validated by solving a linear quadratic gaussian problem in the simulation.
Similar content being viewed by others
References
Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vols. I and II. Athena Scientific, Belmont (2007)
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, New York (2007)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Geist, M., Pietquin, O.: Algorithmic survey of parametric value function approximation. IEEE Trans. Neural Netw. Learn. Syst. 24, 845–867 (2013)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988)
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)
Nedic, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn Syst 13, 79–110 (2003)
Lagoudakis, M., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
Xu, X., Hu, D.W.: Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Netw. 18, 973–992 (2007)
Bertsekas, D.P.: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011)
Powell, W.B., Ma, J.: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9, 336–352 (2011)
Cheng, K., Fei, S., Zhang, K., Liu, X., Wei, H.: Temporal difference-based policy iteration for optimal control of stochastic systems. J. Optim. Theory Appl. 163, 165–180 (2014)
Xu, X., Lu, C.M., Hu, D.W.: Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft. Comput. 15, 1055–1070 (2011)
Cao, X.R.: Basic ideas for event-based optimization of Markov systems. Discrete Event Dyn. Syst. 15, 167–197 (2005)
Cao, X.R.: Single sample path-based optimization of Markov chains. J. Optim. Theory Appl. 100, 527–548 (1999)
Zhang, K.J., Xu, Y.K., Chen, X., Cao, X.R.: Policy iteration based feedback control. Automatica 44, 1055–1061 (2008)
Meyn, S.P.: The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Trans. Autom. Control 42, 1663–1680 (1997)
Reich, S.: Weak convergence theorems for nonexpansive mappings in Banach spaces. J. Math. Anal. Appl. 67, 274–276 (1979)
Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Math. Program. 89, 149–185 (2000)
Acknowledgments
The authors would like to thank the editors and the anonymous reviewers for their constructive comments that improved the manuscript. The paper work was supported by the National Natural Science Foundation of China under Grant Nos. 60874030 and 61374006, and the Major Program of National Natural Science Foundation of China under Grant No. 11190015.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Qianchuan Zhao.
Rights and permissions
About this article
Cite this article
Cheng, K., Zhang, K., Fei, S. et al. Potential-Based Least-Squares Policy Iteration for a Parameterized Feedback Control System. J Optim Theory Appl 169, 692–704 (2016). https://doi.org/10.1007/s10957-015-0809-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-015-0809-6
Keywords
- Stochastic system
- Markov decision processes
- Performance potential
- Least-squares policy evaluation
- Policy iteration