Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems

Cheng, Kang; Fei, Shumin; Zhang, Kanjian; Liu, Xiaomei; Wei, Haikun

doi:10.1007/s10957-013-0418-1

Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems

Published: 26 September 2013

Volume 163, pages 165–180, (2014)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Kang Cheng¹,
Shumin Fei¹,
Kanjian Zhang¹,
Xiaomei Liu¹ &
…
Haikun Wei¹

312 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, a unified policy iteration approach is presented for the optimal control problem of stochastic system with discounted average cost and continuous state space. The approach consists of temporal difference learning-based potential function approximation algorithms and performance difference formula-based policy improvement. The approximation algorithms are derived by solving the Poisson equation-based fixed-point equation, which can be viewed as continuous versions of least squares policy evaluation algorithm and least squares temporal difference algorithm. The simulations are provided to illustrate the effectiveness of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems

Article 05 April 2023

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

Article 10 September 2016

Potential-Based Least-Squares Policy Iteration for a Parameterized Feedback Control System

Article 29 December 2015

References

Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Bertsekas, D.P.: Dynamic Programming and Stochastic Control, vols. I and II, 3rd edn. Athena Scientific, Belmont (2007)
Google Scholar
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, New York (2007)
Book Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Bertsekas, D.P.: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011)
Article MathSciNet MATH Google Scholar
Powell, W.B., Ma, J.: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9, 336–352 (2011)
Article MathSciNet MATH Google Scholar
Cao, X.R., Chen, H.F.: Potentials, perturbation realization, and sensitivity analysis of Markov processes. IEEE Trans. Autom. Control 42, 1382–1393 (1997)
Article MathSciNet MATH Google Scholar
Cao, X.R.: A unified approach to Markov decision problems and performance sensitivity analysis. Automatica 36, 771–774 (2000)
Article MATH Google Scholar
Cao, X.R., Guo, X.P.: A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases. Automatica 40, 1749–1759 (2004)
Article MathSciNet MATH Google Scholar
Zhang, K.J., Xu, Y.K., Chen, X., Cao, X.R.: Policy iteration based feedback control. Automatica 44, 1055–1061 (2008)
Article MathSciNet MATH Google Scholar
Cao, X.R.: Single sample path-based optimization of Markov chains. J. Optim. Theory Appl. 100, 527–548 (1999)
Article MathSciNet MATH Google Scholar
Fang, H.T., Cao, X.R.: Potential-based on-line policy iteration algorithms for Markov decision processes. IEEE Trans. Autom. Control 49, 493–505 (2004)
Article MathSciNet Google Scholar
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988)
Google Scholar
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)
MATH Google Scholar
Nedic, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Syst. 13, 79–110 (2003)
Article MathSciNet MATH Google Scholar
Lagoudakis, M., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
MathSciNet Google Scholar
Bertsekas, D.P., Yu, H.Z.: Projected equation methods for approximate solution of large linear systems. J. Comput. Appl. Math. 227, 27–50 (2009)
Article MathSciNet MATH Google Scholar
Tsitsiklis, J.N., Roy, B.V.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)
Article MATH Google Scholar
Ma, J., Powell, W.B.: A convergent recursive least squares approximate policy iteration algorithm for multi-dimensional Markov decision process with continuous state and action spaces. In: ADPRL 2009—Proceedings, Nashville, USA, pp. 66–73 (2009)
Google Scholar
Chan, K.S., Tong, H.: On the use of the deterministic Lyapunov function for the ergodicity of stochastic difference equations. Adv. Appl. Probab. 17, 666–678 (1985)
Article MathSciNet MATH Google Scholar
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, London (1993)
Book MATH Google Scholar
Yu, H.Z., Bertsekas, D.P.: Convergence results for some temporal difference methods based on least squares. IEEE Trans. Autom. Control 54, 1515–1531 (2009)
Article MathSciNet Google Scholar
Cao, X.R.: Stochastic control via direct comparison. Discrete Event Dyn. Syst. 21, 11–38 (2011)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the editors and anonymous reviewers for their constructive comments that improved the manuscript. The work was supported by the National Natural Science Foundation of China under Grants Nos. 60874030 and 61374006, and the Major Program of National Natural Science Foundation of China under Grant No. 11190015.

Author information

Authors and Affiliations

School of Automation, Southeast University, Nanjing, Jiangsu, 210096, China
Kang Cheng, Shumin Fei, Kanjian Zhang, Xiaomei Liu & Haikun Wei

Authors

Kang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Shumin Fei
View author publications
You can also search for this author in PubMed Google Scholar
Kanjian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haikun Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kanjian Zhang.

Additional information

Communicated by Qianchuan Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, K., Fei, S., Zhang, K. et al. Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems. J Optim Theory Appl 163, 165–180 (2014). https://doi.org/10.1007/s10957-013-0418-1

Download citation

Received: 02 August 2012
Accepted: 27 August 2013
Published: 26 September 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10957-013-0418-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems

Abstract

Access this article

Similar content being viewed by others

An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

Potential-Based Least-Squares Policy Iteration for a Parameterized Feedback Control System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems

Abstract

Access this article

Similar content being viewed by others

An adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems

Two Approaches to Stochastic Optimal Control Problems with a Final-Time Expectation Constraint

Potential-Based Least-Squares Policy Iteration for a Parameterized Feedback Control System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation