Skip to main content
Log in

Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

In this paper, a unified policy iteration approach is presented for the optimal control problem of stochastic system with discounted average cost and continuous state space. The approach consists of temporal difference learning-based potential function approximation algorithms and performance difference formula-based policy improvement. The approximation algorithms are derived by solving the Poisson equation-based fixed-point equation, which can be viewed as continuous versions of least squares policy evaluation algorithm and least squares temporal difference algorithm. The simulations are provided to illustrate the effectiveness of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  2. Bertsekas, D.P.: Dynamic Programming and Stochastic Control, vols. I and II, 3rd edn. Athena Scientific, Belmont (2007)

    Google Scholar 

  3. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, New York (2007)

    Book  Google Scholar 

  4. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  5. Bertsekas, D.P.: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310–335 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  6. Powell, W.B., Ma, J.: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9, 336–352 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cao, X.R., Chen, H.F.: Potentials, perturbation realization, and sensitivity analysis of Markov processes. IEEE Trans. Autom. Control 42, 1382–1393 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cao, X.R.: A unified approach to Markov decision problems and performance sensitivity analysis. Automatica 36, 771–774 (2000)

    Article  MATH  Google Scholar 

  9. Cao, X.R., Guo, X.P.: A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases. Automatica 40, 1749–1759 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  10. Zhang, K.J., Xu, Y.K., Chen, X., Cao, X.R.: Policy iteration based feedback control. Automatica 44, 1055–1061 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  11. Cao, X.R.: Single sample path-based optimization of Markov chains. J. Optim. Theory Appl. 100, 527–548 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  12. Fang, H.T., Cao, X.R.: Potential-based on-line policy iteration algorithms for Markov decision processes. IEEE Trans. Autom. Control 49, 493–505 (2004)

    Article  MathSciNet  Google Scholar 

  13. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988)

    Google Scholar 

  14. Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22, 33–57 (1996)

    MATH  Google Scholar 

  15. Nedic, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Syst. 13, 79–110 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  16. Lagoudakis, M., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    MathSciNet  Google Scholar 

  17. Bertsekas, D.P., Yu, H.Z.: Projected equation methods for approximate solution of large linear systems. J. Comput. Appl. Math. 227, 27–50 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  18. Tsitsiklis, J.N., Roy, B.V.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)

    Article  MATH  Google Scholar 

  19. Ma, J., Powell, W.B.: A convergent recursive least squares approximate policy iteration algorithm for multi-dimensional Markov decision process with continuous state and action spaces. In: ADPRL 2009—Proceedings, Nashville, USA, pp. 66–73 (2009)

    Google Scholar 

  20. Chan, K.S., Tong, H.: On the use of the deterministic Lyapunov function for the ergodicity of stochastic difference equations. Adv. Appl. Probab. 17, 666–678 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  21. Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, London (1993)

    Book  MATH  Google Scholar 

  22. Yu, H.Z., Bertsekas, D.P.: Convergence results for some temporal difference methods based on least squares. IEEE Trans. Autom. Control 54, 1515–1531 (2009)

    Article  MathSciNet  Google Scholar 

  23. Cao, X.R.: Stochastic control via direct comparison. Discrete Event Dyn. Syst. 21, 11–38 (2011)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editors and anonymous reviewers for their constructive comments that improved the manuscript. The work was supported by the National Natural Science Foundation of China under Grants Nos. 60874030 and 61374006, and the Major Program of National Natural Science Foundation of China under Grant No. 11190015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kanjian Zhang.

Additional information

Communicated by Qianchuan Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, K., Fei, S., Zhang, K. et al. Temporal Difference-Based Policy Iteration for Optimal Control of Stochastic Systems. J Optim Theory Appl 163, 165–180 (2014). https://doi.org/10.1007/s10957-013-0418-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-013-0418-1

Keywords

Navigation