From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Cao, Xi-Ren

doi:10.1023/A:1022188803039

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Published: January 2003

Volume 13, pages 9–39, (2003)
Cite this article

Discrete Event Dynamic Systems Aims and scope Submit manuscript

Xi-Ren Cao¹

337 Accesses
36 Citations
1 Altmetric
Explore all metrics

Abstract

The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(λ), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abounadi, J., Bertsekas, D., and Borkar, V. Learning algorithms for Markov decision processes with average cost, Report LIDS-P-2434, Lab. for Info. and Decision Systems, October 1998; to appear in SIAM J. on Control and Optimization.
Altman, E. 1999. Constrained Markov Decision Processes. Chapman Hall/CRC.
Barto, A., Sutton, R., and Anderson, C. 1983. Neuron-like elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics13: 835–846.
Google Scholar
Baxter, J., and Bartlett, P. L. 2001. Innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350.
Google Scholar
Baxter, J., Bartlett, P. L., and Weaver, L. 2001. Experiments with innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 351–381.
Google Scholar
Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control. Vols. I, II, Athena Scientic, Belmont, Massachusetts.
Google Scholar
Berman, A., and Plemmons, R. J. 1994. Nonnegative matrices in the mathematical sciences. SIAM, Philadelphia.
Bertsekas, D. P., and Tsitsiklis, T. N. 1996. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts.
Google Scholar
Cao, X. R. 1994. Realization Probabilities: The Dynamics of Queuing Systems. Springer-Verlag, New York.
Google Scholar
Cao, X. R. 1998. The relation among potentials, perturbation analysis, Markov decision processes, and other topics. Journal of Discrete Event Dynamic Systems 8: 71–87.
Google Scholar
Cao, X. R. 1999. Single sample path-based optimization of markov chains. Journal of Optimization: Theory and Application 100: 527–548.
Google Scholar
Cao, X. R. 2000. A unified approach to Markov decision problems and performance sensitivity analysis. Automatica36: 771–774.
Google Scholar
Cao, X. R., and Chen, H. F. 1997. Potentials, perturbation realization, and sensitivity analysis of Markov processes. IEEE Transactions on AC 42: 1382–1393.
Google Scholar
Cao, X. R., Ren, Z. Y., Bhatnagar, S., Fu, M., and Marcus, S. 2002. A time aggregation approach to Markov decision processes. Automatica 38: 929–943.
Google Scholar
Cao, X. R., and Fang, H. T. Gradient-based policy iteration: an example. To appear in 2002 IEEE Conference on Decision and Control.
Cao, X. R., and Wan, Y. W. 1998. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. IEEE Transactions on Control Systems Technology 6: 482–494.
Google Scholar
Cassandras, C. G., and Lafortune, S. 1999. Introduction to Discrete Event Systems. Kluwer Academic Publishers.
Chong, E. K. P., and Ramadge, P. J. 1992. Convergence of recursive optimization algorithms using infinitesimal perturbation analysis estimates. Journal of Discrete Event Dynamic Systems1: 339–372.
Google Scholar
CËinlar, E. 1975. Introduction to Stochastic Processes. Prentice Hall, Englewood cliffs, NJ.
Google Scholar
Fang, H. T., and Cao, X. R. Single sample path-based recursive algorithms for Markov decision processes. IEEE Trans.on Automatic Control, submitted.
Feinberg, E. A., and Adam Shwartz (ed.) 2002. Handbook of Markov Decision Processes. Kluwer, 2002.
Fu, M. C. 1990. Convergence of a stochastic approximation algorithm for the GI/G/1 queue using infinitesimal perturbation analysis. Journal of Optimization Theory and Applications 65: 149–160.
Google Scholar
Fu, M. C. and Hu, J. Q. 1997. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publishers, Boston.
Google Scholar
Glasserman, P. 1991. Gradient Estimation Via Perturbation Analysis. Kluwer Academic Publishers, Boston.
Google Scholar
Glynn, P. W., and Meyn, S. P. 1996. A Lyapunov bound for solutions of Poisson's equation. Ann.Probab. 24: 916–931.
Google Scholar
Gong, W. B., and Ho, Y. C. 1987. Smoothed perturbation analysis of discrete event systems. IEEE Transactions on Control Systems Technology 32: 858–866.
Google Scholar
Ho, Y. C., and Cao, X. R. 1991. Perturbation Analysis of Discrete-Event Dynamic Systems. Kluwer Academic Publisher, Boston.
Google Scholar
Ho, Y. C., and Cao, X. R. 1983. Perturbation analysis and optimization of queuing networks. Journal of Optimization Theory and Applications 40(4): 559–582.
Google Scholar
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101.
Kemeny, J. G., and Snell, J. L. 1960. Finite Markov Chains. Van Nostrand, New York.
Google Scholar
Konda, V. R., and Borkar, V. S. 1990. Actor-critic like learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38: 94–123.
Google Scholar
Konda, V. R., and Tsitsiklis, J. N. 2001. Actor-critic Algorithms. Submitted to SIAM Journal on Control and Optimization, February.
Marbach, P., and Tsitsiklis, T. N. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46: 191–209.
Google Scholar
Meyn, S. P. 1997. The policy improvement algorithm for Markov decision processes with general state space. IEEE Transactions on Automatic Control 42: 1663–1680.
Google Scholar
Meyn, S. P., and Tweedie, R. L. 1993. Markov Chains and Stochastic Stability. Springer-Verlag, London.
Google Scholar
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
Google Scholar
Singh, S. P. 1994. Reinforcement learning algorithms for average-payoff Markovain decision processes. Proceedings of the Twelfth National Conference on Artificial Intelligence 202–207.
Smart, W. D., and Kaelbling, L. P. 2000. Practical reinforcement learning in continuous spaces. Proceedings of the Seventeenth International Conference on Machine Learning.
Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3: 835–846.
Google Scholar
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
Google Scholar
Tsitsiklis, J. N., and Van Roy, B. 1999. Average cost temporal-difference learning. Automatica 35: 1799–1808.
Google Scholar
Vazquez-Abad, F. J., Cassandras, C. G., and Julka, V. 1998. Centralized and decentralized asynchronous optimization of stochastic discret event systems. IEEE Transactions on Automatic Control 43: 631–655.
Google Scholar
Zhang, B., and Ho, Y. C. 1991. Performance gradient estimation for very large finite Markov chains. IEEE Transactions on Automatic Control 36: 1218–1227.
Google Scholar

Download references

Author information

Authors and Affiliations

Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
Xi-Ren Cao

Authors

Xi-Ren Cao
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, XR. From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning. Discrete Event Dynamic Systems 13, 9–39 (2003). https://doi.org/10.1023/A:1022188803039

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1022188803039

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning

Perturbation Analysis of Steady-State Performance and Relative Optimization

Perturbation Analysis of Steady-State Performance and Relative Optimization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning

Perturbation Analysis of Steady-State Performance and Relative Optimization

Perturbation Analysis of Steady-State Performance and Relative Optimization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation