Skip to main content
Log in

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(λ), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abounadi, J., Bertsekas, D., and Borkar, V. Learning algorithms for Markov decision processes with average cost, Report LIDS-P-2434, Lab. for Info. and Decision Systems, October 1998; to appear in SIAM J. on Control and Optimization.

  • Altman, E. 1999. Constrained Markov Decision Processes. Chapman Hall/CRC.

  • Barto, A., Sutton, R., and Anderson, C. 1983. Neuron-like elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics13: 835–846.

    Google Scholar 

  • Baxter, J., and Bartlett, P. L. 2001. Innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350.

    Google Scholar 

  • Baxter, J., Bartlett, P. L., and Weaver, L. 2001. Experiments with innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 351–381.

    Google Scholar 

  • Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control. Vols. I, II, Athena Scientic, Belmont, Massachusetts.

    Google Scholar 

  • Berman, A., and Plemmons, R. J. 1994. Nonnegative matrices in the mathematical sciences. SIAM, Philadelphia.

  • Bertsekas, D. P., and Tsitsiklis, T. N. 1996. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts.

    Google Scholar 

  • Cao, X. R. 1994. Realization Probabilities: The Dynamics of Queuing Systems. Springer-Verlag, New York.

    Google Scholar 

  • Cao, X. R. 1998. The relation among potentials, perturbation analysis, Markov decision processes, and other topics. Journal of Discrete Event Dynamic Systems 8: 71–87.

    Google Scholar 

  • Cao, X. R. 1999. Single sample path-based optimization of markov chains. Journal of Optimization: Theory and Application 100: 527–548.

    Google Scholar 

  • Cao, X. R. 2000. A unified approach to Markov decision problems and performance sensitivity analysis. Automatica36: 771–774.

    Google Scholar 

  • Cao, X. R., and Chen, H. F. 1997. Potentials, perturbation realization, and sensitivity analysis of Markov processes. IEEE Transactions on AC 42: 1382–1393.

    Google Scholar 

  • Cao, X. R., Ren, Z. Y., Bhatnagar, S., Fu, M., and Marcus, S. 2002. A time aggregation approach to Markov decision processes. Automatica 38: 929–943.

    Google Scholar 

  • Cao, X. R., and Fang, H. T. Gradient-based policy iteration: an example. To appear in 2002 IEEE Conference on Decision and Control.

  • Cao, X. R., and Wan, Y. W. 1998. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. IEEE Transactions on Control Systems Technology 6: 482–494.

    Google Scholar 

  • Cassandras, C. G., and Lafortune, S. 1999. Introduction to Discrete Event Systems. Kluwer Academic Publishers.

  • Chong, E. K. P., and Ramadge, P. J. 1992. Convergence of recursive optimization algorithms using infinitesimal perturbation analysis estimates. Journal of Discrete Event Dynamic Systems1: 339–372.

    Google Scholar 

  • CËinlar, E. 1975. Introduction to Stochastic Processes. Prentice Hall, Englewood cliffs, NJ.

    Google Scholar 

  • Fang, H. T., and Cao, X. R. Single sample path-based recursive algorithms for Markov decision processes. IEEE Trans.on Automatic Control, submitted.

  • Feinberg, E. A., and Adam Shwartz (ed.) 2002. Handbook of Markov Decision Processes. Kluwer, 2002.

  • Fu, M. C. 1990. Convergence of a stochastic approximation algorithm for the GI/G/1 queue using infinitesimal perturbation analysis. Journal of Optimization Theory and Applications 65: 149–160.

    Google Scholar 

  • Fu, M. C. and Hu, J. Q. 1997. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publishers, Boston.

    Google Scholar 

  • Glasserman, P. 1991. Gradient Estimation Via Perturbation Analysis. Kluwer Academic Publishers, Boston.

    Google Scholar 

  • Glynn, P. W., and Meyn, S. P. 1996. A Lyapunov bound for solutions of Poisson's equation. Ann.Probab. 24: 916–931.

    Google Scholar 

  • Gong, W. B., and Ho, Y. C. 1987. Smoothed perturbation analysis of discrete event systems. IEEE Transactions on Control Systems Technology 32: 858–866.

    Google Scholar 

  • Ho, Y. C., and Cao, X. R. 1991. Perturbation Analysis of Discrete-Event Dynamic Systems. Kluwer Academic Publisher, Boston.

    Google Scholar 

  • Ho, Y. C., and Cao, X. R. 1983. Perturbation analysis and optimization of queuing networks. Journal of Optimization Theory and Applications 40(4): 559–582.

    Google Scholar 

  • Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101.

  • Kemeny, J. G., and Snell, J. L. 1960. Finite Markov Chains. Van Nostrand, New York.

    Google Scholar 

  • Konda, V. R., and Borkar, V. S. 1990. Actor-critic like learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38: 94–123.

    Google Scholar 

  • Konda, V. R., and Tsitsiklis, J. N. 2001. Actor-critic Algorithms. Submitted to SIAM Journal on Control and Optimization, February.

  • Marbach, P., and Tsitsiklis, T. N. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46: 191–209.

    Google Scholar 

  • Meyn, S. P. 1997. The policy improvement algorithm for Markov decision processes with general state space. IEEE Transactions on Automatic Control 42: 1663–1680.

    Google Scholar 

  • Meyn, S. P., and Tweedie, R. L. 1993. Markov Chains and Stochastic Stability. Springer-Verlag, London.

    Google Scholar 

  • Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.

    Google Scholar 

  • Singh, S. P. 1994. Reinforcement learning algorithms for average-payoff Markovain decision processes. Proceedings of the Twelfth National Conference on Artificial Intelligence 202–207.

  • Smart, W. D., and Kaelbling, L. P. 2000. Practical reinforcement learning in continuous spaces. Proceedings of the Seventeenth International Conference on Machine Learning.

  • Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3: 835–846.

    Google Scholar 

  • Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.

    Google Scholar 

  • Tsitsiklis, J. N., and Van Roy, B. 1999. Average cost temporal-difference learning. Automatica 35: 1799–1808.

    Google Scholar 

  • Vazquez-Abad, F. J., Cassandras, C. G., and Julka, V. 1998. Centralized and decentralized asynchronous optimization of stochastic discret event systems. IEEE Transactions on Automatic Control 43: 631–655.

    Google Scholar 

  • Zhang, B., and Ho, Y. C. 1991. Performance gradient estimation for very large finite Markov chains. IEEE Transactions on Automatic Control 36: 1218–1227.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, XR. From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning. Discrete Event Dynamic Systems 13, 9–39 (2003). https://doi.org/10.1023/A:1022188803039

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022188803039

Navigation