Skip to main content

Part of the book series: Communications and Control Engineering ((CCE))

  • 2630 Accesses

Abstract

In Chap. 5, we consider an approximate rolling-horizon control framework for solving infinite-horizon MDPs with large state/action spaces in an on-line manner by simulation. Specifically, we consider policies in which the system (either the actual system itself or a simulation model of the system) evolves to a particular state that is observed, and the action to be taken in that particular state is then computed on-line at the decision time, with a particular emphasis on the use of simulation. We first present an updating scheme involving multiplicative weights for updating a probability distribution over a restricted set of policies; this scheme can be used to estimate the optimal value function over this restricted set by sampling on the (restricted) policy space. The lower-bound estimate of the optimal value function is used for constructing on-line control policies, called (simulated) policy switching and parallel rollout. We also discuss an upper-bound-based method, called hindsight optimization. Finally, we present an algorithm, called approximate stochastic annealing, which combines Q-learning with the MARS algorithm of Sect. 4.6.1 to directly search the policy space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983)

    Article  Google Scholar 

  2. Bertsekas, D.P.: Differential training of rollout policies. In: Proceedings of the 35th Allerton Conference on Communication, Control, and Computing (1997)

    Google Scholar 

  3. Bertsekas, D.P.: Dynamic programming and suboptimal control: a survey from ASP to MPC. Eur. J. Control 11, 310–334 (2005)

    Article  MathSciNet  Google Scholar 

  4. Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. J. Heuristics 5, 89–108 (1999)

    Article  MATH  Google Scholar 

  5. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

    MATH  Google Scholar 

  6. Bhatnagar, S., Kumar, S.: A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Trans. Autom. Control 49, 592–598 (2004)

    Article  MathSciNet  Google Scholar 

  7. Bhulai, S., Koole, G.: On the structure of value functions for threshold policies in queueing models. Technical Report 2001-4, Department of Stochastics, Vrije Universiteit, Amsterdam (2001)

    Google Scholar 

  8. Chang, H.S.: Multi-policy iteration with a distributed voting. Math. Methods Oper. Res. 60(2), 299–310 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chang, H.S.: On ordinal comparison of policies in Markov reward processes. J. Optim. Theory Appl. 122(1), 207–217 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  10. Chang, H.S.: Multi-policy improvement in stochastic optimization with forward recursive function criteria. J. Math. Anal. Appl. 305(1), 130–139 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chang, H.S.: A policy improvement method in constrained stochastic dynamic programming. IEEE Trans. Autom. Control 51(9), 1523–1526 (2006)

    Article  Google Scholar 

  12. Chang, H.S.: A policy improvement method for constrained average Markov decision processes. Oper. Res. Lett. 35(4), 434–438 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  13. Chang, H.S., Chong, E.K.P.: Solving controlled Markov set-chains with discounting via multi-policy improvement. IEEE Trans. Autom. Control 52(3), 564–569 (2007)

    Article  MathSciNet  Google Scholar 

  14. Chang, H.S., Marcus, S.I.: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286(2), 636–651 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  15. Chang, H.S., Marcus, S.I.: Two-person zero-sum Markov games: receding horizon approach. IEEE Trans. Autom. Control 48(11), 1951–1961 (2003)

    Article  MathSciNet  Google Scholar 

  16. Chang, H.S., Givan, R., Chong, E.K.P.: Parallel rollout for on-line solution of partially observable Markov decision processes. Discrete Event Dyn. Syst. Theory Appl. 15(3), 309–341 (2004)

    Article  MathSciNet  Google Scholar 

  17. Chong, E.K.P., Givan, R., Chang, H.S.: A framework for simulation-based network control via hindsight optimization. In: Proceedings of the 39th IEEE Conference on Decision and Control, pp. 1433–1438 (2000)

    Google Scholar 

  18. Dai, L.: Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems. J. Optim. Theory Appl. 91, 363–388 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  19. Evans, S.N., Weber, N.C.: On the almost sure convergence of a general stochastic approximation procedure. Bull. Aust. Math. Soc. 34, 335–342 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  20. Freund, Y., Schapire, R.: Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79–103 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  21. Fu, M.C., Jin, X.: On the convergence rate of ordinal comparisons of random variables. IEEE Trans. Autom. Control 46, 1950–1954 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  22. Givan, R., Chong, E.K.P., Chang, H.S.: Scheduling multiclass packet streams to minimize weighted loss. Queueing Syst. 41(3), 241–270 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hartley, R.: Inequalities for a class of sequential stochastic decision processes. In: Dempster, M.A.H. (ed.) Stochastic Programming, pp. 109–123. Academic Press, San Diego (1980)

    Google Scholar 

  24. Hausch, D.B., Ziemba, W.T.: Bounds on the value of information in uncertain decision problems. Stochastics 10, 181–217 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  25. He, Y., Chong, E.K.P.: Sensor scheduling for target tracking: a Monte Carlo sampling approach. Digit. Signal Process. 16(5), 533–545 (2006)

    Article  Google Scholar 

  26. Hernández-Lerma, O., Lasserre, J.B.: Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Trans. Autom. Control 35, 1118–1124 (1990)

    Article  MATH  Google Scholar 

  27. Hu, J., Chang, H.S.: Approximate stochastic annealing for online control of infinite horizon Markov decision processes. Automatica 48, 2182–2188 (2012)

    Article  MathSciNet  Google Scholar 

  28. Iwamoto, S.: Stochastic optimization of forward recursive functions. J. Math. Anal. Appl. 292, 73–83 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001)

    Article  Google Scholar 

  30. Kolarov, A., Hui, J.: On computing Markov decision theory-based cost for routing in circuit-switched broadband networks. J. Netw. Syst. Manag. 3(4), 405–425 (1995)

    Article  Google Scholar 

  31. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  32. Koole, G.: The deviation matrix of the M/M/1/∞ and M/M/1/N queue, with applications to controlled queueing models. In: Proceedings of the 37th IEEE Conference on Decision and Control, pp. 56–59 (1998)

    Google Scholar 

  33. Koole, G., Nain, P.: On the value function of a priority queue with an application to a controlled polling model. Queueing Syst. Theory Appl. 34, 199–214 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  34. Littlestone, N., Warmnuth, M.K.: The weighted majority algorithm. Inf. Comput. 108, 212–261 (1994)

    Article  MATH  Google Scholar 

  35. Madansky, A.: Inequalities for stochastic linear programming problems. Manag. Sci. 6, 197–204 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  36. Ott, T.J., Krishnan, K.R.: Separable routing: a scheme for state-dependent routing of circuit switched telephone traffic. Ann. Oper. Res. 35, 43–68 (1992)

    Article  MATH  Google Scholar 

  37. Peha, J.M., Tobagi, F.A.: Evaluating scheduling algorithms for traffic with heterogeneous performance objectives. In: Proceedings of the IEEE GLOBECOM, pp. 21–27 (1990)

    Google Scholar 

  38. Ross, S.M.: Stochastic Processes, 2nd edn. Wiley, New York (1996)

    MATH  Google Scholar 

  39. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University (1994)

    Google Scholar 

  40. Savagaonkar, U., Chong, E.K.P., Givan, R.L.: Online pricing for bandwidth provisioning in multi-class networks. Comput. Netw.b 44(6), 835–853 (2004)

    Article  Google Scholar 

  41. Secomandi, N.: Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands. Comput. Oper. Res. 27, 1201–1225 (2000)

    Article  MATH  Google Scholar 

  42. Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 39, 287–308 (2000)

    Article  Google Scholar 

  43. Tesauro, G., Galperin, G.R.: On-line policy improvement using Monte-Carlo search. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, NIPS 1996, pp. 1068–1074. MIT Press, Cambridge (1997)

    Google Scholar 

  44. Tinnakornsrisuphap, P., Vanichpun, S., La, R.: Dynamic resource allocation of GPS queues under leaky buckets. In: Proceedings of IEEE GLOBECOM, pp. 3777–3781 (2003)

    Google Scholar 

  45. Topsoe, F.: Bounds for entropy and divergence for distributions over a two-element set. J. Inequal. Pure Appl. Math. 2(2), 25 (2001)

    MathSciNet  Google Scholar 

  46. Wu, G., Chong, E.K.P., Givan, R.L.: Burst-level congestion control using hindsight optimization. IEEE Trans. Autom. Control 47(6), 979–991 (2002)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag London

About this chapter

Cite this chapter

Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I. (2013). On-Line Control Methods via Simulation. In: Simulation-Based Algorithms for Markov Decision Processes. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-5022-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5022-0_5

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5021-3

  • Online ISBN: 978-1-4471-5022-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics