On-Line Control Methods via Simulation

Chang, Hyeong Soo; Hu, Jiaqiao; Fu, Michael C.; Marcus, Steven I.

doi:10.1007/978-1-4471-5022-0_5

Hyeong Soo Chang⁵,
Jiaqiao Hu⁶,
Michael C. Fu⁷ &
…
Steven I. Marcus⁸

Part of the book series: Communications and Control Engineering ((CCE))

2630 Accesses

Abstract

In Chap. 5, we consider an approximate rolling-horizon control framework for solving infinite-horizon MDPs with large state/action spaces in an on-line manner by simulation. Specifically, we consider policies in which the system (either the actual system itself or a simulation model of the system) evolves to a particular state that is observed, and the action to be taken in that particular state is then computed on-line at the decision time, with a particular emphasis on the use of simulation. We first present an updating scheme involving multiplicative weights for updating a probability distribution over a restricted set of policies; this scheme can be used to estimate the optimal value function over this restricted set by sampling on the (restricted) policy space. The lower-bound estimate of the optimal value function is used for constructing on-line control policies, called (simulated) policy switching and parallel rollout. We also discuss an upper-bound-based method, called hindsight optimization. Finally, we present an algorithm, called approximate stochastic annealing, which combines Q-learning with the MARS algorithm of Sect. 4.6.1 to directly search the policy space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983)
Article Google Scholar
Bertsekas, D.P.: Differential training of rollout policies. In: Proceedings of the 35th Allerton Conference on Communication, Control, and Computing (1997)
Google Scholar
Bertsekas, D.P.: Dynamic programming and suboptimal control: a survey from ASP to MPC. Eur. J. Control 11, 310–334 (2005)
Article MathSciNet Google Scholar
Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. J. Heuristics 5, 89–108 (1999)
Article MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
MATH Google Scholar
Bhatnagar, S., Kumar, S.: A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Trans. Autom. Control 49, 592–598 (2004)
Article MathSciNet Google Scholar
Bhulai, S., Koole, G.: On the structure of value functions for threshold policies in queueing models. Technical Report 2001-4, Department of Stochastics, Vrije Universiteit, Amsterdam (2001)
Google Scholar
Chang, H.S.: Multi-policy iteration with a distributed voting. Math. Methods Oper. Res. 60(2), 299–310 (2004)
Article MathSciNet MATH Google Scholar
Chang, H.S.: On ordinal comparison of policies in Markov reward processes. J. Optim. Theory Appl. 122(1), 207–217 (2004)
Article MathSciNet MATH Google Scholar
Chang, H.S.: Multi-policy improvement in stochastic optimization with forward recursive function criteria. J. Math. Anal. Appl. 305(1), 130–139 (2005)
Article MathSciNet MATH Google Scholar
Chang, H.S.: A policy improvement method in constrained stochastic dynamic programming. IEEE Trans. Autom. Control 51(9), 1523–1526 (2006)
Article Google Scholar
Chang, H.S.: A policy improvement method for constrained average Markov decision processes. Oper. Res. Lett. 35(4), 434–438 (2007)
Article MathSciNet MATH Google Scholar
Chang, H.S., Chong, E.K.P.: Solving controlled Markov set-chains with discounting via multi-policy improvement. IEEE Trans. Autom. Control 52(3), 564–569 (2007)
Article MathSciNet Google Scholar
Chang, H.S., Marcus, S.I.: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286(2), 636–651 (2003)
Article MathSciNet MATH Google Scholar
Chang, H.S., Marcus, S.I.: Two-person zero-sum Markov games: receding horizon approach. IEEE Trans. Autom. Control 48(11), 1951–1961 (2003)
Article MathSciNet Google Scholar
Chang, H.S., Givan, R., Chong, E.K.P.: Parallel rollout for on-line solution of partially observable Markov decision processes. Discrete Event Dyn. Syst. Theory Appl. 15(3), 309–341 (2004)
Article MathSciNet Google Scholar
Chong, E.K.P., Givan, R., Chang, H.S.: A framework for simulation-based network control via hindsight optimization. In: Proceedings of the 39th IEEE Conference on Decision and Control, pp. 1433–1438 (2000)
Google Scholar
Dai, L.: Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems. J. Optim. Theory Appl. 91, 363–388 (1996)
Article MathSciNet MATH Google Scholar
Evans, S.N., Weber, N.C.: On the almost sure convergence of a general stochastic approximation procedure. Bull. Aust. Math. Soc. 34, 335–342 (1986)
Article MathSciNet MATH Google Scholar
Freund, Y., Schapire, R.: Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79–103 (1999)
Article MathSciNet MATH Google Scholar
Fu, M.C., Jin, X.: On the convergence rate of ordinal comparisons of random variables. IEEE Trans. Autom. Control 46, 1950–1954 (2001)
Article MathSciNet MATH Google Scholar
Givan, R., Chong, E.K.P., Chang, H.S.: Scheduling multiclass packet streams to minimize weighted loss. Queueing Syst. 41(3), 241–270 (2002)
Article MathSciNet MATH Google Scholar
Hartley, R.: Inequalities for a class of sequential stochastic decision processes. In: Dempster, M.A.H. (ed.) Stochastic Programming, pp. 109–123. Academic Press, San Diego (1980)
Google Scholar
Hausch, D.B., Ziemba, W.T.: Bounds on the value of information in uncertain decision problems. Stochastics 10, 181–217 (1983)
Article MathSciNet MATH Google Scholar
He, Y., Chong, E.K.P.: Sensor scheduling for target tracking: a Monte Carlo sampling approach. Digit. Signal Process. 16(5), 533–545 (2006)
Article Google Scholar
Hernández-Lerma, O., Lasserre, J.B.: Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Trans. Autom. Control 35, 1118–1124 (1990)
Article MATH Google Scholar
Hu, J., Chang, H.S.: Approximate stochastic annealing for online control of infinite horizon Markov decision processes. Automatica 48, 2182–2188 (2012)
Article MathSciNet Google Scholar
Iwamoto, S.: Stochastic optimization of forward recursive functions. J. Math. Anal. Appl. 292, 73–83 (2004)
Article MathSciNet MATH Google Scholar
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001)
Article Google Scholar
Kolarov, A., Hui, J.: On computing Markov decision theory-based cost for routing in circuit-switched broadband networks. J. Netw. Syst. Manag. 3(4), 405–425 (1995)
Article Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)
Article MathSciNet MATH Google Scholar
Koole, G.: The deviation matrix of the M/M/1/∞ and M/M/1/N queue, with applications to controlled queueing models. In: Proceedings of the 37th IEEE Conference on Decision and Control, pp. 56–59 (1998)
Google Scholar
Koole, G., Nain, P.: On the value function of a priority queue with an application to a controlled polling model. Queueing Syst. Theory Appl. 34, 199–214 (2000)
Article MathSciNet MATH Google Scholar
Littlestone, N., Warmnuth, M.K.: The weighted majority algorithm. Inf. Comput. 108, 212–261 (1994)
Article MATH Google Scholar
Madansky, A.: Inequalities for stochastic linear programming problems. Manag. Sci. 6, 197–204 (1960)
Article MathSciNet MATH Google Scholar
Ott, T.J., Krishnan, K.R.: Separable routing: a scheme for state-dependent routing of circuit switched telephone traffic. Ann. Oper. Res. 35, 43–68 (1992)
Article MATH Google Scholar
Peha, J.M., Tobagi, F.A.: Evaluating scheduling algorithms for traffic with heterogeneous performance objectives. In: Proceedings of the IEEE GLOBECOM, pp. 21–27 (1990)
Google Scholar
Ross, S.M.: Stochastic Processes, 2nd edn. Wiley, New York (1996)
MATH Google Scholar
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University (1994)
Google Scholar
Savagaonkar, U., Chong, E.K.P., Givan, R.L.: Online pricing for bandwidth provisioning in multi-class networks. Comput. Netw.b 44(6), 835–853 (2004)
Article Google Scholar
Secomandi, N.: Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands. Comput. Oper. Res. 27, 1201–1225 (2000)
Article MATH Google Scholar
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 39, 287–308 (2000)
Article Google Scholar
Tesauro, G., Galperin, G.R.: On-line policy improvement using Monte-Carlo search. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, NIPS 1996, pp. 1068–1074. MIT Press, Cambridge (1997)
Google Scholar
Tinnakornsrisuphap, P., Vanichpun, S., La, R.: Dynamic resource allocation of GPS queues under leaky buckets. In: Proceedings of IEEE GLOBECOM, pp. 3777–3781 (2003)
Google Scholar
Topsoe, F.: Bounds for entropy and divergence for distributions over a two-element set. J. Inequal. Pure Appl. Math. 2(2), 25 (2001)
MathSciNet Google Scholar
Wu, G., Chong, E.K.P., Givan, R.L.: Burst-level congestion control using hindsight optimization. IEEE Trans. Autom. Control 47(6), 979–991 (2002)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Engineering, Sogang University, Seoul, South Korea
Hyeong Soo Chang
Dept. Applied Mathematics & Statistics, State University of New York, Stony Brook, NY, USA
Jiaqiao Hu
Smith School of Business, University of Maryland, College Park, MD, USA
Michael C. Fu
Dept. Electrical & Computer Engineering, University of Maryland, College Park, MD, USA
Steven I. Marcus

Authors

Hyeong Soo Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqiao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Fu
View author publications
You can also search for this author in PubMed Google Scholar
Steven I. Marcus
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I. (2013). On-Line Control Methods via Simulation. In: Simulation-Based Algorithms for Markov Decision Processes. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-5022-0_5

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5022-0_5
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5021-3
Online ISBN: 978-1-4471-5022-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics