Abstract
In Chap. 5, we consider an approximate rolling-horizon control framework for solving infinite-horizon MDPs with large state/action spaces in an on-line manner by simulation. Specifically, we consider policies in which the system (either the actual system itself or a simulation model of the system) evolves to a particular state that is observed, and the action to be taken in that particular state is then computed on-line at the decision time, with a particular emphasis on the use of simulation. We first present an updating scheme involving multiplicative weights for updating a probability distribution over a restricted set of policies; this scheme can be used to estimate the optimal value function over this restricted set by sampling on the (restricted) policy space. The lower-bound estimate of the optimal value function is used for constructing on-line control policies, called (simulated) policy switching and parallel rollout. We also discuss an upper-bound-based method, called hindsight optimization. Finally, we present an algorithm, called approximate stochastic annealing, which combines Q-learning with the MARS algorithm of Sect. 4.6.1 to directly search the policy space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983)
Bertsekas, D.P.: Differential training of rollout policies. In: Proceedings of the 35th Allerton Conference on Communication, Control, and Computing (1997)
Bertsekas, D.P.: Dynamic programming and suboptimal control: a survey from ASP to MPC. Eur. J. Control 11, 310–334 (2005)
Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. J. Heuristics 5, 89–108 (1999)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Bhatnagar, S., Kumar, S.: A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Trans. Autom. Control 49, 592–598 (2004)
Bhulai, S., Koole, G.: On the structure of value functions for threshold policies in queueing models. Technical Report 2001-4, Department of Stochastics, Vrije Universiteit, Amsterdam (2001)
Chang, H.S.: Multi-policy iteration with a distributed voting. Math. Methods Oper. Res. 60(2), 299–310 (2004)
Chang, H.S.: On ordinal comparison of policies in Markov reward processes. J. Optim. Theory Appl. 122(1), 207–217 (2004)
Chang, H.S.: Multi-policy improvement in stochastic optimization with forward recursive function criteria. J. Math. Anal. Appl. 305(1), 130–139 (2005)
Chang, H.S.: A policy improvement method in constrained stochastic dynamic programming. IEEE Trans. Autom. Control 51(9), 1523–1526 (2006)
Chang, H.S.: A policy improvement method for constrained average Markov decision processes. Oper. Res. Lett. 35(4), 434–438 (2007)
Chang, H.S., Chong, E.K.P.: Solving controlled Markov set-chains with discounting via multi-policy improvement. IEEE Trans. Autom. Control 52(3), 564–569 (2007)
Chang, H.S., Marcus, S.I.: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286(2), 636–651 (2003)
Chang, H.S., Marcus, S.I.: Two-person zero-sum Markov games: receding horizon approach. IEEE Trans. Autom. Control 48(11), 1951–1961 (2003)
Chang, H.S., Givan, R., Chong, E.K.P.: Parallel rollout for on-line solution of partially observable Markov decision processes. Discrete Event Dyn. Syst. Theory Appl. 15(3), 309–341 (2004)
Chong, E.K.P., Givan, R., Chang, H.S.: A framework for simulation-based network control via hindsight optimization. In: Proceedings of the 39th IEEE Conference on Decision and Control, pp. 1433–1438 (2000)
Dai, L.: Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems. J. Optim. Theory Appl. 91, 363–388 (1996)
Evans, S.N., Weber, N.C.: On the almost sure convergence of a general stochastic approximation procedure. Bull. Aust. Math. Soc. 34, 335–342 (1986)
Freund, Y., Schapire, R.: Adaptive game playing using multiplicative weights. Games Econ. Behav. 29, 79–103 (1999)
Fu, M.C., Jin, X.: On the convergence rate of ordinal comparisons of random variables. IEEE Trans. Autom. Control 46, 1950–1954 (2001)
Givan, R., Chong, E.K.P., Chang, H.S.: Scheduling multiclass packet streams to minimize weighted loss. Queueing Syst. 41(3), 241–270 (2002)
Hartley, R.: Inequalities for a class of sequential stochastic decision processes. In: Dempster, M.A.H. (ed.) Stochastic Programming, pp. 109–123. Academic Press, San Diego (1980)
Hausch, D.B., Ziemba, W.T.: Bounds on the value of information in uncertain decision problems. Stochastics 10, 181–217 (1983)
He, Y., Chong, E.K.P.: Sensor scheduling for target tracking: a Monte Carlo sampling approach. Digit. Signal Process. 16(5), 533–545 (2006)
Hernández-Lerma, O., Lasserre, J.B.: Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Trans. Autom. Control 35, 1118–1124 (1990)
Hu, J., Chang, H.S.: Approximate stochastic annealing for online control of infinite horizon Markov decision processes. Automatica 48, 2182–2188 (2012)
Iwamoto, S.: Stochastic optimization of forward recursive functions. J. Math. Anal. Appl. 292, 73–83 (2004)
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2001)
Kolarov, A., Hui, J.: On computing Markov decision theory-based cost for routing in circuit-switched broadband networks. J. Netw. Syst. Manag. 3(4), 405–425 (1995)
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)
Koole, G.: The deviation matrix of the M/M/1/∞ and M/M/1/N queue, with applications to controlled queueing models. In: Proceedings of the 37th IEEE Conference on Decision and Control, pp. 56–59 (1998)
Koole, G., Nain, P.: On the value function of a priority queue with an application to a controlled polling model. Queueing Syst. Theory Appl. 34, 199–214 (2000)
Littlestone, N., Warmnuth, M.K.: The weighted majority algorithm. Inf. Comput. 108, 212–261 (1994)
Madansky, A.: Inequalities for stochastic linear programming problems. Manag. Sci. 6, 197–204 (1960)
Ott, T.J., Krishnan, K.R.: Separable routing: a scheme for state-dependent routing of circuit switched telephone traffic. Ann. Oper. Res. 35, 43–68 (1992)
Peha, J.M., Tobagi, F.A.: Evaluating scheduling algorithms for traffic with heterogeneous performance objectives. In: Proceedings of the IEEE GLOBECOM, pp. 21–27 (1990)
Ross, S.M.: Stochastic Processes, 2nd edn. Wiley, New York (1996)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University (1994)
Savagaonkar, U., Chong, E.K.P., Givan, R.L.: Online pricing for bandwidth provisioning in multi-class networks. Comput. Netw.b 44(6), 835–853 (2004)
Secomandi, N.: Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands. Comput. Oper. Res. 27, 1201–1225 (2000)
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 39, 287–308 (2000)
Tesauro, G., Galperin, G.R.: On-line policy improvement using Monte-Carlo search. In: Mozer, M., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, NIPS 1996, pp. 1068–1074. MIT Press, Cambridge (1997)
Tinnakornsrisuphap, P., Vanichpun, S., La, R.: Dynamic resource allocation of GPS queues under leaky buckets. In: Proceedings of IEEE GLOBECOM, pp. 3777–3781 (2003)
Topsoe, F.: Bounds for entropy and divergence for distributions over a two-element set. J. Inequal. Pure Appl. Math. 2(2), 25 (2001)
Wu, G., Chong, E.K.P., Givan, R.L.: Burst-level congestion control using hindsight optimization. IEEE Trans. Autom. Control 47(6), 979–991 (2002)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Chang, H.S., Hu, J., Fu, M.C., Marcus, S.I. (2013). On-Line Control Methods via Simulation. In: Simulation-Based Algorithms for Markov Decision Processes. Communications and Control Engineering. Springer, London. https://doi.org/10.1007/978-1-4471-5022-0_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5022-0_5
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5021-3
Online ISBN: 978-1-4471-5022-0
eBook Packages: EngineeringEngineering (R0)