Abstract
In reinforcement learning problems with large-scale and continuous state or action spaces, the approximate reinforcement learning methods are proposed by using function approximation methods to fit the policy. The least-square approximation can extract more useful information from the samples and can be applied to the online algorithms effectively. Because of the complexity of the reinforcement learning problems, the samples generate from target policy cannot be used to evaluate target policy, so off-policy methods have to be used. Eligibility is usually able to accelerate the convergence of the algorithm. This paper proposed off-policy least square algorithms with eligibility trace based on importance reweighting: OFP-LSPE-Q, OFP-LSTD-Q. From the derivation, the algorithm indicates that the convergence rate of the algorithm will be faster with the increase of sample size in the case of off-policy, compared with the traditional least square reinforcement learning method.
Similar content being viewed by others
References
Sutton, R.S., Barto, G.A.: Reinforcement Learning. MIT Press, Cambridge (1998)
Koller, D., Parr, R.: Policy iteration for factored MDPs. In: Proceedings of the 16th Conference on Uncertain in Artificial Intelligence, Stanford, USA (2000)
Andoh, A., Kobayashi, T., Kuzuoka, H., Tsujikawa, T., Suzuki, Y.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)
Geist, M., Pietquin, O.: Parametric value function approximation: a unified view. In: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Piscataway, USA (2011)
Lagoudakis, M., Parr, R., Littman, M.: Least-squares methods in reinforcement learning for control. Methods Appl. Artif. Intell. 2308, 249–260 (2002)
Lagoudakis, M., Parr, R.: Least squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
Wikipedia, F., Programming, D., Processes, M.: Markov decision process. Stat. Neerl. 39(2), 219–233 (1985)
Thiery, C., Scherrer, B.: Least squares policy iteration: Bias-variance trade-off in control problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, pp. 1071–1078 (2010)
Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Mach. Learn. 22, 283–290 (1996)
Schoknecht, R.: Optimality of reinforcement learning algorithms with linear function approximation. In: Advances in Neural Information Processing Systems 15 (2002)
Sutton, R.S., Szepesvari, Cs., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Granada, Spain (2008)
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22(1–3), 33–57 (1996)
Boyan, J.: Technical update: least-squares temporal difference learning. Mach. Learn. 49, 233–246 (2002)
Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report, LIDS-P-2349, Massachusetts Institute of Technology, Cambridge, US (1996). http://web.mit.edu/dimitrib/www/Tempdif.pdf
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
Lagoudakis, M.G., Parr, R.: Model-free least squares policy iteration. In: International Conference on Neural Information Processing Systems: Natural & Synthetic 4(6), 1547–1554 (2001)
Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Buşoniu, L., De Schutter, B., Babuška, R., Ernst, D.: Using prior knowledge to accelerate online least-squares policy iteration. In: 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-2010), Cluj-Napoca, Romania (2010)
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration for reinforcement learning control. In: Proceedings 2010 American Control Conference (ACC-2010), Baltimore, US, pp. 486–491 (2010)
Jung, T., Polani, D.: Kernelizing LSPE(\(\lambda )\). In: Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-2007), Honolulu, US, pp. 338–345 (2007)
Jung, T., Polani, D.: Learning RoboCup-keepaway with kernels. In: Gaussian Processes in Practice, JMLR Workshop and Conference Proceedings, vol. 1, pp. 33–57 (2007)
Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2009), Budapest, Hungary, vol. 2, pp. 733–739 (2009)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. (in progress, draft). MIT Press, Cambridge (2017)
Yahyaa, S., Manderick, B.: Knowledge gradient exploration in online kernel-based LSPI. In: Proceedings of the 25th Belgium-Netherlands Artificial Intelligence Conference, Delft, The Netherlands, pp. 263–270 (2013)
Acknowledgements
This work is sponsored by the scientific research backbone training project of Nantong Institute of Technology, the Universities Natural Science Research Project of Jiangsu Province under Grant No. 17KJB520031 and the Universities Natural Science Research Project of Anhui Province under Grant No. KJ2016A664.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, H., Hong, Y. & Qiu, J. An off-policy least square algorithms with eligibility trace based on importance reweighting. Cluster Comput 20, 3475–3487 (2017). https://doi.org/10.1007/s10586-017-1165-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1165-0