An off-policy least square algorithms with eligibility trace based on importance reweighting

Zhang, Haifei; Hong, Ying; Qiu, Jianlin

doi:10.1007/s10586-017-1165-0

An off-policy least square algorithms with eligibility trace based on importance reweighting

Published: 12 September 2017

Volume 20, pages 3475–3487, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Haifei Zhang^1,2,
Ying Hong³ &
Jianlin Qiu^1,4

291 Accesses
1 Citation
Explore all metrics

Abstract

In reinforcement learning problems with large-scale and continuous state or action spaces, the approximate reinforcement learning methods are proposed by using function approximation methods to fit the policy. The least-square approximation can extract more useful information from the samples and can be applied to the online algorithms effectively. Because of the complexity of the reinforcement learning problems, the samples generate from target policy cannot be used to evaluate target policy, so off-policy methods have to be used. Eligibility is usually able to accelerate the convergence of the algorithm. This paper proposed off-policy least square algorithms with eligibility trace based on importance reweighting: OFP-LSPE-Q, OFP-LSTD-Q. From the derivation, the algorithm indicates that the convergence rate of the algorithm will be faster with the increase of sample size in the case of off-policy, compared with the traditional least square reinforcement learning method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Article 13 February 2018

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

A Survey on Constraining Policy Updates Using the KL Divergence

References

Sutton, R.S., Barto, G.A.: Reinforcement Learning. MIT Press, Cambridge (1998)
Google Scholar
Koller, D., Parr, R.: Policy iteration for factored MDPs. In: Proceedings of the 16th Conference on Uncertain in Artificial Intelligence, Stanford, USA (2000)
Andoh, A., Kobayashi, T., Kuzuoka, H., Tsujikawa, T., Suzuki, Y.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)
Article Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)
Article MATH MathSciNet Google Scholar
Geist, M., Pietquin, O.: Parametric value function approximation: a unified view. In: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Piscataway, USA (2011)
Lagoudakis, M., Parr, R., Littman, M.: Least-squares methods in reinforcement learning for control. Methods Appl. Artif. Intell. 2308, 249–260 (2002)
Article MATH Google Scholar
Lagoudakis, M., Parr, R.: Least squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
MATH MathSciNet Google Scholar
Wikipedia, F., Programming, D., Processes, M.: Markov decision process. Stat. Neerl. 39(2), 219–233 (1985)
Article Google Scholar
Thiery, C., Scherrer, B.: Least squares policy iteration: Bias-variance trade-off in control problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, pp. 1071–1078 (2010)
Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Google Scholar
Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Mach. Learn. 22, 283–290 (1996)
Google Scholar
Schoknecht, R.: Optimality of reinforcement learning algorithms with linear function approximation. In: Advances in Neural Information Processing Systems 15 (2002)
Sutton, R.S., Szepesvari, Cs., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Granada, Spain (2008)
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22(1–3), 33–57 (1996)
MATH Google Scholar
Boyan, J.: Technical update: least-squares temporal difference learning. Mach. Learn. 49, 233–246 (2002)
Article MATH Google Scholar
Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report, LIDS-P-2349, Massachusetts Institute of Technology, Cambridge, US (1996). http://web.mit.edu/dimitrib/www/Tempdif.pdf
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
MATH MathSciNet Google Scholar
Lagoudakis, M.G., Parr, R.: Model-free least squares policy iteration. In: International Conference on Neural Information Processing Systems: Natural & Synthetic 4(6), 1547–1554 (2001)
Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Google Scholar
Buşoniu, L., De Schutter, B., Babuška, R., Ernst, D.: Using prior knowledge to accelerate online least-squares policy iteration. In: 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-2010), Cluj-Napoca, Romania (2010)
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration for reinforcement learning control. In: Proceedings 2010 American Control Conference (ACC-2010), Baltimore, US, pp. 486–491 (2010)
Jung, T., Polani, D.: Kernelizing LSPE(\(\lambda )\). In: Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-2007), Honolulu, US, pp. 338–345 (2007)
Jung, T., Polani, D.: Learning RoboCup-keepaway with kernels. In: Gaussian Processes in Practice, JMLR Workshop and Conference Proceedings, vol. 1, pp. 33–57 (2007)
Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2009), Budapest, Hungary, vol. 2, pp. 733–739 (2009)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. (in progress, draft). MIT Press, Cambridge (2017)
Yahyaa, S., Manderick, B.: Knowledge gradient exploration in online kernel-based LSPI. In: Proceedings of the 25th Belgium-Netherlands Artificial Intelligence Conference, Delft, The Netherlands, pp. 263–270 (2013)

Download references

Acknowledgements

This work is sponsored by the scientific research backbone training project of Nantong Institute of Technology, the Universities Natural Science Research Project of Jiangsu Province under Grant No. 17KJB520031 and the Universities Natural Science Research Project of Anhui Province under Grant No. KJ2016A664.

Author information

Authors and Affiliations

School of Computer and Information Engineering, Nantong Institute of Technology, Nantong, China
Haifei Zhang & Jianlin Qiu
School of Computer and Information Engineering, Hohai University, Nanjing, China
Haifei Zhang
Center of International Education and Exchange, Jiangsu College of Engineering and Technology, Nantong, China
Ying Hong
School of Computer Science and Technology, Nantong University, Nantong, China
Jianlin Qiu

Authors

Haifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Hong
View author publications
You can also search for this author in PubMed Google Scholar
Jianlin Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifei Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Hong, Y. & Qiu, J. An off-policy least square algorithms with eligibility trace based on importance reweighting. Cluster Comput 20, 3475–3487 (2017). https://doi.org/10.1007/s10586-017-1165-0

Download citation

Received: 20 June 2017
Revised: 30 August 2017
Accepted: 01 September 2017
Published: 12 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10586-017-1165-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An off-policy least square algorithms with eligibility trace based on importance reweighting

Abstract

Access this article

Similar content being viewed by others

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

A Survey on Constraining Policy Updates Using the KL Divergence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An off-policy least square algorithms with eligibility trace based on importance reweighting

Abstract

Access this article

Similar content being viewed by others

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

A Survey on Constraining Policy Updates Using the KL Divergence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation