Skip to main content
Log in

An off-policy least square algorithms with eligibility trace based on importance reweighting

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In reinforcement learning problems with large-scale and continuous state or action spaces, the approximate reinforcement learning methods are proposed by using function approximation methods to fit the policy. The least-square approximation can extract more useful information from the samples and can be applied to the online algorithms effectively. Because of the complexity of the reinforcement learning problems, the samples generate from target policy cannot be used to evaluate target policy, so off-policy methods have to be used. Eligibility is usually able to accelerate the convergence of the algorithm. This paper proposed off-policy least square algorithms with eligibility trace based on importance reweighting: OFP-LSPE-Q, OFP-LSTD-Q. From the derivation, the algorithm indicates that the convergence rate of the algorithm will be faster with the increase of sample size in the case of off-policy, compared with the traditional least square reinforcement learning method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Sutton, R.S., Barto, G.A.: Reinforcement Learning. MIT Press, Cambridge (1998)

    Google Scholar 

  2. Koller, D., Parr, R.: Policy iteration for factored MDPs. In: Proceedings of the 16th Conference on Uncertain in Artificial Intelligence, Stanford, USA (2000)

  3. Andoh, A., Kobayashi, T., Kuzuoka, H., Tsujikawa, T., Suzuki, Y.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)

    Article  Google Scholar 

  4. Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42, 674–690 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  5. Geist, M., Pietquin, O.: Parametric value function approximation: a unified view. In: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Piscataway, USA (2011)

  6. Lagoudakis, M., Parr, R., Littman, M.: Least-squares methods in reinforcement learning for control. Methods Appl. Artif. Intell. 2308, 249–260 (2002)

    Article  MATH  Google Scholar 

  7. Lagoudakis, M., Parr, R.: Least squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    MATH  MathSciNet  Google Scholar 

  8. Wikipedia, F., Programming, D., Processes, M.: Markov decision process. Stat. Neerl. 39(2), 219–233 (1985)

    Article  Google Scholar 

  9. Thiery, C., Scherrer, B.: Least squares policy iteration: Bias-variance trade-off in control problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, pp. 1071–1078 (2010)

  10. Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)

    Google Scholar 

  11. Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Mach. Learn. 22, 283–290 (1996)

    Google Scholar 

  12. Schoknecht, R.: Optimality of reinforcement learning algorithms with linear function approximation. In: Advances in Neural Information Processing Systems 15 (2002)

  13. Sutton, R.S., Szepesvari, Cs., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Granada, Spain (2008)

  14. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22(1–3), 33–57 (1996)

    MATH  Google Scholar 

  15. Boyan, J.: Technical update: least-squares temporal difference learning. Mach. Learn. 49, 233–246 (2002)

    Article  MATH  Google Scholar 

  16. Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report, LIDS-P-2349, Massachusetts Institute of Technology, Cambridge, US (1996). http://web.mit.edu/dimitrib/www/Tempdif.pdf

  17. Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    MATH  MathSciNet  Google Scholar 

  18. Lagoudakis, M.G., Parr, R.: Model-free least squares policy iteration. In: International Conference on Neural Information Processing Systems: Natural & Synthetic 4(6), 1547–1554 (2001)

  19. Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)

    Google Scholar 

  20. Buşoniu, L., De Schutter, B., Babuška, R., Ernst, D.: Using prior knowledge to accelerate online least-squares policy iteration. In: 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-2010), Cluj-Napoca, Romania (2010)

  21. Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration for reinforcement learning control. In: Proceedings 2010 American Control Conference (ACC-2010), Baltimore, US, pp. 486–491 (2010)

  22. Jung, T., Polani, D.: Kernelizing LSPE(\(\lambda )\). In: Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-2007), Honolulu, US, pp. 338–345 (2007)

  23. Jung, T., Polani, D.: Learning RoboCup-keepaway with kernels. In: Gaussian Processes in Practice, JMLR Workshop and Conference Proceedings, vol. 1, pp. 33–57 (2007)

  24. Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2009), Budapest, Hungary, vol. 2, pp. 733–739 (2009)

  25. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. (in progress, draft). MIT Press, Cambridge (2017)

  26. Yahyaa, S., Manderick, B.: Knowledge gradient exploration in online kernel-based LSPI. In: Proceedings of the 25th Belgium-Netherlands Artificial Intelligence Conference, Delft, The Netherlands, pp. 263–270 (2013)

Download references

Acknowledgements

This work is sponsored by the scientific research backbone training project of Nantong Institute of Technology, the Universities Natural Science Research Project of Jiangsu Province under Grant No. 17KJB520031 and the Universities Natural Science Research Project of Anhui Province under Grant No. KJ2016A664.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haifei Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Hong, Y. & Qiu, J. An off-policy least square algorithms with eligibility trace based on importance reweighting. Cluster Comput 20, 3475–3487 (2017). https://doi.org/10.1007/s10586-017-1165-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1165-0

Keywords

Navigation