On the worst-case analysis of temporal-difference learning algorithms

Schapire, Robert E.; Warmuth, Manfred K.

doi:10.1007/BF00114725

On the worst-case analysis of temporal-difference learning algorithms

Published: March 1996

Volume 22, pages 95–121, (1996)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

On the worst-case analysis of temporal-difference learning algorithms

Download PDF

Robert E. Schapire¹ &
Manfred K. Warmuth²

451 Accesses
8 Citations
Explore all metrics

Abstract

We study the behavior of a family of learning algorithms based on Sutton's method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton's so-called TD(λ) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner's observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.

References

Nicolò Cesa-Bianchi, Philip M. Long, & Manfred K. Warmuth. (1993). Worst-case quadratic loss bounds for a generalization of the Widrow-Hoff rule. InProceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 429–438.
Peter Dayan. (1992). The convergence ofTD(λ) for general λ.Machine Learning, 8(3/4):341–362.
Google Scholar
Peter Dayan & Terrence J. Sejnowski. (1994).TD(λ) converges with probability 1.Machine Learning, 14(3):295–301.
Google Scholar
Roger A. Horn & Charles R. Johnson. (1985).Matrix Analysis. Cambridge University Press.
Tommi Jaakkola, Michael I. Jordan, & Satinder P. Singh. (1993). On the convergence of stochastic iterative dynamic programming algorithms. Technical Report 9307, MIT Computational Cognitive Science.
Jyrki Kivinen & Manfred K. Warmuth. (1994). Additive versus exponentiated gradient updates for learning linear functions. Technical Report UCSC-CRL-94-16, University of California Santa Cruz, Computer Research Laboratory.
Richard S. Sutton. (1988). Learning to predict by the methods of temporal differences.Machine Learning. 3:9–44.
Google Scholar
C. J. C. H. Watkins. (1989).Learning from delayed rewards. PhD thesis, University of Cambridge, England. 1989.

Download references

Author information

Authors and Affiliations

AT&T Bell Laboratories, 600 Mountain Avenue, Room 2A-424, 07974, Murray Hill, NJ
Robert E. Schapire
Computer and Information Sciences, University of California, 95064, Santa Cruz, CA
Manfred K. Warmuth

Authors

Robert E. Schapire
View author publications
You can also search for this author in PubMed Google Scholar
Manfred K. Warmuth
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schapire, R.E., Warmuth, M.K. On the worst-case analysis of temporal-difference learning algorithms. Mach Learn 22, 95–121 (1996). https://doi.org/10.1007/BF00114725

Download citation

Received: 02 November 1994
Accepted: 23 February 1995
Issue Date: March 1996
DOI: https://doi.org/10.1007/BF00114725

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the worst-case analysis of temporal-difference learning algorithms

Abstract

Article PDF

Similar content being viewed by others

On the Distributional Convergence of Temporal Difference Learning

Discrepancy-Based Theory and Algorithms for Forecasting Non-Stationary Time Series

From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the worst-case analysis of temporal-difference learning algorithms

Abstract

Article PDF

Similar content being viewed by others

On the Distributional Convergence of Temporal Difference Learning

Discrepancy-Based Theory and Algorithms for Forecasting Non-Stationary Time Series

From Reinforcement Learning to Optimal Control: A Unified Framework for Sequential Decisions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation