Abstract
We study the behavior of a family of learning algorithms based on Sutton's method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton's so-called TD(λ) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner's observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.
Article PDF
Similar content being viewed by others
References
Nicolò Cesa-Bianchi, Philip M. Long, & Manfred K. Warmuth. (1993). Worst-case quadratic loss bounds for a generalization of the Widrow-Hoff rule. InProceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 429–438.
Peter Dayan. (1992). The convergence ofTD(λ) for general λ.Machine Learning, 8(3/4):341–362.
Peter Dayan & Terrence J. Sejnowski. (1994).TD(λ) converges with probability 1.Machine Learning, 14(3):295–301.
Roger A. Horn & Charles R. Johnson. (1985).Matrix Analysis. Cambridge University Press.
Tommi Jaakkola, Michael I. Jordan, & Satinder P. Singh. (1993). On the convergence of stochastic iterative dynamic programming algorithms. Technical Report 9307, MIT Computational Cognitive Science.
Jyrki Kivinen & Manfred K. Warmuth. (1994). Additive versus exponentiated gradient updates for learning linear functions. Technical Report UCSC-CRL-94-16, University of California Santa Cruz, Computer Research Laboratory.
Richard S. Sutton. (1988). Learning to predict by the methods of temporal differences.Machine Learning. 3:9–44.
C. J. C. H. Watkins. (1989).Learning from delayed rewards. PhD thesis, University of Cambridge, England. 1989.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Schapire, R.E., Warmuth, M.K. On the worst-case analysis of temporal-difference learning algorithms. Mach Learn 22, 95–121 (1996). https://doi.org/10.1007/BF00114725
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00114725