Reinforcement Learning with Replacing Eligibility Traces

Singh, Satinder P.; Sutton, Richard S.

doi:10.1023/A:1018012322525

Reinforcement Learning with Replacing Eligibility Traces

Published: January 1996

Volume 22, pages 123–158, (1996)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Reinforcement Learning with Replacing Eligibility Traces

Download PDF

Satinder P. Singh &
Richard S. Sutton

3654 Accesses
9 Citations
Explore all metrics

Abstract

The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.

References

Albus, J. S., (1981). Brain, Behavior, and Robotics, chapter 6. pages 139–179. Byte Books.
Baase, S., (1988). Computer Algorithms: Introduction to design and analysis. Reading, MA: Addison Wesley.
Google Scholar
Barnard, E., (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man and Cybernetics, 23(2), 357–365
Google Scholar
Banu, A. G.; & Duff, M. (1994). Monte Carlo matrix inversion and reinforcement learning. In Advances in Nural Information Processing systems 6, pages 687–694. San Mateo, CA. Morgan Kaufmann.
Google Scholar
Barto, A. G., Sutton, R. S., & Andersorn, C. W., (1983). Neuronlike elements that can solve difficult learningg control problems. IEEE Trans. on Sytems, Man and Cybernetics, 13 835–846.
Google Scholar
Bellman, R. E., (1957). Dynamic Programming. Princeton NJ: Princeton University Press.
Google Scholar
Curtiss, J. H., (1954). A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo method for computing one component of the solution of a set of linear algebraic equations. In Meyer, H. A. (Ed.), Symposium on Monte Carlo Methods. pages 191–233, New York Wiley.
Google Scholar
Dayan, P., (1992). The convergencee of TD(λ) for general λ. Machine Learning, 8(3/4), 341–362.
Google Scholar
Dayan, P., (1993) Improving generalization for temporal difference clearing: The successor representation. Neural Computation. 5(4) 613–624
Google Scholar
Dayan, P., & Sejnowski T (1994) T(λ) converges with probability 1. Machine Learning 5(4). 295–301.
Google Scholar
Holland, J. H., (1986). Escaping brittleness: The possibilities of general-purpose learning algorithm applied to parallel rule-based systems, Volume 2 of Machine Learning: An Artificial Intelligence Approach. chapter 20 Morgan Kaufmann.
Jaakkola, T., Jordan M. I., & Singh S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.
Google Scholar
Jaakkola, T., Singh, S. P., & Jordan, M. I., (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in Neural Information Processing Systems 7 Morgan Kaufmann.
Klopf, A. H., (1972). Brain function and adaptive systems A heterostatic theory. Technical Report AFCRL–72–0164. Air Force Cambridge Research Laboratories, Bedford, MA
Google Scholar
Kumar, P. R. & Varaiya, P. P., (1986). Stochastic System: Estimation, Identification, and Adaptive Control. Englewood Cliffs, N. J: Prentice Hall.
Google Scholar
Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3/4), 293–321
Google Scholar
Miller, W. T., Glanz, F. H. & Kraft, L. G. (1990). CMAC. An associative neural network alternative to backpropagation. Proc. of the IEEE 78. 1561–1567.
Google Scholar
Moore, A. W. (1991) Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces. In Machine Learning: Proceedings of the Eighth International Workshop, pages 333–337, San Mateo, CA Morgan Kaufmann.
Google Scholar
Peng, J., (1993) Dynamic Programming-based Learning for Control. PhD thesis, Northeastern University.
Peng J. & Williams, R. J., (1994). Incremental multi-step Q-learning. In Machine Learning: Proceedings of the Eleventh International Conference pages 226–232. Morgan Kaufmann.
Rubinstein, R., (1981) Simulation and the Monte Carlo Method. New York John Wiley Sons.
Google Scholar
Rummery, G. A., & Niranjan M., (1994). On line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.
Sutton, R. S., (1984) Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA.
Sutton, R. S., (1988) Learning to predict by the methods of temporal differences. Machine Learning, 3. 9–44.
Google Scholar
Sutton R. S., (1995). TD models Modeling the world at a mixture of time scales. In Machine Learning: Proceedings of the Twelfth International Conference pages 531–39. Morgan Kaufmann.
Sutton, R. S., & Barto, A. G., (1987) A Temporal-difference model of classical conditioning: In Proceedings of the Ninth Annual Conference of he Cognitive Science Society pages 355–378, Hillsdale, NJ: Erlbaum
Google Scholar
Sutton, R. S., & Barto A. G., (1990); Time-derivative models of Pavlovian conditioning In Gabriel, M. & Moore, J. W. (Eds.), Learning and Computational Neuroscience, pages 497–537. Cambridge, MA: MIT Press.
Google Scholar
Sutton, R. S., & Singh, S. P. (1994) On step-size and bias In temporal-difference learning. In Eighth Yale Workshop on Adaptive and Learning Systems. pages 91–96. New Haven, CT
Sutton, R. S., & Whitehead, S. D. (1993) Online learning with random representations. In Machine Learning: Proceedings of the Tenth Int. Conference pages 314–321. Morgan Kaufmann.
Tesauro, G. J., (1992) Practical issues in temporal difference leaning. Machine Learning 8(3/4), 257–277.
Google Scholar
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3) 185–202.
Google Scholar
Wasow W. R., (1952) A note on the inversion of matrices by random walks. Math. Tables Other Aids Comput., 6, 78–81.
Google Scholar
Watkins C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge Univ., Cambridge.
Wilson, S. W., (to appear). Classifier witness based on accuracy. Evolutionary Computation.

Download references

Authors

Satinder P. Singh
View author publications
You can also search for this author in PubMed Google Scholar
Richard S. Sutton
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, S.P., Sutton, R.S. Reinforcement Learning with Replacing Eligibility Traces. Machine Learning 22, 123–158 (1996). https://doi.org/10.1023/A:1018012322525

Download citation

Issue Date: January 1996
DOI: https://doi.org/10.1023/A:1018012322525

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforcement Learning with Replacing Eligibility Traces

Abstract

Article PDF

Similar content being viewed by others

Emerging trends in federated learning: from model fusion to federated X learning

A simple introduction to Markov Chain Monte–Carlo sampling

Reward Function Design in Reinforcement Learning

References

Rights and permissions

About this article

Cite this article

Navigation

Reinforcement Learning with Replacing Eligibility Traces

Abstract

Article PDF

Similar content being viewed by others

Emerging trends in federated learning: from model fusion to federated X learning

A simple introduction to Markov Chain Monte–Carlo sampling

Reward Function Design in Reinforcement Learning

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation