skip to main content
10.1145/1935826.1935878acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Published:09 February 2011Publication History

ABSTRACT

Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.

Skip Supplemental Material Section

Supplemental Material

wsdm2011_li_uoe_01.mp4

mp4

154.4 MB

References

  1. Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263--293, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Explore/exploit schemes for web content optimization. In Proceedings of the Ninth International Conference on Data Mining, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Spatio-temporal models for estimating click-through rate. In Proceedings of the Eighteenth International Conference on World Wide Web, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397--422, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2--3):235--256, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48--77, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. Chapman and Hall, 1985.Google ScholarGoogle Scholar
  8. J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41:148--177, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  9. Leslie Pack Kaelbling. Associative reinforcement learning:Functions in k -DNF. Machine Learning, 15(3):279--298, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems 12, 2000.Google ScholarGoogle Scholar
  11. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4--22, 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. John Langford, Alexander L. Strehl, and Jennifer Wortman. Exploration scavenging. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 528--535, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, 2008.Google ScholarGoogle Scholar
  14. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the Nineteenth International Conference on World Wide Web, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Colin McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of London Mathematical Society Lecture Notes, pages 148--188. Cambridge University Press, 1989.Google ScholarGoogle Scholar
  16. Taesup Moon, Lihong Li, Wei Chu, Ciya Liao, Zhaohui Zheng, and Yi Chang. Online learning for recency search ranking using real-time user feedback. In Proceedings of the Nineteenth International Conference on Knowledge Management, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759--766, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23, 2011.Google ScholarGoogle Scholar
  19. Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. Experience-efficient learning in associative bandit problems. In Proceedings of the Twenty-Third International Conference on Machine Learning, pages 889--896, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning:An Introduction. MIT Press, Cambridge, MA, March 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50(3):338--355, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  22. Michael Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the American Statistics Association, 74(368):799--806, 1979.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
          February 2011
          870 pages
          ISBN:9781450304931
          DOI:10.1145/1935826

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 February 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader