ABSTRACT
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a replay methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.
Supplemental Material
- Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263--293, 2003.Google ScholarDigital Library
- Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Explore/exploit schemes for web content optimization. In Proceedings of the Ninth International Conference on Data Mining, 2009. Google ScholarDigital Library
- Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Spatio-temporal models for estimating click-through rate. In Proceedings of the Eighteenth International Conference on World Wide Web, 2009. Google ScholarDigital Library
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397--422, 2002. Google ScholarDigital Library
- Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2--3):235--256, 2002. Google ScholarDigital Library
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48--77, 2002. Google ScholarDigital Library
- Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. Chapman and Hall, 1985.Google Scholar
- J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41:148--177, 1979.Google ScholarCross Ref
- Leslie Pack Kaelbling. Associative reinforcement learning:Functions in k -DNF. Machine Learning, 15(3):279--298, 1994. Google ScholarDigital Library
- Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems 12, 2000.Google Scholar
- Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4--22, 1985.Google ScholarDigital Library
- John Langford, Alexander L. Strehl, and Jennifer Wortman. Exploration scavenging. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 528--535, 2008. Google ScholarDigital Library
- John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, 2008.Google Scholar
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the Nineteenth International Conference on World Wide Web, 2010. Google ScholarDigital Library
- Colin McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, volume 141 of London Mathematical Society Lecture Notes, pages 148--188. Cambridge University Press, 1989.Google Scholar
- Taesup Moon, Lihong Li, Wei Chu, Ciya Liao, Zhaohui Zheng, and Yi Chang. Online learning for recency search ranking using real-time user feedback. In Proceedings of the Nineteenth International Conference on Knowledge Management, 2010. Google ScholarDigital Library
- Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759--766, 2000. Google ScholarDigital Library
- Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23, 2011.Google Scholar
- Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. Experience-efficient learning in associative bandit problems. In Proceedings of the Twenty-Third International Conference on Machine Learning, pages 889--896, 2006. Google ScholarDigital Library
- Richard S. Sutton and Andrew G. Barto. Reinforcement Learning:An Introduction. MIT Press, Cambridge, MA, March 1998. Google ScholarDigital Library
- Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50(3):338--355, 2005.Google ScholarCross Ref
- Michael Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the American Statistics Association, 74(368):799--806, 1979.Google ScholarCross Ref
Index Terms
- Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms
Recommendations
An unbiased offline evaluation of contextual bandit algorithms with generalized linear models
OTEAE'11: Proceedings of the 2011 International Conference on On-line Trading of Exploration and Exploitation 2 - Volume 26Contextual bandit algorithms have become popular tools in online recommendation and advertising systems. Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very ...
A contextual-bandit approach to personalized news article recommendation
WWW '10: Proceedings of the 19th international conference on World wide webPersonalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two ...
A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation
RepSys '13: Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems EvaluationOffline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which ...
Comments