skip to main content
10.1145/3159652.3159687acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Honorable Mention

Offline A/B Testing for Recommender Systems

Published:02 February 2018Publication History

ABSTRACT

Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance generated by a new technology. Offline performance can be measured using estimators known as counterfactual or off-policy estimators. Traditional counterfactual estimators, such as capped importance sampling or normalised importance sampling, exhibit unsatisfying bias-variance compromises when experimenting on personalized product recommendation systems. To overcome this issue, we model the bias incurred by these estimators rather than bound it in the worst case, which leads us to propose a new counterfactual estimator. We provide a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

References

  1. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Léon Bottou and Jonas Peters. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Proceedings of Journal of Machine Learning Research(JMLR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika(1952).Google ScholarGoogle Scholar
  4. Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. 2010. Label ranking methods based on the Plackett-Luce model. Proceedings of the 27th International Conference on Machine Learning(ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning(ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. Proceedings of the 26th annual International Conference on Machine Learning(ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. JM Hammersley and DC Handscomb. 1964. Monte Carlo Methods. Chapter.Google ScholarGoogle Scholar
  9. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. Proceedings of Transactions on Information Systems(TOIS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association.Google ScholarGoogle ScholarCross RefCross Ref
  12. Business Insider. 2017. Morgan Stanley puts Amazon Prime subscribers at 65M. http://www.businessinsider.fr/us/morgan-stanley-puts-amazon-primesubscribers-at-65m-2017--2/.(2017).Google ScholarGoogle Scholar
  13. Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international conference on Research and development in information retrieval(SIGIR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Deba B Lahiri. 1951. A method of sample selection providing unbiased ratio estimates. Bulletin of the International Statistical Institute.Google ScholarGoogle Scholar
  15. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Benjamin M. Marlin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence(UAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory(COLT).Google ScholarGoogle Scholar
  18. Hiroshi Midzuno. 1951. On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics.Google ScholarGoogle ScholarCross RefCross Ref
  19. Art Owen. 2010. Monte Carlo theory, methods and examples.(2010). arXiv:arXiv:1012.5461v2Google ScholarGoogle Scholar
  20. R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics).Google ScholarGoogle Scholar
  21. MJD Powell and J Swann. 1966. Weighted uniform sampling: a Monte Carlo technique for reducing variance. Journal of Applied Mathematics.Google ScholarGoogle Scholar
  22. Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with nonrandom missing ratings: influence of popularity and positivity on evaluation metrics. Proceedings of the Sixth Conference on Recommender Systems(RecSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Amode Ranjan Sen. 1952. Present status of probability sampling and its use in estimation of farm characteristics. Econometrica(1952).Google ScholarGoogle Scholar
  24. Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. Proceeding of Neural Information Processing Systems(NIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the International Conference on Machine Learning(ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Louis Leon Thurstone. 1927. A Law of Comparative Judgement. Psychological Review.Google ScholarGoogle Scholar
  27. Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Journal of Machine Learning Research(JMLR)(1992). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. John I. Yellott. 1977. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Offline A/B Testing for Recommender Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
        February 2018
        821 pages
        ISBN:9781450355810
        DOI:10.1145/3159652

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 February 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WSDM '18 Paper Acceptance Rate81of514submissions,16%Overall Acceptance Rate498of2,863submissions,17%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader