research-article

Offline A/B Testing for Recommender Systems

Authors:
Alexandre Gilotte

Criteo Research, Paris, France

Criteo Research, Paris, France
View Profile

,
Clément Calauzènes

Criteo Research, Paris, France

Criteo Research, Paris, France
View Profile

,
Thomas Nedelec

Criteo Research, Paris, France

Criteo Research, Paris, France
View Profile

,
Alexandre Abraham

Criteo Research, Paris, France

Criteo Research, Paris, France
View Profile

,
Simon Dollé

Criteo Research, Paris, France

Criteo Research, Paris, France
View Profile

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data MiningFebruary 2018Pages 198–206https://doi.org/10.1145/3159652.3159687

Published:02 February 2018Publication History

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Pages 198–206

ABSTRACT

Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance generated by a new technology. Offline performance can be measured using estimators known as counterfactual or off-policy estimators. Traditional counterfactual estimators, such as capped importance sampling or normalised importance sampling, exhibit unsatisfying bias-variance compromises when experimenting on personalized product recommendation systems. To overcome this issue, we model the bias incurred by these estimators rather than bound it in the worst case, which leads us to propose a new counterfactual estimator. We provide a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

References

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Google ScholarDigital Library
Léon Bottou and Jonas Peters. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Proceedings of Journal of Machine Learning Research(JMLR). Google ScholarDigital Library
Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika(1952).Google Scholar
Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. 2010. Label ranking methods based on the Plackett-Luce model. Proceedings of the 27th International Conference on Machine Learning(ICML). Google ScholarDigital Library
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarDigital Library
Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning(ICML). Google ScholarDigital Library
John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. Proceedings of the 26th annual International Conference on Machine Learning(ICML). Google ScholarDigital Library
JM Hammersley and DC Handscomb. 1964. Monte Carlo Methods. Chapter.Google Scholar
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.. Google ScholarDigital Library
Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. Proceedings of Transactions on Information Systems(TOIS). Google ScholarDigital Library
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association.Google ScholarCross Ref
Business Insider. 2017. Morgan Stanley puts Amazon Prime subscribers at 65M. http://www.businessinsider.fr/us/morgan-stanley-puts-amazon-primesubscribers-at-65m-2017--2/.(2017).Google Scholar
Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international conference on Research and development in information retrieval(SIGIR). Google ScholarDigital Library
Deba B Lahiri. 1951. A method of sample selection providing unbiased ratio estimates. Bulletin of the International Statistical Institute.Google Scholar
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarDigital Library
Benjamin M. Marlin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence(UAI). Google ScholarDigital Library
Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory(COLT).Google Scholar
Hiroshi Midzuno. 1951. On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics.Google ScholarCross Ref
Art Owen. 2010. Monte Carlo theory, methods and examples.(2010). arXiv:arXiv:1012.5461v2Google Scholar
R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics).Google Scholar
MJD Powell and J Swann. 1966. Weighted uniform sampling: a Monte Carlo technique for reducing variance. Journal of Applied Mathematics.Google Scholar
Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with nonrandom missing ratings: influence of popularity and positivity on evaluation metrics. Proceedings of the Sixth Conference on Recommender Systems(RecSys). Google ScholarDigital Library
Amode Ranjan Sen. 1952. Present status of probability sampling and its use in estimation of farm characteristics. Econometrica(1952).Google Scholar
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. Proceeding of Neural Information Processing Systems(NIPS). Google ScholarDigital Library
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the International Conference on Machine Learning(ICML). Google ScholarDigital Library
Louis Leon Thurstone. 1927. A Law of Comparative Judgement. Psychological Review.Google Scholar
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Journal of Machine Learning Research(JMLR)(1992). Google ScholarDigital Library
John I. Yellott. 1977. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology.Google ScholarCross Ref

Index Terms

Offline A/B Testing for Recommender Systems
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Learning from implicit feedback
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems
AIMLSystems '21: Proceedings of the First International Conference on AI-ML Systems

A recommendation system generally outputs a ranked list of items which is presented to the user. Based on the consumption signals from the user (like click, play) in an production environment, various performance metrics like Click Through Rate (CTR), ...
Read More
Evaluating the Robustness of Off-Policy Evaluation
RecSys '21: Proceedings of the 15th ACM Conference on Recommender Systems

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive ...
Read More
Improving Accuracy of Recommender System by Item Clustering

Recommender System (RS) predicts user's ratings towards items, and then recommends highly-predicted items to user. In recent years, RS has been playing more and more important role in the agent research field. There have been a great deal of researches ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
February 2018
821 pages
ISBN:9781450355810
DOI:10.1145/3159652
General Chairs:
Yi Chang
Jilin University, Huawei Inc.
,
Chengxiang Zhai
University of Illinois Urbana-Champaign
,
Program Chairs:
Yan Liu
University of Southern California
,
Yoelle Maarek
Amazon
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 February 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Honorable Mention
Author Tags
counterfactual estimation
importance sampling.
off-policy evaluation
recommender system
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '18 Paper Acceptance Rate81of514submissions,16%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 97
  Total Citations
  View Citations
- 1,481
  Total Downloads
- Downloads (Last 12 months)189
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Offline A/B Testing for Recommender Systems

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems

Evaluating the Robustness of Off-Policy Evaluation

Improving Accuracy of Recommender System by Item Clustering