ABSTRACT
Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance generated by a new technology. Offline performance can be measured using estimators known as counterfactual or off-policy estimators. Traditional counterfactual estimators, such as capped importance sampling or normalised importance sampling, exhibit unsatisfying bias-variance compromises when experimenting on personalized product recommendation systems. To overcome this issue, we model the bias incurred by these estimators rather than bound it in the worst case, which leads us to propose a new counterfactual estimator. We provide a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.
- Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Google ScholarDigital Library
- Léon Bottou and Jonas Peters. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Proceedings of Journal of Machine Learning Research(JMLR). Google ScholarDigital Library
- Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika(1952).Google Scholar
- Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. 2010. Label ranking methods based on the Plackett-Luce model. Proceedings of the 27th International Conference on Machine Learning(ICML). Google ScholarDigital Library
- Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarDigital Library
- Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning(ICML). Google ScholarDigital Library
- John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. Proceedings of the 26th annual International Conference on Machine Learning(ICML). Google ScholarDigital Library
- JM Hammersley and DC Handscomb. 1964. Monte Carlo Methods. Chapter.Google Scholar
- F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.. Google ScholarDigital Library
- Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. Proceedings of Transactions on Information Systems(TOIS). Google ScholarDigital Library
- Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association.Google ScholarCross Ref
- Business Insider. 2017. Morgan Stanley puts Amazon Prime subscribers at 65M. http://www.businessinsider.fr/us/morgan-stanley-puts-amazon-primesubscribers-at-65m-2017--2/.(2017).Google Scholar
- Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international conference on Research and development in information retrieval(SIGIR). Google ScholarDigital Library
- Deba B Lahiri. 1951. A method of sample selection providing unbiased ratio estimates. Bulletin of the International Statistical Institute.Google Scholar
- Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the International Conference on Web Search and Data Mining(WSDM). Google ScholarDigital Library
- Benjamin M. Marlin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence(UAI). Google ScholarDigital Library
- Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory(COLT).Google Scholar
- Hiroshi Midzuno. 1951. On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics.Google ScholarCross Ref
- Art Owen. 2010. Monte Carlo theory, methods and examples.(2010). arXiv:arXiv:1012.5461v2Google Scholar
- R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics).Google Scholar
- MJD Powell and J Swann. 1966. Weighted uniform sampling: a Monte Carlo technique for reducing variance. Journal of Applied Mathematics.Google Scholar
- Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with nonrandom missing ratings: influence of popularity and positivity on evaluation metrics. Proceedings of the Sixth Conference on Recommender Systems(RecSys). Google ScholarDigital Library
- Amode Ranjan Sen. 1952. Present status of probability sampling and its use in estimation of farm characteristics. Econometrica(1952).Google Scholar
- Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. Proceeding of Neural Information Processing Systems(NIPS). Google ScholarDigital Library
- Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the International Conference on Machine Learning(ICML). Google ScholarDigital Library
- Louis Leon Thurstone. 1927. A Law of Comparative Judgement. Psychological Review.Google Scholar
- Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Journal of Machine Learning Research(JMLR)(1992). Google ScholarDigital Library
- John I. Yellott. 1977. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology.Google ScholarCross Ref
Index Terms
- Offline A/B Testing for Recommender Systems
Recommendations
A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems
AIMLSystems '21: Proceedings of the First International Conference on AI-ML SystemsA recommendation system generally outputs a ranked list of items which is presented to the user. Based on the consumption signals from the user (like click, play) in an production environment, various performance metrics like Click Through Rate (CTR), ...
Evaluating the Robustness of Off-Policy Evaluation
RecSys '21: Proceedings of the 15th ACM Conference on Recommender SystemsOff-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive ...
Improving Accuracy of Recommender System by Item Clustering
Recommender System (RS) predicts user's ratings towards items, and then recommends highly-predicted items to user. In recent years, RS has been playing more and more important role in the agent research field. There have been a great deal of researches ...
Comments