ABSTRACT
Counterfactual learning to rank (CLTR) relies on exposure-based inverse propensity scoring (IPS), a LTR-specific adaptation of IPS to correct for position bias. While IPS can provide unbiased and consistent estimates, it often suffers from high variance. Especially when little click data is available, this variance can cause CLTR to learn sub-optimal ranking behavior. Consequently, existing CLTR methods bring significant risks with them, as naively deploying their models can result in very negative user experiences.
We introduce a novel risk-aware CLTR method with theoretical guarantees for safe deployment. We apply a novel exposure-based concept of risk regularization to IPS estimation for LTR. Our risk regularization penalizes the mismatch between the ranking behavior of a learned model and a given safe model. Thereby, it ensures that learned ranking models stay close to a trusted model, when there is high uncertainty in IPS estimation, which greatly reduces the risks during deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little date is available, while also maintaining high performance at convergence. For the CLTR field, our novel exposure-based risk minimization method enables practitioners to adopt CLTR methods in a safer manner that mitigates many of the risks attached to previous methods.
- Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky, and Marc Najork. 2019. Addressing Trust Bias for Unbiased Learning-to-rank. In The World Wide Web Conference. 4--14.Google Scholar
- Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge Overview. In Proceedings of the learning to rank challenge. PMLR, 1--24.Google Scholar
- Olivier Chapelle and Ya Zhang. 2009. A Dynamic Bayesian Network Click Model for Web Search Ranking. In Proceedings of the 18th international conference on World wide web. 1--10.Google ScholarDigital Library
- Jia Chen, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. TianGong-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2485--2488.Google ScholarDigital Library
- Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan & Claypool Publishers.Google Scholar
- Corinna Cortes, Yishay Mansour, and Mehryar Mohri. 2010. Learning Bounds for Importance Weighting. In Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 1. 442--450.Google Scholar
- Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. In Proceedings of the 2008 international conference on web search and data mining. 87--94.Google ScholarDigital Library
- Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM Transactions on Information Systems (TOIS), Vol. 35, 2 (2016), 1--31.Google ScholarDigital Library
- BK Ghosh. 2002. Probability Inequalities Related to Markov's Theorem. The American Statistician, Vol. 56, 3 (2002), 186--190.Google ScholarCross Ref
- Li He, Long Xia, Wei Zeng, Zhi-Ming Ma, Yihong Zhao, and Dawei Yin. 2019. Off-policy Learning for Multiple Loggers. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1184--1193.Google ScholarDigital Library
- Daniel G Horvitz and Donovan J Thompson. 1952. A Generalization of Sampling Without Replacement From a Finite Universe. Journal of the American statistical Association, Vol. 47, 260 (1952), 663--685.Google ScholarCross Ref
- Rolf Jagerman, Ilya Markov, and Maarten de Rijke. 2020. Safe Exploration for Optimizing Contextual Bandits. ACM Transactions on Information Systems (TOIS), Vol. 38, 3 (2020), 1--23.Google ScholarDigital Library
- Frederick James. 1980. Monte Carlo Theory and Practice. Reports on progress in Physics, Vol. 43, 9 (1980), 1145.Google Scholar
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.Google ScholarDigital Library
- Shan Jiang, Yuening Hu, Changsung Kang, Tim Daly Jr, Dawei Yin, Yi Chang, and Chengxiang Zhai. 2016. Learning Query and Document Relevance from a Web-scale Click Graph. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 185--194.Google ScholarDigital Library
- Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 133--142.Google ScholarDigital Library
- Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 1199--1201.Google ScholarDigital Library
- Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 781--789.Google ScholarDigital Library
- Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th international conference on World wide web. 661--670.Google ScholarDigital Library
- Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends® in Information Retrieval, Vol. 3, 3 (2009), 225--331.Google Scholar
- Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample-Variance Penalization. In Annual Conference Computational Learning Theory.Google Scholar
- Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. Advances in neural information processing systems, Vol. 29 (2016).Google Scholar
- Harrie Oosterhuis. 2020. Learning from User Interactions with Rankings: A Unification of the Field. Ph.,D. Dissertation. Informatics Institute, University of Amsterdam.Google Scholar
- Harrie Oosterhuis. 2021. Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1023--1032.Google ScholarDigital Library
- Harrie Oosterhuis. 2022a. Doubly-Robust Estimation for Unbiased Learning-to-Rank from Position-Biased Click Feedback. arXiv preprint arXiv:2203.17118 (2022).Google Scholar
- Harrie Oosterhuis. 2022b. Reaching the End of Unbiasedness: Uncovering Implicit Limitations of Click-Based Learning to Rank. In Proceedings of the 2022 ACM SIGIR International Conference on the Theory of Information Retrieval. ACM.Google ScholarDigital Library
- Harrie Oosterhuis and Maarten de Rijke. 2020a. Policy-aware Unbiased Learning to Rank for Top-k Rankings. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 489--498.Google ScholarDigital Library
- Harrie Oosterhuis and Maarten de Rijke. 2020b. Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval. 137--144.Google ScholarDigital Library
- Harrie Oosterhuis and Maarten de Rijke. 2021a. Unifying Online and Counterfactual Learning to Rank: A Novel Counterfactual Estimator that Effectively Utilizes Online Interventions. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 463--471.Google ScholarDigital Library
- Harrie Oosterhuis and Maarten de de Rijke. 2021b. Robust Generalization and Safe Query-Specialization in Counterfactual Learning to Rank. In Proceedings of the Web Conference 2021. 158--170.Google Scholar
- Harrie Oosterhuis, Rolf Jagerman, and Maarten de Rijke. 2020. Unbiased Learning to Rank: Counterfactual and Online Approaches. In Companion Proceedings of the Web Conference 2020. 299--300.Google ScholarDigital Library
- Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. arXiv preprint arXiv:1306.2597 (2013).Google Scholar
- Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval. Information Retrieval, Vol. 13, 4 (2010), 346--374.Google ScholarDigital Library
- Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How Does Clickthrough Data Reflect Retrieval Quality?. In Proceedings of the 17th ACM conference on Information and knowledge management. 43--52.Google ScholarDigital Library
- Alfréd Rényi. 1961. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Vol. 1. Berkeley, California, USA.Google Scholar
- Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2020. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).Google Scholar
- Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Fifteenth ACM Conference on Recommender Systems. 828--830.Google ScholarDigital Library
- Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do User Preferences and Evaluation Measures Line Up?. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 555--562.Google ScholarDigital Library
- Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.Google ScholarDigital Library
- Mirco Speretta and Susan Gauch. 2005. Personalized Search Based on User Search Histories. In The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05). IEEE, 622--628.Google Scholar
- Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. The Journal of Machine Learning Research, Vol. 16, 1 (2015), 1731--1755.Google ScholarDigital Library
- Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-Confidence Off-Policy Evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29.Google ScholarCross Ref
- Ali Vardasbi, Harrie Oosterhuis, and Maarten de Rijke. 2020. When Inverse Propensity Scoring does not Work: Affine Corrections for Unbiased Learning to Rank. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1475--1484.Google ScholarDigital Library
- Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to Rank with Selection Bias in Personal Search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 115--124.Google ScholarDigital Library
- Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 610--618.Google ScholarDigital Library
- Steve Wedig and Omid Madani. 2006. A Large-Scale Analysis of Query Logs for Assessing Personalization Opportunities. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 742--747.Google ScholarDigital Library
- Ryen W White, Mikhail Bilenko, and Silviu Cucerzan. 2007. Studying the Use of Popular Destinations to Enhance Web Search Interaction. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 159--166.Google ScholarDigital Library
- Ronald J Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine learning, Vol. 8, 3 (1992), 229--256.Google ScholarDigital Library
- Hang Wu and May Wang. 2018. Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization. In International Conference on Machine Learning. PMLR, 5353--5362.Google Scholar
- Himank Yadav, Zhengxiao Du, and Thorsten Joachims. 2021. Policy-Gradient Training of Fair and Unbiased Ranking Functions. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1044--1053.Google ScholarDigital Library
Index Terms
- Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization
Recommendations
Policy-Aware Unbiased Learning to Rank for Top-k Rankings
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalCounterfactual Learning to Rank (LTR) methods optimize ranking systems using logged user interactions that contain interaction biases. Existing methods are only unbiased if users are presented with all relevant items in every ranking. There is currently ...
Unifying Online and Counterfactual Learning to Rank: A Novel Counterfactual Estimator that Effectively Utilizes Online Interventions
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data MiningOptimizing ranking systems based on user interactions is a well-studied problem. State-of-the-art methods for optimizing ranking systems based on user interactions are divided into online approaches - that learn by directly interacting with users - and ...
Cascade Model-based Propensity Estimation for Counterfactual Learning to Rank
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalUnbiased counterfactual learning to rank (CLTR) requires click propensities to compensate for the difference between user clicks and true relevance of search results via inverse propensity scoring (IPS). Current propensity estimation methods assume that ...
Comments