ABSTRACT
Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.
- Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. 2013. KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION. The Annals of Statistics 41, 3 (2013), 1516--1541.Google ScholarCross Ref
- Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems. 231--244.Google ScholarDigital Library
- Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. 87--94.Google ScholarDigital Library
- Camille-Sovanneary Gauthier, Romaric Gaudel, and Elisa Fromont. 2022. Unirank: Unimodal bandit algorithms for online ranking. In International Conference on Machine Learning. PMLR, 7279--7309.Google Scholar
- Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16 (2013), 63--90.Google ScholarDigital Library
- Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. 2017. Conservative contextual linear bandits. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
- Kia Khezeli and Eilyan Bitar. 2020. Safe linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10202--10209.Google ScholarCross Ref
- Junpei Komiyama, Junya Honda, and Akiko Takeda. 2017. Position-based Multiple-play Bandit Problem with Unknown Position Bias. In Advances in Neural Information Processing Systems, Vol. 30.Google Scholar
- Branislav Kveton, Ofer Meshi, Masrour Zoghi, and Zhen Qin. 2022. On the value of prior in online learning to rank. In International Conference on Artificial Intelligence and Statistics. PMLR, 6880--6892.Google Scholar
- Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. In International conference on machine learning. PMLR, 767--776.Google Scholar
- Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015. Combinatorial cascading bandits. Advances in Neural Information Processing Systems 28 (2015).Google Scholar
- Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-play bandits in the position-based model. Advances in Neural Information Processing Systems 29 (2016).Google Scholar
- Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. Toprank: A practical algorithm for online stochastic ranking. Advances in Neural Information Processing Systems 31 (2018).Google Scholar
- Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvári, and Masrour Zoghi. 2020. BubbleRank: Safe online learning to re-rank via implicit click feedback. In Uncertainty in Artificial Intelligence. PMLR, 196--206.Google Scholar
- Shuai Li, Tor Lattimore, and Csaba Szepesvári. 2019. Online learning to rank with features. In International Conference on Machine Learning. PMLR, 3856--3865.Google Scholar
- Stefan Magureanu, Alexandre Proutiere, Marcus Isaksson, and Boxun Zhang. 2017. Online Learning of Optimally Diverse Rankings. Proc. ACM Meas. Anal. Comput. Syst. 1, 2 (2017).Google ScholarDigital Library
- Ahmadreza Moradipari, Christos Thrampoulidis, and Mahnoosh Alizadeh. 2020. Stage-wise conservative linear bandits. Advances in neural information processing systems 33 (2020), 11191--11201.Google Scholar
- Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. 2016. Conservative bandits. In International Conference on Machine Learning. PMLR, 1254--1262.Google Scholar
- Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning. 1201--1208.Google ScholarDigital Library
- Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online learning to rank in stochastic click models. In International conference on machine learning. PMLR, 4199--4208.Google Scholar
- Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359 (2016).Google Scholar
Index Terms
- Exploration of Unranked Items in Safe Online Learning to Re-Rank
Recommendations
Online learning to rank for sequential music recommendation
RecSys '19: Proceedings of the 13th ACM Conference on Recommender SystemsThe prominent success of music streaming services has brought increasingly complex challenges for music recommendation. In particular, in a streaming setting, songs are consumed sequentially within a listening session, which should cater not only for ...
Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity
RecSys '20: Proceedings of the 14th ACM Conference on Recommender SystemsRelevance ranking and result diversification are two core areas in modern recommender systems. Relevance ranking aims at building a ranked list sorted in decreasing order of item relevance, while result diversification focuses on generating a ranked ...
How do Online Learning to Rank Methods Adapt to Changes of Intent?
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalOnline learning to rank (OLTR) uses interaction data, such as clicks, to dynamically update rankers. OLTR has been thought to capture user intent change overtime - a task that is impossible for rankers trained on statistic datasets such as in offline and ...
Comments