skip to main content
10.1145/3539618.3591985acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open Access

Exploration of Unranked Items in Safe Online Learning to Re-Rank

Published:18 July 2023Publication History

ABSTRACT

Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.

References

  1. Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. 2013. KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION. The Annals of Statistics 41, 3 (2013), 1516--1541.Google ScholarGoogle ScholarCross RefCross Ref
  2. Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems. 231--244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. 87--94.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Camille-Sovanneary Gauthier, Romaric Gaudel, and Elisa Fromont. 2022. Unirank: Unimodal bandit algorithms for online ranking. In International Conference on Machine Learning. PMLR, 7279--7309.Google ScholarGoogle Scholar
  5. Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16 (2013), 63--90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. 2017. Conservative contextual linear bandits. Advances in Neural Information Processing Systems 30 (2017).Google ScholarGoogle Scholar
  7. Kia Khezeli and Eilyan Bitar. 2020. Safe linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10202--10209.Google ScholarGoogle ScholarCross RefCross Ref
  8. Junpei Komiyama, Junya Honda, and Akiko Takeda. 2017. Position-based Multiple-play Bandit Problem with Unknown Position Bias. In Advances in Neural Information Processing Systems, Vol. 30.Google ScholarGoogle Scholar
  9. Branislav Kveton, Ofer Meshi, Masrour Zoghi, and Zhen Qin. 2022. On the value of prior in online learning to rank. In International Conference on Artificial Intelligence and Statistics. PMLR, 6880--6892.Google ScholarGoogle Scholar
  10. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. In International conference on machine learning. PMLR, 767--776.Google ScholarGoogle Scholar
  11. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015. Combinatorial cascading bandits. Advances in Neural Information Processing Systems 28 (2015).Google ScholarGoogle Scholar
  12. Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-play bandits in the position-based model. Advances in Neural Information Processing Systems 29 (2016).Google ScholarGoogle Scholar
  13. Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. Toprank: A practical algorithm for online stochastic ranking. Advances in Neural Information Processing Systems 31 (2018).Google ScholarGoogle Scholar
  14. Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvári, and Masrour Zoghi. 2020. BubbleRank: Safe online learning to re-rank via implicit click feedback. In Uncertainty in Artificial Intelligence. PMLR, 196--206.Google ScholarGoogle Scholar
  15. Shuai Li, Tor Lattimore, and Csaba Szepesvári. 2019. Online learning to rank with features. In International Conference on Machine Learning. PMLR, 3856--3865.Google ScholarGoogle Scholar
  16. Stefan Magureanu, Alexandre Proutiere, Marcus Isaksson, and Boxun Zhang. 2017. Online Learning of Optimally Diverse Rankings. Proc. ACM Meas. Anal. Comput. Syst. 1, 2 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ahmadreza Moradipari, Christos Thrampoulidis, and Mahnoosh Alizadeh. 2020. Stage-wise conservative linear bandits. Advances in neural information processing systems 33 (2020), 11191--11201.Google ScholarGoogle Scholar
  18. Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. 2016. Conservative bandits. In International Conference on Machine Learning. PMLR, 1254--1262.Google ScholarGoogle Scholar
  19. Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning. 1201--1208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online learning to rank in stochastic click models. In International conference on machine learning. PMLR, 4199--4208.Google ScholarGoogle Scholar
  21. Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Exploration of Unranked Items in Safe Online Learning to Re-Rank

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
          July 2023
          3567 pages
          ISBN:9781450394086
          DOI:10.1145/3539618

          Copyright © 2023 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 July 2023

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%
        • Article Metrics

          • Downloads (Last 12 months)130
          • Downloads (Last 6 weeks)12

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader