Exploration of Unranked Items in Safe Online Learning to Re-Rank

Authors:
Hiroaki Shiino

CyberAgent, Inc., Tokyo, Japan

CyberAgent, Inc., Tokyo, Japan

0009-0006-5191-9705
View Profile

,
Kaito Ariu

CyberAgent, Inc., Tokyo, Japan

CyberAgent, Inc., Tokyo, Japan

0000-0001-6286-9906
View Profile

,
Kenshi Abe

CyberAgent, Inc., Tokyo, Japan

CyberAgent, Inc., Tokyo, Japan

0000-0002-9267-9510
View Profile

,
Riku Togashi

CyberAgent, Inc., Tokyo, Japan

CyberAgent, Inc., Tokyo, Japan

0000-0001-9026-0495
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 1991–1995https://doi.org/10.1145/3539618.3591985

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1991–1995

ABSTRACT

Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.

References

Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. 2013. KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION. The Annals of Statistics 41, 3 (2013), 1516--1541.Google ScholarCross Ref
Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems. 231--244.Google ScholarDigital Library
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining. 87--94.Google ScholarDigital Library
Camille-Sovanneary Gauthier, Romaric Gaudel, and Elisa Fromont. 2022. Unirank: Unimodal bandit algorithms for online ranking. In International Conference on Machine Learning. PMLR, 7279--7309.Google Scholar
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16 (2013), 63--90.Google ScholarDigital Library
Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. 2017. Conservative contextual linear bandits. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
Kia Khezeli and Eilyan Bitar. 2020. Safe linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10202--10209.Google ScholarCross Ref
Junpei Komiyama, Junya Honda, and Akiko Takeda. 2017. Position-based Multiple-play Bandit Problem with Unknown Position Bias. In Advances in Neural Information Processing Systems, Vol. 30.Google Scholar
Branislav Kveton, Ofer Meshi, Masrour Zoghi, and Zhen Qin. 2022. On the value of prior in online learning to rank. In International Conference on Artificial Intelligence and Statistics. PMLR, 6880--6892.Google Scholar
Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. In International conference on machine learning. PMLR, 767--776.Google Scholar
Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015. Combinatorial cascading bandits. Advances in Neural Information Processing Systems 28 (2015).Google Scholar
Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-play bandits in the position-based model. Advances in Neural Information Processing Systems 29 (2016).Google Scholar
Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. Toprank: A practical algorithm for online stochastic ranking. Advances in Neural Information Processing Systems 31 (2018).Google Scholar
Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvári, and Masrour Zoghi. 2020. BubbleRank: Safe online learning to re-rank via implicit click feedback. In Uncertainty in Artificial Intelligence. PMLR, 196--206.Google Scholar
Shuai Li, Tor Lattimore, and Csaba Szepesvári. 2019. Online learning to rank with features. In International Conference on Machine Learning. PMLR, 3856--3865.Google Scholar
Stefan Magureanu, Alexandre Proutiere, Marcus Isaksson, and Boxun Zhang. 2017. Online Learning of Optimally Diverse Rankings. Proc. ACM Meas. Anal. Comput. Syst. 1, 2 (2017).Google ScholarDigital Library
Ahmadreza Moradipari, Christos Thrampoulidis, and Mahnoosh Alizadeh. 2020. Stage-wise conservative linear bandits. Advances in neural information processing systems 33 (2020), 11191--11201.Google Scholar
Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. 2016. Conservative bandits. In International Conference on Machine Learning. PMLR, 1254--1262.Google Scholar
Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning. 1201--1208.Google ScholarDigital Library
Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online learning to rank in stochastic click models. In International conference on machine learning. PMLR, 4199--4208.Google Scholar
Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359 (2016).Google Scholar

Index Terms

Exploration of Unranked Items in Safe Online Learning to Re-Rank
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Learning from implicit feedback
      2. Online learning settings
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank

Recommendations

Online learning to rank for sequential music recommendation
RecSys '19: Proceedings of the 13th ACM Conference on Recommender Systems

The prominent success of music streaming services has brought increasingly complex challenges for music recommendation. In particular, in a streaming setting, songs are consumed sequentially within a listening session, which should cater not only for ...
Read More
Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity
RecSys '20: Proceedings of the 14th ACM Conference on Recommender Systems

Relevance ranking and result diversification are two core areas in modern recommender systems. Relevance ranking aims at building a ranked list sorted in decreasing order of item relevance, while result diversification focuses on generating a ranked ...
Read More
How do Online Learning to Rank Methods Adapt to Changes of Intent?
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Online learning to rank (OLTR) uses interaction data, such as clicks, to dynamically update rankers. OLTR has been thought to capture user intent change overtime - a task that is impossible for rankers trained on statistic datasets such as in offline and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Check for updates
Author Tags
implicit feedback
online learning to rank
safety
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 130
  Total Downloads
- Downloads (Last 12 months)130
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploration of Unranked Items in Safe Online Learning to Re-Rank

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Online learning to rank for sequential music recommendation

Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity

How do Online Learning to Rank Methods Adapt to Changes of Intent?