research-article

Free Access

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Authors:
Haruka Kiyohara

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA

0009-0000-6378-4365
View Profile

,
Masahiro Nomura

CyberAgent, Inc., Tokyo, Japan

CyberAgent, Inc., Tokyo, Japan

0000-0003-0375-6826
View Profile

,
Yuta Saito

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA

0000-0003-4357-5835
View Profile

Authors Info & Claims

WWW '24: Proceedings of the ACM on Web Conference 2024May 2024Pages 3150–3161https://doi.org/10.1145/3589334.3645343

Published:13 May 2024Publication History

WWW '24: Proceedings of the ACM on Web Conference 2024

Pages 3150–3161

ABSTRACT

We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.

Supplemental Material

rfp0168.mp4

Supplemental video

mp4

6.9 MB

Download

References

Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, Vol. 2, 4 (2010), 433--459.Google ScholarDigital Library
Korbinian Abstreiter, Sarthak Mittal, Stefan Bauer, Bernhard Schölkopf, and Arash Mehrjou. 2021. Diffusion-based representation learning. arXiv preprint arXiv:2105.14257 (2021).Google Scholar
Arthur Allshire, Roberto Mart'in-Mart'in, Charles Lin, Shawn Manuel, Silvio Savarese, and Animesh Garg. 2021. Laser: Learning a latent action space for efficient reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6650--6656.Google ScholarDigital Library
Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 129--138.Google ScholarDigital Library
K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. 2016. The extreme classification repository: Multi-label datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.htmlGoogle Scholar
Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, and Philip Thomas. 2019. Learning action representations for reinforcement learning. In International conference on machine learning. PMLR, 941--950.Google Scholar
Romain Deffayet, Thibaut Thonet, Jean-Michel Renders, and Maarten de Rijke. 2023. Generative Slate Recommendation with Reinforcement Learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 580--588.Google ScholarDigital Library
Maria Dimakopoulou, Nikos Vlassis, and Tony Jebara. 2019. Marginal Posterior Sampling for Slate Bandits.. In IJCAI. 2223--2229.Google Scholar
Miroslav Dudik, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci., Vol. 29, 4 (2014), 485--511.Google ScholarCross Ref
Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML'11). Omnipress, Madison, WI, USA, 1097--1104.Google ScholarDigital Library
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447--1456.Google Scholar
Christian Fong and Justin Grimmer. 2021. Causal inference with latent treatments. American Journal of Political Science (2021).Google Scholar
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Chandra, and Craig Boutilier. 2019. SLATEQ: a tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2592--2599.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Haruka Kiyohara, Kosuke Kawakami, and Yuta Saito. 2021. Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation. arXiv preprint arXiv:2109.08331 (2021).Google Scholar
Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2023 a. SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation. arXiv preprint arXiv:2311.18206 (2023).Google Scholar
Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In International Conference on Learning Representations.Google Scholar
Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 487--497.Google ScholarDigital Library
Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023 b. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154--1163.Google ScholarDigital Library
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1685--1694.Google ScholarDigital Library
Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
Matthieu Martin, Panayotis Mertikopoulos, Thibaud Rahier, and Houssam Zenati. 2022. Nested Bandits. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 15093--15121.Google Scholar
James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1779--1788.Google ScholarDigital Library
Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning., Vol. 34 (2021).Google Scholar
Rubens Moraes, Julian Marino, Levi Lelis, and Mario Nascimento. 2018. Action Abstractions for Combinatorial Multi-Armed Bandit Tree Search. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 14. 74--80.Google ScholarCross Ref
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084Google ScholarCross Ref
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828--830.Google ScholarDigital Library
Yuta Saito and Thorsten Joachims. 2022a. Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4824--4825.Google ScholarDigital Library
Yuta Saito and Thorsten Joachims. 2022b. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning. 19089--19122.Google Scholar
Yuta Saito, Ren Qingyang, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In International Conference on Machine Learning. PMLR, 29734--29759.Google Scholar
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114--123.Google ScholarDigital Library
Rajat Sen, Alexander Rakhlin, Lexing Ying, Rahul Kidambi, Dean Foster, Daniel N Hill, and Inderjit S Dhillon. 2021. Top-k Extreme Contextual Bandits with Arm Hierarchy. In International Conference on Machine Learning. PMLR, 9422--9433.Google Scholar
Aleksandrs Slivkins. 2011. Multi-armed bandits on implicit metric spaces. Advances in Neural Information Processing Systems, Vol. 24 (2011).Google Scholar
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, Vol. 23. 2217--2225.Google Scholar
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dud'ik. 2020a. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167--9176.Google Scholar
Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020b. Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 9196--9205.Google Scholar
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. 6005--6014.Google Scholar
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems, Vol. 30. 3632--3642.Google Scholar
Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High Confidence Policy Improvement, In Proceedings of the 32th International Conference on Machine Learning. ICML, 2380--2388.Google Scholar
George Tucker and Jonathan Lee. 2021. Improved Estimator Selection for Off-Policy Evaluation.Google Scholar
Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. 2023. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.Google ScholarDigital Library
Victor Veitch, Dhanya Sridhar, and David Blei. 2020. Adapting text embeddings for causal inference. In Conference on Uncertainty in Artificial Intelligence. PMLR, 919--928.Google Scholar
Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil, and Nathan Kallus. 2021. Control Variates for Slate Off-Policy Evaluation. arXiv preprint arXiv:2106.07914 (2021).Google Scholar
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits, In Proceedings of the 34th International Conference on Machine Learning. ICML, 3589--3597.Google Scholar
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning (1992), 5--32.Google Scholar
Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. 2018. Challenges of using text classifiers for causal inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2018. NIH Public Access, 4586.Google ScholarCross Ref
Xingyi Yang and Xinchao Wang. 2023. Diffusion Model as Representation Learner. arXiv preprint arXiv:2308.10916 (2023).Google Scholar
Tiancheng Zhao, Kaige Xie, and Maxine Eskenazi. 2019. Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1208--1218.Google ScholarCross Ref
Wenxuan Zhou, Sujay Bajracharya, and David Held. 2021. Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning. PMLR, 1719--1735. ioGoogle Scholar

Index Terms

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Off-Policy Evaluation for Large Action Spaces via Policy Convolution
WWW '24: Proceedings of the ACM on Web Conference 2024

Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we ...
Read More
Handling Confounding for Realistic Off-Policy Evaluation
WWW '18: Companion Proceedings of the The Web Conference 2018

Inverse Propensity Score estimator (IPS) is a basic, unbiased, off-policy evaluation technique to measure the impact of a user-interactive system without serving live traffic. We present our work on applying IPS to real-world settings by addressing some ...
Read More
Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '24: Proceedings of the ACM on Web Conference 2024
May 2024
4826 pages
ISBN:9798400701719
DOI:10.1145/3589334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
inverse propensity score
off-policy evaluation
slate bandits
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 17
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

WWW '24: Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Handling Confounding for Realistic Off-Policy Evaluation

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

WWW '24: Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Handling Confounding for Realistic Off-Policy Evaluation

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media