ABSTRACT
Recommender systems are used to suggest items to users based on the users’ preferences. Such systems often deal with massive item sets and incredibly sparse user-item interactions, which makes it very challenging to generate high-quality personalized recommendations. Reinforcement learning (RL) is a framework for sequential decision making and naturally formulates recommender-system tasks: recommending items as actions in different user and context states to maximize long-term user experience. We investigate two RL policy parameterizations that generalize sparse user-items interactions by leveraging the relationships between actions: parameterizing the policy over action features as a softmax or Gaussian distribution. Our experiments on synthetic problems suggest that the Gaussian parameterization—which is not commonly used on recommendation tasks—is more robust to the set of action features than the softmax parameterization. Based on these promising results, we propose a more thorough investigation of the theoretical properties and empirical benefits of the Gaussian parameterization for recommender systems.
- Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, 2019. Solving rubik’s cube with a robot hand. arXiv:1910.07113.Google Scholar
- Yoshua Bengio and Jean-Sébastien Senécal. 2007. Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. IEEE Transactions on Neural Networks, 713–722.Google Scholar
- Yash Chandak, Georgios Theocharous, James E. Kostas, Scott M. Jordan, and Philip S. Thomas. 2019. Learning action representations for reinforcement learning. International Conference on Machine Learning, 1565–1582.Google Scholar
- Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. International Conference on Web Search and Data Mining, 456–464.Google ScholarDigital Library
- Minmin Chen, Bo Chang, Can Xu, and Ed H. Chi. 2021. User Response Models to Improve a REINFORCE Recommender System. International Conference on Web Search and Data Mining, 121–129.Google ScholarDigital Library
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. ACM Conference on Recommender Systems, 191–198.Google Scholar
- Thomas Degris, Patrick M Pilarski, and Richard S Sutton. 2012. Model-Free Reinforcement Learning with Continuous Action in Practice. American Control Conference, 2177–2182.Google Scholar
- Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement learning for continuous control. International Conference on Machine Learning, 1329–1338.Google Scholar
- Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv:1512.07679.Google Scholar
- Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Dengt, and Mari Ostendor. 2016. Deep Reinforcement Learning with a Natural Language Action Space. Annual Meeting of the Association for Computational Linguistics, 1621–1630.Google ScholarCross Ref
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. International Conference on World Wide Web, 173–182.Google Scholar
- Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. 2022. The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/Google Scholar
- Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier. 2019. SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets. International Joint Conference on Artificial Intelligence, 2592–2599.Google Scholar
- Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. Advances in Neural Information Processing Systems, 1008–1014.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, 2015. Human-level control through deep reinforcement learning. Nature, 529–533.Google Scholar
- Marius Muja and David G. Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2227–2240.Google ScholarCross Ref
- Reuven Y. Rubinstein. 1981. Simulation and the Monte Carlo method. John Wiley & Sons.Google ScholarDigital Library
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. International Conference on Machine Learning, 1889–1897.Google ScholarDigital Library
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347.Google Scholar
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 1140–1144.Google Scholar
- Richard S. Sutton. 1988. Learning to predict by the methods of temporal differences. Machine Learning, 9–44.Google Scholar
- Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.Google ScholarDigital Library
- Gerald Tesauro. 1995. Temporal difference learning and TD-Gammon. Commun. ACM, 58–68.Google Scholar
- A. Töscher, M. Jahrer, and R. M. Bell. 2009. The BigChaos Solution to the Netflix Grand Prize. 1–52.Google Scholar
- Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep Content-based Music Recommendation. Advances in Neural Information Processing Systems, 2643–2651.Google Scholar
- Hado van Hasselt and Marco A. Wiering. 2009. Using continuous action spaces to solve discrete problems. International Joint Conference on Neural Networks, 1149–1156.Google ScholarCross Ref
- Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning, 279–292.Google Scholar
- Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement Learning, 5–32.Google Scholar
- Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed H. Chi. 2019. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. ACM Conference on Recommender Systems, 269–277.Google Scholar
Recommendations
Learning discriminative recommendation systems with side information
IJCAI'17: Proceedings of the 26th International Joint Conference on Artificial IntelligenceTop-N recommendation systems are useful in many real world applications such as E-commerce platforms. Most previous methods produce top-N recommendations based on the observed user purchase or recommendation activities. Recently, it has been noticed ...
Investigating serendipity in recommender systems based on real user feedback
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingOver the past several years, research in recommender systems has emphasized the importance of serendipity, but there is still no consensus on the definition of this concept and whether serendipitous items should be recommended is still not a well-...
New Recommendation Techniques for Multicriteria Rating Systems
Traditional single-rating recommender systems have been successful in a number of personalization applications, but the research area of multicriteria recommender systems has been largely untouched. Taking full advantage of multicriteria ratings in ...
Comments