Offline Reinforcement Learning with On-Policy Q-Function Regularization

Shi, Laixi; Dadashi, Robert; Chi, Yuejie; Castro, Pablo Samuel; Geist, Matthieu

doi:10.1007/978-3-031-43421-1_27

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

901 Accesses

Abstract

The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that \(Q^{\pi _{\textsf{b}}}\) is unknown. So we utilize the reward-to-go function [19] starting from any state-action pair \((s,a) \in \mathcal {D}\) as \(Q^{\pi _{\textsf{b}}}(s,a)\), i.e., \(Q^{\pi _{\textsf{b}}}(s,a) :=\sum _{t'=t}^T \gamma ^{t'-t} r(s_{t'}, a_{t'})\) with \((s_t,a_t) = (s,a)\). The estimation can be filled by the trajectories in the entire dataset \(\mathcal {D}\) with simple Monte Carlo return estimate, the same as the estimation of the value function used by [28].

References

Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 (2017)
Bradbury, J., et al.: Jax: Autograd and xla. Astrophysics Source Code Library pp. ascl-2111 (2021)
Google Scholar
Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. Adv. Neural Inf. Process. Syst. 34, 4933–4946 (2021)
Google Scholar
Buckman, J., Gelada, C., Bellemare, M.G.: The importance of pessimism in fixed-dataset policy optimization. In: International Conference on Learning Representations (2020)
Google Scholar
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 34, 15084–15097 (2021)
Google Scholar
Chen, X., Zhou, Z., Wang, Z., Wang, C., Wu, Y., Ross, K.: Bail: best-action imitation learning for batch deep reinforcement learning. Adv. Neural Inf. Process. Syst. 33, 18353–18363 (2020)
Google Scholar
Dadashi, R., Rezaeifar, S., Vieillard, N., Hussenot, L., Pietquin, O., Geist, M.: Offline reinforcement learning with pseudometric learning. In: International Conference on Machine Learning, pp. 2307–2318. PMLR (2021)
Google Scholar
Fakoor, R., Mueller, J.W., Asadi, K., Chaudhari, P., Smola, A.J.: Continuous doubly constrained batch reinforcement learning. Adv. Neural Inf. Process. Syst. 34, 11260–11273 (2021)
Google Scholar
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. Adv. Neural Inf. Process. Syst. 34, 20132–20145 (2021)
Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR (2018)
Google Scholar
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Google Scholar
Garg, D., Hejna, J., Geist, M., Ermon, S.: Extreme q-learning: maxent RL without entropy. arXiv preprint arXiv:2301.02328 (2023)
Ghasemipour, S.K.S., Schuurmans, D., Gu, S.S.: EMaQ: expected-max Q-learning operator for simple yet effective offline and online RL. In: International Conference on Machine Learning, pp. 3682–3691. PMLR (2021)
Google Scholar
Gulcehre, C., et al.: Regularized behavior value estimation. arXiv preprint arXiv:2103.09575 (2021)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR (2018)
Google Scholar
Harris, C.R., et al.: Array programming with numpy. Nature 585(7825), 357–362 (2020)
Article Google Scholar
Hoffman, M., et al.: Acme: a research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979 (2020)
Janner, M., Li, Q., Levine, S.: Offline reinforcement learning as one big sequence modeling problem. Adv. Neural Inf. Process. Syst. 34, 1273–1286 (2021)
Google Scholar
Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021)
Google Scholar
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169 (2021)
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative Q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. 33, 1179–1191 (2020)
Google Scholar
Lee, B.J., Lee, J., Kim, K.E.: Representation balancing offline model-based reinforcement learning. In: International Conference on Learning Representations (2020)
Google Scholar
Levine, S.: Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909 (2018)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Lyu, J., Ma, X., Li, X., Lu, Z.: Mildly conservative Q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. 35, 1711–1724 (2022)
Google Scholar
Peng, X.B., Kumar, A., Zhang, G., Levine, S.: Advantage-weighted regression: simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019)
Rezaeifar, S., et al.: Offline reinforcement learning as anti-exploration. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8106–8114 (2022)
Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Google Scholar
Vinyals, O., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Article Google Scholar
Wang, Z., Hunt, J.J., Zhou, M.: Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193 (2022)
Wang, Z., et al.: Critic regularized regression. Adv. Neural Inf. Process. Syst. 33, 7768–7778 (2020)
Google Scholar
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)
Yang, S., Wang, Z., Zheng, H., Feng, Y., Zhou, M.: A regularized implicit policy for offline reinforcement learning. arXiv preprint arXiv:2202.09673 (2022)
Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., Finn, C.: Combo: conservative offline model-based policy optimization. Adv. Neural Inf. Process. Syst. 34, 28954–28967 (2021)
Google Scholar
Zhang, G., Kashima, H.: Behavior estimation from multi-source data for offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 11201–11209 (2023)
Google Scholar

Download references

Acknowledgment

Part of this work was completed when L. Shi was an intern at Google Research, Brain Team. The work of L. Shi and Y. Chi is supported in part by the grants NSF CCF-2106778 and CNS-2148212. L. Shi is also gratefully supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship, and Liang Ji-Dian Graduate Fellowship at Carnegie Mellon University. The authors would like to thank Alexis Jacq for reviewing an early version of the paper. The authors would like to thank the anonymous reviewers for valuable feedback and suggestions. We would also like to thank the Python and RL community for useful tools that are widely used in this work, including Acme [18], Numpy [17], and JAX [2].

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Laixi Shi & Yuejie Chi
Google Research, Brain Team, Pittsburgh, USA
Robert Dadashi, Pablo Samuel Castro & Matthieu Geist

Authors

Laixi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Robert Dadashi
View author publications
You can also search for this author in PubMed Google Scholar
Yuejie Chi
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Samuel Castro
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Geist
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laixi Shi .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

Offline RL methods may bring benefits for social application scenarios when collecting new data is infeasible due to cost, privacy or safety. For example, learning to diagnose from historical medical records or designing recommendations given existing clicking records of some advertisements. For negative social impact, offline methods may enable big data discriminatory pricing to yield unfair market or improve the recommendation techniques to make more people to be addicted to the social media. However, our proposed methods is more related to introducing scientific thoughts and investigations, which do not target such possible applications. Additionally, this work will only use public benchmarks and data, so no personal data will be acquired or inferred.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, L., Dadashi, R., Chi, Y., Castro, P.S., Geist, M. (2023). Offline Reinforcement Learning with On-Policy Q-Function Regularization. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_27
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Offline Reinforcement Learning with On-Policy Q-Function Regularization