OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning

Authors

  • Fan Wu Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China
  • Rui Zhang SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
  • Qi Yi University of Science and Technology of China, USTC, Hefei, China
  • Yunkai Gao University of Science and Technology of China, USTC, Hefei, China
  • Jiaming Guo SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
  • Shaohui Peng Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
  • Siming Lan University of Science and Technology of China, USTC, Hefei, China
  • Husheng Han SKL of Processors, Institute of Computing Technology, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China
  • Yansong Pan University of Chinese Academy of Sciences, UCAS, Beijing, China
  • Kaizhao Yuan University of Chinese Academy of Sciences, UCAS, Beijing, China
  • Pengwei Jin University of Chinese Academy of Sciences, UCAS, Beijing, China SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
  • Ruizhi Chen Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
  • Yunji Chen SKL of Processors, Institute of Computing Technology, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China
  • Ling Li Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v38i14.29520

Keywords:

ML: Reinforcement Learning

Abstract

Model-based offline reinforcement learning (RL) algorithms have emerged as a promising paradigm for offline RL. These algorithms usually learn a dynamics model from a static dataset of transitions, use the model to generate synthetic trajectories, and perform conservative policy optimization within these trajectories. However, our observations indicate that policy optimization methods used in these model-based offline RL algorithms are not effective at exploring the learned model and induce biased exploration, which ultimately impairs the performance of the algorithm. To address this issue, we propose Offline Conservative ExplorAtioN (OCEAN), a novel rollout approach to model-based offline RL. In our method, we incorporate additional exploration techniques and introduce three conservative constraints based on uncertainty estimation to mitigate the potential impact of significant dynamic errors resulting from exploratory transitions. Our work is a plug-in method and can be combined with classical model-based RL algorithms, such as MOPO, COMBO, and RAMBO. Experiment results of our method on the D4RL MuJoCo benchmark show that OCEAN significantly improves the performance of existing algorithms.

Published

2024-03-24

How to Cite

Wu, F., Zhang, R., Yi, Q., Gao, Y., Guo, J., Peng, S., Lan, S., Han, H., Pan, Y., Yuan, K., Jin, P., Chen, R., Chen, Y., & Li, L. (2024). OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 15897-15905. https://doi.org/10.1609/aaai.v38i14.29520

Issue

Section

AAAI Technical Track on Machine Learning V