Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Xu, Linjie; Jiang, Zhengyao; Wang, Jinyu; Song, Lei; Bian, Jiang

Abstract:Offline reinforcement learning (RL) methodologies enforce constraints on the policy to adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions during test time. Conventional approaches apply identical constraints for both value learning and test time inference. However, our findings indicate that the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a Mildly Constrained Evaluation Policy (MCEP) for test time inference with a more constrained target policy for value estimation. Since the target policy has been adopted in various prior approaches, MCEP can be seamlessly integrated with them as a plug-in. We instantiate MCEP based on TD3-BC [Fujimoto and Gu, 2021] and AWAC [Nair et al., 2020] algorithms. The empirical results on MuJoCo locomotion tasks show that the MCEP significantly outperforms the target policy and achieves competitive results to state-of-the-art offline RL methods. The codes are open-sourced at this https URL.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2306.03680 [cs.LG]
	(or arXiv:2306.03680v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.03680

Computer Science > Machine Learning

Title:Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators