Long-Term Safe Reinforcement Learning with Binary Feedback

Authors

  • Akifumi Wachi LINE Corporation
  • Wataru Hashimoto Osaka University
  • Kazumune Hashimoto Osaka University

DOI:

https://doi.org/10.1609/aaai.v38i19.30164

Keywords:

General

Abstract

Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binary-feedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.

Published

2024-03-24

How to Cite

Wachi, A., Hashimoto, W., & Hashimoto, K. (2024). Long-Term Safe Reinforcement Learning with Binary Feedback. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19), 21656-21663. https://doi.org/10.1609/aaai.v38i19.30164

Issue

Section

AAAI Technical Track on Safe, Robust and Responsible AI Track