A novel multi-step reinforcement learning method for solving reward hacking

Yuan, Yinlong; Yu, Zhu Liang; Gu, Zhenghui; Deng, Xiaoyan; Li, Yuanqing

doi:10.1007/s10489-019-01417-4

A novel multi-step reinforcement learning method for solving reward hacking

Published: 09 February 2019

Volume 49, pages 2874–2888, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yinlong Yuan¹,
Zhu Liang Yu¹,
Zhenghui Gu¹,
Xiaoyan Deng¹ &
…
Yuanqing Li¹

981 Accesses
7 Citations
Explore all metrics

Abstract

Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer’s intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

Two-Stage Reinforcement Learning Algorithm for Quick Cooperation in Repeated Games

Embedding multi-agent reinforcement learning into behavior trees with unexpected interruptions

Article Open access 25 January 2024

References

Amin K, Jiang N, Singh S (2017) Repeated inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1815–1824
Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in ai safety. arXiv:160606565
An Y, Ding S, Shi S, Li J (2018) Discrete space reinforcement learning algorithm based on support vector machine classification. Pattern Recogn Lett 111:30–35
Article Google Scholar
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Proc Mag 34(6):26–38
Article Google Scholar
Aslund H, Mhamdi EME, Guerraoui R, Maurer A (2018) Virtuously safe reinforcement learning. arXiv:180511447
Bragg J, Habli I (2018) What is acceptably safe for reinforcement learning. In: International workshop on artificial intelligence safety engineering
De Asis K, Hernandez-Garcia JF, Holland GZ, Sutton RS (2017) Multi-step reinforcement learning: A unifying algorithm. arXiv:170301327v1
Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245
Article Google Scholar
Everitt T, Krakovna V, Orseau L, Hutter M, Legg S (2017) Reinforcement learning with a corrupted reward channel. In: International joint conferences on artificial intelligence (IJCAI), pp 4705–4713
Fernandez-Gauna B, Osa JL, Graña M (2017) Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 271:38–47
Garcia J, Femandez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16:1437–1480
Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Advances in neural information processing systems (NIPS), pp 6765–6774
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning. arXiv:171002298
Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, Van Hasselt H, Silver D (2018) Distributed prioritized experience replay. arXiv:180300933
Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput 6(6):1185–1201
Article MATH Google Scholar
Laurent O, Stuart A (2016) Safely interruptible agents. In: Association for uncertainty in artificial intelligence
Leike J, Martic M, Krakovna V, Ortega P A, Everitt T, Lefrancq A, Orseau L, Legg S (2017) Ai safety gridworlds. arXiv:171109883
Ludvig EA, Sutton RS, Kehoe EJ (2012) Evaluating the td model of classical conditioning. Learning & Behavior 40(3):305– 319
Article Google Scholar
Marco D (2009) Markov random processes are neither bandlimited nor recoverable from samples or after quantization. IEEE Trans Inf Theory 55(2):900–905
Article MathSciNet MATH Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: Annual Conference on Neural Information Processing Systems (NIPS)
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland A, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning (ICML), pp 1928–1937
Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480
Article MathSciNet MATH Google Scholar
Murphy SA (2005) A generalization error for q-learning. Journal of Machine Learning Research Jmlr 6(3):1073
MathSciNet MATH Google Scholar
Pakizeh E, Pedram M M, Palhang M (2015) Multi-criteria expertness based cooperative method for sarsa and eligibility trace algorithms. Appl Intell 43:487–498
Article Google Scholar
Pathak S, Pulina L, Tacchella A (2017) Verification and repair of control policies for safe reinforcement learning. Appl Intell 1:886–908
Google Scholar
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190
Article Google Scholar
Richard GF, Shlomo Z (2016) Safety in ai-hri: challenges complementing user experience quality. In: AAAI Conference on Artificial Intelligence(AAAI
Seijen HV, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(1):5057–5096
MathSciNet MATH Google Scholar
Singh S, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308
Article MATH Google Scholar
Suri RE (2002) Td models of reward predictive responses in dopamine neurons. Neural Netw 15(4-6):523–533
Article Google Scholar
Sutton R, Barto A (2017) Introduction to rinforcement learning (2nd Edition, in preparation). MIT Press
Sutton RS (2016) Tile coding software – reference manual, version 3 beta. http://incompleteideas.net/tiles/tiles3.html
Van Seijen H, Van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: Proceedings of the IEEE symposium on adaptive dynamic programming reinforcement learning, pp 177–184
Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31
Article MathSciNet MATH Google Scholar
Zhao X, Ding S, An Y (2018) A new asynchronous architecture for tabular reinforcement learning algorithms. In: Proceedings of the 8th international conference on extreme learning machines, pp 172–180
Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48(12):4889–4904
Article Google Scholar

Download references

Acknowledgment

The authors would like to thank the editors and reviewers for the time and effort spent in handling this work, and the detailed and constructive comments provided to further improve its presentation and quality. This work was supported in part by the National Natural Science Foundation of China under Grant 61836003, 61573150, 61573152 and 61633010.

Author information

Authors and Affiliations

College of Automation Science and Engineering, South China University of Technology, Guangzhou, 510641, China
Yinlong Yuan, Zhu Liang Yu, Zhenghui Gu, Xiaoyan Deng & Yuanqing Li

Authors

Yinlong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhu Liang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenghui Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yuanqing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhu Liang Yu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Convergence proof for expected n-step SARSA

In this part, we give the proof that expected n-step SARSA converges to the optimal policy under some straightforward conditions. This proof largely draws on the proof of convergence of SARSA [30] as well as Q-learning [15]. In the process of proof, we still use the following lemma, which was also used to prove convergence of SARSA and Q-learning.

Lemma 1

Consider a stochastic process (α_t,Δ_t,F_t),where$\alpha _{t}, {\Delta }_{t}, F_{t}: X \to \mathbb {R}$satisfythe equation:

$$ {\Delta}_{t + 1}(x_{t})=(1-\alpha_{t}(x_{t})){\Delta}_{t}(x_{t}) + {\Delta}_{t}(x_{t})F_{t}(x_{t}), $$

(14)

where x_t ∈ X and t = 0, 1, 2,…. Then Δ_t(x_t) converges to 0 with probability 1 if the following properties hold:

1)
The set X is finite.
2)
α_t(x_t) ∈ [0, 1], ${\sum }_{t} \alpha _{t}(x_{t}) = \infty $, ${\sum }_{t} (\alpha _{t}(x_{t}))^{2} < \infty $ with probability 1 and ∀x≠x_t : α_t(x) = 0.
3)
$||\mathbb {E}[F_{t}|P_{t}]||\le \kappa ||{\Delta }_{t}(x_{t})|| + c_{t}$, where κ ∈ [0, 1) and c_t converges to 0 with probability 1.
4)
Var[F_t(x_t)|p_t] ≤ K(1 + κ||Δ_t(x_t))², where K is some constant.

Here, ||⋅|| denotes the maximum norm, P_t is a sequence of increasing σ −fields that includes the past of the process, and α_t,Δ_t,F_t− 1 ∈ P_t

To prove the convergence of expected n-step SARSA, we need to associate this algorithm with the converging stochastic process defined by Lemma 1. We can define $X=\mathcal {S} \times \mathcal {A}$, $P_{t}= \{\hat {Q}_{0},s_{0},a_{0},\hat {r}^{(n)}_{0},s_{1},a_{1},\hat {r}^{(n)}_{1},\dots ,s_{t},a_{t}\}$, x_t = (s_t,a_t), α_t(x_t) = α_t(s_t,a_t) and ${\Delta }_{t}(s_{t},a_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})$. If we can prove that Δ_t(s_t,a_t) converges to 0 with probability 1, the proposed algorithm will converge to the optimal policy. Therefore, we have a Theorem:

Theorem 1

Expected n-step SARSA converges to the optimal value function if the following assumptions hold:

1)
The state space $\mathcal {S}$ and action space $\mathcal {A}$ are finite.
2)
The learning rates satisfyα_t(s_t,a_t) ∈ [0, 1],${\sum }_{t} \alpha _{t}(s_{t},a_{t}) = \infty $,${\sum }_{t} (\alpha _{t}(s_{t},a_{t}))^{2} < \infty $withprobability 1 and ∀(s,a)≠(s_t,a_t) : α_t(s,a) = 0.
3)
The policy is greedy in the limit with infinite exploration.
4)
The reward function is bounded.
5)
γ ∈ [0, 1).

Proof of Theorem 1

To prove this theorem, we note that the conditions 1), 2) and 4) of Lemma 1 correspond to the conditions 1), 2), and 4) of Theorem 1 respectively. Therefore, in the following, we only need to show that the proposed algorithm satisfy the condition 3) of Lemma 1. According to (13) and ${\Delta }_{t}(x_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})$, we have

$$\begin{array}{@{}rcl@{}} &&{\Delta}_{t + 1}(s_{t},a_{t})\\ &=&\hat{Q}_{t + 1}(s_{t},a_{t})-\hat{Q}^{*}(s_{t},a_{t})\\ &=&\hat{Q}_{t}(s_{t},a_{t})+\alpha_{t}(s_{t},a_{t}) (\hat{r}_{t}^{(n)} + \gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})\\ &&-\hat{Q}_{t}(s_{t},a_{t}))-\hat{Q}^{*}(s_{t},a_{t})\\ &=&(1-\alpha_{t}(s_{t},a_{t}))(\hat{Q}_{t}(s_{t},a_{t})-\hat{Q}^{*}(s_{t},a_{t}))\\ &&+\alpha_{t}(s_{t},a_{t})(\hat{r}_{t}^{(n)}+\gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t}))\\ &=& (1-\alpha_{t}(s_{t},a_{t})){\Delta}_{t}(x_{t})+\alpha_{t}(s_{t},a_{t})(\hat{r}_{t}^{(n)} + \gamma \hat{Q}_{t}\\&&\times (s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t}))\\ &=& (1-\alpha_{t}(s_{t},a_{t})){\Delta}_{t}(x_{t})+ \alpha_{t}(s_{t},a_{t}) F_{t}(s_{t},a_{t}) \end{array} $$

(15)

where we define F_t(s_t,a_t) as:

$$\begin{array}{@{}rcl@{}} &&F_{t}(s_{t},a_{t})\\ &=& \hat{r}_{t}^{(n)}\!+\gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t})\\ &=& \hat{r\!}_{t}^{(n)}+\gamma \max_{a}\hat{Q}_{t}(s_{t + 1},a)-\hat{Q}^{*}(s_{t},a_{t}) \\ &&+\gamma (\hat{Q}_{t}(s_{t + 1},a_{t + 1}) -\max_{a}\hat{Q}_{t}(s_{t + 1},a)) \end{array} $$

(16)

Therefore, we have

$$\begin{array}{@{}rcl@{}} ||\mathbb{E}[F_{t}(s_{t},a_{t})]||&\!\leq\!& ||\mathbb{E}[ \hat{r}_{t}^{(n)} + \gamma \max_{a}\hat{Q}_{t}(s_{t + 1},a) - \hat{Q}^{*}(s_{t},a_{t})]||\\ && + \gamma ||\mathbb{E}[ \hat{Q}_{t}(s_{t + 1},a_{t + 1}) - \max_{a}\hat{Q}_{t}(s_{t + 1},a)]||\\ & = & \gamma ||\mathbb{E}[\max_{a}\hat{Q}_{t}(s_{t + 1},a) - \max_{a}\hat{Q}^{*}(s_{t + 1},a)]||\\ && + \gamma ||\mathbb{E}[ \hat{Q}_{t}(s_{t + 1},a_{t + 1}) - \max_{a}\hat{Q}_{t}(s_{t + 1},a)]||\\ &\!\le\!& \gamma \max_{s}\max_{a}|\hat{Q}_{t}(s,a) - \hat{Q}^{*}_{t}(s,a)|\\ && + \gamma \max_{s} |\hat{Q}_{t}(s,a) - \max_{a}\hat{Q}_{t}(s,a)|\\ & = & \gamma ||{\Delta}_{t}|| + \gamma \max_{s} |\hat{Q}_{t}(s,a) - \max_{a}\hat{Q}_{t}(s,a)| \end{array} $$

(17)

We identify $c_{t}=\gamma \max _{s} |\hat {Q}_{t}(s,a) -\max _{a}\hat {Q}_{t}(s,a)|$ and κ = γ. According to the proof in SARSA [30], c_t converges to 0 with probability 1 for policies that are greedy in limit. Therefore, the proposed algorithm satisfy the condition 3) of the Lemma 1, and the convergence of expected n-step SARSA is proved. □

Appendix B: The pseudocode of expected n-step SARSA

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, Y., Yu, Z.L., Gu, Z. et al. A novel multi-step reinforcement learning method for solving reward hacking. Appl Intell 49, 2874–2888 (2019). https://doi.org/10.1007/s10489-019-01417-4

Download citation

Published: 09 February 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s10489-019-01417-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel multi-step reinforcement learning method for solving reward hacking

Abstract

Access this article

Similar content being viewed by others

No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

Two-Stage Reinforcement Learning Algorithm for Quick Cooperation in Repeated Games

Embedding multi-agent reinforcement learning into behavior trees with unexpected interruptions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Convergence proof for expected n-step SARSA

Lemma 1

Theorem 1

Proof of Theorem 1

Appendix B: The pseudocode of expected n-step SARSA

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel multi-step reinforcement learning method for solving reward hacking

Abstract

Access this article

Similar content being viewed by others

No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

Two-Stage Reinforcement Learning Algorithm for Quick Cooperation in Repeated Games

Embedding multi-agent reinforcement learning into behavior trees with unexpected interruptions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Convergence proof for expected n-step SARSA

Lemma 1

Theorem 1

Proof of Theorem 1

Appendix B: The pseudocode of expected n-step SARSA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation