Skip to main content
Log in

A novel multi-step reinforcement learning method for solving reward hacking

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer’s intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Amin K, Jiang N, Singh S (2017) Repeated inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1815–1824

  2. Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in ai safety. arXiv:160606565

  3. An Y, Ding S, Shi S, Li J (2018) Discrete space reinforcement learning algorithm based on support vector machine classification. Pattern Recogn Lett 111:30–35

    Article  Google Scholar 

  4. Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Proc Mag 34(6):26–38

    Article  Google Scholar 

  5. Aslund H, Mhamdi EME, Guerraoui R, Maurer A (2018) Virtuously safe reinforcement learning. arXiv:180511447

  6. Bragg J, Habli I (2018) What is acceptably safe for reinforcement learning. In: International workshop on artificial intelligence safety engineering

  7. De Asis K, Hernandez-Garcia JF, Holland GZ, Sutton RS (2017) Multi-step reinforcement learning: A unifying algorithm. arXiv:170301327v1

  8. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245

    Article  Google Scholar 

  9. Everitt T, Krakovna V, Orseau L, Hutter M, Legg S (2017) Reinforcement learning with a corrupted reward channel. In: International joint conferences on artificial intelligence (IJCAI), pp 4705–4713

  10. Fernandez-Gauna B, Osa JL, Graña M (2017) Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 271:38–47

  11. Garcia J, Femandez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16:1437–1480

  12. Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Advances in neural information processing systems (NIPS), pp 6765–6774

  13. Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning. arXiv:171002298

  14. Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, Van Hasselt H, Silver D (2018) Distributed prioritized experience replay. arXiv:180300933

  15. Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput 6(6):1185–1201

    Article  MATH  Google Scholar 

  16. Laurent O, Stuart A (2016) Safely interruptible agents. In: Association for uncertainty in artificial intelligence

  17. Leike J, Martic M, Krakovna V, Ortega P A, Everitt T, Lefrancq A, Orseau L, Legg S (2017) Ai safety gridworlds. arXiv:171109883

  18. Ludvig EA, Sutton RS, Kehoe EJ (2012) Evaluating the td model of classical conditioning. Learning & Behavior 40(3):305– 319

    Article  Google Scholar 

  19. Marco D (2009) Markov random processes are neither bandlimited nor recoverable from samples or after quantization. IEEE Trans Inf Theory 55(2):900–905

    Article  MathSciNet  MATH  Google Scholar 

  20. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: Annual Conference on Neural Information Processing Systems (NIPS)

  21. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland A, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  22. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning (ICML), pp 1928–1937

  23. Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480

    Article  MathSciNet  MATH  Google Scholar 

  24. Murphy SA (2005) A generalization error for q-learning. Journal of Machine Learning Research Jmlr 6(3):1073

    MathSciNet  MATH  Google Scholar 

  25. Pakizeh E, Pedram M M, Palhang M (2015) Multi-criteria expertness based cooperative method for sarsa and eligibility trace algorithms. Appl Intell 43:487–498

    Article  Google Scholar 

  26. Pathak S, Pulina L, Tacchella A (2017) Verification and repair of control policies for safe reinforcement learning. Appl Intell 1:886–908

    Google Scholar 

  27. Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190

    Article  Google Scholar 

  28. Richard GF, Shlomo Z (2016) Safety in ai-hri: challenges complementing user experience quality. In: AAAI Conference on Artificial Intelligence(AAAI

  29. Seijen HV, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(1):5057–5096

    MathSciNet  MATH  Google Scholar 

  30. Singh S, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308

    Article  MATH  Google Scholar 

  31. Suri RE (2002) Td models of reward predictive responses in dopamine neurons. Neural Netw 15(4-6):523–533

    Article  Google Scholar 

  32. Sutton R, Barto A (2017) Introduction to rinforcement learning (2nd Edition, in preparation). MIT Press

  33. Sutton RS (2016) Tile coding software – reference manual, version 3 beta. http://incompleteideas.net/tiles/tiles3.html

  34. Van Seijen H, Van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: Proceedings of the IEEE symposium on adaptive dynamic programming reinforcement learning, pp 177–184

  35. Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31

    Article  MathSciNet  MATH  Google Scholar 

  36. Zhao X, Ding S, An Y (2018) A new asynchronous architecture for tabular reinforcement learning algorithms. In: Proceedings of the 8th international conference on extreme learning machines, pp 172–180

  37. Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48(12):4889–4904

    Article  Google Scholar 

Download references

Acknowledgment

The authors would like to thank the editors and reviewers for the time and effort spent in handling this work, and the detailed and constructive comments provided to further improve its presentation and quality. This work was supported in part by the National Natural Science Foundation of China under Grant 61836003, 61573150, 61573152 and 61633010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhu Liang Yu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Convergence proof for expected n-step SARSA

In this part, we give the proof that expected n-step SARSA converges to the optimal policy under some straightforward conditions. This proof largely draws on the proof of convergence of SARSA [30] as well as Q-learning [15]. In the process of proof, we still use the following lemma, which was also used to prove convergence of SARSA and Q-learning.

Lemma 1

Consider a stochastic process (αtt,Ft),where\(\alpha _{t}, {\Delta }_{t}, F_{t}: X \to \mathbb {R}\)satisfythe equation:

$$ {\Delta}_{t + 1}(x_{t})=(1-\alpha_{t}(x_{t})){\Delta}_{t}(x_{t}) + {\Delta}_{t}(x_{t})F_{t}(x_{t}), $$
(14)

where xtX and t = 0, 1, 2,…. Then Δt(xt) converges to 0 with probability 1 if the following properties hold:

  1. 1)

    The set X is finite.

  2. 2)

    αt(xt) ∈ [0, 1], \({\sum }_{t} \alpha _{t}(x_{t}) = \infty \), \({\sum }_{t} (\alpha _{t}(x_{t}))^{2} < \infty \) with probability 1 and ∀xxt : αt(x) = 0.

  3. 3)

    \(||\mathbb {E}[F_{t}|P_{t}]||\le \kappa ||{\Delta }_{t}(x_{t})|| + c_{t}\), where κ ∈ [0, 1) and ct converges to 0 with probability 1.

  4. 4)

    Var[Ft(xt)|pt] ≤ K(1 + κ||Δt(xt))2, where K is some constant.

Here, ||⋅|| denotes the maximum norm, Pt is a sequence of increasing σ −fields that includes the past of the process, and αtt,Ft− 1Pt

To prove the convergence of expected n-step SARSA, we need to associate this algorithm with the converging stochastic process defined by Lemma 1. We can define \(X=\mathcal {S} \times \mathcal {A}\), \(P_{t}= \{\hat {Q}_{0},s_{0},a_{0},\hat {r}^{(n)}_{0},s_{1},a_{1},\hat {r}^{(n)}_{1},\dots ,s_{t},a_{t}\}\), xt = (st,at), αt(xt) = αt(st,at) and \({\Delta }_{t}(s_{t},a_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})\). If we can prove that Δt(st,at) converges to 0 with probability 1, the proposed algorithm will converge to the optimal policy. Therefore, we have a Theorem:

Theorem 1

Expected n-step SARSA converges to the optimal value function if the following assumptions hold:

  1. 1)

    The state space \(\mathcal {S}\) and action space \(\mathcal {A}\) are finite.

  2. 2)

    The learning rates satisfyαt(st,at) ∈ [0, 1],\({\sum }_{t} \alpha _{t}(s_{t},a_{t}) = \infty \),\({\sum }_{t} (\alpha _{t}(s_{t},a_{t}))^{2} < \infty \)withprobability 1 and ∀(s,a)≠(st,at) : αt(s,a) = 0.

  3. 3)

    The policy is greedy in the limit with infinite exploration.

  4. 4)

    The reward function is bounded.

  5. 5)

    γ ∈ [0, 1).

Proof of Theorem 1

To prove this theorem, we note that the conditions 1), 2) and 4) of Lemma 1 correspond to the conditions 1), 2), and 4) of Theorem 1 respectively. Therefore, in the following, we only need to show that the proposed algorithm satisfy the condition 3) of Lemma 1. According to (13) and \({\Delta }_{t}(x_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})\), we have

$$\begin{array}{@{}rcl@{}} &&{\Delta}_{t + 1}(s_{t},a_{t})\\ &=&\hat{Q}_{t + 1}(s_{t},a_{t})-\hat{Q}^{*}(s_{t},a_{t})\\ &=&\hat{Q}_{t}(s_{t},a_{t})+\alpha_{t}(s_{t},a_{t}) (\hat{r}_{t}^{(n)} + \gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})\\ &&-\hat{Q}_{t}(s_{t},a_{t}))-\hat{Q}^{*}(s_{t},a_{t})\\ &=&(1-\alpha_{t}(s_{t},a_{t}))(\hat{Q}_{t}(s_{t},a_{t})-\hat{Q}^{*}(s_{t},a_{t}))\\ &&+\alpha_{t}(s_{t},a_{t})(\hat{r}_{t}^{(n)}+\gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t}))\\ &=& (1-\alpha_{t}(s_{t},a_{t})){\Delta}_{t}(x_{t})+\alpha_{t}(s_{t},a_{t})(\hat{r}_{t}^{(n)} + \gamma \hat{Q}_{t}\\&&\times (s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t}))\\ &=& (1-\alpha_{t}(s_{t},a_{t})){\Delta}_{t}(x_{t})+ \alpha_{t}(s_{t},a_{t}) F_{t}(s_{t},a_{t}) \end{array} $$
(15)

where we define Ft(st,at) as:

$$\begin{array}{@{}rcl@{}} &&F_{t}(s_{t},a_{t})\\ &=& \hat{r}_{t}^{(n)}\!+\gamma \hat{Q}_{t}(s_{t + 1},a_{t + 1})-\hat{Q}^{*}(s_{t},a_{t})\\ &=& \hat{r\!}_{t}^{(n)}+\gamma \max_{a}\hat{Q}_{t}(s_{t + 1},a)-\hat{Q}^{*}(s_{t},a_{t}) \\ &&+\gamma (\hat{Q}_{t}(s_{t + 1},a_{t + 1}) -\max_{a}\hat{Q}_{t}(s_{t + 1},a)) \end{array} $$
(16)

Therefore, we have

$$\begin{array}{@{}rcl@{}} ||\mathbb{E}[F_{t}(s_{t},a_{t})]||&\!\leq\!& ||\mathbb{E}[ \hat{r}_{t}^{(n)} + \gamma \max_{a}\hat{Q}_{t}(s_{t + 1},a) - \hat{Q}^{*}(s_{t},a_{t})]||\\ && + \gamma ||\mathbb{E}[ \hat{Q}_{t}(s_{t + 1},a_{t + 1}) - \max_{a}\hat{Q}_{t}(s_{t + 1},a)]||\\ & = & \gamma ||\mathbb{E}[\max_{a}\hat{Q}_{t}(s_{t + 1},a) - \max_{a}\hat{Q}^{*}(s_{t + 1},a)]||\\ && + \gamma ||\mathbb{E}[ \hat{Q}_{t}(s_{t + 1},a_{t + 1}) - \max_{a}\hat{Q}_{t}(s_{t + 1},a)]||\\ &\!\le\!& \gamma \max_{s}\max_{a}|\hat{Q}_{t}(s,a) - \hat{Q}^{*}_{t}(s,a)|\\ && + \gamma \max_{s} |\hat{Q}_{t}(s,a) - \max_{a}\hat{Q}_{t}(s,a)|\\ & = & \gamma ||{\Delta}_{t}|| + \gamma \max_{s} |\hat{Q}_{t}(s,a) - \max_{a}\hat{Q}_{t}(s,a)| \end{array} $$
(17)

We identify \(c_{t}=\gamma \max _{s} |\hat {Q}_{t}(s,a) -\max _{a}\hat {Q}_{t}(s,a)|\) and κ = γ. According to the proof in SARSA [30], ct converges to 0 with probability 1 for policies that are greedy in limit. Therefore, the proposed algorithm satisfy the condition 3) of the Lemma 1, and the convergence of expected n-step SARSA is proved. □

Appendix B: The pseudocode of expected n-step SARSA

figure f

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, Y., Yu, Z.L., Gu, Z. et al. A novel multi-step reinforcement learning method for solving reward hacking. Appl Intell 49, 2874–2888 (2019). https://doi.org/10.1007/s10489-019-01417-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01417-4

Keywords

Navigation