Abstract
Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer’s intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.
Similar content being viewed by others
References
Amin K, Jiang N, Singh S (2017) Repeated inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1815–1824
Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in ai safety. arXiv:160606565
An Y, Ding S, Shi S, Li J (2018) Discrete space reinforcement learning algorithm based on support vector machine classification. Pattern Recogn Lett 111:30–35
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Proc Mag 34(6):26–38
Aslund H, Mhamdi EME, Guerraoui R, Maurer A (2018) Virtuously safe reinforcement learning. arXiv:180511447
Bragg J, Habli I (2018) What is acceptably safe for reinforcement learning. In: International workshop on artificial intelligence safety engineering
De Asis K, Hernandez-Garcia JF, Holland GZ, Sutton RS (2017) Multi-step reinforcement learning: A unifying algorithm. arXiv:170301327v1
Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245
Everitt T, Krakovna V, Orseau L, Hutter M, Legg S (2017) Reinforcement learning with a corrupted reward channel. In: International joint conferences on artificial intelligence (IJCAI), pp 4705–4713
Fernandez-Gauna B, Osa JL, Graña M (2017) Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 271:38–47
Garcia J, Femandez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16:1437–1480
Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Advances in neural information processing systems (NIPS), pp 6765–6774
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning. arXiv:171002298
Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, Van Hasselt H, Silver D (2018) Distributed prioritized experience replay. arXiv:180300933
Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput 6(6):1185–1201
Laurent O, Stuart A (2016) Safely interruptible agents. In: Association for uncertainty in artificial intelligence
Leike J, Martic M, Krakovna V, Ortega P A, Everitt T, Lefrancq A, Orseau L, Legg S (2017) Ai safety gridworlds. arXiv:171109883
Ludvig EA, Sutton RS, Kehoe EJ (2012) Evaluating the td model of classical conditioning. Learning & Behavior 40(3):305– 319
Marco D (2009) Markov random processes are neither bandlimited nor recoverable from samples or after quantization. IEEE Trans Inf Theory 55(2):900–905
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: Annual Conference on Neural Information Processing Systems (NIPS)
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland A, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning (ICML), pp 1928–1937
Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480
Murphy SA (2005) A generalization error for q-learning. Journal of Machine Learning Research Jmlr 6(3):1073
Pakizeh E, Pedram M M, Palhang M (2015) Multi-criteria expertness based cooperative method for sarsa and eligibility trace algorithms. Appl Intell 43:487–498
Pathak S, Pulina L, Tacchella A (2017) Verification and repair of control policies for safe reinforcement learning. Appl Intell 1:886–908
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190
Richard GF, Shlomo Z (2016) Safety in ai-hri: challenges complementing user experience quality. In: AAAI Conference on Artificial Intelligence(AAAI
Seijen HV, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(1):5057–5096
Singh S, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308
Suri RE (2002) Td models of reward predictive responses in dopamine neurons. Neural Netw 15(4-6):523–533
Sutton R, Barto A (2017) Introduction to rinforcement learning (2nd Edition, in preparation). MIT Press
Sutton RS (2016) Tile coding software – reference manual, version 3 beta. http://incompleteideas.net/tiles/tiles3.html
Van Seijen H, Van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: Proceedings of the IEEE symposium on adaptive dynamic programming reinforcement learning, pp 177–184
Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31
Zhao X, Ding S, An Y (2018) A new asynchronous architecture for tabular reinforcement learning algorithms. In: Proceedings of the 8th international conference on extreme learning machines, pp 172–180
Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48(12):4889–4904
Acknowledgment
The authors would like to thank the editors and reviewers for the time and effort spent in handling this work, and the detailed and constructive comments provided to further improve its presentation and quality. This work was supported in part by the National Natural Science Foundation of China under Grant 61836003, 61573150, 61573152 and 61633010.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Convergence proof for expected n-step SARSA
In this part, we give the proof that expected n-step SARSA converges to the optimal policy under some straightforward conditions. This proof largely draws on the proof of convergence of SARSA [30] as well as Q-learning [15]. In the process of proof, we still use the following lemma, which was also used to prove convergence of SARSA and Q-learning.
Lemma 1
Consider a stochastic process (αt,Δt,Ft),where\(\alpha _{t}, {\Delta }_{t}, F_{t}: X \to \mathbb {R}\)satisfythe equation:
where xt ∈ X and t = 0, 1, 2,…. Then Δt(xt) converges to 0 with probability 1 if the following properties hold:
-
1)
The set X is finite.
-
2)
αt(xt) ∈ [0, 1], \({\sum }_{t} \alpha _{t}(x_{t}) = \infty \), \({\sum }_{t} (\alpha _{t}(x_{t}))^{2} < \infty \) with probability 1 and ∀x≠xt : αt(x) = 0.
-
3)
\(||\mathbb {E}[F_{t}|P_{t}]||\le \kappa ||{\Delta }_{t}(x_{t})|| + c_{t}\), where κ ∈ [0, 1) and ct converges to 0 with probability 1.
-
4)
Var[Ft(xt)|pt] ≤ K(1 + κ||Δt(xt))2, where K is some constant.
Here, ||⋅|| denotes the maximum norm, Pt is a sequence of increasing σ −fields that includes the past of the process, and αt,Δt,Ft− 1 ∈ Pt
To prove the convergence of expected n-step SARSA, we need to associate this algorithm with the converging stochastic process defined by Lemma 1. We can define \(X=\mathcal {S} \times \mathcal {A}\), \(P_{t}= \{\hat {Q}_{0},s_{0},a_{0},\hat {r}^{(n)}_{0},s_{1},a_{1},\hat {r}^{(n)}_{1},\dots ,s_{t},a_{t}\}\), xt = (st,at), αt(xt) = αt(st,at) and \({\Delta }_{t}(s_{t},a_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})\). If we can prove that Δt(st,at) converges to 0 with probability 1, the proposed algorithm will converge to the optimal policy. Therefore, we have a Theorem:
Theorem 1
Expected n-step SARSA converges to the optimal value function if the following assumptions hold:
-
1)
The state space \(\mathcal {S}\) and action space \(\mathcal {A}\) are finite.
-
2)
The learning rates satisfyαt(st,at) ∈ [0, 1],\({\sum }_{t} \alpha _{t}(s_{t},a_{t}) = \infty \),\({\sum }_{t} (\alpha _{t}(s_{t},a_{t}))^{2} < \infty \)withprobability 1 and ∀(s,a)≠(st,at) : αt(s,a) = 0.
-
3)
The policy is greedy in the limit with infinite exploration.
-
4)
The reward function is bounded.
-
5)
γ ∈ [0, 1).
Proof of Theorem 1
To prove this theorem, we note that the conditions 1), 2) and 4) of Lemma 1 correspond to the conditions 1), 2), and 4) of Theorem 1 respectively. Therefore, in the following, we only need to show that the proposed algorithm satisfy the condition 3) of Lemma 1. According to (13) and \({\Delta }_{t}(x_{t}) = \hat {Q}_{t}(s_{t},a_{t})-\hat {Q}^{*}(s_{t},a_{t})\), we have
where we define Ft(st,at) as:
Therefore, we have
We identify \(c_{t}=\gamma \max _{s} |\hat {Q}_{t}(s,a) -\max _{a}\hat {Q}_{t}(s,a)|\) and κ = γ. According to the proof in SARSA [30], ct converges to 0 with probability 1 for policies that are greedy in limit. Therefore, the proposed algorithm satisfy the condition 3) of the Lemma 1, and the convergence of expected n-step SARSA is proved. □
Appendix B: The pseudocode of expected n-step SARSA
Rights and permissions
About this article
Cite this article
Yuan, Y., Yu, Z.L., Gu, Z. et al. A novel multi-step reinforcement learning method for solving reward hacking. Appl Intell 49, 2874–2888 (2019). https://doi.org/10.1007/s10489-019-01417-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01417-4