计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 314-321.doi: 10.11896/jsjkx.221100096
徐亚鹏1, 刘全1,2, 栗军伟1
XU Yapeng1, LIU Quan1,2, LI Junwei1
摘要: 基于option的分层强化学习(The Option-Based Hierarchical Reinforcement Learning,O-HRL)算法具有时序抽象的特点,可以有效处理强化学习中难以解决的长时序、稀疏奖励等复杂问题。目前O-HRL方法的研究主要集中在数据效率提升方面,通过提高智能体的采样效率以及探索能力,来最大化其获得优秀经验的概率。然而,在策略稳定性方面,由于在上层策略引导下层动作的过程中仅仅考虑了状态信息,造成了option信息的利用不充分,进而导致下层策略的不稳定。针对这一问题,提出了一种基于轨迹信息量的分层强化学习(Hierarchical Reinforcement Learning Method Based on Trajectory Information,THRL)方法。该方法利用option轨迹的不同类型信息指导下层动作选择,通过得到的扩展轨迹信息生成推断option。同时引入鉴别器将推断option与原始option作为输入,以获得内部奖励,使得下层动作的选择更符合当前option策略,从而解决下层策略不稳定的问题。将THRL算法以及目前优秀的深度强化学习算法应用于MuJoCo环境问题中,实验结果表明,THRL算法具有更好的稳定性以及性能表现,验证了算法的有效性。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT press,2018. [2]GOODFELLOW I,BENGIO Y,COURVILLE A,et al.Deeplearning[M].MIT press,2016. [3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359. [5]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,2017(19):70-76. [6]GOTTIPATI S K,SATTAROV B,NIU S,et al.Learning to navigate the synthetically accessible chemical space using reinforcement learning[C]//International Conference on Machine Learning.PMLR,2020:3668-3679. [7]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27. [8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016. [9]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937. [10]BARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(4):341-379. [11]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:4295-4304. [12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017. [13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870. [14]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211. [15]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//AAAI Conference on Artificial Intelligence.2017:1726-1734. [16]ZHANG S,WHITESON S.Dac:The double actor-critic architecture for learning options[C]//Advances in Neural Information Processing Systems.2019:2012-2022. [17]SMITH M,HOOF H,PINEAU J.An inference-based policygradient method for learning options[C]//International Confe-rence on Machine Learning.PMLR,2018:4703-4712. [18]OSA T,TANGKARATT V,SUGIYAMA M.Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization[C]//International Conference on Learning Representations.2018. [19]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596. [20]LI C,MA X,ZHANG C,et al.SOAC:The Soft Option Actor-Critic Architecture[J].arXiv:2006.14363,2020. [21]LEVINE S.Reinforcement learning and control as probabilistic inference:Tutorial and review[J].arXiv:1805.00909,2018. [22]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016. |
|