计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 314-321.doi: 10.11896/jsjkx.221100096

• 人工智能 • 上一篇    下一篇

基于轨迹信息量的分层强化学习方法

徐亚鹏1, 刘全1,2, 栗军伟1   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215006
  • 收稿日期:2022-11-10 修回日期:2023-03-28 出版日期:2023-12-15 发布日期:2023-12-07
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:(20205227104@stu.suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61772355,61702055,61876217,62176175);江苏高校优势学科建设工程资助项目

Hierarchical Reinforcement Learning Method Based on Trajectory Information

XU Yapeng1, LIU Quan1,2, LI Junwei1   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2022-11-10 Revised:2023-03-28 Online:2023-12-15 Published:2023-12-07
  • About author:XU Yapeng,born in 1996,postgraduate.His main research interests include hie-rarchical reinforcement learning and deep reinforcement learning.
    LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61876217,62176175) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD).

摘要: 基于option的分层强化学习(The Option-Based Hierarchical Reinforcement Learning,O-HRL)算法具有时序抽象的特点,可以有效处理强化学习中难以解决的长时序、稀疏奖励等复杂问题。目前O-HRL方法的研究主要集中在数据效率提升方面,通过提高智能体的采样效率以及探索能力,来最大化其获得优秀经验的概率。然而,在策略稳定性方面,由于在上层策略引导下层动作的过程中仅仅考虑了状态信息,造成了option信息的利用不充分,进而导致下层策略的不稳定。针对这一问题,提出了一种基于轨迹信息量的分层强化学习(Hierarchical Reinforcement Learning Method Based on Trajectory Information,THRL)方法。该方法利用option轨迹的不同类型信息指导下层动作选择,通过得到的扩展轨迹信息生成推断option。同时引入鉴别器将推断option与原始option作为输入,以获得内部奖励,使得下层动作的选择更符合当前option策略,从而解决下层策略不稳定的问题。将THRL算法以及目前优秀的深度强化学习算法应用于MuJoCo环境问题中,实验结果表明,THRL算法具有更好的稳定性以及性能表现,验证了算法的有效性。

关键词: option, 分层强化学习, 轨迹信息, 鉴别器, 深度强化学习

Abstract: The option-based hierarchical reinforcement learning(O-HRL) algorithm has the characteristics of temporal abstraction,which can effectively deal with complex problems such as long-term temporal order and sparse rewards that are difficult to solve in reinforcement learning.The existing studies of O-HRL methods mainly focus on data efficiency improvement by increa-sing the sampling efficiency as well as the exploration ability of the agent to maximize its probability of obtaining excellent expe-riences.However,in terms of policy stability,the high-level policy guides the low-level action by only considering the state,resulting in the underutilization of option information,which leads to the instability of the low-level policy.To address this problem,a hierarchical reinforcement learning method based on trajectory information(THRL) is proposed.THRL uses different types of information of option trajectories to guide the selection of low-level actions,and also generates inferred options by the obtained extended trajectory information.A discriminator is introduced to use the inferred options and the original options as inputs to obtain internal rewards,which makes the selection of low-level actions more consistent with the current option policy,thus solving the instability problem of low-level policies.The effectiveness of THRL is verified by applying it to the MuJoCo environment,along with the best deep reinforcement learning algorithms,and experimental results show that the THRL algorithm has better stability and performance.

Key words: Option, Hierarchical reinforcement learning, Trajectory information, Discriminator, Deep reinforcement learning

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT press,2018.
[2]GOODFELLOW I,BENGIO Y,COURVILLE A,et al.Deeplearning[M].MIT press,2016.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.
[5]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,2017(19):70-76.
[6]GOTTIPATI S K,SATTAROV B,NIU S,et al.Learning to navigate the synthetically accessible chemical space using reinforcement learning[C]//International Conference on Machine Learning.PMLR,2020:3668-3679.
[7]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016.
[9]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[10]BARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(4):341-379.
[11]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:4295-4304.
[12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[14]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[15]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//AAAI Conference on Artificial Intelligence.2017:1726-1734.
[16]ZHANG S,WHITESON S.Dac:The double actor-critic architecture for learning options[C]//Advances in Neural Information Processing Systems.2019:2012-2022.
[17]SMITH M,HOOF H,PINEAU J.An inference-based policygradient method for learning options[C]//International Confe-rence on Machine Learning.PMLR,2018:4703-4712.
[18]OSA T,TANGKARATT V,SUGIYAMA M.Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization[C]//International Conference on Learning Representations.2018.
[19]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596.
[20]LI C,MA X,ZHANG C,et al.SOAC:The Soft Option Actor-Critic Architecture[J].arXiv:2006.14363,2020.
[21]LEVINE S.Reinforcement learning and control as probabilistic inference:Tutorial and review[J].arXiv:1805.00909,2018.
[22]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!