First return, then explore

Ecoffet, Adrien; Huizinga, Joost; Lehman, Joel; Stanley, Kenneth O.; Clune, Jeff

doi:10.1038/s41586-020-03157-9

Article
Published: 24 February 2021

First return, then explore

Nature volume 590, pages 580–586 (2021)Cite this article

21k Accesses
103 Citations
488 Altmetric
Metrics details

Subjects

Abstract

Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse¹ and deceptive² feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly ‘remembering’ promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games¹, with orders-of-magnitude improvements on the grand challenges of Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration—an insight that may prove critical to the creation of truly intelligent learning agents.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance of robustified Go-Explore on Atari games.**

**Fig. 3: Human-normalized performance of the exploration phase and state-of-the-art algorithms on all Atari games.**

**Fig. 4: Go-Explore can solve a challenging, sparse-reward, simulated robotics task.**

**Fig. 5: Policy-based Go-Explore with domain knowledge outperforms state-of-the-art and average human performance in Montezuma’s Revenge and Pitfall.**

Machine learning reveals the control mechanics of an insect wing hinge

Article 17 April 2024

Johan M. Melis, Igor Siwanowicz & Michael H. Dickinson

De novo design of protein structure and function with RFdiffusion

Article Open access 11 July 2023

Joseph L. Watson, David Juergens, … David Baker

Neural operators for accelerating scientific simulations and design

Article 08 April 2024

Kamyar Azizzadenesheli, Nikola Kovachki, … Anima Anandkumar

Data availability

The data that support the findings of this study (including the raw data for all figures and tables in the manuscript, Extended Data, Supplementary Information, as well as the demonstration trajectories used in robustification) are available from the corresponding authors upon reasonable request.

Code availability

The Go-Explore code is available at https://github.com/uber-research/go-explore.

References

Bellemare, M. et al. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29 (NIPS 2016) (eds Lee, D. et al.) 1471–1479 (2016).
Lehman, J. & Stanley, K. O. Novelty search and the problem with objectives. In Genetic Programming Theory and Practice IX (eds Riolo, R. et al.) 37–56 (2011).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
Article ADS CAS Google Scholar
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Article ADS CAS Google Scholar
Open AI. Dota 2 with large-scale deep reinforcement learning. Preprint at https://arxiv.org/abs/1912.06680 (2019).
Merel, J. et al. Hierarchical visuomotor control of humanoids. In Int. Conf. Learning Representations https://openreview.net/forum?id=BJfYvo09Y7 (2019).
Open AI. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39, 3–20 (2020).
Article Google Scholar
Lehman, J. et al. The surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities. Artif. Life 26, 274–306 (2020).
Article Google Scholar
Amodei, D. et al. Concrete problems in AI safety. Preprint https://arxiv.org/abs/1606.06565 (2016).
Smart, W. D. & Kaelbling, L. P. Effective reinforcement learning for mobile robots. In Proc. 2002 IEEE Int. Conf. Robotics and Automation 3404–3410 (IEEE, 2002).
Lehman, J. & Stanley, K. O. Abandoning objectives: evolution through the search for novelty alone. Evol. Comput. 19, 189–223 (2011).
Article Google Scholar
Conti, E. et al. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (eds Bengio S. et al.) 5027–5038 (2018).
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The Arcade Learning Environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Article Google Scholar
Puigdomènech Badia, A. et al. Agent57: outperforming the Atari human benchmark. In Int. Conf. Machine Learning 507–517 (PMLR, 2020).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS CAS Google Scholar
Aytar, Y. et al. Playing hard exploration games by watching YouTube. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (eds Bengio, S. et al.) 2930–2941 (2018).
Machado, M. C. et al. Revisiting the Arcade Learning Environment: evaluation protocols and open problems for general agents. J. Artif. Intell. Res. 61, 523–562 (2018).
Article MathSciNet Google Scholar
Lipovetzky, N., Ramirez, M. & Geffner, H. Classical planning with simulators: results on the Atari video games. In IJCAI’15 Proc. 24th Int. Conf. Artificial Intelligence (eds Yang, Q. & Woolridge, M.) 1610–1616 (2015).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (Bradford, 1998).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. 33rd Int. Conf. Machine Learning (eds Balcan, M. F. & Weinberger, K. Q.) 1928–1937 (2016).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017).
Cully, A., Clune, J., Tarapore, D. & Mouret, J.-B. Robots that can adapt like animals. Nature 521, 503–507 (2015).
Article ADS CAS Google Scholar
Peng, X. B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE Int. Conf. Robotics and Automation (ICRA) (ed. Lynch, K.) 3803–3817 (IEEE, 2018).
Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Proc. Robotics: Science and Systems (eds Kress-Gazit, H. et al.) https://doi.org/10.15607/RSS.2018.XIV.010 (2018).
Hester, T. et al. Deep Q-learning from demonstrations. In Thirty-Second AAAI Conf. Artificial Intelligence 3223–3230 (2018).
Guo, X., Singh, S. P., Lee, H., Lewis, R. L. & Wang, X. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems 27 (NIPS 2014) (eds Ghahramani, Z. et al.) 3338–3346 (2014).
Horgan, D. et al. Distributed prioritized experience replay. In Int. Conf. Learning Representations https://openreview.net/forum?id=H1Dy---0Z (2018).
Espeholt, L. et al. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 1407–1416 (2018).
Salimans, T. & Chen, R. Learning Montezuma’s Revenge from a single demonstration. Preprint at https://arxiv.org/abs/1812.03381 (2018).
Van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V. & Silver, D. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems 29 (NIPS 2016) (eds Lee, D. et al.) 4287–4295 (2016).
Puigdomènech Badia, A. et al. Never give up: learning directed exploration strategies. In Int. Conf. Learning Representations https://openreview.net/forum?id=Sye57xStvB (2020).
Brockman, G. et al. OpenAI gym. Preprint at https://arxiv.org/abs/1606.01540 (2016).
ATARI VCS/2600 Scoreboard. Atari Compendium http://www.ataricompendium.com/game_library/high_scores/high_scores.html (accessed 6 January 2020).
Guo, Y. et al. Efficient exploration with self-imitation learning via trajectory-conditioned policy. Preprint at https://arxiv.org/abs/1907.10247 (2019).
Wise, M., Ferguson, M., King, D., Diehr, E. & Dymesich, D. Fetch and freight: standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots of the Intl Joint Conf. Artificial Intelligence (2016).
Eysenbach, B., Salakhutdinov, R. R. & Levine, S. Search on the replay buffer: bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 15220–15231 (2019).
Oh, J., Guo, Y., Singh, S. & Lee, H. Self-imitation learning. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 3878–3887 (2018).
Madotto, A. et al. Exploration-based language learning for text-based games. Preprint at https://arxiv.org/abs/2001.08868 (2020).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Article ADS CAS Google Scholar
Alvernaz, S. & Togelius, J. Autoencoder-augmented neuroevolution for visual Doom playing. In 2017 IEEE Conf. Computational Intelligence and Games (CIG) 1–8 (IEEE, 2017).
Cuccu, G., Togelius, J. & Cudré-Mauroux, P. Playing Atari with six neurons. In Proc. 18th Intl Conf. Autonomous Agents and MultiAgent Systems 998–1006 (2019).
Oord, A. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).
Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In Int. Conf. Learning Representations https://openreview.net/forum?id=SJ6yPD5xg (2017).
Chaslot, G., Bakkes, S., Szita, I. & Spronck, P. Monte-Carlo tree search: a new framework for game AI. In AIIDE'08: Proc. Fourth AAAI Conf. Artificial Intelligence and Interactive Digital Entertainment (eds Darken, C. & Mateas, M.) 216–217 (2008).
Lavalle, S. M. Rapidly-Exploring Random Trees: A New Tool for Path Planning. Technical Report No. 98-11 (Iowa State Univ., 1998).
Hart, P. E., Nilsson, N. J. & Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968).
Article Google Scholar
Smith, D. E. & Weld, D. S. Conformant Graphplan. In AAAI '98/IAAI '98: Proc. 15th Natl/10th Conf. Artificial Intelligence/Innovative Applications of Artificial Intelligence (eds Mostow, J. et al.) 889–896 (1998).
Castro, P. S., Moitra, S., Gelada, C., Kumar, S. & Bellemare, M. G. Dopamine: a research framework for deep reinforcement learning. Preprint at https://arxiv.org/abs/1812.06110 (2018).
Toromanoff, M., Wirbel, E. & Moutarde, F. Is deep reinforcement learning really superhuman on Atari? In Deep Reinforcement Learning Workshop of 39th Conf. Neural Information Processing Systems (NeurIPS 2019) (2019).
Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. In Int. Conf. Learning Representations https://openreview.net/forum?id=H1lJJnR5Ym (2019).
Choi, J. et al. Contingency-aware exploration in reinforcement learning. In Int. Conf. Learning Representations https://openreview.net/forum?id=HyxGB2AcY7 (2019).
Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).
Taiga, A. A., Fedus, W., Machado, M. C., Courville, A. & Bellemare, M. G. On bonus based exploration methods in the Arcade Learning Environment. In Int. Conf. Learning Representations https://openreview.net/forum?id=BJewlyStDr (2020).
Tang, Y., Valko, M. & Munos, R. Taylor expansion policy optimization. In Proc. 37th Int. Conf. Machine Learning (eds Daumé III, H. & Singh, A.) 9397–9406 (2020).
Ostrovski, G., Bellemare, M. G., van den Oord, A. & Munos, R. Count-based exploration with neural density models. In Proc. 34th Int. Conf. Machine Learning (eds Precup, D. & Teh, Y. W.) 2721–2730 (2017).
Martin, J., Sasikumar, S. N., Everitt, T. & Hutter, M. Count-based exploration in feature space for reinforcement learning. In IJCAI’17: Proc. 26th Int. Joint Conf. Artificial Intelligence (ed. Sierra, C.) 2471–2478 (2017).
O’Donoghue, B., Osband, I., Munos, R. & Mnih, V. The uncertainty Bellman equation and exploration. In Proc. 35th Int. Conf. Machine Learning (eds Dy, J. & Krause, A.) 3839–3848 (2018).
Goldenberg, A., Benhabib, B. & Fenton, R. A complete generalized solution to the inverse kinematics of robots. IEEE J. Robot. Autom. 1, 14–20 (1985).
Article Google Scholar
Spong, M. W., Hutchinson, S., Vidyasagar, M. Robot Modeling and Control (Wiley, 2006).
Zhao, Z.-Q., Zheng, P., Xu, S.-t. & Wu, X. Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30, 3212–3232 (2019).
Article Google Scholar
Todorov, E., Erez, T. & Tassa, Y. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ Int. Conf. Intelligent Robots and Systems 5026–5033 (IEEE, 2012).
Kocsis, L. & Szepesvári, C. Bandit-based Monte Carlo planning. In European Conf. Machine Learning ECML 2006 (eds Fürnkranz, J. et al.) 282–293 (Springer, 2006).
Strehl, A. L. & Littman, M. L. An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 74, 1309–1331 (2008).
Article MathSciNet Google Scholar
Tang, H. et al. #Exploration: a study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017) (eds Guyon, I. et al.) 2750–2759 (2017).
Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proc. 16th Int. Conf. Machine Learning (eds Bratko, I. & Džeroski, S.) 278–287 (1999).
Hussein, A., Gaber, M. M., Elyan, E. & Jayne, C. Imitation learning: a survey of learning methods. ACM Comput. Surv. 50, 21 (2017).
Article Google Scholar
Plappert, M. et al. Multi-goal reinforcement learning: challenging robotics environments and request for research. Preprint at https://arxiv.org/abs/1802.09464 (2018).
Cho, K., Van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder-decoder approaches. In Proc. SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation 103–111 (Association for Computational Linguistics, 2014).

Download references

Acknowledgements

We thank A. Edwards, S. Kapoor, F. Petroski Such and J. Zhi for their ideas, feedback, technical support and work on aspects of Go-Explore not presented in this work. We are grateful to the Colorado Data Center and OpusStack Teams at Uber for providing our computing platform. We thank V. Kumar for creating the MuJoCo files that served as the basis for our robotics environment (https://github.com/vikashplus/fetch).

Author information

These authors contributed equally: Adrien Ecoffet, Joost Huizinga

Authors and Affiliations

Uber AI Labs, San Francisco, CA, USA
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley & Jeff Clune
OpenAI, San Francisco, CA, USA
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley & Jeff Clune

Authors

Adrien Ecoffet
View author publications
You can also search for this author in PubMed Google Scholar
Joost Huizinga
View author publications
You can also search for this author in PubMed Google Scholar
Joel Lehman
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth O. Stanley
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Clune
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.E. and J.H. contributed equally and are responsible for the technical work (J.H. focused primarily on policy-based Go-Explore and A.E. on most other technical contributions) as well as the initial draft of the paper. J.C. and K.O.S. led the team. All authors (A.E., J.H., J.L., K.O.S. and J.C.) significantly contributed to ideation, experimental design, analysing data, strategic decisions, developing the philosophical motivation for the algorithm and editing the paper.

Corresponding authors

Correspondence to Adrien Ecoffet, Joost Huizinga or Jeff Clune.

Ethics declarations

Competing interests

Uber Technologies, Inc. has filed a publicly available provisional patent application 16/696,893 about some Go-Explore variants featuring a deep reinforcement learning model, with all authors (A.E., J.H., J.L, K.O.S. and J.C.) listed as inventors.

Additional information

Peer review information Nature thanks Julian Togelius and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Neural network architectures.

a, The Atari architecture is based on the architecture provided with the backward algorithm implementation. The input consists of the RGB channels of the last four frames (rescaled to 80 by 105 pixels) concatenated, resulting in 12 input channels. The network consists of three convolutional layers (C), two fully connected layers (FC), and a layer of gated recurrent units (GRUs)⁶⁸. The network has a policy head π_t(s_t|a_t) and a value head V_t(s_t). b, For the robotics problem, the architecture consists of two separate networks, each with two fully connected layers and a GRU layer. One network specifies the policy π_t(s_t|a_t) by returning a mean μ_t and variance σ_t for the actuator torques of the arm and the desired position of each of the two fingers of the gripper (gripper fingers are implemented as Mujoco position actuators⁶¹ with kp = 10⁴ and a control range of [0, 0.05]). The other network implements the value function V_t(s_t). c, The architecture for policy-based Go-Explore is identical to the Atari architecture, except that the goal representation g_t is concatenated with the input of the first fully connected layer. Activation functions (Act.) are: the rectified-linear unit (Relu), the exponential function (Exp) and the softmax function (Softmax). Layers can also include layer normalization (Layer norm), which transforms the output of the layer by subtracting the mean and dividing by the standard deviation of the layer.

Extended Data Fig. 2 Maximum end-of-episode score found by the exploration phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge, compared to downscaled. Because only scores achieved at the episode end are reported, the plots for some games (for example, Solaris) begin after the start of the run, when the episode end is first reached. In a, averaging is over 50 runs for the 11 focus games and five runs for other games. In b, averaging is over 100 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples. Avg. Human, average human performance; SOTA, state-of-the-art performance; M, ×10⁶; K, ×10³.

Extended Data Fig. 3 Number of cells in archive during the exploration phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge. In a, archive size can decrease when the representation is recomputed. Previous archives are converted to the new format when the representation is recomputed, possibly leading to an archive with a size larger than 50,000. In this case, one iteration of the exploration phase runs and the representation is recomputed again. In a, averaging is over 50 runs for the 11 focus games and five runs for other games. In b, averaging is over 100 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

Extended Data Fig. 4 Progress of robustification phase on Atari.

a, Exploration phase without domain knowledge. b, Exploration phase with domain knowledge. Shown are the scores achieved by robustifying agents across training time for the exploration phase without domain-knowledge representations (a) and with representations informed by domain knowledge (b). In particular, the rolling mean is shown for performance across the past 100 episodes when starting from the virtual demonstration (which corresponds to the domain’s traditional starting state). Note that in a, averaging is over five independent runs, whereas in b, averaging is over 10 runs. Because the final performance is obtained by testing the highest-performing network checkpoint for each run over 1,000 additional episodes, rather than directly extracted from the curves above, the performance reported in Fig. 2b does not necessarily match any particular point along these curves (Methods). Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

Extended Data Fig. 5 Progress of the exploration phase in the robotics environment.

a, Runs with successful trajectories. b, Length of the shortest successful trajectory. In a, the exploration phase quickly achieves 100% success rate for all shelves in the robotics environment. However, b shows that although success is achieved quickly it is useful to keep the exploration phase running longer to reduce the length of the successful trajectories, thus making robustification easier. Lines show the mean over 50 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

Extended Data Fig. 6 Policy-based Go-Explore overview.

With respect to their practical implementation, the main difference between policy-based Go-Explore and Go-Explore when restoring a simulator state is that in policy-based Go-Explore there exist separate actors that each have an internal loop switching between the ‘select’, ‘go’, and ‘explore’ steps, rather than one outer loop in which the ‘select’, ‘go’, and ‘explore’ steps are executed in synchronized batches. This structure allows policy-based Go-Explore to be easily combined with popular reinforcement learning algorithms like A3C²⁰, PPO²¹ or DQN¹⁵, which already divide data-gathering over many actors.

Extended Data Fig. 7 Method by which cells are found.

a, b, In both Montezuma’s Revenge (a) and Pitfall (b), sampling from the goal-conditioned policy results in the discovery of roughly four times more cells than when taking random actions. At the start of training there is effectively no difference between random actions and sampling from the policy, supporting the intuition that sampling from the policy only becomes more efficient than random actions after the policy has acquired the basic skills for moving towards the indicated goal. Lastly, the number of cells that are discovered while returning is about twice that of the cells discovered when taking random actions after returning, indicating that the frames spent while returning to a previously visited cell are not just overhead required for moving towards the frontier of yet-undiscovered states and training the policy network, but actually provide a substantial contribution towards exploration as well. Lines show the mean over 10 runs. Shaded areas show 95% bootstrap CIs of the mean with 1,000 samples.

Extended Data Table 1 Hyperparameters

Full size table

Extended Data Table 2 Robotics state representation

Full size table

Extended Data Table 3 Full scores on Atari

Full size table

Supplementary information

Supplementary Information

The Supplementary Information is made up of a single PDF file containing 13 Supplementary Figures, 2 Supplementary Tables, and additional sections.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ecoffet, A., Huizinga, J., Lehman, J. et al. First return, then explore. Nature 590, 580–586 (2021). https://doi.org/10.1038/s41586-020-03157-9

Download citation

Received: 13 May 2020
Accepted: 22 December 2020
Published: 24 February 2021
Issue Date: 25 February 2021
DOI: https://doi.org/10.1038/s41586-020-03157-9

This article is cited by

Re-attentive experience replay in off-policy reinforcement learning
- Wei Wei
- Da Wang
- Jiye Liang
Machine Learning (2024)
Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks
- Li-yang Zhao
- Tian-qing Chang
- Jiang-feng Wang
International Journal of Machine Learning and Cybernetics (2024)
Champion-level drone racing using deep reinforcement learning
- Elia Kaufmann
- Leonard Bauersfeld
- Davide Scaramuzza
Nature (2023)
Time-aware deep reinforcement learning with multi-temporal abstraction
- Yeo Jin Kim
- Min Chi
Applied Intelligence (2023)
Explicit Explore, Exploit, or Escape (\(E^4\)): near-optimal safety-constrained reinforcement learning in polynomial time
- David M. Bossens
- Nicholas Bishop
Machine Learning (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links