Skip to main content

Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior

  • Conference paper
  • First Online:
PRICAI 2022: Trends in Artificial Intelligence (PRICAI 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13631))

Included in the following conference series:

Abstract

As an important psychological and social experiment, the Iterated Prisoner’s Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner’s Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner’s dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions.

The data and codes to reproduce all the empirical results can be accessed at https://github.com/doerlbh/dilemmaRL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (3), pp. 127–135 (2013)

    Google Scholar 

  2. Andreoni, J., Miller, J..H.: Rational cooperation in the finitely repeated prisoner’s dilemma: experimental evidence. Econ. J. 103, 570–585 (1993)

    Article  Google Scholar 

  3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    Article  MATH  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R..E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  5. Axelrod, R.: Effective choice in the prisoner’s dilemma. J. Conflict Resolut. 24, 3–25 (1980)

    Article  Google Scholar 

  6. Axelrod, R., Hamilton, W..D.: The evolution of cooperation. Science 211(4489), 1390–1396 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  7. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Incorporating behavioral constraints in online AI systems. In: Proceedings of AAAI (2019)

    Google Scholar 

  8. Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Using multi-armed bandits to learn ethical priorities for online ai systems. IBM Journal of Research and Development 63 (2019)

    Google Scholar 

  9. Bayer, H..M., Glimcher, P..W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005)

    Article  Google Scholar 

  10. Bereby-Meyer, Y., Roth, A.E.: The speed of learning in noisy games: partial reinforcement and the sustainability of cooperation. Am. Econ. Rev. 96(4), 1029–1042 (2006)

    Article  Google Scholar 

  11. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: AISTATS (2011)

    Google Scholar 

  12. Bó, P..D..: Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95(5), 1591–1604 (2005)

    Article  Google Scholar 

  13. Bouneffouf, D., Rish, I.: A survey on practical applications of multi-armed and contextual bandits. (2019). CoRR abs/ arXiv: 1904.10040

  14. Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: Reward processing in mental disorders. In: AGI. Springer (2017)

    Google Scholar 

  15. Capraro, V.: A model of human cooperation in social dilemmas. PloS one 8(8), e72427 (2013)

    Article  Google Scholar 

  16. Even-Dar, E., Mansour, Y.: Learning rates for q-learning. JMLR (2003)

    Google Scholar 

  17. Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)

    Article  Google Scholar 

  18. Gupta, G.: Obedience-based multi-agent cooperation for sequential social dilemmas (2020)

    Google Scholar 

  19. Hasselt, H.V.: Double q-learning. In: NIPS (2010)

    Google Scholar 

  20. Holmes, A..J., Patrick, L..M.: The myth of optimality in clinical neuroscience. Trends Cognit. Sci. 22(3), 241–257 (2018)

    Article  Google Scholar 

  21. Johnson, A., Proctor, R.W.: Attention: Theory and Practice. Sage (2004)

    Google Scholar 

  22. Kies, M.: Finding best answers for the iterated prisoner’s dilemma using improved q-learning. Available at SSRN 3556714 (2020)

    Google Scholar 

  23. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  24. Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS (2008)

    Google Scholar 

  25. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint (2017)

    Google Scholar 

  26. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM (2011)

    Google Scholar 

  27. Lin, B.: Supervisorbot: Nlp-annotated real-time recommendations of psychotherapy treatment strategies with deep reinforcement learning. arXiv preprint (2022)

    Google Scholar 

  28. Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th IJCAI (2019)

    Google Scholar 

  29. Lin, B., Bouneffouf, D., Cecchi, G.: Unified models of human behavioral agents in bandits, contextual bandits, and RL. arXiv preprint arXiv:2005.04544 (2020)

  30. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making in psychological tasks with recurrent neural networks. PLoS ONE 17(5), e0267907 (2022)

    Article  Google Scholar 

  31. Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making with LSTM. In: 2022 International Joint Conference on Neural Networks (IJCNN) (2022)

    Google Scholar 

  32. Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the 19th AAMAS, pp. 744–752 (2020)

    Google Scholar 

  33. Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I.: Models of human behavioral agents in bandits, contextual bandits and RL. In: International Workshop on Human Brain and Artificial Intelligence, pp. 14–33. Springer (2021)

    Google Scholar 

  34. Luman, M., Van Meel, C..S., Oosterlaan, J., Sergeant, J..A., Geurts, H..M.: Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder? Psych. Res. 168(3), 222–229 (2009)

    Article  Google Scholar 

  35. Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011)

    Article  Google Scholar 

  36. Nay, J.J., Vorobeychik, Y.: Predicting human cooperation. PloS one 11(5), e0155656 (2016)

    Article  Google Scholar 

  37. Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: IJCAI, pp. 6377–6381 (2019)

    Google Scholar 

  38. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304(5569), 452–454 (2004)

    Article  Google Scholar 

  39. Park, H., Kim, K.J.: Active player modeling in the iterated prisoner’s dilemma. Computational intelligence and neuroscience 2016 (2016)

    Google Scholar 

  40. Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)

    Article  Google Scholar 

  41. Press, W.H., Dyson, F.J.: Iterated prisoner’s dilemma contains strategies that dominate any evolutionary opponent. PNAS 109(26), 10409–10413 (2012)

    Article  MATH  Google Scholar 

  42. Rapoport, A., Chammah, A.M., Orwant, C.J.: Prisoner’s Dilemma: A Study in Conflict and Cooperation, vol. 165. University of Michigan Press, Ann Arbor (1965)

    Book  Google Scholar 

  43. Redish, A.D., Jensen, S., Johnson, A., Kurth-Nelson, Z.: Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychol. Rev. 114(3), 784 (2007)

    Article  Google Scholar 

  44. Rummery, G.A., Niranjan, M.: On-line Q-learning Using Connectionist Systems, vol. 37. University of Cambridge, Cambridge, England (1994)

    Google Scholar 

  45. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997)

    Article  Google Scholar 

  46. Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)

    Google Scholar 

  47. Taylor, A.M., Becker, S., Schweinhardt, P., Cahill, C.: Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain 157(6), 1194 (2016)

    Article  Google Scholar 

  48. Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4), 285–294 (1933)

    Article  MATH  Google Scholar 

  49. Wang, W., Hao, J., Wang, Y., Taylor, M.: Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baihan Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, B., Bouneffouf, D., Cecchi, G. (2022). Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20868-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20867-6

  • Online ISBN: 978-3-031-20868-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics