Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior

Lin, Baihan; Bouneffouf, Djallel; Cecchi, Guillermo

doi:10.1007/978-3-031-20868-3_10

Baihan Lin¹¹,
Djallel Bouneffouf¹² &
Guillermo Cecchi¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13631))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1430 Accesses
1 Citations

Abstract

As an important psychological and social experiment, the Iterated Prisoner’s Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner’s Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner’s dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions.

The data and codes to reproduce all the empirical results can be accessed at https://github.com/doerlbh/dilemmaRL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (3), pp. 127–135 (2013)
Google Scholar
Andreoni, J., Miller, J..H.: Rational cooperation in the finitely repeated prisoner’s dilemma: experimental evidence. Econ. J. 103, 570–585 (1993)
Article Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R..E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Axelrod, R.: Effective choice in the prisoner’s dilemma. J. Conflict Resolut. 24, 3–25 (1980)
Article Google Scholar
Axelrod, R., Hamilton, W..D.: The evolution of cooperation. Science 211(4489), 1390–1396 (1981)
Article MathSciNet MATH Google Scholar
Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Incorporating behavioral constraints in online AI systems. In: Proceedings of AAAI (2019)
Google Scholar
Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Using multi-armed bandits to learn ethical priorities for online ai systems. IBM Journal of Research and Development 63 (2019)
Google Scholar
Bayer, H..M., Glimcher, P..W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005)
Article Google Scholar
Bereby-Meyer, Y., Roth, A.E.: The speed of learning in noisy games: partial reinforcement and the sustainability of cooperation. Am. Econ. Rev. 96(4), 1029–1042 (2006)
Article Google Scholar
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: AISTATS (2011)
Google Scholar
Bó, P..D..: Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95(5), 1591–1604 (2005)
Article Google Scholar
Bouneffouf, D., Rish, I.: A survey on practical applications of multi-armed and contextual bandits. (2019). CoRR abs/ arXiv: 1904.10040
Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: Reward processing in mental disorders. In: AGI. Springer (2017)
Google Scholar
Capraro, V.: A model of human cooperation in social dilemmas. PloS one 8(8), e72427 (2013)
Article Google Scholar
Even-Dar, E., Mansour, Y.: Learning rates for q-learning. JMLR (2003)
Google Scholar
Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)
Article Google Scholar
Gupta, G.: Obedience-based multi-agent cooperation for sequential social dilemmas (2020)
Google Scholar
Hasselt, H.V.: Double q-learning. In: NIPS (2010)
Google Scholar
Holmes, A..J., Patrick, L..M.: The myth of optimality in clinical neuroscience. Trends Cognit. Sci. 22(3), 241–257 (2018)
Article Google Scholar
Johnson, A., Proctor, R.W.: Attention: Theory and Practice. Sage (2004)
Google Scholar
Kies, M.: Finding best answers for the iterated prisoner’s dilemma using improved q-learning. Available at SSRN 3556714 (2020)
Google Scholar
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)
Article MathSciNet MATH Google Scholar
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS (2008)
Google Scholar
Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint (2017)
Google Scholar
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM (2011)
Google Scholar
Lin, B.: Supervisorbot: Nlp-annotated real-time recommendations of psychotherapy treatment strategies with deep reinforcement learning. arXiv preprint (2022)
Google Scholar
Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th IJCAI (2019)
Google Scholar
Lin, B., Bouneffouf, D., Cecchi, G.: Unified models of human behavioral agents in bandits, contextual bandits, and RL. arXiv preprint arXiv:2005.04544 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making in psychological tasks with recurrent neural networks. PLoS ONE 17(5), e0267907 (2022)
Article Google Scholar
Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making with LSTM. In: 2022 International Joint Conference on Neural Networks (IJCNN) (2022)
Google Scholar
Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the 19th AAMAS, pp. 744–752 (2020)
Google Scholar
Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I.: Models of human behavioral agents in bandits, contextual bandits and RL. In: International Workshop on Human Brain and Artificial Intelligence, pp. 14–33. Springer (2021)
Google Scholar
Luman, M., Van Meel, C..S., Oosterlaan, J., Sergeant, J..A., Geurts, H..M.: Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder? Psych. Res. 168(3), 222–229 (2009)
Article Google Scholar
Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011)
Article Google Scholar
Nay, J.J., Vorobeychik, Y.: Predicting human cooperation. PloS one 11(5), e0155656 (2016)
Article Google Scholar
Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: IJCAI, pp. 6377–6381 (2019)
Google Scholar
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304(5569), 452–454 (2004)
Article Google Scholar
Park, H., Kim, K.J.: Active player modeling in the iterated prisoner’s dilemma. Computational intelligence and neuroscience 2016 (2016)
Google Scholar
Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)
Article Google Scholar
Press, W.H., Dyson, F.J.: Iterated prisoner’s dilemma contains strategies that dominate any evolutionary opponent. PNAS 109(26), 10409–10413 (2012)
Article MATH Google Scholar
Rapoport, A., Chammah, A.M., Orwant, C.J.: Prisoner’s Dilemma: A Study in Conflict and Cooperation, vol. 165. University of Michigan Press, Ann Arbor (1965)
Book Google Scholar
Redish, A.D., Jensen, S., Johnson, A., Kurth-Nelson, Z.: Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychol. Rev. 114(3), 784 (2007)
Article Google Scholar
Rummery, G.A., Niranjan, M.: On-line Q-learning Using Connectionist Systems, vol. 37. University of Cambridge, Cambridge, England (1994)
Google Scholar
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997)
Article Google Scholar
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)
Google Scholar
Taylor, A.M., Becker, S., Schweinhardt, P., Cahill, C.: Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain 157(6), 1194 (2016)
Article Google Scholar
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4), 285–294 (1933)
Article MATH Google Scholar
Wang, W., Hao, J., Wang, Y., Taylor, M.: Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Columbia University, New York, USA
Baihan Lin
IBM Research, Yorktown Heights, NY, USA
Djallel Bouneffouf & Guillermo Cecchi

Authors

Baihan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Djallel Bouneffouf
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Cecchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baihan Lin .

Editor information

Editors and Affiliations

CSIRO Australian e-Health Research Centre, Brisbane, QLD, Australia
Sankalp Khanna
Shanghai Jiao Tong University, Shanghai, China
Jian Cao
University of Tasmania, Hobart, TAS, Australia
Quan Bai
University of Technology Sydney, Sydney, NSW, Australia
Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, B., Bouneffouf, D., Cecchi, G. (2022). Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-20868-3_10
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20867-6
Online ISBN: 978-3-031-20868-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior